Curl Show Response Headers: Complete Guide
Learn how to show curl response headers only, headers with body, or headers in scripts using -I, -i, -D, -v, and -w with %header{name} and %{header_json}. Includes examples and pitfalls.
Apr 8, 2026
Learn Python internet scraping step by step, from HTML basics and BeautifulSoup to pagination, data cleaning, dynamic JavaScript pages, and common beginner mistakes.
Python internet scraping, also called Python web scraping, is the process of collecting public information from websites automatically with Python and turning it into structured data you can use. Instead of copying and pasting data by hand, a script can gather titles, prices, links, tables, and other visible page content in seconds. That makes scraping useful for research, price tracking, content monitoring, directory building, and data analysis.
In this guide, we aim to help you build your first scraper, covering how to inspect a page, fetch HTML, extract data, clean it, save it, handle pagination, and recognize when a site needs browser automation instead of plain requests.
At its simplest, scraping follows this workflow:
1. Your Python script sends a request to a webpage.
2. The website returns HTML.
3. Python reads that HTML.
4. You extract the parts you need.
5. You save the results in a structured format.
This works best when the data you want is already present in the raw HTML. Product listings, article titles, prices, and tables are often easy to collect this way.
Python is a strong choice for web scraping because:
Before writing code, it helps to understand the overall logic.
A basic scraper does three jobs:
Fetch the webpage → Parse the HTML → Extract the data
Then, if needed, it can also:
Clean the data
Paginate through multiple pages
Save the output to CSV, JSON, or a database
If you understand that flow, the rest of scraping becomes much easier to follow.
Create a clean virtual environment:
python -m venv scraping-env
# Windows
scraping-env\Scripts\activate
# macOS/Linux
source scraping-env/bin/activate
Install the core packages:
pip install requests beautifulsoup4 pandas lxml
Here is what each package does:
We will start with a simple static website so you can focus on the core scraping workflow first; add Playwright later only when we need it for JavaScript pages.
Good scraping starts with good habits. Before writing code, check these things on every target site:
robots.txt
Many websites publish a robots.txt file that tells automated tools which pages should not be accessed.
Terms of Service
A site may allow public viewing but still restrict automated collection. Always read the rules first.
Data type
Avoid scraping personal data, private accounts, login-protected pages, or anything behind a paywall unless you have explicit permission.
Request speed
Do not send a large number of requests too quickly. Add delays so your scraper behaves politely.
APIs
If an official API exists, use it first. APIs are often more stable, cleaner, and easier to maintain than scraping.
A respectful scraper is more reliable, easier to debug, and less likely to cause problems.
Before you write code, inspect the page in your browser.
Open the page, right-click, and choose Inspect. Then look for:
This matters because scraping is easier when you know exactly where the data lives in the page structure.
Pro tip: Fetch the page once in Python and compare the printed HTML with what you see in DevTools. That quickly shows whether the data is static or loaded dynamically.
We’ll use the excellent practice site books.toscrape.com — it’s designed for beginners.

Do:
sends a request to the page
uses a simple User-Agent header
fails early if the server returns an error
Code:
import requests
url = "https://books.toscrape.com/catalogue/page-1.html"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers, timeout=15)
response.raise_for_status()
html = response.text
print(html[:500])
BeautifulSoup turns raw HTML into something you can search and extract from more easily.
Code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
from urllib.parse import urljoin
books = []
for book in soup.select("article.product_pod"):
title = book.h3.a.get("title")
price = book.select_one(".price_color").get_text(strip=True)
link = urljoin("https://books.toscrape.com", book.h3.a.get("href"))
books.append({
"title": title,
"price": price,
"link": link
})
print(f"Scraped {len(books)} books")
print(books[:3])
Key beginner insight: A page often contains many repeating items, and your job is to identify the repeating pattern.
For beginners, saving to CSV is a great first goal because it is easy to open in spreadsheet tools or use later in analysis.
Do:
extracts a numeric price
converts the data into a table
removes duplicates
saves the result as CSV
Code:
import re
import pandas as pd
def clean_price(text):
match = re.search(r"[\d,.]+", text)
return float(match.group().replace(",", "")) if match else None
df = pd.DataFrame(books)
df["price_value"] = df["price"].apply(clean_price)
df = df.drop_duplicates()
df.to_csv("books.csv", index=False, encoding="utf-8")
print(df.head())
Split data across several pages, that is called pagination. Real sites almost never put everything on one page. Here’s how to loop through all pages politely.
Code:
import time
import random
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
all_books = []
base_url = "https://books.toscrape.com/catalogue/page-{}.html"
headers = {"User-Agent": "Mozilla/5.0"}
for page in range(1, 51):
url = base_url.format(page)
response = requests.get(url, headers=headers, timeout=15)
if response.status_code != 200:
break
soup = BeautifulSoup(response.text, "lxml")
books_on_page = soup.select("article.product_pod")
if not books_on_page:
break
for book in books_on_page:
title = book.h3.a.get("title")
price = book.select_one(".price_color").get_text(strip=True)
link = urljoin("https://books.toscrape.com", book.h3.a.get("href"))
all_books.append({
"title": title,
"price": price,
"link": link
})
print(f"Page {page}: {len(books_on_page)} books")
time.sleep(random.uniform(1, 3))
For bigger projects or when a site starts limiting or blocking your IP after many pages, many scrapers add rotating proxies to automatically rotate your IP address to stay under the radar while still being polite to the website.
Important habits here:
Some sites do not place all the content in the initial HTML. Instead, JavaScript loads the data after the page opens. In those cases, requests alone return empty or incomplete HTML.
Solution: Use Playwright (still the best beginner-friendly browser automation tool).
Install:
pip install playwright
playwright install chromium
Code:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com", wait_until="networkidle")
page.wait_for_selector(".item-card", timeout=10000)
html = page.content()
browser.close()
When to switch to Playwright:
Beginners tip: Always start with requests + BeautifulSoup. Only move to Playwright when you prove the data isn’t in the raw HTML.
Many useful pages are not simple lists. They are search pages, filter pages, or forms.
Pro technique:
1. Open DevTools → Network tab.
2. Perform the search on the site.
3. Look for the actual XHR/fetch request (usually JSON).
4. Reproduce that exact request in Python with requests.get() or requests.post().
This is often easier than scraping the rendered page itself, because the real data may come from a cleaner endpoint behind the scenes.
Remember:
GET is often used for search and filter URLs
POST is often used for form submissions
the visible page is not always the real data source
These are common mistakes causing first-time scraping problems:
1. Empty results
Check whether the selector is correct and print part of the HTML to confirm the data is really there.
2. Broken links
Use urljoin() whenever links are relative.
3. Blocked requests / 403 error
Add delays, use a realistic User-Agent, and avoid sending too many requests too quickly. For persistent blocks, using rotating proxies is the standard next step.
4. Missing elements
Do not assume every item has every field. Use .select_one() + if checks before extracting values.
5. Different HTML when logged in
Many sites show different content based on login state. Test while logged out first.
6. Scraping the wrong thing
Some data is visible in the browser but not in the HTML. In that case, the source may be JavaScript-loaded or coming from an API call.
Good habits:
Keep your scraper modular, test one page before scaling, log errors, save partial results, and record the scrape date.
If your scraper will run more than once, reliability matters as much as extraction. A small script that works once is good. A script that still works next month is much better.
Scraping is useful, but it is not always the best option.
Use an API or public dataset when:
Pro tip: Before writing any scraper, open DevTools → Network tab and search for JSON responses. Many sites expose a clean API behind the scenes — it’s faster, more stable, and 100% legal.
1. Is Python internet scraping legal?
It depends on the site’s terms, robots.txt, the data type, your country’s laws, and how you collect it. Public data is not automatically free to scrape. Always check the rules first.
2. Should I use requests or Playwright?
requests + BeautifulSoup for static pages. Playwright only when the data is loaded by JavaScript.
3. Why does my scraper return empty results?
Wrong selector, page structure changed, or data is dynamic.
4. Why are my links broken?
The page may use relative URLs. Use urljoin() to convert them into absolute links.
Python internet scraping becomes much easier once you follow a simple workflow:
inspect → fetch → parse → extract → clean → save → paginate → handle dynamic pages when needed
Start small. Make your first scraper work on one simple page. Then add pagination, data cleaning, and better error handling. Once that foundation clicks, scraping becomes a practical skill you can use in many real projects.
Happy scraping — and remember: respect the sites you visit, and they’ll let you keep visiting.
Next >
Cancel anytime
No credit card required