This browser does not support JavaScript

How to Build a Reliable eBay Web Scraper(Python)

Post Time: 2025-12-03 Update Time: 2025-12-03

Accessing timely data from eBay can give you a competitive edge. However, eBay's anti-bot measures and terms of service make scraping challenging. This guide provides a complete, step-by-step approach covering quick proof-of-concept scrapers (requests + BeautifulSoup), robust scrapers for dynamic content (Playwright + proxies), and production practices (proxy pools, monitoring, legal checks).

Build a Reliable eBay Web Scraper with Python and Poxies

Important Legal Note: eBay’s Terms of Service restrict automated access in many situations. This guide is technical only. Before scraping at scale or for commercial use, check https://www.ebay.com/robots.txt, review the eBay Developer Program (developer.ebay.com), and consult legal counsel if needed. Only collect public listing data; avoid private buyer/seller personal data.

Why Scrape eBay?

Scraping eBay can provide valuable insights without relying on manual browsing:

Price Monitoring and Competitive Analysis: Track listings to adjust pricing dynamically. For example, resellers can spot underpriced items daily.

Market Research: Analyze trends in product popularity, seller ratings, and review sentiments to guide inventory decisions.

Personal Use: Set up custom alerts for rare items like vintage collectibles.

Development Projects: Build apps or bots for aggregated data, ensuring scalability and compliance.

Legal and Ethical Check First

Before coding, ensure your approach is responsible:

Check robots.txt and TOS: Visit https://www.ebay.com/robots.txt and follow disallow rules as a courtesy. Even if allowed, review eBay's full Terms of Service.

Prefer Official APIs When Possible: Register at developer.ebay.com for stable, compliant data feeds. For commercial use, this reduces legal risks. See the "Alternatives to Scraping" section for a quick API example.

Collect Public Data Only: Focus on listings; avoid buyer info, personal details, or anything behind logins. Comply with GDPR/CCPA if handling user data.

Rate-Limit Your Requests: Do not overload servers—aim for 1-3 requests per second with random delays. Be transparent and ethical.

Record Good Faith Efforts: Keep logs of requests, low rates, and your ability to pause crawls if contacted.

Quick Method Decision

Goal Tooling Cost Time
Learn / one-off requests + BeautifulSoup4 Free 1–4 hours
JS / scale Playwright + proxies Medium–High Days–Weeks
No infra Scraper API Monthly fee Hours

After building the POC (Steps 1-3), test success rate. If <50% (e.g., due to blocks or missing data), add proxies (Step 7). If JS content is missing >20%, switch to Playwright (Step 8).

What & Where: Data to Scrape

Before coding, inspect pages in your browser DevTools (right click → Inspect) and locate fields. Focus on public pages only.

Typical targets

Search / Listing Pages (Bulk Scanning): item_id, title, price, list_price (strikethrough), shipping, url, thumbnail, sold_count, condition, seller_location.

Product / Detail Pages (Per-Item Depth): title, price, currency (often itemprop="price"), item_specifics (brand/model/MPN), description, images[], seller_info, return_policy, shipping_options.

Hidden JSON / Variants: eBay often embeds JSON (e.g., in <script type="application/ld+json"> or window.INITIAL_STATE). Variants load via XHR for SKUs/pricing.

Useful query strings

 _nkw=keywords; _pgn=page number for pagination.

Quick checklist before coding

1. Open a listing results page and a product page.

2. Note 3 selectors (title, price, link) and copy a short HTML snippet for each.

3. Decide if data is in static HTML (use Requests) or loaded dynamically (use Playwright/XHR).

Prerequisites

You’ll need Python familiarity (variables, functions, loops) and DevTools basics.

1. Install basics

python -m venv venv

source venv/bin/activate       # macOS/Linux; Windows: .\venv\Scripts\activate

pip install requests beautifulsoup4 playwright

python -m playwright install

2. Optional libraries: httpx, parsel, pytest (for tests). If using a Scraper API, install their SDK.

3. Create a project folder with a src/ subfolder for code files.

Test: Run python --version in your env.

Step-by-step: Build Your eBay Scraper

Start simple and iterate. All code is in Python; files are under src/. Run each in your activated env.

1. Set up the environment

As in Prerequisites. If issues, refer back: Ensure venv is activated.

2. Understand eBay's structure

  • Search URLs: https://www.ebay.com/sch/i.html?_nkw=keyword.
  • Product URLs: /itm/item-id.
  • Key elements: Titles in .x-item-title__mainTitle or .s-item__title; prices in [itemprop="price"], .s-item__price, or .x-price-primary.

Inspect a page now to confirm.

3. Try a single search page (requests + BeautifulSoup)

Goal: Extract title, price, url from a search results page.

File: src/poc_search.py

import requests

from bs4 import BeautifulSoup

 

HEADERS = {

    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36",

    "Accept-Language": "en-US,en;q=0.9"

}

 

def scrape_search_page(url):

    try:

        r = requests.get(url, headers=HEADERS, timeout=15)

        r.raise_for_status()

    except requests.exceptions.RequestException as e:

        print(f"Error fetching page: {e}")

        return []

    soup = BeautifulSoup(r.text, "html.parser")

    items = []

    for el in soup.select('.s-item'):

        title_el = el.select_one('.s-item__title')

        # Fallback selectors for price

        price_el = el.select_one('.s-item__price, .x-price-primary, span[itemprop="price"]')

        link_el = el.select_one('.s-item__link')

        if title_el and price_el:

            items.append({

                "title": title_el.get_text(strip=True),

                "price": price_el.get_text(strip=True),

                "url": link_el['href'] if link_el else None

            })

    return items

 

if __name__ == "__main__":

    url = "https://www.ebay.com/sch/i.html?_nkw=wireless+earbuds"

    results = scrape_search_page(url)

    print(f"Found {len(results)} items")

print(results[:3])

Checkpoint 1: Run it. Expect dozens of items; verify first 3 have title/price. If empty, view r.text (add print(r.text[:500])), update selectors, or try a different UA. If fails >20%, add proxies (Step 7).

4. Handle pagination & de-duplication

Goal: Scrape multiple pages with delays; dedupe by URL.

File: src/pagination.py

import time

import random

from poc_search import scrape_search_page  # Import from previous file

 

def scrape_paginated(keyword, pages=3, delay=(2, 5)):

    all_items = []

    kw = keyword.replace(" ", "+")

    for page in range(1, pages + 1):

        url = f"https://www.ebay.com/sch/i.html?_nkw={kw}&_pgn={page}"

        print(f"Scraping {url}")

        items = scrape_search_page(url)

        all_items.extend(items)

        time.sleep(random.uniform(*delay))  # Random delay to mimic human behavior

    # Dedupe by url (or title as fallback)

    seen = set()

    dedup = []

    for it in all_items:

        key = it.get("url") or it.get("title")

        if key and key not in seen:

            dedup.append(it)

            seen.add(key)

    return dedup

 

if __name__ == "__main__":

    res = scrape_paginated("wireless earbuds", pages=3)

print("Total unique items:", len(res))

Checkpoint 2: Run with pages=2-3. You should get a non-duplicated set. If you see repeated items: adjust dedupe or pagination logic. If blocks occur, proceed to proxies.

5. Extract product details(item specifics & price)

Goal: Fetch a product URL for detailed fields, including JSON.

File: src/product_detail.py

import requests

import json

from bs4 import BeautifulSoup

 

HEADERS = {  # Same as before

    "User-Agent": "Mozilla/5.0 ...", "Accept-Language": "en-US,en;q=0.9"

}

 

def fetch_product(url):

    try:

        r = requests.get(url, headers=HEADERS, timeout=15)

        r.raise_for_status()

    except requests.exceptions.RequestException as e:

        print(f"Error: {e}")

        return {}

    soup = BeautifulSoup(r.text, "html.parser")

    # Fallback for title

    title = soup.select_one(".x-item-title__mainTitle, .s-item__title")

    # Fallback for price

    price = soup.select_one('span[itemprop="price"], .s-item__price, .x-price-primary')

    shipping = soup.select_one(".s-item__shipping, .x-shipping__primary") or None

    item_specifics = {}

    for row in soup.select("#viTabs_0_is .itemAttr tr, .ux-layout-section__row"):

        tds = row.select("td, .ux-labels-values__labels")

        if len(tds) >= 2:

            key = tds[0].get_text(strip=True).rstrip(':')

            val = tds[1].get_text(strip=True)

            item_specifics[key] = val

    # Parse embedded JSON

    ld_json = None

    sd = soup.find('script', type='application/ld+json')

    if sd:

        try:

            ld_json = json.loads(sd.string)

        except json.JSONDecodeError:

            print("JSON parse error")

    return {

        "title": title.get_text(strip=True) if title else None,

        "price": price.get_text(strip=True) if price else None,

        "shipping": shipping.get_text(strip=True) if shipping else None,

        "item_specifics": item_specifics,

        "ld_json": ld_json

    }

 

if __name__ == "__main__":

    # Use a URL from previous results

    test_url = "https://www.ebay.com/itm/EXAMPLE_ITEM_ID"  # Replace with real

print(fetch_product(test_url))

Checkpoint 3:Test on a real product URL from Step 3. Confirm title, price, and specifics. If missing, update fallbacks or use DevTools.

6. Extracting variants & hidden JSON / XHRs

Goal: Find JSON with variant pricing (MSKU / SKU lists).

How: Open product page → DevTools → Network → XHR tab. Reload and filter for JSON (e.g., "item", "product"). Extract endpoints or embedded scripts

Example: Building on Step 5, parse ld_json for variants. For window.INITIAL_STATE (if present):

# Add to fetch_product after soup:

initial_state = None

for script in soup.find_all('script'):

    if '__INITIAL_STATE__' in script.text:

        # Extract JSON part (clean manually if needed)

        json_text = script.text.split('=')[1].strip(';')

        try:

            initial_state = json.loads(json_text)

            # Example: variants = initial_state.get('item', {}).get('variations', [])

        except json.JSONDecodeError:

            print("Parse error")

For a clothing item: Test on a page with variants; extract price per SKU.

Checkpoint 4: On a variant-heavy item (e.g., shoes), confirm variant prices. If not found, browse XHR for direct GET endpoints (include in requests with headers).

7. Add Proxies For Reliability

Why: Prevent IP bans and rate limits. Obtain localized results (country-specific pricing/shipping).

Types:

Residential: harder to detect, better for eBay at scale.

Mobile (4G/5G): most human-like; expensive.

Datacenter: cheap, fast, more detectable.

Sticky vs rotating:

Sticky = same IP for a session (necessary if cookies/login matter).

Rotating = new IP per request (good for breadth scraping).

Simple rotating proxy pool (requests)

File: src/proxy_pool.py

import itertools

import random

import time

import requests

 

PROXIES = [  # Add your proxies; for free testing: ["http://proxy1:port", "http://proxy2:port"]

    "http://user:[email protected]:8000",

    "http://user:[email protected]:8000",

]

proxy_iter = itertools.cycle(PROXIES)  # Cycles through list endlessly

 

def get_proxy_dict(proxy_url):

    return {"http": proxy_url, "https": proxy_url}

 

def fetch_with_rotation(url, max_retries=3):

    for attempt in range(max_retries):

        proxy = next(proxy_iter)

        try:

            r = requests.get(url, headers=HEADERS, proxies=get_proxy_dict(proxy), timeout=15)

            if r.status_code == 200:

                return r.text, proxy

            print(f"[WARN] {r.status_code} via {proxy}")

        except Exception as e:

            print(f"[WARN] proxy {proxy} failed: {e}")

        time.sleep((2 ** attempt) + random.random())  # Exponential backoff

    raise RuntimeError("All proxies failed")

Best Practices:

Track failures, health-check (e.g., ping google.com via proxy), evict bad ones.

For 50 workers, consider using≥500 residential IPs.

Choose reputable proxy services offering a free trial, like GoProxy.

Checkpoint 5: Run 100 requests; aim for ≥95% success. If low, evict failing proxies or buy residential.

Integrate: Replace requests.get in earlier functions with fetch_with_rotation.

8. Move to Playwright for JS rendering & Stealth

When: Pages rely on JS for content (prices load dynamically), or you face frequent blocks/CAPTCHAs with requests.

Key tips

Use one browser instance per worker and one context per proxy (isolates cookies).

Block images/fonts to save bandwidth.

Add human-like behavior: randomized typing, scrolls, delays.

Use realistic viewport & Accept-Language matching proxy country.

In 2025, consider stealth add-ons like playwright-stealth to evade AI detection.

Example (sync)

File: src/playwright_scraper.py

from playwright.sync_api import sync_playwright

import time

 

def scrape_with_playwright(search_url, proxy_server=None):

    with sync_playwright() as p:

        browser = p.chromium.launch(headless=True)

        context_kwargs = {}

        if proxy_server:

            context_kwargs["proxy"] = {"server": proxy_server}

        context = browser.new_context(**context_kwargs)

        page = context.new_page()

        # Block heavy assets for speed

        page.route("**/*.{png,jpg,jpeg,svg,woff,woff2,ttf}", lambda route: route.abort())

        page.goto(search_url, timeout=60000)

        time.sleep(random.uniform(1, 3))  # Human-like delay

        page.wait_for_selector("li.s-item", timeout=15000)

        items = page.query_selector_all("li.s-item")

        out = []

        for it in items:

            title = it.query_selector(".s-item__title")

            price = it.query_selector(".s-item__price, .x-price-primary")  # Fallback

            out.append({

                "title": title.inner_text().strip() if title else None,

                "price": price.inner_text().strip() if price else None

            })

        context.close()

        browser.close()

    return out

 

# Example: Integrate proxy from pool

Checkpoint 6: Run on a JS-heavy page; compare to Requests. If CAPTCHAs: Use residential proxies or integrate solvers (e.g., CAPTCHA API services).

Scale, Parallelism & Production

Worker model: Use a queue (Redis/RabbitMQ) with worker processes. Each worker uses one Playwright context tied to a proxy.

Concurrency: Favor multi-process (multiprocessing) vs many contexts in one process.

Rate limiting: Per-proxy rate limit (1–3 req/sec) with jitter.

Cache & incremental crawling: Avoid re-fetching pages too often; store timestamps and scrape deltas.

Data modeling & storage

Canonical fields: item_id, title, price, currency, shipping, condition, item_url, seller_id, timestamp, market.

Store raw HTML/JSON in a DB (e.g., SQLite/MongoDB) for replay & parsing fixes.

Monitoring, Health Checks & KPIs

Per-request logs (capture):

timestamp, url, status_code, latency_ms, proxy_id, worker_id, parsed_ok(bool), error_type

KPIs & thresholds:

KPI Threshold Action
403 Rate (1 hour) >5% Alert, rotate proxies
CAPTCHA Rate (24h) >1% Investigate solvers
Proxy Failure Rate >20% Evict proxies
p95 Latency >5s Check network/proxies

Daily selector health job:

Take 10 representative URLs (mix markets).

Fetch and verify title and price exist.

If >30% fail, send alert + raw HTML.

Troubleshooting Common Errors

HTTP 403: Rotate proxies, slow down, change UA, check robots.txt.

Selectors missing: Open page in browser, update selectors, or use fallback XPath/regex.

High CAPTCHA: Switch to residential/mobile proxies or a solver API; reduce request rate.

Timeouts: Increase timeout, check proxy health and latency.

FAQs

Q: What proxy type should I buy first?

A: Start with a small residential proxy plan or trial from a reputable provider if you expect blocks; datacenter for low-budget experiments.

Q: Can I just use VPN + requests?

A: VPNs limit to one IP — not suitable for parallel scraping; proxies (pool) are better.

Q: Will headless browsers reduce blocks?

A: Playwright/Puppeteer helps with JS, but without good proxies and stealth techniques you’ll still get blocked.

Q: Handling 2025 Anti-Bot?

A: Use stealth-playwright; monitor for AI detection updates.

Final Thoughts

Start small: build a POC, validate fields, then add proxies and Playwright only when needed. Logging and daily health checks will save you much time. If you plan to provide scraped data commercially, switch to eBay's official APIs to reduce legal and operational risk. And remember, sites change—test selectors regularly.

Look for reliable web scraping proxies? Get your free trial today, full experience before payment, worry-free! Customized proxy pool supports your specific demand.

Next >

Beginner → Pro: Ecommerce Data Scraping in 2025
Start Your 7-Day Free Trial Now!
GoProxy Cancel anytime
GoProxy No credit card required