How to Scrape All Tweets from a User's X Account in 2025
Step-by-step 2025 guide to download and archive a user’s tweets (text + media) using Official X API, no-code, CLI, or programmatic methods.
Nov 25, 2025
Step-by-step guide to build a eBay web scraper using Python, proxies, requests & Playwright examples, proxy setup, scale tips, monitoring, and legal cautions.
Accessing timely data from eBay can give you a competitive edge. However, eBay's anti-bot measures and terms of service make scraping challenging. This guide provides a complete, step-by-step approach covering quick proof-of-concept scrapers (requests + BeautifulSoup), robust scrapers for dynamic content (Playwright + proxies), and production practices (proxy pools, monitoring, legal checks).

Important Legal Note: eBay’s Terms of Service restrict automated access in many situations. This guide is technical only. Before scraping at scale or for commercial use, check https://www.ebay.com/robots.txt, review the eBay Developer Program (developer.ebay.com), and consult legal counsel if needed. Only collect public listing data; avoid private buyer/seller personal data.
Scraping eBay can provide valuable insights without relying on manual browsing:
Price Monitoring and Competitive Analysis: Track listings to adjust pricing dynamically. For example, resellers can spot underpriced items daily.
Market Research: Analyze trends in product popularity, seller ratings, and review sentiments to guide inventory decisions.
Personal Use: Set up custom alerts for rare items like vintage collectibles.
Development Projects: Build apps or bots for aggregated data, ensuring scalability and compliance.
Before coding, ensure your approach is responsible:
Check robots.txt and TOS: Visit https://www.ebay.com/robots.txt and follow disallow rules as a courtesy. Even if allowed, review eBay's full Terms of Service.
Prefer Official APIs When Possible: Register at developer.ebay.com for stable, compliant data feeds. For commercial use, this reduces legal risks. See the "Alternatives to Scraping" section for a quick API example.
Collect Public Data Only: Focus on listings; avoid buyer info, personal details, or anything behind logins. Comply with GDPR/CCPA if handling user data.
Rate-Limit Your Requests: Do not overload servers—aim for 1-3 requests per second with random delays. Be transparent and ethical.
Record Good Faith Efforts: Keep logs of requests, low rates, and your ability to pause crawls if contacted.
| Goal | Tooling | Cost | Time |
| Learn / one-off | requests + BeautifulSoup4 | Free | 1–4 hours |
| JS / scale | Playwright + proxies | Medium–High | Days–Weeks |
| No infra | Scraper API | Monthly fee | Hours |
After building the POC (Steps 1-3), test success rate. If <50% (e.g., due to blocks or missing data), add proxies (Step 7). If JS content is missing >20%, switch to Playwright (Step 8).
Before coding, inspect pages in your browser DevTools (right click → Inspect) and locate fields. Focus on public pages only.
Search / Listing Pages (Bulk Scanning): item_id, title, price, list_price (strikethrough), shipping, url, thumbnail, sold_count, condition, seller_location.
Product / Detail Pages (Per-Item Depth): title, price, currency (often itemprop="price"), item_specifics (brand/model/MPN), description, images[], seller_info, return_policy, shipping_options.
Hidden JSON / Variants: eBay often embeds JSON (e.g., in <script type="application/ld+json"> or window.INITIAL_STATE). Variants load via XHR for SKUs/pricing.
_nkw=keywords; _pgn=page number for pagination.
1. Open a listing results page and a product page.
2. Note 3 selectors (title, price, link) and copy a short HTML snippet for each.
3. Decide if data is in static HTML (use Requests) or loaded dynamically (use Playwright/XHR).
You’ll need Python familiarity (variables, functions, loops) and DevTools basics.
1. Install basics
python -m venv venv
source venv/bin/activate # macOS/Linux; Windows: .\venv\Scripts\activate
pip install requests beautifulsoup4 playwright
python -m playwright install
2. Optional libraries: httpx, parsel, pytest (for tests). If using a Scraper API, install their SDK.
3. Create a project folder with a src/ subfolder for code files.
Test: Run python --version in your env.
Start simple and iterate. All code is in Python; files are under src/. Run each in your activated env.
As in Prerequisites. If issues, refer back: Ensure venv is activated.
Inspect a page now to confirm.
Goal: Extract title, price, url from a search results page.
File: src/poc_search.py
import requests
from bs4 import BeautifulSoup
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9"
}
def scrape_search_page(url):
try:
r = requests.get(url, headers=HEADERS, timeout=15)
r.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"Error fetching page: {e}")
return []
soup = BeautifulSoup(r.text, "html.parser")
items = []
for el in soup.select('.s-item'):
title_el = el.select_one('.s-item__title')
# Fallback selectors for price
price_el = el.select_one('.s-item__price, .x-price-primary, span[itemprop="price"]')
link_el = el.select_one('.s-item__link')
if title_el and price_el:
items.append({
"title": title_el.get_text(strip=True),
"price": price_el.get_text(strip=True),
"url": link_el['href'] if link_el else None
})
return items
if __name__ == "__main__":
url = "https://www.ebay.com/sch/i.html?_nkw=wireless+earbuds"
results = scrape_search_page(url)
print(f"Found {len(results)} items")
print(results[:3])
Checkpoint 1: Run it. Expect dozens of items; verify first 3 have title/price. If empty, view r.text (add print(r.text[:500])), update selectors, or try a different UA. If fails >20%, add proxies (Step 7).
Goal: Scrape multiple pages with delays; dedupe by URL.
File: src/pagination.py
import time
import random
from poc_search import scrape_search_page # Import from previous file
def scrape_paginated(keyword, pages=3, delay=(2, 5)):
all_items = []
kw = keyword.replace(" ", "+")
for page in range(1, pages + 1):
url = f"https://www.ebay.com/sch/i.html?_nkw={kw}&_pgn={page}"
print(f"Scraping {url}")
items = scrape_search_page(url)
all_items.extend(items)
time.sleep(random.uniform(*delay)) # Random delay to mimic human behavior
# Dedupe by url (or title as fallback)
seen = set()
dedup = []
for it in all_items:
key = it.get("url") or it.get("title")
if key and key not in seen:
dedup.append(it)
seen.add(key)
return dedup
if __name__ == "__main__":
res = scrape_paginated("wireless earbuds", pages=3)
print("Total unique items:", len(res))
Checkpoint 2: Run with pages=2-3. You should get a non-duplicated set. If you see repeated items: adjust dedupe or pagination logic. If blocks occur, proceed to proxies.
Goal: Fetch a product URL for detailed fields, including JSON.
File: src/product_detail.py
import requests
import json
from bs4 import BeautifulSoup
HEADERS = { # Same as before
"User-Agent": "Mozilla/5.0 ...", "Accept-Language": "en-US,en;q=0.9"
}
def fetch_product(url):
try:
r = requests.get(url, headers=HEADERS, timeout=15)
r.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"Error: {e}")
return {}
soup = BeautifulSoup(r.text, "html.parser")
# Fallback for title
title = soup.select_one(".x-item-title__mainTitle, .s-item__title")
# Fallback for price
price = soup.select_one('span[itemprop="price"], .s-item__price, .x-price-primary')
shipping = soup.select_one(".s-item__shipping, .x-shipping__primary") or None
item_specifics = {}
for row in soup.select("#viTabs_0_is .itemAttr tr, .ux-layout-section__row"):
tds = row.select("td, .ux-labels-values__labels")
if len(tds) >= 2:
key = tds[0].get_text(strip=True).rstrip(':')
val = tds[1].get_text(strip=True)
item_specifics[key] = val
# Parse embedded JSON
ld_json = None
sd = soup.find('script', type='application/ld+json')
if sd:
try:
ld_json = json.loads(sd.string)
except json.JSONDecodeError:
print("JSON parse error")
return {
"title": title.get_text(strip=True) if title else None,
"price": price.get_text(strip=True) if price else None,
"shipping": shipping.get_text(strip=True) if shipping else None,
"item_specifics": item_specifics,
"ld_json": ld_json
}
if __name__ == "__main__":
# Use a URL from previous results
test_url = "https://www.ebay.com/itm/EXAMPLE_ITEM_ID" # Replace with real
print(fetch_product(test_url))
Checkpoint 3:Test on a real product URL from Step 3. Confirm title, price, and specifics. If missing, update fallbacks or use DevTools.
Goal: Find JSON with variant pricing (MSKU / SKU lists).
How: Open product page → DevTools → Network → XHR tab. Reload and filter for JSON (e.g., "item", "product"). Extract endpoints or embedded scripts
Example: Building on Step 5, parse ld_json for variants. For window.INITIAL_STATE (if present):
# Add to fetch_product after soup:
initial_state = None
for script in soup.find_all('script'):
if '__INITIAL_STATE__' in script.text:
# Extract JSON part (clean manually if needed)
json_text = script.text.split('=')[1].strip(';')
try:
initial_state = json.loads(json_text)
# Example: variants = initial_state.get('item', {}).get('variations', [])
except json.JSONDecodeError:
print("Parse error")
For a clothing item: Test on a page with variants; extract price per SKU.
Checkpoint 4: On a variant-heavy item (e.g., shoes), confirm variant prices. If not found, browse XHR for direct GET endpoints (include in requests with headers).
Why: Prevent IP bans and rate limits. Obtain localized results (country-specific pricing/shipping).
Types:
Residential: harder to detect, better for eBay at scale.
Mobile (4G/5G): most human-like; expensive.
Datacenter: cheap, fast, more detectable.
Sticky vs rotating:
Sticky = same IP for a session (necessary if cookies/login matter).
Rotating = new IP per request (good for breadth scraping).
Simple rotating proxy pool (requests)
File: src/proxy_pool.py
import itertools
import random
import time
import requests
PROXIES = [ # Add your proxies; for free testing: ["http://proxy1:port", "http://proxy2:port"]
"http://user:[email protected]:8000",
"http://user:[email protected]:8000",
]
proxy_iter = itertools.cycle(PROXIES) # Cycles through list endlessly
def get_proxy_dict(proxy_url):
return {"http": proxy_url, "https": proxy_url}
def fetch_with_rotation(url, max_retries=3):
for attempt in range(max_retries):
proxy = next(proxy_iter)
try:
r = requests.get(url, headers=HEADERS, proxies=get_proxy_dict(proxy), timeout=15)
if r.status_code == 200:
return r.text, proxy
print(f"[WARN] {r.status_code} via {proxy}")
except Exception as e:
print(f"[WARN] proxy {proxy} failed: {e}")
time.sleep((2 ** attempt) + random.random()) # Exponential backoff
raise RuntimeError("All proxies failed")
Best Practices:
Track failures, health-check (e.g., ping google.com via proxy), evict bad ones.
For 50 workers, consider using≥500 residential IPs.
Choose reputable proxy services offering a free trial, like GoProxy.
Checkpoint 5: Run 100 requests; aim for ≥95% success. If low, evict failing proxies or buy residential.
Integrate: Replace requests.get in earlier functions with fetch_with_rotation.
When: Pages rely on JS for content (prices load dynamically), or you face frequent blocks/CAPTCHAs with requests.
Key tips
Use one browser instance per worker and one context per proxy (isolates cookies).
Block images/fonts to save bandwidth.
Add human-like behavior: randomized typing, scrolls, delays.
Use realistic viewport & Accept-Language matching proxy country.
In 2025, consider stealth add-ons like playwright-stealth to evade AI detection.
Example (sync)
File: src/playwright_scraper.py
from playwright.sync_api import sync_playwright
import time
def scrape_with_playwright(search_url, proxy_server=None):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context_kwargs = {}
if proxy_server:
context_kwargs["proxy"] = {"server": proxy_server}
context = browser.new_context(**context_kwargs)
page = context.new_page()
# Block heavy assets for speed
page.route("**/*.{png,jpg,jpeg,svg,woff,woff2,ttf}", lambda route: route.abort())
page.goto(search_url, timeout=60000)
time.sleep(random.uniform(1, 3)) # Human-like delay
page.wait_for_selector("li.s-item", timeout=15000)
items = page.query_selector_all("li.s-item")
out = []
for it in items:
title = it.query_selector(".s-item__title")
price = it.query_selector(".s-item__price, .x-price-primary") # Fallback
out.append({
"title": title.inner_text().strip() if title else None,
"price": price.inner_text().strip() if price else None
})
context.close()
browser.close()
return out
# Example: Integrate proxy from pool
Checkpoint 6: Run on a JS-heavy page; compare to Requests. If CAPTCHAs: Use residential proxies or integrate solvers (e.g., CAPTCHA API services).
Worker model: Use a queue (Redis/RabbitMQ) with worker processes. Each worker uses one Playwright context tied to a proxy.
Concurrency: Favor multi-process (multiprocessing) vs many contexts in one process.
Rate limiting: Per-proxy rate limit (1–3 req/sec) with jitter.
Cache & incremental crawling: Avoid re-fetching pages too often; store timestamps and scrape deltas.
Canonical fields: item_id, title, price, currency, shipping, condition, item_url, seller_id, timestamp, market.
Store raw HTML/JSON in a DB (e.g., SQLite/MongoDB) for replay & parsing fixes.
Per-request logs (capture):
timestamp, url, status_code, latency_ms, proxy_id, worker_id, parsed_ok(bool), error_type
KPIs & thresholds:
| KPI | Threshold | Action |
| 403 Rate (1 hour) | >5% | Alert, rotate proxies |
| CAPTCHA Rate (24h) | >1% | Investigate solvers |
| Proxy Failure Rate | >20% | Evict proxies |
| p95 Latency | >5s | Check network/proxies |
Daily selector health job:
Take 10 representative URLs (mix markets).
Fetch and verify title and price exist.
If >30% fail, send alert + raw HTML.
HTTP 403: Rotate proxies, slow down, change UA, check robots.txt.
Selectors missing: Open page in browser, update selectors, or use fallback XPath/regex.
High CAPTCHA: Switch to residential/mobile proxies or a solver API; reduce request rate.
Timeouts: Increase timeout, check proxy health and latency.
Q: What proxy type should I buy first?
A: Start with a small residential proxy plan or trial from a reputable provider if you expect blocks; datacenter for low-budget experiments.
Q: Can I just use VPN + requests?
A: VPNs limit to one IP — not suitable for parallel scraping; proxies (pool) are better.
Q: Will headless browsers reduce blocks?
A: Playwright/Puppeteer helps with JS, but without good proxies and stealth techniques you’ll still get blocked.
Q: Handling 2025 Anti-Bot?
A: Use stealth-playwright; monitor for AI detection updates.
Start small: build a POC, validate fields, then add proxies and Playwright only when needed. Logging and daily health checks will save you much time. If you plan to provide scraped data commercially, switch to eBay's official APIs to reduce legal and operational risk. And remember, sites change—test selectors regularly.
Look for reliable web scraping proxies? Get your free trial today, full experience before payment, worry-free! Customized proxy pool supports your specific demand.
Next >
Cancel anytime
No credit card required