How to Effectively Scrape Booking.com: 2025 Step-by-Step Guide with Proxies
Step-by-step guide to scrape Booking.com for hotel data, prices, and reviews using Python and proxies to bypass blocks effectively.
Oct 15, 2025
Legal and scalable guide to scraping eBay: stepwise methods, proxy strategy, code samples, and monitoring.
Scraping eBay is a powerful way to collect pricing, availability, seller, and historical “sold” data for market research, repricing, inventory intelligence, and analytics. But eBay is a large, actively defended marketplace — you must balance legal/ethical constraints, technical reliability, and anti-bot defenses. This guide gives a practical, beginner→advanced path: concrete steps, example code patterns, and a full proxy strategy using GoProxy.
Short takeaway: Prefer eBay’s official APIs whenever they meet your needs. If you must scrape HTML, follow a clear, safe workflow: Plan → Provision GoProxy → Build & Test Minimal Scraper → Harden (retries & pacing) → Monitor → Scale slowly.
Who this article is for
Beginners: want a safe way to extract a few items quickly.
Developers: need robust patterns (retries, concurrency, JSON parsing, proxies).
Teams / Ops: building production scrapers with monitoring, cost planning, and legal control.
Fetch a product page through GoProxy and print the main <h1> (replace credentials & item ID):
curl -s -x "http://user:[email protected]:8000" \
-H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)" \
"https://www.ebay.com/itm/315889834153" | htmlq -t 'h1'
Note: htmlq (or pup) is a small HTML query tool; the canonical Python quick start is below in Step 3.
1. Test the eBay Developer APIs for the fields you need.
2. Create a GoProxy account and get test credentials.
3. Run the one-line proxy health check (TL;DR).
4. Run the minimal Python scraper (Step 3) for one product.
5. Inspect results; if OK, add retry/backoff (Step 4) and conservative pacing before scaling.
eBay has ~1B+ listings across marketplaces — price history, sold items, variants, and seller metadata are valuable signals for pricing engines, market research, product discovery, and analytics.
API note (2025): eBay has modernized its APIs and moved functionality to newer REST endpoints. Older legacy endpoints have been deprecated in waves; always check eBay’s developer docs for the current recommended endpoints and rate limits before choosing scraping vs API. If an API covers your fields (search, listing data, sold/analytics), prefer it — it reduces legal/maintenance risk and improves stability.
Prefer APIs when they supply the fields you need.
Public page scraping is not automatically illegal, but it may violate eBay’s Terms of Service — violating TOS can trigger IP blocks or account suspensions. Evaluate risk, especially for commercial use.
Do not scrape login-only pages unless you have explicit permission.
If you hit CAPTCHAs or login walls, pause and choose compliant alternatives (APIs, manual review). Do not attempt to circumvent protections.
Document your data purpose, retention policy, and access controls. Avoid harvesting or storing PII.
Monitor eBay policy updates periodically — developer policies and TOS change over time.
Need | Approach | Proxy guidance |
Single items / price alerts | No-code or one-file Python script | Small residential/sticky proxy pool |
Weekly research (1k–50k items) | Async httpx or Scrapy + JSON parsing | Mixed residential + datacenter, geo selection |
Production (100k+ items) | Distributed workers + headless for JS |
Large rotating residential/mobile pool, session control, unlimited traffic plans |
If you picked an approach above: follow the numbered Setup → Operate → Maintain steps in Section 7. Beginners can stop at Step 4; pros continue to Step 8+.
Decide before you build:
If blocked?
Start → hit 403/CAPTCHA? → reduce speed & switch proxy → still blocked? → pause and use API/manual review.
1. Sign up and get credentials.
2. Choose a hybrid pool: small residential set for product pages + datacenter set for low-risk search pages.
3. Enable sticky sessions (per-worker session tokens) for multi-step flows.
4. Geo-target IPs to match marketplaces (e.g., UK pages via UK IPs).
5. Run health check:
curl -I -x "http://user:[email protected]:8000" https://httpbin.org/get
Security best practice: never hardcode credentials. Use environment variables, secret managers, or container secrets (examples below).
Install
pip install httpx parsel
Set secret via env var
export GOPROXY_URL="http://user:[email protected]:8000"
minimal_scraper.py
# minimal_scraper.py
import os, time, re, json, random
import httpx
from parsel import Selector
PROXY = os.getenv("GOPROXY_URL")
if not PROXY:
raise RuntimeError("Please set GOPROXY_URL env variable")
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
"Accept-Language": "en-US,en;q=0.9"
}
def fetch(url, proxy=PROXY, timeout=20):
with httpx.Client(proxies={"http:/": proxy, "https:/": proxy}, headers=HEADERS, timeout=timeout) as client:
r = client.get(url)
r.raise_for_status()
return r.text
def parse_product(html):
s = Selector(text=html)
title = s.css('h1::text').get() or s.css('.it-ttl::text').get()
price = s.css('[itemprop="price"]::attr(content)').get() or s.css('.x-price-primary span::text').get()
variants = None
for script in s.css('script::text').getall():
if 'MSKU' in script or 'itemVariants' in script or 'INITIAL_STATE' in script.upper():
m = re.search(r'(\{.+\})', script, re.S)
if m:
try:
variants = json.loads(m.group(1))
break
except Exception:
pass
return {"title": title.strip() if title else None, "price": price, "variants": variants}
if __name__ == "__main__":
url = "https://www.ebay.com/itm/315889834153" # Replace with your target
html = fetch(url)
print(parse_product(html))
time.sleep(2 + random.random())
Run
export GOPROXY_URL="http://user:[email protected]:8000"
python minimal_scraper.py
Notes
Use saved HTML fixtures to run parser unit tests without hitting eBay.
Keep requests polite (random sleep + jitter).
Treat this script as a test harness; production uses a queued worker model.
Safe fetch pattern
import random, time, httpx
PROXIES = [
"http://u:[email protected]:8000",
"http://u:[email protected]:8000",
]
def get_with_retry(url, max_retries=5, proxies=PROXIES):
backoff = 1
for attempt in range(1, max_retries + 1):
proxy = random.choice(proxies)
try:
with httpx.Client(proxies={"http:/": proxy, "https:/": proxy}, headers=HEADERS, timeout=20) as c:
r = c.get(url)
if r.status_code == 200:
return r.text
except httpx.HTTPError:
pass
time.sleep(backoff + random.random())
backoff = min(backoff * 2, 30)
raise RuntimeError("Failed after retries")
Defaults / guidance
max_retries: 3–5
Backoff: start at 1s, exponential, cap ~30s
Per-proxy concurrency: 1–4 for residential IPs (conservative).
Key URL params: _nkw (keyword), _sacat (category), _pgn (page), _ipg (items per page), _sop (sort).
Tip: Inspect Network tab for XHR JSON endpoints — they’re usually more stable and easier to parse.
Stop rule: stop when no new item IDs are found or after a max pages limit.
Incremental crawl: track item_id with timestamps; re-crawl only when the last checked time is older than your update window.
1. Search <script> tags for tokens like MSKU, itemVariants, INITIAL_STATE.
2. Extract JSON carefully (avoid greedy regex). Use a small tolerant JSON extractor.
3. Map variant attributes (color/size) → price/stock.
4. Fallback: if no JSON, parse DOM option lists and make per-variant requests selectively.
Caution: eBay’s HTML can change; maintain unit tests and sample fixtures.
Do
Emulate real browsers via headers (User-Agent, Accept-Language, Referer).
Use sticky proxies for session flows.
Add randomized jitter in delays; avoid fixed intervals.
Rotate user agents and proxies; keep concurrency low per IP.
Log proxy used, status code, response size, and parse result.
Never
Do not provide instructions that bypass CAPTCHAs or login walls. If you encounter those, stop and evaluate APIs, manual verification, or reduce scope.
Task queue: Celery/RabbitMQ or Kafka.
Workers: stateless fetch → parse → normalize → store.
Dedup & schedule: Redis for seen IDs & scheduling windows.
Storage: Postgres for normalized data + object store for raw HTML.
Headless pool: separate fleet for JS-only pages.
Hybrid pool: datacenter for searches, residential for product/detail pages.
Sticky session tokens: create one token per worker, reuse 5–20 requests, refresh on failures.
Geo mapping: map TLD/marketplace → local IP geo.
Health checks: probe proxies and evict slow/failed IPs.
Estimate your per-1k request cost and add 20–30% for retries. Example ranges vary widely depending on residential vs datacenter mix. Contact us with your project, and get your personal pricing!
scraper_requests_total
scraper_errors_total{code} (403/429/5xx)
proxy_health_failures_total
parser_missing_fields_total
Request latency percentiles (p50/p95/p99)
Blocked rate (403/429): alert if >2% over 10 min or if doubling in 10 min.
Proxy failure: alert if >10% proxies fail in 1 hour.
Parser failure: alert if >5% missing required fields for 24 hours.
Cost burn: alert on unexpected spikes.
Pause new tasks in the affected queue.
1. Reduce global QPS by 50%.
2. Switch to standby proxy pool (residential only) or rotate to spare proxies.
3. Run parser unit tests on a sampled 100 pages.
4. Resume at 10% throughput; ramp slowly.
Keep selector/JSON path manifest and unit tests with saved HTML fixtures. Run nightly and fail your CI on significant regressions.
sum(rate(scraper_errors_total[10m])) / sum(rate(scraper_requests_total[10m])) > 0.02
Revalidate selectors monthly or more often for volatile categories.
Purge raw HTML with PII after a retention window (e.g., 30 days).
Keep an internal audit trail of data usage and access.
For commercial use, consult legal counsel and consider partnerships or licensed data options.
Sudden 403s: reduce concurrency, use residential sticky proxies, increase delays.
Frequent CAPTCHAs: pause, consider API/manual review, or reduce scope.
Missing fields: re-run selector tests; update CSS/XPath or JSON paths.
High cost: cache, sample low-priority categories, use delta updates.
If you’re non-technical and need a handful of items:
=IMPORTXML("https://www.ebay.com/itm/315889834153", "//h1")
Warning: IMPORTXML is fragile and will break with small HTML changes. No-code tools are fine for small experiments but are not suitable for scale.
Q: Is scraping eBay legal?
A: Not automatically illegal — but it may violate eBay’s Terms of Service. Prefer official APIs for commercial projects and document your usage.
A: Conservative defaults: 1–4 concurrent workers per residential IP. For search pages: 0.3–1 req/sec. Always monitor and adjust.
A: Only if data is rendered entirely by JS and no XHR provides the data. Headless is more expensive and may increase detection risk.
A: Check eBay’s APIs and marketplace analytics endpoints first; scraping sold listings can be more fragile and may require additional handling.
This step-marked workflow (Plan → Provision GoProxy → Minimal Scrape → Harden → Monitor → Scale) gets you from a test scrape to a production workflow while minimizing risk. Start small, instrument everything, prefer APIs when possible, and treat proxies, monitoring, and legal review as first-class parts of your system.
Ready to try? Grab a tool, set up proxies, and transform eBay data into insights. Sign up and start a test today!
Next >