A Complete Beginner's Guide to Rotating Proxy
Practical guide to rotating proxies: what rotating proxy is, how it work, benefits, when to choose & use, best practices and cost tips
Sep 5, 2025
Learn why your web scraper is slow and get step-by-step fixes like async I/O, concurrency, proxies, and 2025 trends for faster data extraction.
Web scraping is powerful for data extraction, while nothing kills productivity worse than slow speed. Web scraping projects often start fast and then can slow down as pages, volume, or anti-bot measures increase. In this guide, we'll start with diagnosis, move to quick fixes, and escalate to advanced techniques. Whether you're using Python with Selenium or simpler libraries like Requests.
Slow scraping can lead to incomplete data sets, higher costs for cloud resources, and even IP bans from target sites. Here are common reasons:
Emulating a full browser loads JavaScript, CSS, images, and other elements that aren't necessary. This can add seconds per page, multiplying into hours for large-scale scrapes.
Many basic scrapers handle one request at a time. If you're scraping thousands of pages, waiting for each response sequentially creates massive bottlenecks.
Besides slow internet speeds, distant servers, high latency, anti-scraping measures like rate limiting or CAPTCHAs can also drag down performance.
Using slow parsers (e.g., BeautifulSoup's default HTML parser) or poorly optimized selectors can chew up CPU time. For dynamic sites, rendering JavaScript adds extra overhead.
Running on a local machine with limited CPU cores or memory means your scraper can't parallelize tasks effectively. At scale (e.g., millions of pages), this becomes a showstopper.
Websites detect scraping through patterns like rapid requests from the same IP, leading to blocks that require manual intervention or slowdowns to mimic human behavior.
As users share in communities, sometimes it can take days for what should be hours. For instance, a Python Selenium setup might take 30 seconds per page due to full rendering, while a streamlined alternative could drop that to under a second (often 5-20x speedup when switching to raw HTTP).
To make this guide more relevant, consider your project type—these subdivided scenarios address real user concerns like avoiding blocks or handling JS:
Each scenario has a tailored path—use the "Choose Your Path" table below to jump to the right sections.
Here's a table to help you navigate based on your bottleneck and scenario.
Scenario | Likely Bottleneck | Recommended Sections | Expected Speedup |
Small Project | Network/Overhead | Diagnose, Quick Wins | 2-5x |
Medium Scraping | Sequential/I/O | Quick Wins, Introduce Concurrency | 5-10x |
Large Scraping | Scale/Resource | Introduce Concurrency, Large Scale | 10x+ |
JS-Heavy Pages | Browser Rendering | Handling JS-Heavy Pages | 5-20x |
Login/Session Scrapes | Anti-Bot/Sessions | Anti-Bot Strategies (in Concurrency) | 3-8x |
Measure first: Capture per-page p50/p95 and profile request vs. parse vs. write times.
Quick wins: Switch to HTTP requests when possible, reuse sessions, use lxml, batch writes, and add checkpointing.
Goal: know which part of your pipeline to optimize, avoiding waste time on wrong fixes.
Per-page elapsed: Log each page's fetch → parse → save durations (store timestamps).
Aggregate metrics: p50, p95, throughput (pages/min), error rates (timeouts, 429s).
Resource usage: CPU%, memory, open fds, network I/O.
Server responses: Status codes, TTFB (Time to First Byte), response size.
curl RTT/TTFB
Create curl-format.txt with %{time_namelookup} %{time_connect} %{time_appconnect} %{time_starttransfer} %{time_total}\n:
curl -w "@curl-format.txt" -o /dev/null -s "https://example.com/page"
Python profiler (cProfile)
python -m cProfile -o profile.out my_scraper.py
python - <<'PY'
import pstats
p = pstats.Stats('profile.out')
p.sort_stats('cumtime').print_stats(30)
PY
Compute percentiles
After logging per-request times to times.csv:
import numpy as np
times = np.loadtxt('times.csv') # one numeric value per line (seconds)
print("p50:", np.percentile(times, 50))
print("p95:", np.percentile(times, 95))
Network/wait > 50% → I/O-bound (use async/threading/proxies).
Parsing CPU > 50% → CPU-bound (use multiprocessing).
Browser render time dominates → Rendering overhead; consider API extraction or targeted headless usage.
High 429/5xx → Server throttling/anti-bot—slow down, rotate IPs (GoProxy).
Verification: Run on a sample of 100 pages; if p95 > 5s, prioritize that bottleneck.
These usually give the best ROI with minimal code changes.
Prerequisites: Basic Python setup; install libs via pip if needed (e.g., pip install requests lxml).
Why: Browsers add seconds per page; HTTP fetches in milliseconds.
Steps: If data is in HTML/JSON, use requests instead of Selenium.
Pitfall: Won't work for JS-rendered content—check with browser dev tools.
Test: Time 10 pages before/after switch.
Why: Avoids repeated TCP/TLS handshakes.
Snippet (Requests example with pooling + retries):
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
adapter = HTTPAdapter(pool_connections=100, pool_maxsize=100)
session.mount('http://', adapter)
session.mount('https://', adapter)
retries = Retry(total=3, backoff_factor=0.5, status_forcelist=(429,500,502,503))
session.adapters['https://'].max_retries = retries
Pitfall: Overly large pools can hit OS limits—start with 50.
Expected Gain: 20-50% for repeated requests
Block images/fonts in headless browsers or set appropriate headers like Accept: text/html. Smaller payloads = faster transfers and parsing.
Switch BeautifulSoup(..., 'html.parser') to BeautifulSoup(..., 'lxml') (install: pip install lxml).
Use direct CSS/XPath instead of DOM traversal.
Gain: 6-10% on parsing.
Flush DB/writes every 500-5,000 records; use pickle for checkpoints.
Pitfall: Ensure idempotency to avoid duplicates on resume.
As of September 2025, web scraping trends incorporate lighter tools and AI for efficiency:
Use Requests-HTML for JS without full browsers (pip install requests-html). Example:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get(url)
r.html.render() # Renders JS
Gain: 30-50% faster than Selenium for dynamic sites.
Pitfall: Still heavier than pure HTTP—use only when needed.
Leverage ML for auto-optimizing selectors or parsing.
Example: Use Hugging Face models (via transformers lib) for entity extraction, offloading CPU by 20%.
Why: Reduces manual tuning; ideal for unstructured data.
2x faster than Selenium for automation. Install: pip install playwright. Example:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url)
Pitfall: Async version for better concurrency.
The right model depends on your bottleneck. Start with profiling to confirm.
Prerequisites: pip install aiohttp concurrent.futures.
Scales to thousands; best for high-throughput.
Example with proxy rotation:
import asyncio, aiohttp, random
from asyncio import Semaphore
CONCURRENCY = 100
sema = Semaphore(CONCURRENCY)
PROXIES = ["http://user:pass@p1:port", "http://user:pass@p2:port"] # From GoProxy
async def fetch(session, url):
proxy = random.choice(PROXIES)
async with sema:
async with session.get(url, proxy=proxy, timeout=20) as resp:
resp.raise_for_status()
return await resp.text()
async def main(urls):
timeout = aiohttp.ClientTimeout(total=30)
conn = aiohttp.TCPConnector(limit=0)
async with aiohttp.ClientSession(connector=conn, timeout=timeout) as session:
tasks = [fetch(session, u) for u in urls]
return await asyncio.gather(*tasks, return_exceptions=True)
results = asyncio.run(main(urls))
Pitfall: Ensure async-compatible libs; handle exceptions in gather.
Test: On 100 URLs; reduce CONCURRENCY if errors >5%.
Gain: 10x+ (e.g., 126s to 7s for 100 pages).
Easier for blocking clients. Snippet:
from concurrent.futures import ThreadPoolExecutor, as_completed
def scrape_url(url):
return session.get(url).text # Use session from Quick Wins
with ThreadPoolExecutor(max_workers=20) as ex:
futures = [ex.submit(scrape_url, u) for u in urls]
for f in as_completed(futures):
data = f.result()
# Parse & save
Tuning: Start workers = min(32, 2 * cpu_count()); increment by 10, monitor p95/errors.
Gain: 50-100%.
Use multiprocessing.Pool for parsing. Gain: 100-150% on multi-core.
Basic Idea: Fetch async/threaded → Queue raw HTML → Parser processes.
Snippet:
from multiprocessing import Pool
def parse_html(html):
# Your parsing logic
return parsed_data
with Pool(processes=4) as pool:
results = pool.map(parse_html, html_list)
Pitfall: High memory; use queues for backpressure (e.g., multiprocessing.Queue).
Test: Monitor CPU%; add error handling.
Async fetching → process pool parsing → batched writes. This separates I/O and CPU concerns and scales well but requires more orchestration (queues, backpressure handling).
Throttle & Randomize: time.sleep(random.uniform(0.5, 2.0)).
Rotate IPs: Use a reliable proxy service for managed pools, like GoProxy. Fetch: proxies = requests.get('https://api.goproxy.com/get-proxies?key=your_key').json()['proxies'].
Session Reuse: Persist cookies to avoid relogins.
Pitfall: Monitor for blocks; respect robots.txt.
1. Discover API Calls: Use DevTools Network tab for XHR/Fetch JSON—prefer over rendering.
2. If No API: Use headless (Playwright preferred) only for needed pages.
3. Best Practices: Reuse instances, block extras, stealth headers, random delays.
4. Authenticated Flows: Sticky proxies (GoProxy supports up to 60 minutes, customized proxy pools can be 120 minutes) to maintain sessions.
Note: Browser is 5-20x slower than HTTP; use sparingly.
If you’re aiming for massive scale, design for reliability and ops:
1. Scheduler — store list of targets, priority queues, and retry metadata.
2. Fetcher fleet — horizontally scalable workers (containers/VMs) performing HTTP calls or headless browsing; controlled concurrency per worker.
3. Parser workers — CPU-bound processes or services separating parsing from fetching.
4. Proxy manage (GoProxy) — pool management, rotate IPs, sticky sessions, geo-targeting. The featured unlimited traffic rotating residential proxy plans are your outstanding partner for scale scraping.
5. Storage / DB — bulk writes, partitioning, dedup keys.
6. Monitoring & alerting — per-page latency, error rates, queue depth, worker health.
7. Checkpointing & resume — persistent job state to resume interrupted runs.
Shard target lists across worker groups to avoid overlap.
Autoscale fetcher fleet based on queue depth and target error rate.
Keep producers rate-limited to avoid bursts that trigger server defenses.
Implement idempotency keys for safe retries and dedup.
If speed doesn’t improve:
Many 429/5xx? → Slow down + rotate proxies.
CPU pegged on parsing? → Subprocesses.
Retries spiking? → Proxy/network check.
Out of memory/handles? → Scale horizontally.
High browser time? → API alternatives.
Q: Is Selenium always bad for speed?
A: No—necessary for some JS, but switch to Playwright (2x faster) or Requests-HTML (30-50% gain) when possible.
Q: How many proxies do I need?
A: Start with tens for mid-scale; scale while monitoring errors. Use sticky for sessions via GoProxy.
Q: Threading vs. Asyncio—which to choose?
A: Threads for easy adoption with blocking code; Asyncio for high-throughput (better scaling).
Q: How to avoid getting blocked?
A: Rotation, delays, session reuse, backoff on 429s, monitor patterns.
Q: What's new in 2025 for speed?
A: AI for auto-parsing (20% CPU offload), lighter JS tools like Requests-HTML.
Q: Can I use no-code tools?
A: For beginners/small jobs, yes—browser extensions or platforms for quick setups.
Q: How to measure overall improvement?
A: Rerun profiler post-changes; aim for <5% error rate and 2x+ throughput.
If your question was “is web scraper slow?”—it depends, but most are fixable with measurement and tooling. Start with diagnosis/quick wins, profile bottlenecks, apply concurrency/proxy strategies (GoProxy for rotation/sticky, Sign up and get your free trial today!), and scale as needed. Users report 2-10x speedups; follow these steps for efficient data extraction.
< Previous
Next >