This browser does not support JavaScript

Master Selenium Web Scraping: 2025 Step-by-Step Guide for Beginners and Experts

Post Time: 2025-09-22 Update Time: 2025-09-22

Selenium stands out in web scraping, especially for dynamic, JavaScript-heavy websites. This is a step-by-step instructions for Beginners → Experts: environment, a runnable starter scraper, proxy integration, anti-detection tactics, retries & checkpointing, scaling patterns, and production hardening.

Selenium Web Scraping Guide

Who This Is For

Beginners: Follow Quick Start then the modular starter to get your first scraper running. Focus on basics like setup and simple extraction.

Intermediate / Pros: Jump to sections on proxies, rotation, stealth, scaling, and ops for advanced features like handling CAPTCHAs or deploying to the cloud.

When & Why Use Selenium for Web Scraping?

Use Selenium when the target site requires real browser behavior (heavy JavaScript, user interaction, forms, infinite scroll, or content loaded after events). Unlike simpler libraries like BeautifulSoup or Requests, Selenium handles dynamic content where pages load data via AJAX or require clicks/forms.

When content is static or provided via an API, prefer requests + BeautifulSoup for speed and simplicity.

Core Concepts Overview

Concept Description Tips
WebDriver Programmatic controller for a real browser. Use webdriver-manager for auto-syncing versions.
Locators By.ID, By.CSS_SELECTOR, By.XPATH. Prefer stable CSS or well-targeted XPath; test in browser dev tools (F12).
Waits Waits
Implicit: global, can cause subtle bugs. Explicit: WebDriverWait + expected_conditions — use this.
Explicit waits prevent flakiness on slow loads.
Headless Faster, less resource-heavy — sometimes more detectable. In 2025, combine with stealth libraries like SeleniumBase for better evasion.
Proxy Routes browser traffic; used for IP rotation, geo-targeting, and evasion. Residential proxies are key for tough sites.
Resource Blocking Blocking images/fonts/CSS speeds runs but may break JS-heavy pages. Test per site; start with images only.

Security, Ethics & Legal Checklist

Respect robots.txt and site Terms (robots.txt is advisory but informative).

Do not scrape personal or protected data illegally—comply with 2025 laws like updated GDPR/CCPA by anonymizing data and obtaining consent where required.

Use secrets manager or CI secret variables for credentials — never commit .env.

Alerts on spikes in errors or abnormal behavior.

Add rate-limiting and exponential backoff to avoid accidentally overwhelming targets.

Step 1. Prerequisites

Let's start with the basics. We will use Python as the example(most popular for scraping).

1. Install Chrome or Chromium and confirm it runs.

2. Install Python 3.8+ (3.11 recommended). Verify: python --version

3. Create project folder and virtual environment:

python -m venv venv

source venv/bin/activate   # Windows: venv\Scripts\activate

4. Create requirements.txt (pinned versions in Appendix) and install:

pip install -r requirements.txt.

5. Create .env from .env.example in the project root and edit credentials if using proxies.

6. Run quick_start.py (next step) to verify environment.

Cross-platform env var notes

macOS / Linux:

export GOPROXY_USER="user"

export GOPROXY_PASS="pass"

Windows (PowerShell):

$env:GOPROXY_USER="user"

$env:GOPROXY_PASS="pass"

or edit .env and rely on python-dotenv for local development.

Step 2. Quick Start: Verify Environment(No Proxy)

Save this as quick_start.py and run it. This proves Python + Selenium are installed and working.

# quick_start.py

from selenium import webdriver

from selenium.webdriver.chrome.service import Service

from webdriver_manager.chrome import ChromeDriverManager

from selenium.webdriver.common.by import By

 

def quick_start():

    options = webdriver.ChromeOptions()

    options.add_argument("--headless=new")

    options.add_argument("--window-size=1280,800")

 

    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

    try:

        driver.get("https://books.toscrape.com")

        books = driver.find_elements(By.CSS_SELECTOR, "article.product_pod h3 a")

        for b in books:

            print(b.get_attribute("title"))

    finally:

        driver.quit()

 

if __name__ == "__main__":

    quick_start()

Run:

python quick_start.py

Expected output: a list of book titles printed to console. If you see SessionNotCreatedException, update Chrome or let webdriver-manager handle it (it will download a compatible driver).

Step 3. Starter Scraper

This is a single-file starter you can copy, customize, and run. It demonstrates:

  • .env config (via python-dotenv)
  • GoProxy integration via selenium-wire
  • robust pagination using DOM-change detection (not blind sleeps)
  • resource blocking & request interception
  • cookie save/load for sessions
  • tenacity retries & exponential backoff
  • CSV checkpointing and logging
  • debug mode toggles for easier troubleshooting

Save as scraper.py.

# scraper.py

import os

import csv

import time

import random

import logging

import json

from dotenv import load_dotenv

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

from seleniumwire import webdriver   # pip install selenium-wire

from selenium.webdriver.chrome.service import Service

from webdriver_manager.chrome import ChromeDriverManager

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

from selenium.common.exceptions import TimeoutException, StaleElementReferenceException

 

# --- Load environment

load_dotenv()

USE_PROXY = os.getenv("USE_PROXY", "false").lower() == "true"

GP_USER = os.getenv("GOPROXY_USER")

GP_PASS = os.getenv("GOPROXY_PASS")

GP_HOST = os.getenv("GOPROXY_HOST", "proxy.goproxy.com")

GP_PORT = os.getenv("GOPROXY_PORT", "8000")

 

# --- Logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")

logger = logging.getLogger("scraper")

 

# --- Proxy health check (quick)

def test_proxy_httpbin(host, port, user=None, pw=None, timeout=8):

    import requests

    proxy = f"http://{user}:{pw}@{host}:{port}" if user else f"http://{host}:{port}"

    proxies = {"http": proxy, "https": proxy}

    try:

        r = requests.get("https://httpbin.org/ip", proxies=proxies, timeout=timeout)

        return True, r.json()

    except Exception as e:

        return False, str(e)

 

# --- Driver factory (selenium-wire) with debug toggle and resource blocking

from selenium.webdriver.chrome.options import Options as ChromeOptions

def make_driver(proxy=None, block_images=True, debug=False):

    options = ChromeOptions()

    if not debug:

        options.add_argument("--headless=new")

    options.add_argument("--window-size=1280,800")

    options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "

                         "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120 Safari/537.36")

    if block_images:

        prefs = {"profile.managed_default_content_settings.images": 2,

                 "profile.managed_default_content_settings.fonts": 2}

        options.add_experimental_option("prefs", prefs)

 

    seleniumwire_opts = None

    if proxy:

        auth = f"{proxy['user']}:{proxy['pass']}@{proxy['host']}:{proxy['port']}"

        seleniumwire_opts = {

            "proxy": {

                "http":  f"http://{auth}",

                "https": f"https://{auth}",

                "no_proxy": "localhost,127.0.0.1"

            }

        }

        logger.info("Using proxy: %s:%s", proxy['host'], proxy['port'])

 

    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),

                              options=options, seleniumwire_options=seleniumwire_opts)

 

    # Basic interceptor: abort obvious static assets (disabled in debug mode)

    def interceptor(request):

        if not debug and request.path.endswith(('.png', '.jpg', '.jpeg', '.gif', '.woff2', '.woff')):

            request.abort()

    driver.request_interceptor = interceptor

    return driver

 

# --- Robust pagination + extraction with tenacity retries

@retry(reraise=True, stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=8),

       retry=retry_if_exception_type((TimeoutException, StaleElementReferenceException)))

def extract_titles(driver, start_url):

    driver.get(start_url)

    wait = WebDriverWait(driver, 15)

    rows = []

 

    while True:

        # wait for product elements

        items = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "article.product_pod h3 a")))

        # extract

        for it in items:

            rows.append({"title": it.get_attribute("title")})

 

        # attempt to find a next button; if absent, finish

        try:

            next_btn = driver.find_element(By.CSS_SELECTOR, "li.next a")

        except Exception:

            break

 

        # record first item to detect page change

        first_title = items[0].get_attribute("title") if items else None

        next_btn.click()

 

        # wait for either URL change or first item change; fallback to a small wait

        try:

            wait.until(lambda d: d.execute_script("return document.readyState") == "complete")

            wait.until(lambda d: d.find_element(By.CSS_SELECTOR, "article.product_pod h3 a").get_attribute("title") != first_title)

        except Exception:

            time.sleep(random.uniform(1, 2))

            continue

 

    return rows

 

# --- Cookie utilities

def save_cookies(driver, path="cookies.json"):

    with open(path, "w", encoding="utf-8") as f:

        json.dump(driver.get_cookies(), f)

 

def load_cookies(driver, path="cookies.json"):

    with open(path, "r", encoding="utf-8") as f:

        cookies = json.load(f)

    for c in cookies:

        try:

            driver.add_cookie(c)

        except Exception:

            pass

 

# --- CSV append helper (checkpointing)

def append_rows_csv(path, rows, fieldnames):

    exists = os.path.exists(path)

    with open(path, "a", newline="", encoding="utf-8") as f:

        writer = csv.DictWriter(f, fieldnames=fieldnames)

        if not exists:

            writer.writeheader()

        writer.writerows(rows)

 

# --- Proxy picker

def pick_proxy_from_env():

    if not GP_USER or not GP_PASS:

        return None

    return {"host": GP_HOST, "port": GP_PORT, "user": GP_USER, "pass": GP_PASS}

 

def main():

    url = "https://books.toscrape.com"

    proxy = pick_proxy_from_env() if USE_PROXY else None

 

    # Optional: quick proxy health test

    if proxy:

        ok, info = test_proxy_httpbin(proxy['host'], proxy['port'], proxy['user'], proxy['pass'])

        logger.info("Proxy test ok=%s info=%s", ok, info)

        if not ok:

            logger.warning("Proxy health check failed. Proceeding anyway may cause driver errors.")

 

    # Toggle debug=True to see browser and disable interceptor for easier debugging

    debug_mode = False

    driver = make_driver(proxy=proxy, block_images=True, debug=debug_mode)

    start = time.time()

    try:

        rows = extract_titles(driver, url)

        append_rows_csv("output.csv", rows, fieldnames=["title"])

        logger.info("Saved %d rows in %.2fs", len(rows), time.time() - start)

    except Exception:

        logger.exception("Failed to scrape: %s", url)

    finally:

        driver.quit()

 

if __name__ == "__main__":

    main()

How to run

1. cp .env.example .env and edit .env if using proxies.

2. Activate venv and pip install -r requirements.txt.

3. python scraper.py — writes output.csv incrementally as pages are scraped.

4. If something fails, set debug_mode = True near the call to make_driver(...) to see the browser and disable asset blocking.

Expected output.csv sample

title

"A Light in the Attic"

"Tipping the Velvet"

...

Step 4. Add Proxies with Selenium

Why use proxies? Avoid IP blocks, geo-target content, scale across many IPs.

Proxy Types & Choices

Residential: harder to detect, higher cost and latency. Use for anti-bot sensitive targets (marketplaces, ticketing).

Datacenter: cheap and fast; easier to detect. Use for news, public listings.

Mobile: highest evasion, highest cost; rarely necessary—use for high-stakes social meida, rarely necessary.

Sticky vs Rotating

Sticky sessions: Same IP for an entire logical session (e.g., logins/carts).

Per-session rotation: Assign a fresh proxy per worker/task (recommended for most scrapers).

Choose rotation frequency based on sensitivity: start with per-session, and for high-sensitivity targets experiment with N requests per IP (N=1..10) and monitor blocks.

Setup Steps (GoProxy)

1. Sign up and choose a rotating proxy plan as need, get your credentials in the dashbord.

2. Store credentials in .env or a secrets store; never hardcode.

3. Health check: Use the test_proxy_httpbin helper above to verify connectivity before creating a driver.

4. For pools, maintain a small proxy pool: Randomize selection, log failures, and remove bad nodes.

Tips: For geo-targeting (e.g., regional prices), specify country in GoProxy dashboard.

Immediate check using curl

# small check (run in REPL or python file)

curl -x http://USER:[email protected]:8000 https://httpbin.org/ip

Implementation Note: selenium-wire accepts http://user:pass@host:port for proxy auth; this is the cleanest integration for authenticated proxies in the Python ecosystem.

Step 5. Anti-detection & Resource Optimization

These are pragmatic, ordered by simplicity → complexity. In 2025, consider integrating SeleniumBase for advanced stealth: pip install seleniumbase; from seleniumbase import Driver; driver = Driver(undetected=True).

1. User-Agent rotation: Change UA per session. Example:

options.add_argument("user-agent=Your User Agent string")

2. Language & timezone: --lang=en-US; or inject JS for timezone if needed.

3. Block heavy assets only after testing. Block images/fonts only if visible content still loads. Use the interceptor in scraper.py.

4. Human-like interactions: Add random small sleeps and scrolls before clicks:

import random, time

time.sleep(random.uniform(0.5, 1.8))

driver.execute_script("window.scrollBy(0, arguments[0]);", random.randint(100, 400))

5. Cookie reuse: Save cookies after a successful login and load them for subsequent sessions.

6. Avoid honeypots: Ignore hidden elements (display:none or zero size).

7. Captcha handling: If lawful and permitted, reduce rate, use residential proxies, route to human-in-the-loop or compliant solver services.

Decision Rule for Blocking

  • If page renders content with images blocked → safe to block images.
  • If key content is missing after blocking (e.g., site is SPA) → do not block CSS/JS.

Step 6. Enhance: Retries, Checkpoints & Logging

Retries

Use tenacity for transient network/DOM issues. Example already in extract_titles().

Checkpointing

Write partial results to output.csv after each page/batch using append_rows_csv to avoid data loss.

Logging — capture

timestamp, url, worker_id, proxy_id, attempt, status, error_message, duration.

Common exception handling patterns

TimeoutException: increase wait or validate selector.

StaleElementReferenceException: re-find element or retry.

Proxy fail: remove proxy from pool and retry job.

Step 7. Scaling: Single → Grid → Cloud

Small (single machine)

Sequential or small multi-threaded runs; rotate proxy per run.

Medium (workers)

Use a job queue (Celery, RQ) where each worker:

  • Picks a proxy
  • Spins up an ephemeral Chrome process
  • Scrapes assigned pages and quits

Large (Grid / container farm)

Selenium Grid or cloud-managed browser farm. Use ephemeral tasks, central logs/metrics, and k8s for orchestration.

Practical tips

Prefer ephemeral browser processes per job (spin up → scrape → quit) to avoid memory leaks.

Centralize logs (ELK/Fluentbit) and metrics (Prometheus/Grafana).

Monitor: success rate, proxy health, CPU/memory, CAPTCHA frequency.

Minimal Grid notes

Use official/maintained Selenium images; replace example tags with current stable versions.

Testing Checklist Before Scaling

1. Selector stability test: Run the scraper for 10–50 pages and measure failures.

2. Proxy health test: Sanity-check proxies and measure latency; remove slow ones.

3. Headless vs headed: Compare results — if headless is detected, use headed nodes or stealth options.

4. Rate limit test: Slowly ramp request rate to find safe throttle.

5. Resource test: Measure CPU/memory per browser instance; set worker concurrency accordingly.

6. CAPTCHA frequency: Log CAPTCHA encounters and reduce rate.

Quick Troubleshooting

SessionNotCreatedException → Update Chrome/browser or use webdriver-manager for auto-sync.

TimeoutException → Increase WebDriverWait timeout (e.g., to 20s) or verify selector in dev tools.

StaleElementReferenceException → Re-find the element after page changes or wrap in retry.

Proxy auth fails → Check .env credentials; test with curl: curl -x http://user:pass@host:port https://httpbin.org/ip.

Captcha frequently → Slow down with random delays, switch to residential proxies via GoProxy, or add more human-like behaviors like random scrolls.

Debug tips

If something fails unexpectedly: set debug=True in make_driver(...), disable interceptor, and run headful (not headless) to visually inspect the page and selectors.

Final Thoughts

Selenium helps you scrape modern JS-heavy sites — but with complexity: cost, detection risk, and operational burden. This guide gives you a linear, runnable path from local proof-of-concept to hardened scraper: verify environment, run Quick Start, run Starter Scraper with optional GoProxy, add anti-detection measures, and scale. Keep everything modular — one function per page or job makes parallelism and debugging easier. Test thoroughly and instrument logs/metrics before you scale.

Next >

Ethical Practices to Evade Blockers in Web Scraping
Start Your 7-Day Free Trial Now!
GoProxy Cancel anytime
GoProxy No credit card required