This browser does not support JavaScript

Build Your Amazon Web Scraper with Python & GoProxy: Beginner → Pro

Post Time: 2025-07-16 Update Time: 2025-07-16

Scraping Amazon for product data—like prices, reviews, and ratings—can unlock powerful insights for e-commerce sellers, marketers, and data enthusiasts. However, users often worry about legality, getting blocked, or the technical know-how required.

Before diving into complex pipelines, we begin with a minimal 4-step scraper. This helps you understand the fundamental building blocks—HTTP requests, CSS selectors, parsing, and data saving—so every later enhancement (proxies, headless fallback, distributed jobs) builds on solid ground.

Build Amazon Web Scraper with Python

Stage 0: Key Concepts & User Scenarios

Legality & TOS

Public data scraping isn’t illegal but breaks Amazon’s Terms of Service → risk IP/account blocks

Thresholds: < 200 requests/day/uniform IP usually safe; > 1,000 almost guarantees blocks

Top Use Cases

1. Price Monitoring: 50–100 ASINs hourly

2. Review Analysis: Sentiment from thousands of reviews

3. Stock Alerts: Restock notifications

4. Competitive Intelligence: Best‑seller rank & ratings

GoProxy Overview

90 M+ residential IPs in 200+ locations all around the world

Rotating vs. sticky sessions(up to 60min); free country/city-targeting

Helps you avoid IP bans and CAPTCHAs

Stage 1 (Beginner): 4-Step Amazon Scraper

You’ll see how to fetch a page through GoProxy, inspect HTML, parse key fields, and save results—forming the core loop of any scraper.

Step 1. Setup Environment & GoProxy

Prepare your local environment and obtain GoProxy credentials.

1. Create & activate a virtual env

bash

 

python3 -m venv venv  

source venv/bin/activate    # macOS/Linux  

venv\Scripts\activate       # Windows  

pip install requests beautifulsoup4 lxml pandas  

2. Sign up at GoProxy and choose a rotating residential proxy plan, then copy your Bearer token from the dashboard.

Step 2. Fetch with GoProxy

Each request appears from a fresh, real-residence IP, dramatically reducing blocks.

High-level algorithm:

1. Call GoProxy’s API to generate a rotating proxy endpoint.

2. Configure requests to use that endpoint.

3. Retry on failure with exponential back‑off

python

 

import requests, random, time

 

USER_AGENTS = ["Mozilla/5.0 (Windows NT 10.0; ...)", /* more */]

API_KEY       = "YOUR_GOPROXY_BEARER_TOKEN"

API_BASE      = "https://api.goproxy.com/v1"

 

def get_proxy_endpoint(country=None):

    headers = {"Authorization": f"Bearer {API_KEY}"}

    body    = {"session_type":"rotating"}

    if country:

        body["country"] = country

    resp = requests.post(f"{API_BASE}/endpoints:generate", json=body, headers=headers)

    resp.raise_for_status()

    return resp.json()["endpoint"]  # e.g. "user:[email protected]:7890"

 

def fetch_page(url):

    endpoint = get_proxy_endpoint()

    proxies  = {"http": f"http://{endpoint}", "https": f"http://{endpoint}"}

    headers  = {"User-Agent": random.choice(USER_AGENTS), "Accept-Language": "en-US"}

    for attempt in range(3):

        try:

            r = requests.get(url, headers=headers, proxies=proxies, timeout=15)

            r.raise_for_status()

            return r.text

        except:

            time.sleep(2 ** attempt)

    return ""

Step 3. Inspect Selectors

Learn to map HTML elements to the data you need. Open the URL in a browser, right-click, and select “Inspect” (F12). Look for:

  • Title: #productTitle
  • Price: .a-price .a-offscreen
  • Rating: .a-icon-star span
  • Reviews: #acrCustomerReviewText
  • Specs: #feature-bullets li
  • Image: use re to parse hiRes in page JSON

Beginners: Practice clicking elements and noting their selectors; they may change, so always verify.

Step 4. Parse & Save

python

 

from bs4 import BeautifulSoup

import re

 

def parse_product(html):

    soup = BeautifulSoup(html, "lxml")

    get  = lambda sel: (soup.select_one(sel).get_text(strip=True) if soup.select_one(sel) else "N/A")

    data = {

        "title":   get("#productTitle"),

        "price":   get(".a-price .a-offscreen"),

        "rating":  get(".a-icon-star span"),

        "reviews": get("#acrCustomerReviewText"),

        "image":   (re.search(r'"hiRes":"([^"]+)"', html) or ["","N/A"])[1]

    }

    return data

 

# Test

print(parse_product(fetch_page("https://amazon.com/dp/B0BSHF7WHW")))

Next: We’ll expand this to crawl entire search results and review pages.

Stage 2 (Intermediate): Crawling Search & Review Pages

Automate data collection beyond one product—crawl listings and customer reviews for broader insights.

Search-Page Scraping

python

 

def scrape_search(keyword, pages=2):

    """Collect product URLs from search results."""

    results = []

    for p in range(1, pages+1):

        url  = f"https://amazon.com/s?k={keyword}&page={p}"

        html = fetch_page(url)

        soup = BeautifulSoup(html, "lxml")

        for a in soup.select(".s-result-item h2 a"):

            results.append("https://amazon.com" + a["href"].split("?")[0])

        time.sleep(1)

    return results

 

# Example

urls = scrape_search("wireless+earbuds", pages=2)

Review-Page Scraping

python

 

def scrape_reviews(asin, pages=2):

    """Extract ratings, text, and dates from reviews."""

    reviews = []

    for p in range(1, pages+1):

        url  = f"https://amazon.com/product-reviews/{asin}/?pageNumber={p}"

        html = fetch_page(url)

        soup = BeautifulSoup(html, "lxml")

        for r in soup.select(".review"):

            reviews.append({

                "rating": r.select_one(".a-icon-alt").get_text(strip=True),

                "text":   r.select_one(".review-text-content span").get_text(strip=True),

                "date":   r.select_one(".review-date").get_text(strip=True)

            })

        time.sleep(1)

    return reviews

 

# Example

print(scrape_reviews("B0BSHF7WHW", pages=2))

Stage 3 (Advanced): Resilience & Scale

Integrate geotargeting, handle JavaScript‑heavy pages, scale to hundreds of ASINs, and orchestrate distributed crawls with proven reliability.

1. Leveraging GoProxy’s API & Geotargeting

python

 

# Generate a US-only proxy endpoint

us_ep   = get_proxy_endpoint(country="US")

proxies = {"http": f"http://{us_ep}", "https": f"http://{us_ep}"}

html_us = requests.get("https://amazon.com/dp/B0BSHF7WHW",

                       headers={"User-Agent": random.choice(USER_AGENTS)},

                       proxies=proxies).text

You can also specify "regions", "cities", or "postal_code" in the POST body to fine-tune location.

2. Hybrid Headless Fallback

python

 

def fetch_with_fallback(url):

    html = fetch_page(url)

    if not html or "captcha" in html.lower():

        # Fallback to Playwright when proxies alone can’t render

        from playwright.sync_api import sync_playwright

        with sync_playwright() as pw:

            b = pw.chromium.launch()

            p = b.new_page()

            p.goto(url, timeout=30000)

            html = p.content()

            b.close()

    return html

3. High-Volume Alerts & Case Study


Case Study

A U.S. retailer scraped Buy Box prices every 30 minutes across 5 markets.

Before: Basic proxies → 72% success, 200+ manual retries/day.

After: GoProxy rotating IPs + geotargeting + headless fallback → 98% success, 20 manual retries/day.

Full blog here → Case Study: Automating 98% Reliable Amazon Buy Box Scraping with GoProxy


python

 

import csv, random, time

 

asins = open("asins.txt").read().split()

with open("alerts.csv","w",newline="") as f:

    w = csv.DictWriter(f, fieldnames=["asin","price","time"])

    w.writeheader()

    for a in asins:

        data = parse_product(fetch_with_fallback(f"https://amazon.com/dp/{a}"))

        w.writerow({"asin":a,"price":data["price"],"time":time.time()})

        time.sleep(random.uniform(1,3))

4. Distributed & Scheduled Crawls

Task Queues: Use RabbitMQ/Redis to parcel out ASIN jobs.

Scheduler: Cron or Apache Airflow DAGs for automated runs.

Serverless: AWS Lambda + GoProxy + S3 for near-zero ops overhead.

Anti-Bot & Ethical Best Practices

Residential IP Rotation: GoProxy’s 90 M+ IP pool evades blocks.

User-Agent Cycling: Rotate 20+ real browser strings each request.

Throttling: Sleep 1–4 s; honor robots.txt crawl-delay.

Stay Logged Out: Always scrape public pages.

Volume Limits: Aim for ≤ 1,000 requests/day/IP.

Scaling Your Data Pipeline

1. CSV Prototyping: Quick checks with pandas.

2. SQL Databases: PostgreSQL/MySQL for relational analysis.

3. NoSQL Stores: MongoDB for flexible review/spec documents.

4. Cloud Storage: AWS S3 for raw HTML/JSON backups.

5. Visualization: Grafana or Power BI dashboards for real‑time insights.

FAQs & Resources

1. Is scraping Amazon legal?

Public scraping isn’t illegal, but breaches Amazon’s TOS—expect IP/account bans, not lawsuits.

2. How many requests per day are safe?

Under ~200 req/day/IP seldom triggers blocks; above ~1,000 almost guarantees them.

3. What if Amazon changes its layout?

Log parsing failures. Update CSS selectors or use regex fallbacks.

4. What if the site shows CAPTCHAs?

Use GoProxy’s IP pool and hybrid headless fallback. Consider an external CAPTCHA solver.

5. How do I debug failed requests?

Log full HTML of failures to disk and inspect manually in your browser.

Helpful Links

Final Thoughts

You’ve journeyed from a simple 4-step scraper to a production-grade pipeline: geotargeting, hybrid headless fallback, high-volume alerts, and distributed scheduling—backed by GoProxy’s vast residential IP network. Next, explore ML-driven sentiment analysis and real-time dashboards to stay ahead of Amazon’s evolving defenses.

< Previous

India’s July 2025 Aadhaar-OTP mandate: Guide to IRCTC Tatkal Automation to Compliance

Next >

Instagram Disabled Due to Data Scraping: Causes, Recovery, and Prevention
Start Your 7-Day Free Trial Now!
GoProxy Cancel anytime
GoProxy No credit card required