Playwright Web Scraping with Proxies: 2026 Guide
Step-by-step Playwright guide for web scraping (Node/Python/C#) with proxies, resource blocking, scaling, and production tips for beginners to pros.
Feb 24, 2026
Step-by-step Python guide to scrape product & price data: inspection, __NEXT_DATA__ parsing, anti-bot hardening, scaling, data modeling, testing, and ethics( beginner → production).
Accessing real-time data from e-commerce giants like Walmart can provide a competitive edge in pricing, product research, and market analysis. This guide starts with beginner basics to advanced for scraping Walmart search results, product pages, and reviews. We'll cover ethical practices, source identification, reliable extraction, anti-bot strategies, scaling, data modeling, and testing, with code examples, checklists, troubleshooting, and best practices for full implementation.
For 2026: Walmart's site still relies heavily on NEXT_DATA for JSON blobs, but anti-bot measures like CAPTCHA, IP fingerprinting, and ML-based detection have ramped up. Always prioritize compliance and start with official APIs where possible.
How to use this guide
Important Note! Walmart's robots.txt disallows automated access on paths like /search, and their Terms of Service (TOS) prohibit scraping. While scraping public data may be legal in the US (e.g., HiQ v. LinkedIn precedent), violating TOS risks IP bans, account suspension, or legal action. This guide is for educational and personal research only—not commercial use. Always consult a lawyer for production scenarios. We recommend official APIs as your first choice.
Archive timestamped artifacts: Use Wayback Machine for robots.txt, TOS pages, and sample headers.
Document intent: Write a one-page note (e.g., "Personal price tracking for laptops") and keep it for audits.
Ethics: Scrape public pages only; add 0.5-2s delays; avoid service disruption or PII (e.g., user names in reviews).
Alternatives: Walmart Marketplace API for partners; no-code tools like Apify Walmart actors (handles ethics/proxies).
Commercial: Comply with GDPR/CCPA if handling data; switch to licensed providers for scale.
2026 Tip: ML-based detection flags non-human behavior—mimic it with tools like Playwright. If in doubt, stop and use no-code options.
Need guaranteed, stable access + legal safety? → API (e.g., Walmart's official endpoints).
Quick exploratory work/research? → Small, polite scraping (ethical only).
Large commercial usage? → Partner APIs or licensed data providers.
Scraping helps with tasks like price tracking or ML datasets. Here are common scenarios:
| Scenario | User Type | Key Benefit | Example Use Case | Tools Involved | Difficulty Level |
| Price Monitoring | Small Business Owner | Dynamic Adjustments | Track laptop prices to optimize your store. | Requests + Pandas | Beginner |
| Product Research | Marketer/Dropshipper | Trend Spotting | Analyze electronics reviews for AI gadgets. | BeautifulSoup + JSON | Intermediate |
| Competitive Analysis | Consultant/Analyst | Market Reports | Compare stock with rivals for supply insights. | Proxies + Async | Pro |
| Inventory Management | Supplier/Logistics | Stock Checks | Monitor availability by ZIP for planning. | Playwright for JS pages | Intermediate |
| Academic/Hobby Projects | Student/Enthusiast | ML Datasets | Build models on EV trends. | Scrapy for crawling | Pro |
Prerequisites: Basic Python (variables, loops). No scraping experience needed. Time: 30-60 mins basics; 2-4 hours full. Tools: Free (Python/libraries).
Key Steps:
1. Setup: Install Python and libraries.
2. Inspect a Page: Use browser DevTools to find data sources.
3. Extract Basics: Scrape search results for "laptops" and save to CSV.
4. Add Reliability: Parse NEXT_DATA for structured data.
5. Scale: Add anti-bots (covered later).
Quick Test Code (test.py): Start with Playwright to avoid early blocks.
If you're a total beginner, try Requests first (see tweaks below).
from playwright.sync_api import sync_playwright
import json
url = 'https://www.walmart.com/search?q=laptops'
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, timeout=30000)
content = page.content()
script = page.query_selector('#__NEXT_DATA__')
if script:
data = json.loads(script.inner_text())
items = data.get('props', {}).get('pageProps', {}).get('initialData', {}).get('data', {}).get('search', {}).get('items', [])
print(f"Found {len(items)} items." if items else "No items found.")
browser.close()
Run: python test.py. Expected: "Found 40+ items." If blocked, see Anti-Bot section.
1. Install Python 3.12+ from python.org. Verify: python --version.
2. Create a virtual environment: python -m venv scrape_env. Activate: Mac/Linux source scrape_env/bin/activate; Windows scrape_env\Scripts\activate.
3. Install: pip install requests beautifulsoup4 pandas aiohttp lxml playwright aiolimiter; then playwright install.
4. Test: Run the Quick Test Code above
Pro Tip: Use Jupyter Notebook (pip install jupyter; jupyter notebook) for interactive testing.
Common Errors: "pip not found"? Add Python to PATH. Install fails? Upgrade pip: pip install --upgrade pip.
Before coding, understand Walmart's structure using browser DevTools. This helps you adapt to site changes.
1. Visit walmart.com and search for something (e.g., "laptops").
2. Open DevTools (F12), go to Network tab, filter for XHR/Fetch—look for API endpoints or GraphQL calls returning JSON.
3. Check HTML for <script id="__NEXT_DATA__">—this holds structured JSON.
4. Note parameters: q for query, page for pagination, min_price, max_price, sort, facets.
Extraction Order:
1. JSON endpoints (fastest, structured). Pros: Efficient; Cons: Brittle if endpoints change.
2. NEXT_DATA parsing (reliable for initial loads). Pros: Structured, no extra requests; Cons: Requires HTML fetch.
3. Structured DOM selectors (e.g., data-automation-id). Pros: Simple; Cons: Prone to UI changes.
4. Headless browser rendering (last resort for heavy JS). Pros: Handles dynamics; Cons: Slow, resource-heavy.
Tip: Screenshot your DevTools findings for quick reference when sites change.

Start with Playwright to handle blocks. Use proxies early if blocked (see Advanced).
from playwright.sync_api import sync_playwright
import json
import pandas as pd
import random
import time
user_agents = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'] # Add more for rotation
url = 'https://www.walmart.com/search?q=laptops'
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(user_agent=random.choice(user_agents))
page = context.new_page()
page.goto(url)
data = json.loads(page.query_selector('#__NEXT_DATA__').inner_text())
items = data.get('props', {}).get('pageProps', {}).get('initialData', {}).get('data', {}).get('search', {}).get('items', [])
extracted = [{'Title': item.get('name'), 'Price': item.get('price')} for item in items]
df = pd.DataFrame(extracted)
df.to_csv('walmart_laptops.csv', index=False)
print(f"Saved {len(extracted)} items.")
browser.close()
Expected Output: CSV with 40+ rows. Common Tweaks: Change q=laptops to your query. If no data, re-inspect selectors in DevTools.
Add a loop:
for page_num in range(1, 6):
url = f'https://www.walmart.com/search?q=laptops&page={page_num}'
# Fetch code here...
time.sleep(random.uniform(1, 3)) # Ethical delay
Expected Output: Multiple pages appended to CSV.
For a product: Replace url with 'https://www.walmart.com/ip/{product_id}'. Parse NEXT_DATA for details.
Reviews example (limit to top 10 for ethics):
# After fetching product page content...
reviews = data.get('props', {}).get('pageProps', {}).get('initialData', {}).get('data', {}).get('reviews', [])
review_data = [{'Rating': r.get('rating'), 'Text': r.get('reviewText')} for r in reviews[:10]]
df_reviews = pd.DataFrame(review_data)
df_reviews.to_csv('reviews.csv', index=False)
For full pagination: Find reviews endpoint in DevTools (e.g., /reviews?page=), loop responsibly.
Expected Output: CSV with ratings/text. Common Tweaks: Paginate ethically—e.g., for page in range(1, 3) for small sets.
Walmart's 2026 anti-bot tech blocks basic requests—start with residential proxies.
Respect and archive robots.txt (timestamped).
Rate limits: 0.5-2s delays; exponential backoff on 429.
Proxies: Rotate from paid residential pools (e.g., GoProxy, health-checked automated and customized rotation).
2026 Tip: Rotate TLS fingerprints with curl_cffi.
Example
def check_proxy(proxy):
import requests
try:
response = requests.get('https://ipinfo.io', proxies={'http': proxy, 'https': proxy})
return response.status_code == 200
except:
return False
# Usage: if check_proxy('http://user:pass@ip:port'): ...
Scaling: For 10k+ items, use async (aiohttp + aiolimiter). Add queues (Celery) + database for production.
Suggested Postgres schema:
CREATE TABLE products (
product_id TEXT PRIMARY KEY,
title TEXT,
url TEXT,
current_price NUMERIC,
currency TEXT,
last_seen TIMESTAMPTZ
);
CREATE TABLE price_history (
id SERIAL PRIMARY KEY,
product_id TEXT REFERENCES products(product_id),
price NUMERIC,
currency TEXT,
fetched_at TIMESTAMPTZ DEFAULT NOW()
);
Cleaning Example:
import re
def clean_price(text):
if not text: return None
num = re.sub(r'[^\d.]', '', text)
return float(num) if num else None
Storage: CSV/SQLite for dev; Postgres + S3 for pro. Analyze with Pandas: df.plot(x='date', y='price').
Track metrics (e.g., success rate >95%) in Prometheus/CloudWatch. Use pytest:
def test_parser_basic():
with open("tests/fixtures/search_snapshot.html", "r", encoding="utf-8") as f:
html = f.read()
data = json.loads(...) # Your parser
assert len(items) > 0
CI: Run tests on PRs. Alert on drops (Slack/PagerDuty).
Monitor Walmart dev blogs via RSS. Use diff tools on HTML snapshots. Update parsers with fresh DevTools data. Trend: Possible API-only shift—prep no-code backups.
429 Too Many Requests: Increase delays, rotate proxies, backoff: time.sleep(2**retry).
403 Forbidden: New proxy, rotate user-agents.
Missing JSON: Site changed—update keys via DevTools.
Capped Pagination: Split by facets (e.g., &facet=brand:Apple).
Blocked Early: Start with residential proxies; try no-code tools.
Q: Is scraping Walmart legal?
A: Public data scraping can be legal in some jurisdictions but Terms of Service may prohibit it. Consult counsel for commercial projects.
Q: How often should I run a price scraper?
A: Depends on business need — every 15m–24h. Balance freshness vs cost and rate limits.
Q: Do I need proxies?
A: For moderate to large scraping or repeat requests, paid residential proxies are recommended for reliability; free trials are available (e.g., GoProxy).
Q: How to scrape reviews specifically?
A: Inspect product page __NEXT_DATA__ or reviews API endpoints in DevTools. Paginate responsibly and respect rate limits.
This guide equips you to scrape Walmart data ethically and effectively.
Try the code above, and adapt it for your needs. For commercial scale, prefer partnering with official data sources or reputable platforms that handle compliance.
Next >
Cancel anytime
No credit card required