Beginner's Guide to Web Scraping Excel: Power Query, VBA & Python
Step-by-step guide to scrape web data into Excel using Power Query, VBA, Python, and more—includes ethics, tips, and fixes.
Feb 15, 2026
Step-by-step Playwright guide for web scraping (Node/Python/C#) with proxies, resource blocking, scaling, and production tips for beginners to pros.
This guide provides a step-by-step approach to building reliable web scrapers using Playwright and residential proxies. It includes code examples in Node.js, Python, and C#, proxy patterns (per-context vs. per-browser), production tips like proxy health checks and concurrency, and a simple proxy-health microservice. We'll focus on scraping JavaScript-heavy sites, addressing common challenges like rendering SPAs, avoiding blocks, and scaling ethically.

Who It’s For:
Use the guide in three passes: (1) Quickstart for basics, (2) Add proxies and basic hardening for reliability, (3) Move to production scale with monitoring.
Obey robots.txt and site terms when required.
Avoid scraping personal data or protected content.
Rate-limit to avoid service disruption.
For commercial activity, consult legal counsel; GDPR/CCPA-like regulations and the EU AI Act require careful handling of personal data in automated collection.
Playwright runs real browser engines (Chromium 143.0.7499.4, Firefox, WebKit as of v1.58.2), executes JavaScript natively, and supports robust browser-context isolation. In 2026, anti-bot systems are more sophisticated with AI-driven detection—combining Playwright with residential proxies and best practices delivers the highest success rates for JS-heavy sites and session-based workflows. Recent updates like improved Trace Viewer help debug failed sessions, and Chrome for Testing builds ensure headless stability.
Real Browsers = Real Rendering: Playwright uses actual browser engines, solving SPA issues by executing JS and supporting interactions like clicks and scrolls.
Contexts vs. Browsers: A browser can host multiple contexts. Contexts are lightweight isolation units for cookies, storage, user agents—and proxies. Use them for session affinity without heavy browser relaunches.
Proxy Roles: Proxies provide IP diversity, geo-targeting, and rate-limit evasion. Residential proxies (from real devices) are harder to detect than datacenter ones.
Render → Parse: Let Playwright render the page, then parse HTML with tools like BeautifulSoup (Python), cheerio (Node.js), or HtmlAgilityPack (C#) for efficiency.
Stealth = Cumulative Measures: No magic bullet—combine proxies, delays, resource blocking, UA rotation, and consistent sessions.
Proxy Rotation Decision Tree
Need session affinity (logins)?
→ Yes: Per-context sticky proxies
→ No: High-volume stateless? → Per-browser with rotations.)
Start simple to verify your setup. We'll add proxies later.
Node.js 18+ | Python 3.8+ | .NET SDK 6+
Install Playwright per language (see code below).
Test target: https://httpbin.org/ip to verify outbound IP and basic navigation.
Note: Use the test site to confirm your environment before adding proxy complexity: httpbin.org
npm init -y
npm install playwright
npx playwright install
// quickstart.js
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
await page.goto('https://httpbin.org/ip', { waitUntil: 'networkidle' });
console.log(await page.textContent('body'));
await browser.close();
})();
pip install playwright
playwright install chromium
# quickstart.py
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context()
page = context.new_page()
page.goto("https://httpbin.org/ip", wait_until="networkidle")
print(page.text_content("body"))
browser.close()
dotnet new console -o PlaywrightQuickstart
cd PlaywrightQuickstart
dotnet add package Microsoft.Playwright
// Program.cs
using Microsoft.Playwright;
using System.Threading.Tasks;
class Program {
static async Task Main() {
using var playwright = await Playwright.CreateAsync();
await using var browser = await playwright.Chromium.LaunchAsync(new() { Headless = true });
var context = await browser.NewContextAsync();
var page = await context.NewPageAsync();
await page.GotoAsync("https://httpbin.org/ip", new() { WaitUntil = WaitUntilState.Networkidle });
Console.WriteLine(await page.TextContentAsync("body"));
}
}
If the quickstart prints your IP JSON, your basic environment is working — proceed to proxies. Common error: If install fails, check PATH (Python) or SDK version (.NET). Expected output: JSON with your IP.
With setup verified, add proxies for stealth.
Residential proxies mimic real consumer IPs; better for e-commerce and sites that aggressively block datacenter ranges. Datacenter proxies are faster/cheaper but easier to detect.
Per-Browser (launch-level): Proxy applies to all contexts. Lower flexibility; to rotate, relaunch browser (higher overhead).
Per-Context (recommended): Spin many contexts in one browser, each with its own proxy—fast, low overhead, good for sticky sessions (logins).
Decision rule:
Need session affinity (logins)? → Per-context.
High-volume stateless scraping? → Per-browser with occasional relaunches.
Playwright supports both a proxy dict and inline auth URLs. Show both to avoid 407 confusion.
from playwright.sync_api import sync_playwright
import os
PROXY_USER = os.getenv("PROXY_USER")
PROXY_PASS = os.getenv("PROXY_PASS")
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(proxy={
"server": "http://proxy.goproxy.com:8000",
"username": PROXY_USER,
"password": PROXY_PASS
})
page = context.new_page()
page.goto("https://httpbin.org/ip")
print(page.text_content("body"))
browser.close()
Alternative inline: server = "http://user:pass@proxy-host:8000"
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext({
proxy: { server: 'http://proxy.goproxy.com:8000', username: process.env.PROXY_USER, password: process.env.PROXY_PASS }
});
const page = await context.newPage();
await page.goto('https://httpbin.org/ip');
console.log(await page.textContent('body'));
await browser.close();
})();
var context = await browser.NewContextAsync(new() {
Proxy = new() { Server = "http://proxy.goproxy.com:8000", Username = "user", Password = "pass" }
});
Tip: Some providers expect credentials in the server URL; test with curl -x before Playwright. If 407 error: Verify auth or GoProxy dashboard (sign up for residential plans).
Rotating Pool: New proxy per request/context; good for public data no login.
Sticky Sessions: One proxy per context for session lifecycle (login/checkout).
Hybrid: Sticky for auth, rotating for broad scraping.
Start: Rotate every 10–50 requests for e-commerce; per-context sticky for accounts.
TTL: Reassign sticky proxies every 30–120 minutes.
Avoid reusing IPs with recent 403/429.
Render the page in Playwright, then parse HTML with a lightweight parser (BeautifulSoup / cheerio / HtmlAgilityPack) to avoid brittle DOM scripting.
Python + BeautifulSoup
from bs4 import BeautifulSoup # pip install beautifulsoup4
html = page.content()
soup = BeautifulSoup(html, 'html.parser')
for product in soup.select('.product'):
title = product.select_one('.title').get_text(strip=True) if product.select_one('.title') else ''
# Add to list or CSV
Node.js + cheerio
const cheerio = require('cheerio'); // npm install cheerio
const html = await page.content();
const $ = cheerio.load(html);
$('.product').each((i, el) => {
const title = $(el).find('.title').text().trim() || '';
// Add to array
});
C# + HtmlAgilityPack
using HtmlAgilityPack; // dotnet add package HtmlAgilityPack
var html = await page.ContentAsync();
var doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (var product in doc.DocumentNode.SelectNodes("//div[@class='product']")) {
var title = product.SelectSingleNode(".//span[@class='title']")?.InnerText.Trim() ?? "";
// Add to list
}
Tip: Use Playwright for dynamic rendering and context.request for consistent downloads (images/files) reusing cookies and proxy.
Human-like scrolling: evaluate document.body.scrollHeight, scroll, wait randomized time, detect no new content.
Pattern: Scroll to bottom, Wait 1.5s + random(0-1s), Repeat until height unchanged or timeout.
Python example (with error handling)
import random
from playwright.sync_api import TimeoutError
try:
last_height = page.evaluate("document.body.scrollHeight")
while True:
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(1500 + random.randint(0, 1000))
new_height = page.evaluate("document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
except TimeoutError:
print("Timeout – adjust wait or check proxy/network")
Use context.request so downloads inherit the same cookies & proxy:
Python example
resp = context.request.get("https://example.com/image.jpg")
if resp.ok:
with open('image.jpg','wb') as f:
f.write(resp.body)
Blocking non-essential resources reduces bandwidth and fingerprint surface area.
from playwright.sync_api import Route, Request
def handle_route(route: Route, request: Request):
if request.resource_type in ["image", "stylesheet", "font", "media"]:
route.abort()
else:
route.continue_()
page.route("**/*", handle_route)
# page.unroute("**/*") when finished
Node.js and C# equivalents similar.
Warning: Some sites require JS/CSS for critical content—always test.
Keep UA/timezone/language/viewport consistent per session.
Add small mouse movements (page.mouse.move()), typing delays (page.type()), and human flows (visit listing → click item → back).
Start small: 3–5 browsers, 10–20 contexts per machine (adjust by RAM/CPU). Each headless browser ~200–500MB; measure yours.
import asyncio
from playwright.async_api import async_playwright
from asyncio import Semaphore
import random # For jitter if needed
sem = Semaphore(5) # Limit concurrency
async def scrape_with_proxy(proxy):
async with sem:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(proxy=proxy)
page = await context.new_page()
# Scraping logic here
await browser.close()
# Usage
proxy_list = [ # GoProxy dicts
{"server": "http://proxy.goproxy.com:8000", "username": "user1", "password": "pass1"},
]
await asyncio.gather(*[scrape_with_proxy(proxy_list[i % len(proxy_list)]) for i in range(10)])
Periodically test proxies: GET https://httpbin.org/ip via the proxy.
Python example (aiohttp)
# proxy_health.py (pip install aiohttp)
import asyncio
import aiohttp
import time
PROXIES = [
{"server": "http://proxy.goproxy.com:8000", "username": "user", "password": "pass"},
]
async def test_proxy(session, proxy):
t0 = time.time()
proxy_url = f"http://{proxy['username']}:{proxy['password']}@{proxy['server'].split('://')[1]}"
try:
async with session.get("https://httpbin.org/ip", proxy=proxy_url, timeout=10) as r:
ok = (r.status == 200)
except Exception:
ok = False
return proxy['server'], ok, time.time() - t0
async def main():
async with aiohttp.ClientSession() as session:
tasks = [test_proxy(session, p) for p in PROXIES]
results = await asyncio.gather(*tasks)
for p, ok, latency in results:
print(f"{p}: OK={ok}, Latency={latency:.2f}s")
# Integrate with Redis for pool management
if __name__ == "__main__":
asyncio.run(main())
Track: scraper.requests_total, scraper.success_total, scraper.error_403_total, proxy.latency_avg, proxy.failure_rate.Use Prometheus + Grafana.
Simple exporter example (Python with prometheus_client):
from prometheus_client import start_http_server, Counter, Gauge
# In your scraper
requests_total = Counter('scraper_requests_total', 'Total requests')
# Increment in code
start_http_server(8000) # Expose metrics
Alert on success_rate < 85% or 403 spikes.
Use exponential backoff with jitter for retries (e.g., 2^n + rand).
If a proxy yields repeated 403/5xx, mark it suspicious and move to quarantine.
Implement a circuit-breaker to reduce request rate on aggressive 429 responses.
| Setup | Avg time / page (s) | Success rate (example) |
| No proxy, no blocking | 2.8 | 60% |
| Block images/fonts + datacenter proxy | 1.2 | 70% |
| Block resources + residential sticky proxy | 1.5 | 92% |
| Rotating residential pool (high concurrency) | 1.6 | 88% |
Residential proxies generally improve success rates despite slightly higher latency. Based on feedback averages; test your setup.
403 / Blocked page: Switch to residential proxy; reduce concurrency; add human-like navigation.
407 Proxy Auth Required: Wrong auth format — try username:password@host:port or Playwright username/password fields.
Timeouts: Proxy overloaded — retire or lower concurrency; increase wait time; verify target health.
CAPTCHA: Session risk signals triggered — warm up session, use human-in-loop solving if allowed, or slow down.
Empty selectors / missing content: Dynamic rendering not complete — use wait_for_selector or wait_for_load_state('networkidle').
Q: Should I use headless mode?
A: Test both. Some sites detect headless; if you see blocks, test headful or hide headless indicators.
Q: How often rotate proxies?
A: Depends on the target — start with every 10–50 requests for e-commerce; for logins use per-context sticks.
Q: How to store credentials safely?
A: Use environment variables, secret managers (Vault, AWS Secrets Manager) or platform-native secrets.
Q: How to debug a blocked session?
A: Capture a screenshot, save page.content(), check headers and proxy IP reputation, and validate with curl through the proxy.
Q: What's new in Playwright 2026?
A: Enhanced Trace Viewer for debugging, better compatibility with Chrome for Testing.
This guide offers a path from zero to production for scraping JS-heavy sites with Playwright + residential proxies in 2026. Start minimal, measure everything, and prioritize session realism over raw speed — that’s the best way to reduce blocks while collecting reliable data. Try GoProxy's free trial for residential IPs.
Next >
Cancel anytime
No credit card required