A scalable JavaScript web scraper must fetch static HTML, execute client-side code, and dodge IP blocks, rate limits, and geo-restrictions. This guide delivers three hands-on methods—Axios + Cheerio for server-rendered pages, Puppeteer/Playwright for dynamic sites, and GoProxy’s API for managed scraping—each integrated with residential proxies for seamless IP rotation, geo-targeting, and session control. You’ll find clear setup steps, code examples, tables comparing tools, best practices, and an FAQ list to address real-world scraping scenarios.

What You’ll Build
This tutorial covers:
- Static Scraping with Axios + Cheerio to extract data from server-rendered HTML.
- Dynamic Scraping with Puppeteer & Playwright to automate browsers and capture client‑side content.
- Managed Scraping via GoProxy API for enterprise-scale projects with minimal code.
By the end, you’ll have working scripts that fetch product listings, handle pagination, perform infinite scroll, and submit jobs to GoProxy’s service—all through residential proxies to avoid detection and maximize reliability.
Prerequisites & Setup
Make sure you have:
- Node.js v18+ installed (includes native fetch)
- npm or Yarn for package management
- A GoProxy account with proxy credentials (GOPROXY_USER, GOPROXY_PASS, GOPROXY_HOST, GOPROXY_PORT, GOPROXY_API_KEY)
- A basic understanding of JavaScript/Node.js and a code editor (e.g., VS Code)
Project initialization:
bash
mkdir js-scraper && cd js-scraper
npm init -y
npm install axios cheerio puppeteer playwright dotenv
Create a .env file:
ini
GOPROXY_USER=your_user
GOPROXY_PASS=your_pass
GOPROXY_HOST=proxy.goproxy.com
GOPROXY_PORT=8000
GOPROXY_API_KEY=your_api_key
Load with require('dotenv').config() in your scripts.
Core Concepts Recap
1. Node.js Event Loop
Node.js runs JavaScript on a single thread via an event loop that offloads I/O to the system kernel, allowing non‑blocking operations and efficient concurrency.
2. Async/Await & Promises
Use async/await to pause execution until a promise resolves—crucial for sequential scraping tasks. Forgetting await can lead to unfulfilled network calls and empty data.
3. Residential Proxies
Residential proxies route requests through real-user IPs, reducing blocks and enabling geo-targeting. Choose rotating sessions (new IP per request) or sticky sessions (same IP for multi-step flows).
Choosing the Right Tools
HTTP Client Comparison
Efficient HTTP clients are necessary for static scraping. Here’s how three popular options compare:
| Feature |
Fetch API |
Axios |
SuperAgent |
| Built-in |
Yes (Node v18+) |
No |
No |
| JSON auto-parse |
No |
Yes |
No |
| Interceptors |
No |
Yes |
No |
| Cancellation support |
Experimental |
Yes |
Yes |
| Proxy integration |
Environment vars |
Built-in |
Plugin |
| Ease of use |
Moderate |
High |
Moderate |
- Fetch API: Global in Node 18+, simple but lacks advanced hooks.
- Axios: Offers request/response interceptors, automatic JSON transforms, and cancellation tokens—ideal for proxy setups.
- SuperAgent: Stream‑based, chaining API, best for large payloads.
Editor’s Recommendation: Start with Axios for a balance of power and simplicity.
DOM Parsing: Cheerio vs. jsdom
After fetching HTML, choose the right parser:
- Cheerio: Fast, jQuery-like API for server-side HTML traversal. Doesn’t execute JavaScript. Use for simple extraction tasks.
- jsdom: Full DOM emulation with CSSOM and HTML parsing. Slower and heavier—use when you need true browser APIs (e.g., document.createElement).
Example (Cheerio):
js
const cheerio = require('cheerio');
const $ = cheerio.load(html);
const titles = $('h2.title').map((i, el) => $(el).text()).get();
Headless Browser Options
Use browser automation when your page relies on client-side JavaScript.
| Feature |
Puppeteer |
Playwright |
| Browser support |
Chromium, Firefox |
Chromium, Firefox, WebKit |
| Auto-wait |
No (manual) |
Yes |
| Parallel contexts |
Limited |
Multiple |
| Test runner |
No |
@playwright/test |
| Ease of setup |
High |
Moderate |
- Puppeteer: Controls browsers via DevTools Protocol; supports screenshots, PDFs.
- Playwright: Adds auto‑waiting and multi‑browser support out of the box.
Editor’s Recommendation: Use Playwright for cross‑browser needs. And use Puppeteer for quick Chrome‑only automation.
Method 1: Static Scraping with Axios & Cheerio
Use when: Pages serve data in initial HTML without requiring JavaScript.
1. Install & Boilerplate
bash
npm install axios cheerio dotenv
Create scrape-cheerio.js:
js
require('dotenv').config();
const axios = require('axios');
const cheerio = require('cheerio');
const proxy = {
host: process.env.GOPROXY_HOST,
port: +process.env.GOPROXY_PORT,
auth: { username: process.env.GOPROXY_USER, password: process.env.GOPROXY_PASS }
};
async function fetchPage(url) {
const { data } = await axios.get(url, { proxy, timeout: 10000 });
return data;
}
function parseItems(html) {
const $ = cheerio.load(html);
return $('.item').map((i, el) => ({
title: $(el).find('.title').text().trim(),
price: $(el).find('.price').text().trim()
})).get();
}
(async () => {
try {
const html = await fetchPage('https://example.com/products');
console.log(parseItems(html));
} catch (e) {
console.error('Error:', e.message);
}
})();
- Cheerio implements a subset of jQuery for fast DOM parsing (no JS execution).
- Axios offers interceptors, JSON transforms, and built‑in proxy support.
2. Pagination Handling
js
async function scrapeAll(url) {
let next = url, results = [];
while (next) {
const html = await fetchPage(next);
results.push(...parseItems(html));
const $ = cheerio.load(html);
next = $('.next-page').attr('href') || null;
}
return results;
}
Common Pitfalls
Timeouts: Increase the timeout or verify proxy connectivity.
Empty arrays: Check your CSS selectors against the site’s HTML.
Method 2: Dynamic Scraping with Puppeteer & Playwright

Use when: Pages render content via JavaScript (SPAs, infinite scroll).
Why Headless Browsers?
They execute JS exactly like real users, enabling data capture from client-side-rendered pages.
Puppeteer Example
js
require('dotenv').config();
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless: true,
args: [`--proxy-server=${process.env.GOPROXY_HOST}:${process.env.GOPROXY_PORT}`]
});
const page = await browser.newPage();
await page.authenticate({
username: process.env.GOPROXY_USER,
password: process.env.GOPROXY_PASS
});
await page.goto('https://example.com/dynamic', { waitUntil: 'networkidle2' });
let prevHeight;
do {
prevHeight = await page.evaluate('document.body.scrollHeight');
await page.evaluate('window.scrollTo(0, prevHeight)');
await page.waitForTimeout(1000);
} while ((await page.evaluate('document.body.scrollHeight')) > prevHeight);
const data = await page.evaluate(() =>
Array.from(document.querySelectorAll('.item')).map(el => ({
title: el.querySelector('.title')?.innerText,
price: el.querySelector('.price')?.innerText
}))
);
console.log(data);
await browser.close();
})();
- networkidle2 waits for no more than 2 network connections for ≥500 ms.
- Use page.screenshot() to debug selector issues.
Playwright Example
js
require('dotenv').config();
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch({
proxy: { server: `${process.env.GOPROXY_HOST}:${process.env.GOPROXY_PORT}`,
username: process.env.GOPROXY_USER,
password: process.env.GOPROXY_PASS }
});
const page = await browser.newPage();
await page.goto('https://example.com/dynamic', { waitUntil: 'networkidle' });
const items = await page.$$eval('.item', els =>
els.map(el => ({ title: el.querySelector('.title').innerText }))
);
console.log(items);
await browser.close();
})();
- Playwright’s auto-waits reduce the need for manual timeouts.
Common Pitfalls
Proxy Errors: Verify .env credentials.
Missing Data: Use page.screenshot() to debug rendering.
Method 3: Managed Scraping with GoProxy API
Use when: Minimal code and infrastructure—ideal for large-scale or multi-site jobs.
Code Example
js
require('dotenv').config();
const axios = require('axios');
async function submitJob(url, selectors) {
const res = await axios.post('https://api.goproxy.com/scraping/jobs',
{ url, selectors },
{ headers: { 'Authorization': `Bearer ${process.env.GOPROXY_API_KEY}` } }
);
return res.data.jobId;
}
async function fetchResults(jobId) {
const res = await axios.get(`https://api.goproxy.com/scraping/jobs/${jobId}`,
{ headers: { 'Authorization': `Bearer ${process.env.GOPROXY_API_KEY}` } }
);
return res.data;
}
(async () => {
const selectors = [{ name: 'title', path: '.item .title' }, { name: 'price', path: '.item .price' }];
const jobId = await submitJob('https://example.com/products', selectors);
console.log('Job ID:', jobId);
let result;
do {
await new Promise(r => setTimeout(r, 2000));
result = await fetchResults(jobId);
} while (result.status !== 'completed');
console.log(result.data);
})();
GoProxy handles proxy rotation, retries, and structured JSON output out of the box.
Best Practices & Troubleshooting
1. Proxy Strategies
Rotating sessions: New IP per request for breadth-first crawls.
Sticky sessions: Same IP for multi-step interactions (login flows).
Geo-targeting: Use GoProxy’s dashboard to select countries or cities; verify via https://ipinfo.io/json.
2. Rate Limits & Backoff
Use exponential backoff (1s → 2s → 4s) on HTTP 429/503 errors. Insert random delays (2–5 s) between actions to mimic human behavior.
3. CAPTCHA & Bot Defenses
Switch between headless and headful modes. Integrate a CAPTCHA solver for high‑security sites.
4. Logging & Monitoring
Log proxy ID, latency, status codes. Use GoProxy webhooks for error alerts and automated fallback.
Method Comparison at a Glance
| Method |
Dynamic JS |
Setup Complexity |
Speed |
Best Use Case |
| Axios + Cheerio |
No |
Low |
Fast |
Static pages, bulk data extraction |
| Puppeteer |
Yes |
Medium |
Moderate |
Interactive SPAs, infinite scroll |
| Playwright |
Yes |
Medium |
Moderate |
Cross-browser scenarios |
| GoProxy API |
Yes |
Very Low |
High |
Enterprise-scale, low-dev overhead |
Further Reading
Node.js Event Loop
Axios Interceptors
Cheerio Guide
Puppeteer API
Playwright Docs
GoProxy Scraping Proxies
GoProxy Web Scraping Service
FAQs
1. Which method is best for beginners?
Start with Axios + Cheerio for minimal setup and fast static‑HTML scraping.
2. How many proxies do I need?
A pool of 5–10 rotating residential IPs typically handles hundreds of pages per hour.
3. Can proxies eliminate all CAPTCHA challenges?
Rotation reduces CAPTCHA triggers but doesn’t guarantee avoidance. Add a CAPTCHA‑solving service for full coverage.
4. What’s the difference between rotating and sticky sessions?
Rotating: New IP each request—ideal for breadth‑first data collection.
Sticky: One IP per session—necessary for login flows and checkout processes.
5. How do I verify proxy geo‑location?
Call https://ipinfo.io/json through each proxy; inspect the country and city fields.
Final Thoughts
Web scraping is an invaluable skill for gathering data from the internet, but it’s not without its hurdles, especially when tackling dynamic, JavaScript-heavy websites or navigating anti-scraping protections. In this blog, we’ve walked through three powerful methods for web scraping with Node.js. Each approach has its strengths, and the best choice depends on your project’s needs.
But no matter the method, one thing remains constant: the need for reliable, undetectable proxies to bypass blocks, manage geo-restrictions, and keep your scraping running smoothly.
With over 90 million rotating IPs sourced from real residential devices, GoProxy delivers the anonymity and flexibility you need to scrape successfully. Whether you’re a beginner testing the waters or a pro scaling up your operations, our residential proxies integrate seamlessly into your Node.js workflows, offering both rotating and sticky sessions to suit your needs.

We’d love to invite you to experience GoProxy’s residential proxies and web scraping service. See firsthand how easy it is to set up, how reliable our IPs are, and how they can simplify even the toughest scraping challenges. Sign up today to get a free trial and 7*24 technical support!