This browser does not support JavaScript

Master JavaScript Web Scraping with Node.js & Residential Proxies

Post Time: 2025-04-17 Update Time: 2025-04-17

A scalable JavaScript web scraper must fetch static HTML, execute client-side code, and dodge IP blocks, rate limits, and geo-restrictions. This guide delivers three hands-on methods—Axios + Cheerio for server-rendered pages, Puppeteer/Playwright for dynamic sites, and GoProxy’s API for managed scraping—each integrated with residential proxies for seamless IP rotation, geo-targeting, and session control. You’ll find clear setup steps, code examples, tables comparing tools, best practices, and an FAQ list to address real-world scraping scenarios.

scraping dynamic websites using JavaScript

What You’ll Build

This tutorial covers:

  • Static Scraping with Axios + Cheerio to extract data from server-rendered HTML.
  • Dynamic Scraping with Puppeteer & Playwright to automate browsers and capture client‑side content.
  • Managed Scraping via GoProxy API for enterprise-scale projects with minimal code.

By the end, you’ll have working scripts that fetch product listings, handle pagination, perform infinite scroll, and submit jobs to GoProxy’s service—all through residential proxies to avoid detection and maximize reliability.

Prerequisites & Setup

Make sure you have:

  • Node.js v18+ installed (includes native fetch) 
  • npm or Yarn for package management
  • A GoProxy account with proxy credentials (GOPROXY_USER, GOPROXY_PASS, GOPROXY_HOST, GOPROXY_PORT, GOPROXY_API_KEY)
  • A basic understanding of JavaScript/Node.js and a code editor (e.g., VS Code)

Project initialization:

bash

 

mkdir js-scraper && cd js-scraper

npm init -y

npm install axios cheerio puppeteer playwright dotenv

Create a .env file:

ini

 

GOPROXY_USER=your_user

GOPROXY_PASS=your_pass

GOPROXY_HOST=proxy.goproxy.com

GOPROXY_PORT=8000

GOPROXY_API_KEY=your_api_key

Load with require('dotenv').config() in your scripts.

Core Concepts Recap

1. Node.js Event Loop

Node.js runs JavaScript on a single thread via an event loop that offloads I/O to the system kernel, allowing non‑blocking operations and efficient concurrency.

2. Async/Await & Promises

Use async/await to pause execution until a promise resolves—crucial for sequential scraping tasks. Forgetting await can lead to unfulfilled network calls and empty data.

3. Residential Proxies

Residential proxies route requests through real-user IPs, reducing blocks and enabling geo-targeting. Choose rotating sessions (new IP per request) or sticky sessions (same IP for multi-step flows).

Choosing the Right Tools

HTTP Client Comparison

Efficient HTTP clients are necessary for static scraping. Here’s how three popular options compare:

Feature Fetch API Axios SuperAgent
Built-in Yes (Node v18+) No No
JSON auto-parse No Yes No
Interceptors No Yes No
Cancellation support Experimental Yes Yes
Proxy integration Environment vars Built-in Plugin
Ease of use Moderate High Moderate
  • Fetch API: Global in Node 18+, simple but lacks advanced hooks.
  • Axios: Offers request/response interceptors, automatic JSON transforms, and cancellation tokens—ideal for proxy setups.
  • SuperAgent: Stream‑based, chaining API, best for large payloads.

Editor’s Recommendation: Start with Axios for a balance of power and simplicity.

DOM Parsing: Cheerio vs. jsdom

After fetching HTML, choose the right parser:

  • Cheerio: Fast, jQuery-like API for server-side HTML traversal. Doesn’t execute JavaScript. Use for simple extraction tasks.
  • jsdom: Full DOM emulation with CSSOM and HTML parsing. Slower and heavier—use when you need true browser APIs (e.g., document.createElement).

Example (Cheerio):

js

 

const cheerio = require('cheerio');

const $ = cheerio.load(html);

const titles = $('h2.title').map((i, el) => $(el).text()).get();

Headless Browser Options

Use browser automation when your page relies on client-side JavaScript.

Feature Puppeteer Playwright
Browser support Chromium, Firefox Chromium, Firefox, WebKit
Auto-wait No (manual) Yes
Parallel contexts Limited Multiple
Test runner No @playwright/test
Ease of setup High Moderate
  • Puppeteer: Controls browsers via DevTools Protocol; supports screenshots, PDFs.
  • Playwright: Adds auto‑waiting and multi‑browser support out of the box.

Editor’s Recommendation: Use Playwright for cross‑browser needs. And use Puppeteer for quick Chrome‑only automation.

Method 1: Static Scraping with Axios & Cheerio

Use when: Pages serve data in initial HTML without requiring JavaScript.

1. Install & Boilerplate

bash

 

npm install axios cheerio dotenv

Create scrape-cheerio.js:

js

require('dotenv').config();

const axios = require('axios');

const cheerio = require('cheerio');

 

const proxy = {

  host: process.env.GOPROXY_HOST,

  port: +process.env.GOPROXY_PORT,

  auth: { username: process.env.GOPROXY_USER, password: process.env.GOPROXY_PASS }

};

 

async function fetchPage(url) {

  const { data } = await axios.get(url, { proxy, timeout: 10000 });

  return data;

}

 

function parseItems(html) {

  const $ = cheerio.load(html);

  return $('.item').map((i, el) => ({

    title: $(el).find('.title').text().trim(),

    price: $(el).find('.price').text().trim()

  })).get();

}

 

(async () => {

  try {

    const html = await fetchPage('https://example.com/products');

    console.log(parseItems(html));

  } catch (e) {

    console.error('Error:', e.message);

  }

})();

  • Cheerio implements a subset of jQuery for fast DOM parsing (no JS execution).
  • Axios offers interceptors, JSON transforms, and built‑in proxy support.

2. Pagination Handling
js

 

async function scrapeAll(url) {

  let next = url, results = [];

  while (next) {

    const html = await fetchPage(next);

    results.push(...parseItems(html));

    const $ = cheerio.load(html);

    next = $('.next-page').attr('href') || null;

  }

  return results;

}

Common Pitfalls

Timeouts: Increase the timeout or verify proxy connectivity.

Empty arrays: Check your CSS selectors against the site’s HTML.

Method 2: Dynamic Scraping with Puppeteer & Playwright

scraping dynamic webpages

Use when: Pages render content via JavaScript (SPAs, infinite scroll).

Why Headless Browsers?

They execute JS exactly like real users, enabling data capture from client-side-rendered pages.  

Puppeteer Example

js

 

require('dotenv').config();

const puppeteer = require('puppeteer');

 

(async () => {

  const browser = await puppeteer.launch({

    headless: true,

    args: [`--proxy-server=${process.env.GOPROXY_HOST}:${process.env.GOPROXY_PORT}`]

  });

  const page    = await browser.newPage();

  await page.authenticate({

    username: process.env.GOPROXY_USER,

    password: process.env.GOPROXY_PASS

  });

 

  await page.goto('https://example.com/dynamic', { waitUntil: 'networkidle2' });

 

  let prevHeight;

  do {

    prevHeight = await page.evaluate('document.body.scrollHeight');

    await page.evaluate('window.scrollTo(0, prevHeight)');

    await page.waitForTimeout(1000);

  } while ((await page.evaluate('document.body.scrollHeight')) > prevHeight);

 

  const data = await page.evaluate(() =>

    Array.from(document.querySelectorAll('.item')).map(el => ({

      title: el.querySelector('.title')?.innerText,

      price: el.querySelector('.price')?.innerText

    }))

  );

 

  console.log(data);

  await browser.close();

})();

  • networkidle2 waits for no more than 2 network connections for ≥500 ms.
  • Use page.screenshot() to debug selector issues.

Playwright Example

js

 

require('dotenv').config();

const { chromium } = require('playwright');

 

(async () => {

  const browser = await chromium.launch({

    proxy: { server: `${process.env.GOPROXY_HOST}:${process.env.GOPROXY_PORT}`,

             username: process.env.GOPROXY_USER,

             password: process.env.GOPROXY_PASS }

  });

  const page = await browser.newPage();

  await page.goto('https://example.com/dynamic', { waitUntil: 'networkidle' });

  const items = await page.$$eval('.item', els =>

    els.map(el => ({ title: el.querySelector('.title').innerText }))

  );

  console.log(items);

  await browser.close();

})();

  • Playwright’s auto-waits reduce the need for manual timeouts.

Common Pitfalls

Proxy Errors: Verify .env credentials.  

Missing Data: Use page.screenshot() to debug rendering.

Method 3: Managed Scraping with GoProxy API

Use when: Minimal code and infrastructure—ideal for large-scale or multi-site jobs.

Code Example

js

 

require('dotenv').config();

const axios = require('axios');

 

async function submitJob(url, selectors) {

  const res = await axios.post('https://api.goproxy.com/scraping/jobs',

    { url, selectors },

    { headers: { 'Authorization': `Bearer ${process.env.GOPROXY_API_KEY}` } }

  );

  return res.data.jobId;

}

 

async function fetchResults(jobId) {

  const res = await axios.get(`https://api.goproxy.com/scraping/jobs/${jobId}`,

    { headers: { 'Authorization': `Bearer ${process.env.GOPROXY_API_KEY}` } }

  );

  return res.data;

}

 

(async () => {

  const selectors = [{ name: 'title', path: '.item .title' }, { name: 'price', path: '.item .price' }];

  const jobId = await submitJob('https://example.com/products', selectors);

  console.log('Job ID:', jobId);

 

  let result;

  do {

    await new Promise(r => setTimeout(r, 2000));

    result = await fetchResults(jobId);

  } while (result.status !== 'completed');

 

  console.log(result.data);

})();

GoProxy handles proxy rotation, retries, and structured JSON output out of the box.

Best Practices & Troubleshooting

1. Proxy Strategies

Rotating sessions: New IP per request for breadth-first crawls.

Sticky sessions: Same IP for multi-step interactions (login flows).

Geo-targeting: Use GoProxy’s dashboard to select countries or cities; verify via https://ipinfo.io/json.

2. Rate Limits & Backoff

Use exponential backoff (1s → 2s → 4s) on HTTP 429/503 errors. Insert random delays (2–5 s) between actions to mimic human behavior.

3. CAPTCHA & Bot Defenses

Switch between headless and headful modes. Integrate a CAPTCHA solver for high‑security sites.

4. Logging & Monitoring

Log proxy ID, latency, status codes. Use GoProxy webhooks for error alerts and automated fallback.

Method Comparison at a Glance

Method Dynamic JS Setup Complexity Speed Best Use Case
Axios + Cheerio No Low Fast Static pages, bulk data extraction
Puppeteer Yes Medium Moderate Interactive SPAs, infinite scroll
Playwright Yes Medium Moderate Cross-browser scenarios
GoProxy API Yes Very Low High Enterprise-scale, low-dev overhead

Further Reading

Node.js Event Loop

Axios Interceptors

Cheerio Guide

Puppeteer API

Playwright Docs

GoProxy Scraping Proxies

GoProxy Web Scraping Service

FAQs

1. Which method is best for beginners?

Start with Axios + Cheerio for minimal setup and fast static‑HTML scraping.

2. How many proxies do I need?

A pool of 5–10 rotating residential IPs typically handles hundreds of pages per hour.

3. Can proxies eliminate all CAPTCHA challenges?

Rotation reduces CAPTCHA triggers but doesn’t guarantee avoidance. Add a CAPTCHA‑solving service for full coverage.

4. What’s the difference between rotating and sticky sessions?

Rotating: New IP each request—ideal for breadth‑first data collection.

Sticky: One IP per session—necessary for login flows and checkout processes.

5. How do I verify proxy geo‑location?

Call https://ipinfo.io/json through each proxy; inspect the country and city fields.

Final Thoughts

Web scraping is an invaluable skill for gathering data from the internet, but it’s not without its hurdles, especially when tackling dynamic, JavaScript-heavy websites or navigating anti-scraping protections. In this blog, we’ve walked through three powerful methods for web scraping with Node.js. Each approach has its strengths, and the best choice depends on your project’s needs. 

But no matter the method, one thing remains constant: the need for reliable, undetectable proxies to bypass blocks, manage geo-restrictions, and keep your scraping running smoothly.

With over 90 million rotating IPs sourced from real residential devices, GoProxy delivers the anonymity and flexibility you need to scrape successfully. Whether you’re a beginner testing the waters or a pro scaling up your operations, our residential proxies integrate seamlessly into your Node.js workflows, offering both rotating and sticky sessions to suit your needs.

goproxy rotating residential proxies

We’d love to invite you to experience GoProxy’s residential proxies and web scraping service. See firsthand how easy it is to set up, how reliable our IPs are, and how they can simplify even the toughest scraping challenges. Sign up today to get a free trial and 7*24 technical support!

< Previous

CAPTCHAs Made Simple: How They Work, Solving Tips, and Proxy Tricks

Next >

How to Scrape Competitors’ Google Ads Data with Proxies
Start Your 7-Day Free Trial Now!
GoProxy Cancel anytime
GoProxy No credit card required