GoProxy > Blog > Use Cases > Step-by-Step Guide to Web Scraping Job Postings in 2026

Step-by-Step Guide to Web Scraping Job Postings in 2026

Post Time: 2026-01-07 Update Time: 2026-01-07

Step-by-step guide to scrape job postings: code examples (LinkedIn & Indeed), parsing helpers, anti-block tips, monitoring and scaling.

This comprehensive guide covers how to scrape job listings reliably and ethically using Python—from single static sites to thousands of heterogeneous pages. You'll get concrete workflows, code examples, troubleshooting tips, data schema design, and scaling strategies to build a full end-to-end pipeline, plus examples for LinkedIn and Indeed. Designed for pros with deeper production insights, while remaining accessible for beginners.

Important Note: This article is educational, not legal advice. Laws and site policies vary by jurisdiction and by site. If you plan large-scale or commercial scraping, consult legal counsel and the target site’s Terms of Service. Don’t attempt to access private or authenticated data without permission.

Common user questions this article solves

"How do I pull listings from one site (e.g., Indeed) quickly without coding errors?"

"How to handle JavaScript-loaded content or infinite scroll?"

"How to scale across hundreds of varied job pages?"

"How to avoid blocks, clean data, and ensure legality?"

We'll solve these with actionable steps, starting simple and scaling up.

Why Scrape Job Postings?

Typical scenarios include:

Job Hunters: Aggregate postings from sites like Indeed or LinkedIn into a spreadsheet, filter for remote roles or keywords, and apply faster.

Market Researchers: Gather data on trends, salaries, or skills in sectors like tech or finance.

Career Coaches: Identify emerging titles or qualifications for client guidance.

Students/Developers: Build scraping skills for portfolios or projects.

With AI-driven hiring, scraping uncovers patterns like rising demand for AI ethics roles. For pros, it enables custom aggregators for competitive intelligence.

Legal & Ethical Checklist Before You Start

Check robots.txt and the site’s Terms of Service. If a site prohibits scraping, don’t scrape.

Avoid logging into other people’s accounts or using tokens you don’t own.

Don’t collect private/personal data you don’t need; comply with GDPR/CCPA if applicable.

Use polite request rates and per-domain concurrency limits; be prepared to stop if the site blocks or issues CAPTCHA.

If the site requires login or access control to view job data, prefer the official API or ask for permission/licensing.

If in doubt, slow down and ask—public APIs or dataset partnerships are common and often cheaper and more legal than trying to scrape at scale.

Quick Approach Decision

Site type	Easiest approach	Difficulty	When to choose
Static HTML job board	requests + BeautifulSoup	Low	Single site, few pages
JSON/XHR endpoint	Call JSON endpoint with requests	Low → Medium	Stable structured fields available
JS-rendered/infinite scroll	Playwright headless	Medium	No JSON endpoint; small scale
Many heterogeneous sites	Async fetch + template grouping + *proxies*	High	Production aggregator / large-scale research

What to Extract: Standard Job Schema

Design output early for consistency:

job_id (hash of title + company + location + posted_date)
title
company
location (parse into city/region/country if needed)
posted_date (ISO format)
employment_type (e.g., full-time)
salary_raw / salary_min / salary_max (parsed)
description (text)
job_url
remote (boolean)
raw_html (optional for audits)
scraped_at (timestamp)

Use job_id for deduplication. For pros, add source_site and confidence_score (e.g., for AI-parsed fields).

Tools & Setup: What You'll Need

1. Install Python 3.10+ from python.org.

2. Libraries: Run pip install requests beautifulsoup4 pandas playwright aiohttp dateparser in terminal.

3. Environment: VS Code or Jupyter for testing.

4. Optional: Proxies for scaling (paid residential for reliability); AI libs like OpenAI for selector generation

Pro Tip: Start small—scrape 10 jobs to test compliance. For pros, set up virtualenv and logging from the start.

Step 1. Find JSON Endpoints (DevTools Method)

Before coding:

1. Open DevTools (F12) → Network → filter by XHR/Fetch.

2. Perform a search or scroll “load more” on the job page.

3. Look for requests returning JSON; inspect hits / elements arrays.

4. Replay the JSON request with requests (copy safe headers only — do not reuse private auth tokens).

JSON endpoints usually include structured fields (title, company, location, postedDate, applyUrl) and are far more stable than scraping DOM strings.

Quick test: copy the request’s full URL and run curl or requests.get(url, headers=...) to confirm you can receive the data unauthenticated. What if it fails? Re-inspect monthly; sites update often. Beginner Milestone: Validate one endpoint returns data.

Step 2. Beginner: Basic Scraping(requests + BeautifulSoup)

Use a requests.Session() with retries and polite per-page delays. This example is production-safe for small projects.

# robust_requests.py

import requests

import time

import logging

import hashlib

import random

from bs4 import BeautifulSoup

from urllib.parse import urljoin

from requests.adapters import HTTPAdapter, Retry

import pandas as pd

import math

logging.basicConfig(level=logging.INFO)

BASE = "https://www.indeed.com/jobs?q=software+engineer&l=New+York"

HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; job-scraper/1.0)"}

session = requests.Session()

retries = Retry(total=3, backoff_factor=1, status_forcelist=(429,500,502,503,504))

session.mount("https:/", HTTPAdapter(max_retries=retries))

def make_job_id(title, company, location):

key = f"{title}|{company}|{location}".strip().lower()

return hashlib.sha1(key.encode()).hexdigest()

def parse_job_card(card, base_url):

title = card.select_one("h2").get_text(strip=True) if card.select_one("h2") else ""

company = card.select_one("span.companyName").get_text(strip=True) if card.select_one("span.companyName") else ""

location = card.select_one("div.companyLocation").get_text(strip=True) if card.select_one("div.companyLocation") else ""

salary = card.select_one("div.salary-snippet").get_text(strip=True) if card.select_one("div.salary-snippet") else "N/A"

summary = card.select_one("div.job-snippet").get_text(strip=True) if card.select_one("div.job-snippet") else ""

a = card.select_one("a")

url = urljoin(base_url, a["href"]) if a and a.get("href") else ""

return {

"job_id": make_job_id(title, company, location), "title": title, "company": company, "location": location,

"salary_raw": salary, "summary": summary, "job_url": url

}

def scrape_page(url):

try:

r = session.get(url, headers=HEADERS, timeout=15)

r.raise_for_status()

except Exception as e:

logging.error("Request failed %s: %s", url, e)

return []

soup = BeautifulSoup(r.text, "lxml")

cards = soup.select("div.job_seen_beacon")

return [parse_job_card(c, base_url=url) for c in cards]

rows = []

pages = 5 # Or dynamic: total_jobs = int(soup.find('div', id='searchCountPages').text.split()[-3]); pages = math.ceil(total_jobs / 15)

for p in range(pages):

url = f"{BASE}&start={p*10}"

rows.extend(scrape_page(url))

time.sleep(1 + random.random()) # polite jitter

pd.DataFrame(rows).to_csv("jobs.csv", index=False)

logging.info("Saved jobs.csv (%d rows)", len(rows))

Notes: Use session to reuse TCP connections. Retries + backoff handle transient failures. Keep per-domain load low and monitor responses.

For anti-blocks, add residential proxies:

def rotate_proxies(proxies_list):

return random.choice(proxies_list)

proxies = [{'http': 'http://user:pass@ip1:port'}, {'http': 'http://user:pass@ip2:port'}] # Your list

proxy = rotate_proxies(proxies)

r = session.get(url, headers=HEADERS, proxies=proxy, timeout=15)

Step 3. Intermediate: Playwright for JS-rendered Pages

Only use Playwright when JSON endpoints aren’t available. It’s heavier, so prefer it for smaller, necessary tasks.

# playwright_scraper.py

import asyncio

import hashlib

import logging

from playwright.async_api import async_playwright

from urllib.parse import urljoin

import pandas as pd

import random

import time

logging.basicConfig(level=logging.INFO)

def make_job_id(title, company, location):

return hashlib.sha1(f"{title}|{company}|{location}".encode()).hexdigest()

async def fetch(url, item_selector='div.job_seen_beacon'):

async with async_playwright() as p:

browser = await p.chromium.launch(headless=True)

page = await browser.new_page()

await page.goto(url, wait_until="networkidle")

try:

await page.wait_for_selector(item_selector, timeout=10000)

except Exception as e:

logging.warning("No items found or timeout: %s", e)

await browser.close()

return []

cards = await page.locator(item_selector).all()

rows = []

for c in cards:

title = await c.locator('h2').inner_text() if await c.locator('h2').count() else ""

company = await c.locator('span.companyName').inner_text() if await c.locator('span.companyName').count() else ""

loc = await c.locator('div.companyLocation').inner_text() if await c.locator('div.companyLocation').count() else ""

salary = await c.locator('div.salary-snippet').inner_text() if await c.locator('div.salary-snippet').count() else "N/A"

summary = await c.locator('div.job-snippet').inner_text() if await c.locator('div.job-snippet').count() else ""

href = await c.locator('a').get_attribute('href') if await c.locator('a').count() else ""

rows.append({

"job_id": make_job_id(title, company, loc), "title": title, "company": company, "location": loc,

"salary_raw": salary, "summary": summary, "job_url": urljoin(url, href) if href else ""

})

await browser.close()

return rows

# Run with: asyncio.run(fetch("https://www.indeed.com/jobs?q=software+engineer&l=New+York"))

# Save: rows = asyncio.run(fetch(url)); pd.DataFrame(rows).to_csv('jobs_playwright.csv')

Tips

Use wait_for_selector and guard counts to avoid None failures.

Avoid launching too many browsers; reuse workers for multiple pages.

For infinite scroll, loop await page.click('#load-more') until exhausted.

What if timeout? Increase or add proxies as in Step 2.

Pro Extension: Integrate LLM for selectors—e.g., via openai: client = OpenAI(); response = client.chat.completions.create(model="gpt-4o-mini", messages=[{"role": "user", "content": f"Generate CSS selector for job title in this HTML: {html_snippet}"}])

Step 4. Data Cleaning, Normalization & Storage

1. Parsing: salary & dates

Salary starter (handles 60k-80k, $60,000 - $80,000, 40000, varied dashes/currencies)

import re

def parse_salary(raw):

if not raw or raw == 'N/A': return {'min': None, 'max': None}

s = re.sub(r'[$,]', '', raw.lower()) # Remove $ and ,

m = re.search(r'(?P<min>\d+(?:\.\d+)?)(?:k)?\s*[-–—\u2013\u2014]\s*(?P<max>\d+(?:\.\d+)?)(?:k)?', s)

if m:

def to_num(txt, has_k=False): v = float(txt); return int(v * 1000 if has_k else v)

return {'min': to_num(m.group('min'), 'k' in s), 'max': to_num(m.group('max'), 'k' in s)}

m2 = re.search(r'(\d{2,6})', s)

if m2: return {'min': int(m2.group(1)), 'max': None}

return {'min': None, 'max': None}

Relative date → ISO (simple)

from datetime import datetime, timedelta

from dateparser import parse

import re

def relative_to_iso(text):

t = (text or "").lower()

parsed = parse(t) # Use dateparser for robust handling

if parsed: return parsed.date().isoformat()

# Fallback regex

m = re.search(r'(\d+)\s+day', t)

if m: return (datetime.utcnow() - timedelta(days=int(m.group(1)))).date().isoformat()

if 'today' in t: return datetime.utcnow().date().isoformat()

if 'yesterday' in t: return (datetime.utcnow() - timedelta(days=1)).date().isoformat()

return None

2. Deduplication & canonical id

import hashlib

def canonical_job_id(site, title, company, location, posted_date):

s = f"{site}|{title}|{company}|{location}|{posted_date}"

return hashlib.sha1(s.strip().lower().encode()).hexdigest()

Use this as your unique key and store first_seen, last_seen, and raw_html_hash to detect updates.

3. Data Storage: CSV → SQLite(prototype)

CSV header example:

job_id,title,company,location,posted_date,salary_min,salary_max,description,job_url,raw_html_path,scraped_at

For pros: Use pandas to load/clean, then to_sql for SQLite. Add viz:

import matplotlib.pyplot as plt

# After cleaning df

df['salary_min'].hist(bins=20)

plt.title('Salary Distribution')

plt.xlabel('Min Salary')

plt.ylabel('Count')

plt.show()

What if parsing fails? Log and fallback to raw.

Step 5. Monitoring & Testing

Smoke tests

Fetch 1 canonical page per domain daily; alert if element count drops >50%.

Metrics & thresholds

Parser success rate ≥ 95% (24h window).
429 rate < 1% of requests; if >1% sustained, reduce concurrency.
Average latency for HTML pages < 2s (in normal conditions).

CI & parser unit tests (pytest example)

# test_parser.py

import pytest

from yourparser import parse_job_card # Assume your function

def test_parse_job_card_basic():

html = '<div class="job_seen_beacon"><h2>Dev</h2><span class="companyName">Acme</span></div>'

soup = BeautifulSoup(html, 'lxml')

card = soup.select_one('div.job_seen_beacon')

data = parse_job_card(card, 'https://example.com')

assert data['title'] == "Dev"

assert data['company'] == "Acme"

Run tests in CI for each parser change.

Step 6. Scale & Production Design

Postgres (canonical records) + S3 (raw HTML) + search index (OpenSearch) + queue (Redis/RabbitMQ) + worker pool + scheduler (Airflow/Prefect).

Fetch Layer: Async with aiohttp/semaphore as in previous, plus proxy rotation.

Parser Layer: Template groups + schema.org fallback + LLM auto-adaptation.

Monitoring: Alert on metric drops via Slack.

With ML anti-bots, hybridize with APIs/services for managed proxies.

Troubleshooting & Common Pitfalls

Selectors break → Add fixture tests and run daily smoke tests.

CAPTCHA → Stop and evaluate: Do not bypass automatically. Ask for API/license if needed.

High 429s → Back off: Reduce concurrency, add jitter, rotate proxies, and monitor.

Bad dedupe → Include posted_date and canonical normalization in job_id.

Parsing fails → Log errors; use fallbacks like raw fields.

Two Examples: LinkedIn & Indeed

Note: Only scrape publicly viewable pages and public JSON/JSON-LD. If an endpoint needs authentication, do not reuse private tokens — contact the platform for API access or licensing.

Example A. LinkedIn (public job pages & JSON where accessible)

What often need: aggregate LinkedIn public job listings metadata (title, company, location, posted date, apply URL) for market insight. LinkedIn frequently protects internal APIs and may require authenticated requests for some JSON endpoints.

1. Inspect public job posting pages (single job pages) and look for structured data (application/ld+json) or meta tags. Many job posts include JobPosting JSON-LD that you can parse without calling internal APIs.

2. If an unauthenticated JSON search endpoint exists, you may use it — but check it does not require login. If it does, stop and use the public page approach or the official LinkedIn API.

3. Do not reuse CSRF tokens or cookies from your browser to impersonate a logged-in request.

Parsing JobPosting embedded JSON example (safe, public):

import requests, json

from bs4 import BeautifulSoup

url = "https://www.linkedin.com/jobs/view/123456789" # public job page

r = requests.get(url, headers={"User-Agent":"job-scraper/1.0"}, timeout=15)

r.raise_for_status()

soup = BeautifulSoup(r.text, "lxml")

job_json = None

for tag in soup.select("script[type='application/ld+json']"):

try:

payload = json.loads(tag.string)

if payload.get("@type") == "JobPosting":

job_json = payload

break

except Exception:

continue

if job_json:

title = job_json.get("title")

company = job_json.get("hiringOrganization", {}).get("name")

location = job_json.get("jobLocation", {}).get("address", {}).get("addressLocality")

description = job_json.get("description")

print(title, company, location)

else:

print("No JobPosting JSON found — consider parsing page HTML or using official APIs.")

If you find a JSON search endpoint that returns hits without auth (rare), replay the request with requests and parse. If it requires authentication, do not share or reuse login tokens. Ask for API access.

Tips for LinkedIn

Avoid high frequency calls from a single IP, LinkedIn is aggressive about blocking—use proxies/jitter.

Prefer sampling a small number of public pages daily or pursue lawful data licensing if you need volume.

If JobPosting JSON is absent, parse the job page HTML in a careful, respectful way (same rules: keep rate low, use session, log errors).

What if blocked? Switch to the official LinkedIn API with OAuth.

Example B. Indeed (listings + detail pages)

Indeed is commonly scraped for job market signals. Many listing pages are static enough for requests + BeautifulSoup, but they sometimes include dynamic components and anti-bot measures.

1. Search results page: fetch listing pages (with headers, session, retries).

2. Extract job cards: capture basic metadata (title, company, location, short summary, job URL).

3. Follow job URL: fetch the job details page to collect the full description and posted date.

4. Normalize salary & dates with small helper functions.

Example code (listing → details):

import requests, time

from bs4 import BeautifulSoup

from urllib.parse import urljoin

BASE_SEARCH = "https://www.indeed.com/jobs?q=software+engineer&l=New+York&start=0"

HEADERS = {"User-Agent":"Mozilla/5.0 (compatible; job-scraper/1.0)"}

session = requests.Session()

def parse_listing_page(html, base):

soup = BeautifulSoup(html, "lxml")

results = []

for card in soup.select("div.job_seen_beacon"):

title_el = card.select_one("h2")

title = title_el.get_text(strip=True) if title_el else ""

company = (card.select_one("span.companyName").get_text(strip=True) if card.select_one("span.companyName") else "")

location = (card.select_one("div.companyLocation").get_text(strip=True) if card.select_one("div.companyLocation") else "")

rel = card.select_one("a")

job_url = urljoin(base, rel["href"]) if rel and rel.get("href") else ""

results.append({"title": title, "company": company, "location": location, "job_url": job_url})

return results

def fetch_job_description(job_url):

r = session.get(job_url, headers=HEADERS, timeout=15)

r.raise_for_status()

s = BeautifulSoup(r.text, "lxml")

desc = s.select_one("#jobDescriptionText")

return desc.get_text(separator="\n", strip=True) if desc else ""

# Example usage: scrape first 2 pages

all_jobs = []

for page in range(0, 2):

url = f"https://www.indeed.com/jobs?q=software+engineer&l=New+York&start={page*10}"

r = session.get(url, headers=HEADERS, timeout=15)

r.raise_for_status()

listings = parse_listing_page(r.text, base=url)

for job in listings:

if job["job_url"]:

time.sleep(1) # polite delay before fetching detail

job["description"] = fetch_job_description(job["job_url"])

all_jobs.extend(listings)

time.sleep(2 + (page % 2))

Indeed troubleshooting tips

If you see frequent 429s or CAPTCHAs, reduce concurrency and increase jitter.

Indeed sometimes changes CSS selectors — build easy unit tests (HTML fixture) to detect parser breakage.

Use total_results (if present) to compute pages: pages = math.ceil(total_results / page_size).

For production and high volume, consider official data products or licensing — they are often more sustainable.

Final Thoughts

Start with one site, expand ethically. Automate via cron for daily runs, or integrate into apps (e.g., email alerts for matches). For no-code, try no-code tools or a web scraping service. If scaling, consider API partnerships.

< Previous

TikTok Video Scraping: Methods & Steps

Next >

2026 Guide to 4G/5G Proxies for TikTok Automation