TikTok Video Scraping: Methods & Steps
Ethical guide to scraping public TikTok videos with step-by-step methods, code, and production best practices.
Jan 8, 2026
Step-by-step guide to scrape job postings: code examples (LinkedIn & Indeed), parsing helpers, anti-block tips, monitoring and scaling.
This comprehensive guide covers how to scrape job listings reliably and ethically using Python—from single static sites to thousands of heterogeneous pages. You'll get concrete workflows, code examples, troubleshooting tips, data schema design, and scaling strategies to build a full end-to-end pipeline, plus examples for LinkedIn and Indeed. Designed for pros with deeper production insights, while remaining accessible for beginners.
Important Note: This article is educational, not legal advice. Laws and site policies vary by jurisdiction and by site. If you plan large-scale or commercial scraping, consult legal counsel and the target site’s Terms of Service. Don’t attempt to access private or authenticated data without permission.

Common user questions this article solves
"How do I pull listings from one site (e.g., Indeed) quickly without coding errors?"
"How to handle JavaScript-loaded content or infinite scroll?"
"How to scale across hundreds of varied job pages?"
"How to avoid blocks, clean data, and ensure legality?"
We'll solve these with actionable steps, starting simple and scaling up.
Typical scenarios include:
Job Hunters: Aggregate postings from sites like Indeed or LinkedIn into a spreadsheet, filter for remote roles or keywords, and apply faster.
Market Researchers: Gather data on trends, salaries, or skills in sectors like tech or finance.
Career Coaches: Identify emerging titles or qualifications for client guidance.
Students/Developers: Build scraping skills for portfolios or projects.
With AI-driven hiring, scraping uncovers patterns like rising demand for AI ethics roles. For pros, it enables custom aggregators for competitive intelligence.
Check robots.txt and the site’s Terms of Service. If a site prohibits scraping, don’t scrape.
Avoid logging into other people’s accounts or using tokens you don’t own.
Don’t collect private/personal data you don’t need; comply with GDPR/CCPA if applicable.
Use polite request rates and per-domain concurrency limits; be prepared to stop if the site blocks or issues CAPTCHA.
If the site requires login or access control to view job data, prefer the official API or ask for permission/licensing.
If in doubt, slow down and ask—public APIs or dataset partnerships are common and often cheaper and more legal than trying to scrape at scale.
| Site type | Easiest approach | Difficulty | When to choose |
| Static HTML job board | requests + BeautifulSoup | Low | Single site, few pages |
| JSON/XHR endpoint | Call JSON endpoint with requests | Low → Medium | Stable structured fields available |
| JS-rendered/infinite scroll | Playwright headless | Medium | No JSON endpoint; small scale |
| Many heterogeneous sites | Async fetch + template grouping + proxies | High | Production aggregator / large-scale research |
Design output early for consistency:
Use job_id for deduplication. For pros, add source_site and confidence_score (e.g., for AI-parsed fields).
1. Install Python 3.10+ from python.org.
2. Libraries: Run pip install requests beautifulsoup4 pandas playwright aiohttp dateparser in terminal.
3. Environment: VS Code or Jupyter for testing.
4. Optional: Proxies for scaling (paid residential for reliability); AI libs like OpenAI for selector generation
Pro Tip: Start small—scrape 10 jobs to test compliance. For pros, set up virtualenv and logging from the start.
Before coding:
1. Open DevTools (F12) → Network → filter by XHR/Fetch.
2. Perform a search or scroll “load more” on the job page.
3. Look for requests returning JSON; inspect hits / elements arrays.
4. Replay the JSON request with requests (copy safe headers only — do not reuse private auth tokens).
JSON endpoints usually include structured fields (title, company, location, postedDate, applyUrl) and are far more stable than scraping DOM strings.
Quick test: copy the request’s full URL and run curl or requests.get(url, headers=...) to confirm you can receive the data unauthenticated. What if it fails? Re-inspect monthly; sites update often. Beginner Milestone: Validate one endpoint returns data.
Use a requests.Session() with retries and polite per-page delays. This example is production-safe for small projects.
# robust_requests.py
import requests
import time
import logging
import hashlib
import random
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from requests.adapters import HTTPAdapter, Retry
import pandas as pd
import math
logging.basicConfig(level=logging.INFO)
BASE = "https://www.indeed.com/jobs?q=software+engineer&l=New+York"
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; job-scraper/1.0)"}
session = requests.Session()
retries = Retry(total=3, backoff_factor=1, status_forcelist=(429,500,502,503,504))
session.mount("https:/", HTTPAdapter(max_retries=retries))
def make_job_id(title, company, location):
key = f"{title}|{company}|{location}".strip().lower()
return hashlib.sha1(key.encode()).hexdigest()
def parse_job_card(card, base_url):
title = card.select_one("h2").get_text(strip=True) if card.select_one("h2") else ""
company = card.select_one("span.companyName").get_text(strip=True) if card.select_one("span.companyName") else ""
location = card.select_one("div.companyLocation").get_text(strip=True) if card.select_one("div.companyLocation") else ""
salary = card.select_one("div.salary-snippet").get_text(strip=True) if card.select_one("div.salary-snippet") else "N/A"
summary = card.select_one("div.job-snippet").get_text(strip=True) if card.select_one("div.job-snippet") else ""
a = card.select_one("a")
url = urljoin(base_url, a["href"]) if a and a.get("href") else ""
return {
"job_id": make_job_id(title, company, location), "title": title, "company": company, "location": location,
"salary_raw": salary, "summary": summary, "job_url": url
}
def scrape_page(url):
try:
r = session.get(url, headers=HEADERS, timeout=15)
r.raise_for_status()
except Exception as e:
logging.error("Request failed %s: %s", url, e)
return []
soup = BeautifulSoup(r.text, "lxml")
cards = soup.select("div.job_seen_beacon")
return [parse_job_card(c, base_url=url) for c in cards]
rows = []
pages = 5 # Or dynamic: total_jobs = int(soup.find('div', id='searchCountPages').text.split()[-3]); pages = math.ceil(total_jobs / 15)
for p in range(pages):
url = f"{BASE}&start={p*10}"
rows.extend(scrape_page(url))
time.sleep(1 + random.random()) # polite jitter
pd.DataFrame(rows).to_csv("jobs.csv", index=False)
logging.info("Saved jobs.csv (%d rows)", len(rows))
Notes: Use session to reuse TCP connections. Retries + backoff handle transient failures. Keep per-domain load low and monitor responses.
For anti-blocks, add residential proxies:
def rotate_proxies(proxies_list):
return random.choice(proxies_list)
proxies = [{'http': 'http://user:pass@ip1:port'}, {'http': 'http://user:pass@ip2:port'}] # Your list
proxy = rotate_proxies(proxies)
r = session.get(url, headers=HEADERS, proxies=proxy, timeout=15)
Only use Playwright when JSON endpoints aren’t available. It’s heavier, so prefer it for smaller, necessary tasks.
# playwright_scraper.py
import asyncio
import hashlib
import logging
from playwright.async_api import async_playwright
from urllib.parse import urljoin
import pandas as pd
import random
import time
logging.basicConfig(level=logging.INFO)
def make_job_id(title, company, location):
return hashlib.sha1(f"{title}|{company}|{location}".encode()).hexdigest()
async def fetch(url, item_selector='div.job_seen_beacon'):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
try:
await page.wait_for_selector(item_selector, timeout=10000)
except Exception as e:
logging.warning("No items found or timeout: %s", e)
await browser.close()
return []
cards = await page.locator(item_selector).all()
rows = []
for c in cards:
title = await c.locator('h2').inner_text() if await c.locator('h2').count() else ""
company = await c.locator('span.companyName').inner_text() if await c.locator('span.companyName').count() else ""
loc = await c.locator('div.companyLocation').inner_text() if await c.locator('div.companyLocation').count() else ""
salary = await c.locator('div.salary-snippet').inner_text() if await c.locator('div.salary-snippet').count() else "N/A"
summary = await c.locator('div.job-snippet').inner_text() if await c.locator('div.job-snippet').count() else ""
href = await c.locator('a').get_attribute('href') if await c.locator('a').count() else ""
rows.append({
"job_id": make_job_id(title, company, loc), "title": title, "company": company, "location": loc,
"salary_raw": salary, "summary": summary, "job_url": urljoin(url, href) if href else ""
})
await browser.close()
return rows
# Run with: asyncio.run(fetch("https://www.indeed.com/jobs?q=software+engineer&l=New+York"))
# Save: rows = asyncio.run(fetch(url)); pd.DataFrame(rows).to_csv('jobs_playwright.csv')
Tips
Use wait_for_selector and guard counts to avoid None failures.
Avoid launching too many browsers; reuse workers for multiple pages.
For infinite scroll, loop await page.click('#load-more') until exhausted.
What if timeout? Increase or add proxies as in Step 2.
Pro Extension: Integrate LLM for selectors—e.g., via openai: client = OpenAI(); response = client.chat.completions.create(model="gpt-4o-mini", messages=[{"role": "user", "content": f"Generate CSS selector for job title in this HTML: {html_snippet}"}])
Salary starter (handles 60k-80k, $60,000 - $80,000, 40000, varied dashes/currencies)
import re
def parse_salary(raw):
if not raw or raw == 'N/A': return {'min': None, 'max': None}
s = re.sub(r'[$,]', '', raw.lower()) # Remove $ and ,
m = re.search(r'(?P<min>\d+(?:\.\d+)?)(?:k)?\s*[-–—\u2013\u2014]\s*(?P<max>\d+(?:\.\d+)?)(?:k)?', s)
if m:
def to_num(txt, has_k=False): v = float(txt); return int(v * 1000 if has_k else v)
return {'min': to_num(m.group('min'), 'k' in s), 'max': to_num(m.group('max'), 'k' in s)}
m2 = re.search(r'(\d{2,6})', s)
if m2: return {'min': int(m2.group(1)), 'max': None}
return {'min': None, 'max': None}
Relative date → ISO (simple)
from datetime import datetime, timedelta
from dateparser import parse
import re
def relative_to_iso(text):
t = (text or "").lower()
parsed = parse(t) # Use dateparser for robust handling
if parsed: return parsed.date().isoformat()
# Fallback regex
m = re.search(r'(\d+)\s+day', t)
if m: return (datetime.utcnow() - timedelta(days=int(m.group(1)))).date().isoformat()
if 'today' in t: return datetime.utcnow().date().isoformat()
if 'yesterday' in t: return (datetime.utcnow() - timedelta(days=1)).date().isoformat()
return None
import hashlib
def canonical_job_id(site, title, company, location, posted_date):
s = f"{site}|{title}|{company}|{location}|{posted_date}"
return hashlib.sha1(s.strip().lower().encode()).hexdigest()
Use this as your unique key and store first_seen, last_seen, and raw_html_hash to detect updates.
CSV header example:
job_id,title,company,location,posted_date,salary_min,salary_max,description,job_url,raw_html_path,scraped_at
For pros: Use pandas to load/clean, then to_sql for SQLite. Add viz:
import matplotlib.pyplot as plt
# After cleaning df
df['salary_min'].hist(bins=20)
plt.title('Salary Distribution')
plt.xlabel('Min Salary')
plt.ylabel('Count')
plt.show()
What if parsing fails? Log and fallback to raw.
Smoke tests
Fetch 1 canonical page per domain daily; alert if element count drops >50%.
Metrics & thresholds
CI & parser unit tests (pytest example)
# test_parser.py
import pytest
from yourparser import parse_job_card # Assume your function
def test_parse_job_card_basic():
html = '<div class="job_seen_beacon"><h2>Dev</h2><span class="companyName">Acme</span></div>'
soup = BeautifulSoup(html, 'lxml')
card = soup.select_one('div.job_seen_beacon')
data = parse_job_card(card, 'https://example.com')
assert data['title'] == "Dev"
assert data['company'] == "Acme"
Run tests in CI for each parser change.
Postgres (canonical records) + S3 (raw HTML) + search index (OpenSearch) + queue (Redis/RabbitMQ) + worker pool + scheduler (Airflow/Prefect).
Fetch Layer: Async with aiohttp/semaphore as in previous, plus proxy rotation.
Parser Layer: Template groups + schema.org fallback + LLM auto-adaptation.
Monitoring: Alert on metric drops via Slack.
With ML anti-bots, hybridize with APIs/services for managed proxies.
Selectors break → Add fixture tests and run daily smoke tests.
CAPTCHA → Stop and evaluate: Do not bypass automatically. Ask for API/license if needed.
High 429s → Back off: Reduce concurrency, add jitter, rotate proxies, and monitor.
Bad dedupe → Include posted_date and canonical normalization in job_id.
Parsing fails → Log errors; use fallbacks like raw fields.
Note: Only scrape publicly viewable pages and public JSON/JSON-LD. If an endpoint needs authentication, do not reuse private tokens — contact the platform for API access or licensing.
What often need: aggregate LinkedIn public job listings metadata (title, company, location, posted date, apply URL) for market insight. LinkedIn frequently protects internal APIs and may require authenticated requests for some JSON endpoints.
1. Inspect public job posting pages (single job pages) and look for structured data (application/ld+json) or meta tags. Many job posts include JobPosting JSON-LD that you can parse without calling internal APIs.
2. If an unauthenticated JSON search endpoint exists, you may use it — but check it does not require login. If it does, stop and use the public page approach or the official LinkedIn API.
3. Do not reuse CSRF tokens or cookies from your browser to impersonate a logged-in request.
Parsing JobPosting embedded JSON example (safe, public):
import requests, json
from bs4 import BeautifulSoup
url = "https://www.linkedin.com/jobs/view/123456789" # public job page
r = requests.get(url, headers={"User-Agent":"job-scraper/1.0"}, timeout=15)
r.raise_for_status()
soup = BeautifulSoup(r.text, "lxml")
job_json = None
for tag in soup.select("script[type='application/ld+json']"):
try:
payload = json.loads(tag.string)
if payload.get("@type") == "JobPosting":
job_json = payload
break
except Exception:
continue
if job_json:
title = job_json.get("title")
company = job_json.get("hiringOrganization", {}).get("name")
location = job_json.get("jobLocation", {}).get("address", {}).get("addressLocality")
description = job_json.get("description")
print(title, company, location)
else:
print("No JobPosting JSON found — consider parsing page HTML or using official APIs.")
If you find a JSON search endpoint that returns hits without auth (rare), replay the request with requests and parse. If it requires authentication, do not share or reuse login tokens. Ask for API access.
Tips for LinkedIn
Avoid high frequency calls from a single IP, LinkedIn is aggressive about blocking—use proxies/jitter.
Prefer sampling a small number of public pages daily or pursue lawful data licensing if you need volume.
If JobPosting JSON is absent, parse the job page HTML in a careful, respectful way (same rules: keep rate low, use session, log errors).
What if blocked? Switch to the official LinkedIn API with OAuth.
Indeed is commonly scraped for job market signals. Many listing pages are static enough for requests + BeautifulSoup, but they sometimes include dynamic components and anti-bot measures.
1. Search results page: fetch listing pages (with headers, session, retries).
2. Extract job cards: capture basic metadata (title, company, location, short summary, job URL).
3. Follow job URL: fetch the job details page to collect the full description and posted date.
4. Normalize salary & dates with small helper functions.
Example code (listing → details):
import requests, time
from bs4 import BeautifulSoup
from urllib.parse import urljoin
BASE_SEARCH = "https://www.indeed.com/jobs?q=software+engineer&l=New+York&start=0"
HEADERS = {"User-Agent":"Mozilla/5.0 (compatible; job-scraper/1.0)"}
session = requests.Session()
def parse_listing_page(html, base):
soup = BeautifulSoup(html, "lxml")
results = []
for card in soup.select("div.job_seen_beacon"):
title_el = card.select_one("h2")
title = title_el.get_text(strip=True) if title_el else ""
company = (card.select_one("span.companyName").get_text(strip=True) if card.select_one("span.companyName") else "")
location = (card.select_one("div.companyLocation").get_text(strip=True) if card.select_one("div.companyLocation") else "")
rel = card.select_one("a")
job_url = urljoin(base, rel["href"]) if rel and rel.get("href") else ""
results.append({"title": title, "company": company, "location": location, "job_url": job_url})
return results
def fetch_job_description(job_url):
r = session.get(job_url, headers=HEADERS, timeout=15)
r.raise_for_status()
s = BeautifulSoup(r.text, "lxml")
desc = s.select_one("#jobDescriptionText")
return desc.get_text(separator="\n", strip=True) if desc else ""
# Example usage: scrape first 2 pages
all_jobs = []
for page in range(0, 2):
url = f"https://www.indeed.com/jobs?q=software+engineer&l=New+York&start={page*10}"
r = session.get(url, headers=HEADERS, timeout=15)
r.raise_for_status()
listings = parse_listing_page(r.text, base=url)
for job in listings:
if job["job_url"]:
time.sleep(1) # polite delay before fetching detail
job["description"] = fetch_job_description(job["job_url"])
all_jobs.extend(listings)
time.sleep(2 + (page % 2))
Indeed troubleshooting tips
If you see frequent 429s or CAPTCHAs, reduce concurrency and increase jitter.
Indeed sometimes changes CSS selectors — build easy unit tests (HTML fixture) to detect parser breakage.
Use total_results (if present) to compute pages: pages = math.ceil(total_results / page_size).
For production and high volume, consider official data products or licensing — they are often more sustainable.
Start with one site, expand ethically. Automate via cron for daily runs, or integrate into apps (e.g., email alerts for matches). For no-code, try no-code tools or a web scraping service. If scaling, consider API partnerships.
< Previous
Next >
Cancel anytime
No credit card required