Step-by-Step Guide to Web Scraping Job Postings in 2026
Step-by-step guide to scrape job postings: code examples (LinkedIn & Indeed), parsing helpers, anti-block tips, monitoring and scaling.
Jan 7, 2026
Ethical guide to scraping public TikTok videos with step-by-step methods, code, and production best practices.
TikTok is one of the top platforms for viral content, trends, and user insights. Scraping TikTok videos can unlock valuable information effectively, but with evolving anti-scraping measures—like AI-driven behavioral detection in 2026—doing it right requires knowledge, caution, and the right approach. This guide shows step-by-step workflows, starter code (with session/backoff), detection logging, testing guidance, and an ethics checklist so you can go from prototype to a maintainable pipeline responsibly.

Who this guide is for: researchers (public posts), marketers, engineers building data pipelines or ML datasets.
Do not scrape private accounts or bypass authentication. Check Terms of Service, robots.txt for signposts (not legal immunity), and local laws (GDPR, etc.). For research or commercial projects, document purpose, retention, and anonymization policies; consult legal counsel when in doubt.
Top scenarios include:
Each use case affects scale, frequency, and legal requirements — design accordingly.
Scraping isn't inherently illegal, but it must comply with laws and platform rules. TikTok's terms prohibit automated access that bypasses protections or harvests private data. Focus on public content only—never scrape protected accounts or personal info.
Checklist before you run:
√ Confirm target content is public.
√ Limit scope to non-sensitive metadata where possible.
√ Store minimal identifiers and anonymize comments if used for research.
√ Retain a clear purpose, retention window, and access controls.
If uncertain, seek legal counsel.
Video metadata: id, caption, creation timestamp, stats (views/likes/comments), duration.
Author metadata: username, display name, followers (public counts), bio.
Hashtags, music IDs, mentions present in captions.
Public comments (paginated where available).
Video play URLs (may be watermarked or non-watermarked depending on flow).
Thumbnails and downloadable media (if accessible).
You cannot reliably get private messages, private-account content, or content behind login walls without explicit permission.
Hydration / Embedded JSON: initial HTML may include a JSON blob (UNIVERSAL_DATA_FOR_REHYDRATION or similar) with first-page data. Easy to parse when present.
XHR / Fetch endpoints: client triggers API calls (e.g., item_list, comment/list) that return paginated JSON. Replicating them gives complete feeds.
Client token & fingerprinting: JS can generate signatures or device signals; advanced fingerprinting and CAPTCHAs may appear to deter scraping.
Use developer tools (Network → XHR) to inspect endpoints and payloads before coding.
| Method | Best for | Pros | Cons | Difficulty |
| Hidden JSON parsing | Quick one-off tests | Fast, low compute | Fragile; first batch only | Low |
| XHR / API emulation | Full pagination, efficient | Reliable if endpoints stable | Requires reverse engineering | Medium |
| Headless browser capture | Client-token flows, dynamic loads | Works when JS required | Higher compute & detectability | High |
| Bulk/Async (Node.js) | High throughput & orchestration | Concurrency, streaming I/O | Needs orchestration & proxies | Medium–High |
Note: Method 4 is for scaling. Still use Hidden JSON / XHR / Headless under the hood—Node helps when you need many concurrent workers and non-blocking I/O.
Aim for reliability, not stealth. Don’t teach or attempt illegal evasion.
Respect rate limits: randomized delays, jitter and backoff.
Reputable proxies / geo-testing: for region-specific public data or higher request volumes, teams often rely on reputable proxy services that offer stable IP rotation and clear compliance policies.
Session reuse & cookie hygiene: use cookies from sessions you control. Never use stolen credentials.
Human-like headless behavior: varied scroll timing, occasional mouse moves.
CAPTCHA policy: do not bypass CAPTCHAs programmatically to evade protections — pause and escalate to manual resolution or authorized resolver services.
Minimal collection & anonymization: only store what you need. Hash or anonymize PII.
Logging & fail-safe: persist detection signals and pause jobs when thresholds are hit.
Transparency for research: keep documentation, retention rules, and ethical review evidence.
Quick tests, initial prototyping, or pages that embed hydration JSON.
Some pages embed initial state JSON in <script> tags. Parsing that can yield the first batch of videos and author metadata without executing JavaScript.
1. Inspect page source (Ctrl+U) for script tags that contain userInfo, aweme_list, itemList, or hydration names.
2. Write a robust finder that tries: (a) known script id, (b) script blocks containing known keys, (c) regex fallback across HTML.
3. Parse JSON and extract posts array.
4. For each post, save metadata and optionally download the play_addr MP4 (streaming with retries).
5. Run & test on a small public profile you control. Verify JSON saved and video playable.
This script demonstrates multiple fallback strategies, session reuse, and exponential backoff.
#!/usr/bin/env python3
# robust_hidden_json.py
# Usage: python robust_hidden_json.py https://www.tiktok.com/@somepublicprofile
import sys, requests, re, json, time, random, os, hashlib
from bs4 import BeautifulSoup
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/json;q=0.9,*/*;q=0.8",
"Referer": "https://www.tiktok.com",
}
def fetch_with_backoff(session, url, max_retries=4):
backoff = 1.0
for attempt in range(max_retries):
r = session.get(url, headers=HEADERS, timeout=20)
if r.status_code == 200:
return r
if r.status_code in (429, 403, 503):
time.sleep(backoff + random.random())
backoff *= 2
continue
r.raise_for_status()
r.raise_for_status()
def find_json(html):
soup = BeautifulSoup(html, "html.parser")
# 1) by well-known id
script = soup.find('script', id='__UNIVERSAL_DATA_FOR_REHYDRATION__')
if script and script.string:
return json.loads(script.string)
# 2) scan scripts for known keys
keys = ['"userInfo"', '"aweme_list"', '"itemList"', 'UNIVERSAL_DATA_FOR_REHYDRATION']
for s in soup.find_all('script'):
text = s.string or ""
if any(k in text for k in keys):
m = re.search(r'(\{.*\})', text, flags=re.DOTALL)
if m:
try:
return json.loads(m.group(1))
except Exception:
continue
# 3) fallback regex across HTML
m = re.search(r'(?s)(\{.*"userInfo".*?\})', html)
if m:
return json.loads(m.group(1))
return None
def download_stream(session, url, out_path):
with session.get(url, stream=True, timeout=60) as r:
r.raise_for_status()
with open(out_path, "wb") as f:
for chunk in r.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
def scrape_profile(url, out_dir="data"):
os.makedirs(out_dir, exist_ok=True)
with requests.Session() as s:
r = fetch_with_backoff(s, url)
data = find_json(r.text)
if not data:
print("No embedded JSON found — try XHR or headless methods.")
return
# adapt to the JSON structure observed; below are common keys
posts = data.get('aweme_list') or data.get('itemList') or []
for p in posts:
vid = p.get('aweme_id') or p.get('id')
caption = p.get('desc') or p.get('title')
# common path to video URL
video_url = (p.get('video') or {}).get('play_addr', {}).get('url_list', [None])[0]
meta = {"id": vid, "caption": caption, "video_url": video_url}
# save metadata
meta_path = os.path.join(out_dir, f"{vid}.json")
with open(meta_path, "w", encoding="utf-8") as f:
json.dump(meta, f, ensure_ascii=False, indent=2)
print("Saved metadata:", meta_path)
# optionally download video
if video_url:
try:
out_file = os.path.join(out_dir, f"{vid}.mp4")
download_stream(s, video_url, out_file)
# compute checksum
h = hashlib.sha256()
with open(out_file, "rb") as fh:
for chunk in iter(lambda: fh.read(8192), b""):
h.update(chunk)
print("Downloaded:", out_file, "sha256:", h.hexdigest())
except Exception as e:
print("Download failed for", vid, e)
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python robust_hidden_json.py <profile_url>")
sys.exit(1)
scrape_profile(sys.argv[1])
Page lacks hydration JSON (fully client-rendered).
Key names change — update finder.
Pagination not available — only initial batch.
Complete pagination (many posts) and efficient runs (profiles, hashtags).
The web client calls JSON endpoints as you scroll. Emulating those calls often gives full lists more reliably and with less compute than headless.
1. Open DevTools → Network → XHR/Fetch. Load profile and scroll to capture item_list, comment/list calls.
2. Capture required query params (cursor, count, secUid) and headers.
3. Replicate calls in code with a session; page until hasMore false.
4. Save metadata and download media URLs. (Extension for comments: Use /api/comment/list/ with aweme_id from video metadata, paginate via cursor.)
5. Run & test one page cycle and check cursor and hasMore fields. Verify the number of items matches the UI.
#!/usr/bin/env python3
# xhr_pagination.py
# Usage: python xhr_pagination.py
import requests, time, random, json
# NOTE: Verify the actual endpoint and required params in DevTools -> Network
BASE = "https://www.tiktok.com/api/post/item_list" # example: verify before use
HEADERS = {
"User-Agent": "Mozilla/5.0",
"Referer": "https://www.tiktok.com",
"Accept": "application/json, text/javascript, */*; q=0.01"
}
def fetch_pages(secUid, max_pages=50, out_file="posts.json"):
session = requests.Session()
params = {
"secUid": secUid, # glean from profile or initial JSON
"count": "30",
"cursor": "0",
}
all_items = []
for page in range(max_pages):
r = session.get(BASE, params=params, headers=HEADERS, timeout=15)
try:
r.raise_for_status()
except Exception as e:
print("HTTP error:", e, "Status:", getattr(r, "status_code", None))
break
data = r.json()
items = data.get("itemList") or data.get("aweme_list") or []
all_items.extend(items)
print(f"Page {page+1}: got {len(items)} items (total {len(all_items)})")
if not data.get("hasMore"):
print("No more pages")
break
params["cursor"] = data.get("cursor", params["cursor"])
time.sleep(random.uniform(1.5, 4.5))
with open(out_file, "w", encoding="utf-8") as f:
json.dump(all_items, f, ensure_ascii=False)
return all_items
if __name__ == "__main__":
# set secUid to the profile's secUid (inspect page or initial JSON)
SECUID = "USER_SEC_UID_HERE"
fetch_pages(SECUID)
Endpoint signatures change or require server-side signatures.
Token or signature generation moves behind client JS.
Pages that generate tokens, or content only available after real browser interactions.
You use an actual browser engine to run the site’s JS, capture network responses, and simulate human actions.
1. Launch browser automation tool and intercept network responses.
2. Simulate human behavior: variable scrolls, random pauses, mouse movements.
3. Capture item_list and related responses, parse JSON, and save.
4. Respect CAPTCHAs and detection — pause and manual-resolve if seen.
Run & test a 5–10 scroll iterations on a known profile; confirm that captured JSON covers the same items as UI. Compare headless and headed mode to detect fingerprinting issues.
#!/usr/bin/env python3
# playwright_capture.py
# Usage: python playwright_capture.py
from playwright.sync_api import sync_playwright
import json, time, random
def run(profile_url, scroll_iterations=8, out_file="captured_data.json"):
captured = []
with sync_playwright() as p:
# play with headless=False for debugging if blocked
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(profile_url, timeout=60000)
def on_response(response):
try:
url = response.url
# tweak pattern based on what you see in DevTools
if "item_list" in url and response.status == 200:
try:
data = response.json()
captured.append(data)
print("Captured item_list response, len:", len(data.get("itemList") or []))
except Exception:
pass
except Exception:
pass
page.on("response", on_response)
# human-like scrolling
for i in range(scroll_iterations):
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
# small random mouse move
page.mouse.move(random.randint(0, 800), random.randint(0, 600))
time.sleep(random.uniform(2, 5))
browser.close()
# write captured responses (may be nested arrays)
with open(out_file, "w", encoding="utf-8") as f:
json.dump(captured, f, ensure_ascii=False, indent=2)
print("Saved captured responses:", out_file)
if __name__ == "__main__":
PROFILE = "https://www.tiktok.com/@somepublicprofile"
run(PROFILE)
Advanced fingerprinting detects headless patterns; tokens move to different generation mechanisms.
Large scale runs get blocked if no proxies or sessions used.
Scaling up to many concurrent jobs, high throughput pipelines, or JS-first stacks.
Not a different data-source technique — Node.js is an implementation choice for scale. You still use hidden JSON/XHR/Headless extraction; Node helps with concurrency and non-blocking streaming. Typical architecture includes a job queue, worker pool, proxy manager, progress DB, and object storage.
Job queue (Redis, RabbitMQ) → worker pool (Node/Python) → progress DB (SQLite/Postgres) → object storage (S3 / local) → monitoring/alerts.
Proxy manager, session store, and retry/backoff orchestrator. At this stage, most production pipelines integrate a managed rotating proxy service alongside the job queue and session store to avoid single-IP bottlenecks.
// bulk_async.js
// Usage: node bulk_async.js
const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs-extra');
const pLimit = require('p-limit');
const HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.tiktok.com"
};
async function tryHiddenJson(url) {
const r = await axios.get(url, { headers: HEADERS, timeout: 20000 });
const $ = cheerio.load(r.data);
const script = $('#__UNIVERSAL_DATA_FOR_REHYDRATION__').html();
if (script) {
try {
const data = JSON.parse(script);
return data;
} catch (e) {
// fallback: search for JSON-like blob inside scripts
const scripts = $('script').map((i, s) => $(s).html()).get();
for (const txt of scripts) {
if (txt && txt.includes('aweme_list')) {
const m = txt.match(/\{[\s\S]*"aweme_list"[\s\S]*\}/);
if (m) {
try { return JSON.parse(m[0]); } catch(e){ continue; }
}
}
}
}
}
return null;
}
async function processProfile(url, outDir = './out') {
await fs.ensureDir(outDir);
try {
const data = await tryHiddenJson(url);
if (data) {
const outfile = `${outDir}/${encodeURIComponent(url)}.json`;
await fs.writeJson(outfile, data, { spaces: 2 });
console.log('Saved', outfile);
return;
}
// Fallbacks: XHR emulation or headless. Placeholder below.
console.log('Hidden JSON not found for', url, '- fallback required (XHR/headless).');
} catch (err) {
console.error('Error processing', url, err.message);
}
}
// simple concurrency-limited runner
async function main() {
const profiles = [
'https://www.tiktok.com/@somepublicprofile',
// add more profile URLs
];
const limit = pLimit(5); // concurrency = 5
const tasks = profiles.map(url => limit(() => processProfile(url, './out')));
await Promise.all(tasks);
console.log('All done');
}
main().catch(console.error);
Start with 1 worker, run smoke tests (100 videos).
Increase concurrency gradually and monitor detection_hits, latency, and error trends.
Implement autoscale-down or pause when detection thresholds exceed safety margins.
Unbounded concurrency without proxies leads to rapid detection.
Orchestration errors (job leaks, duplicate work) without robust progress DB logic.
Session & headers
Use persistent HTTP sessions (requests.Session() / cookie jar).
Use a realistic header set (User-Agent, Accept-Language, Accept, Referer, Connection).
Rate limiting, backoff & detection
Randomized delays: time.sleep(random.uniform(a,b)).
Exponential backoff for transient 403/429/503.
Correlate detection hits and back off earlier when they accumulate.
Proxies & geo-targeting
Use reputable residential or mobile proxy services if regional access is required. Rotate IPs and use session affinity limitedly.
Tip: If proxies are required, use a single, well-documented proxy service(e.g., GoProxy) rather than mixing sources, which can introduce inconsistent behavior and detection patterns.
Progress & resume
Use a progress DB (SQLite/Postgres) to track video_id, status, attempts, and last_error.
Data integrity
Stream downloads to disk to avoid high memory use.
Compute SHA256 per file and store in metadata.
Monitoring & CI
Nightly smoke test against canonical profile(s).
Unit tests for parser functions with canned HTML/JSON.
Alerts on rising error rates or CAPTCHAs encountered.
Smoke tests
Nightly job that scrapes 5–10 known public posts and validates key fields (ids, timestamps).
Unit tests
Keep parser fixtures (tests/fixtures/profile_page.html) and write pytest tests asserting find_json(html) returns expected keys.
Maintenance
Monthly endpoint verification: check DevTools Network tab for changed endpoints. Keep a “Last checked” log.
No JSON found: pivot to XHR or headless.
403/429 spikes: add more delay, reuse sessions, rotate proxies, persist detection state.
Incomplete downloads: stream, retry, compute checksum.
Region-specific content: test through geo-targeted proxies.
Frequent breakage: add parser unit tests and nightly smoke tests.
Q: Can I get non-watermarked MP4s?
A: Sometimes — some flows expose non-watermarked URLs; it’s not guaranteed. Respect copyrights.
Q: How often will scrapers break?
A: Often enough that you should run daily/weekly smoke tests. UI or endpoint changes can break parsers immediately.
Q: Should I use managed extraction services?
A: For scale, compliance, and lower maintenance burden, yes—at cost. DIY is cheaper but requires ongoing maintenance.
A dependable TikTok video scraper is achievable if you pick the right method for scale, harden your pipeline (sessions, backoff, progress DB), monitor and test continuously, and prioritize legal/ethical compliance. Start small, validate, and iterate. That approach turns a fragile hack into a maintainable data pipeline.
Next >
Cancel anytime
No credit card required