A Step-by-Step Guide to Scraping Bing(Beginner → Production)
Hands-on guide to scrape Bing SERPs, PAA, Maps, images, and chat outputs with code, anti-block tactics, and GoProxy proxy setups.
Aug 15, 2025
Step-by-step 2025 guide: twscrape, Playwright XHR capture, account pools, and scaling with GoProxy proxies—by data type.
Twitter, now X, remains a key database for insights, but since the platform's API restrictions tightened, many users have turned to web scraping techniques. A workflow to go from Plan → Prototype → Robust capture → Authenticated scraping → Production. At each step you’ll find Code, No-code, and Hybrid options so beginners and professionals can follow exact commands or platform instructions. Includes twscrape (prototype), Playwright (robust capture), GoProxy proxy guidance (residential vs datacenter, sticky sessions), auth persistence, proxy tests, orchestration hints (Airflow/Kubernetes), KPIs, and troubleshooting.
Who this helps
Analysts & beginners: want quick datasets or dashboards (no-code / CLI).
Non-developers: Octoparse, PhantomBuster, Apify fit well.
Engineers & data teams: Playwright, twscrape + token pools, orchestration & monitoring.
Market research & sentiment: track public reaction to product launches, events.
Journalism & academia: gather historical samples for studies (misinformation, elections).
Business intelligence: monitor competitors, hashtags, influencer activity.
Training data & scale: if the official API is restricted or costly, scraping public content is a fallback — but it requires engineering and risk management.
Law vs Terms: Some courts have allowed scraping of publicly accessible data in certain contexts (e.g., hiQ), but outcomes depend on jurisdiction and specific facts. Platform Terms of Service may still forbid scraping — that’s a separate risk. Consult counsel for commercial or high-risk projects.
Privacy & ethics: Limit to public tweets, avoid private messages and sensitive personal data. Anonymize PII where possible and document retention and deletion policies.
Enterprise caution: For large commercial usage consider licensing options or enterprise APIs where available.
Quick small dataset: No-code (Octoparse / PhantomBuster / Apify) or twscrape CLI (code).
JS-heavy pages / robust capture: Playwright (code).
Authenticated & scale: twscrape + account pools and sticky GoProxy residential IPs.
Production: orchestration + rotating residential proxies + monitoring.
Public search / hashtags: Step 2 (twscrape) or Step 3 (Playwright) if dynamic. No-code often suffices.
Single public timeline: Step 2 or 3.
Followers / following: Step 4 (authenticated) — account pools + sticky IPs.
Likes / bookmarks: Step 4 (authenticated) or Playwright with storage_state.
Media: Step 3 to capture media URLs; download via separate worker (datacenter proxies OK for downloads with rate limits).
Explore / regional trends: Step 4 with geotargeted residential IPs.
Python 3.10+ (download from python.org).
Terminal/CLI for code path.
A GoProxy account and at least 1 test proxy (IP:PORT + credentials).
An editor (VSCode) and optionally Git.
python -m pip install --upgrade pip
python -m pip install twscrape playwright jmespath pandas requests
python -m playwright install
# Quick verify
python -m pip show twscrape playwright
playwright --version
python -c "import sys; print(sys.version)"
twscrape --help
If installs fail, re-run pip and check firewall/network settings.
Define the data model, volume target, compliance scope, and pilot queries.
Data model: tweet_id, created_at, author_id, author_username, text, metrics, media_urls, lang, raw_payload.
Volume & escalation: if >10k tweets/day, plan orchestration + proxy pools.
Legal check: limit to public tweets; avoid private/sensitive data; log retention policy.
Manual test: run the target query in a browser to confirm whether login or geo affects results.
Narrow fields or consult legal/policy.
Project spec + pilot queries.
Get a working dataset quickly with a tool that supports auth and modern site behavior.
twscrape is the preferred 2025 CLI prototype tool (supports GraphQL, search, and auth options).
pip install twscrape
# Example: search and save JSONL
twscrape search "climate since:2023-01-01" --jsonl > tweets.jsonl
Notes: twscrape often requires account cookies/tokens for consistent results. Check the docs for login options.
Tools: Octoparse, PhantomBuster, Apify.
Steps: Create task → start at https://twitter.com/search?q=yourquery → map tweet text, username, date, likes → add infinite scroll loop (e.g., 5–20 scrolls) → set delays 2–5s → enter GoProxy credentials in proxy settings → run & export CSV/JSON.
Validate queries with no-code then scale with twscrape.
twscrape returns nothing → check whether the search requires login; add credentials.
No-code problems → verify cookie import and proxy settings, reduce scroll speed.
tweets.jsonl (500–10k rows, depending on query).
Capture GraphQL/XHR JSON payloads using Playwright (resilient vs UI changes).
pip install playwright jmespath requests pandas
python -m playwright install
# play_intercept.py
import json, time, logging
from playwright.sync_api import sync_playwright
logging.basicConfig(level=logging.INFO)
PROXY = {"server":"http://goproxy-host:8000","username":"goproxy_user","password":"goproxy_pass"}
def save_raw(url, data):
with open("raw_responses.jsonl","a",encoding="utf-8") as f:
f.write(json.dumps({"url":url,"data":data}, ensure_ascii=False)+"\n")
with sync_playwright() as p:
browser = p.chromium.launch(headless=True, proxy=PROXY)
page = browser.new_page()
def on_response(response):
try:
url = response.url
if "graphql" in url or "TweetResultByRestId" in url:
text = response.text()
data = json.loads(text)
save_raw(url, data)
except Exception as e:
logging.warning("Error parsing %s : %s", getattr(response,"url", "unknown"), str(e))
page.on("response", on_response)
page.goto("https://twitter.com/search?q=climate", timeout=60000)
time.sleep(5)
browser.close()
# Save storage state after manual login (one-time)
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False, proxy=PROXY)
context = browser.new_context()
page = context.new_page()
page.goto("https://twitter.com/login")
input("Log in manually then press Enter...")
context.storage_state(path="auth_state.json")
browser.close()
# Reuse storage_state in later runs
Use cloud actors (Apify, PhantomBuster) that expose Playwright-like execution for non-devs.
response.text() non-JSON → save response.body() and inspect (may be an HTML error page).
Headless detection → set headless=False for debugging and add human interactions.
403/429/CAPTCHA → reduce concurrency and rotate to residential proxies.
raw_responses.jsonl with GraphQL payloads.
Collect follower lists, likes, personalized Explore results; manage tokens & per-account throttles.
Use twscrape with saved tokens or build a token/cookie pool (SQLite/Redis).
For each account, assign a sticky GoProxy residential IP for session stability.
Platforms like PhantomBuster or Apify support session cookie injection & proxy per actor — useful for small pilots.
Repeated login fails → manually create auth_state.json in Playwright and reuse.
Truncated follower lists → inspect GraphQL response for cursors & pagination fields.
Authenticated payloads (followers, likes, personalized results).
Continuous ingestion at scale with observability and cost control.
Account manager: secure token/cookie storage.
Proxy manager: GoProxy pool with health checks & retirement.
Worker fleet: Kubernetes pods (each pod uses 1 proxy & optionally 1 account), job queue (Redis/Celery/Kafka).
Storage: raw JSON → Parquet → data warehouse (BigQuery / Snowflake).
Monitoring: Grafana/Prometheus (success rate, captcha frequency, latency).
Simple proxy rotation example
import requests, random
PROXIES = [
{"http":"http://user:pass@ip1:port","https":"http://user:pass@ip1:port"},
{"http":"http://user:pass@ip2:port","https":"http://user:pass@ip2:port"},
]
def fetch_with_proxy(url):
p = random.choice(PROXIES)
return requests.get(url, proxies=p, timeout=20)
High error rates → scale back concurrency, remove failing proxies/accounts, add circuit breakers and backoff.
Success rate per proxy/account: >95%
CAPTCHA frequency: <1 per 1,000 requests (target)
Monitor: percent 429/403/timeouts and throughput per proxy.
Use realistic interactions (scrolls, clicks, variable delays).
Rotate user-agents and manage browser fingerprints (profiles/stealth).
Prefer residential + sticky IPs for authenticated flows.
Monitor signals and pause scaling if captcha/throttling spikes.
twscrape returns no results → verify query in logged-in browser; add credentials/cookies.
Playwright captures non-JSON or errors → inspect raw_responses.jsonl, identify payload keys, update parsing.
High 403/429 → rotat e proxies, reduce rate, use residential IPs.
CAPTCHA spike → reduce concurrency, switch accounts/IPs, humanize interactions.
import requests
proxy = {"http":"http://user:pass@ip:port","https":"http://user:pass@ip:port"}
r = requests.get("https://ifconfig.me/ip", proxies=proxy, timeout=15)
print("Public IP via proxy:", r.text)
Or curl:
curl -x "http://user:pass@ip:port" https://ifconfig.me/ip
import jmespath
expr = "data.search.results[*].tweet | [].{id:id, text:legacy.full_text, author:user.screen_name, likes:metrics.like_count}"
rows = jmespath.search(expr, data)
import pandas as pd
df = pd.read_json("raw_responses.jsonl", lines=True)
df.to_parquet("tweets.parquet")
Q: Is scraping public tweets legal?
A: Legal status depends on jurisdiction and use. Courts have at times allowed scraping public data, but platform Terms of Service can still forbid it. For commercial projects, consult legal counsel.
Q: What proxy type should I use first?
A: Start with a small residential pilot (5–20 IPs). For bulk downloads (images/videos) datacenter proxies may be OK but rate-limit them, consider our unlimited traffic rotating residential plans for scaling.
Q: Playwright headless detection issues?
A: Debug in headful mode, save storage_state, add interactions, rotate fingerprints, and use residential IPs..
This integrated Steps + Pick-Your-Path structure helps beginners learn in order while letting pros choose the most robust tools. Start small, measure block/captcha metrics, and scale only after your proxy & account health metrics are stable. Prioritize legal review and respectful rate limits — ethical scraping protects your project long term.
Ready to try? Create a GoProxy account, start with a small 5-IP pilot, run the Prototype step, and bring logs or errors back to the discussion — we’ll help you troubleshoot, 7*24h technical support team is always ready to help you.
Next >