This browser does not support JavaScript

How to Scrape Twitter (X): 2025 Methods + Steps

Post Time: 2025-08-19 Update Time: 2025-08-19

Twitter, now X, remains a key database for insights, but since the platform's API restrictions tightened, many users have turned to web scraping techniques. A workflow to go from Plan → Prototype → Robust capture → Authenticated scraping → Production. At each step you’ll find Code, No-code, and Hybrid options so beginners and professionals can follow exact commands or platform instructions. Includes twscrape (prototype), Playwright (robust capture), GoProxy proxy guidance (residential vs datacenter, sticky sessions), auth persistence, proxy tests, orchestration hints (Airflow/Kubernetes), KPIs, and troubleshooting.

Who this helps

Analysts & beginners: want quick datasets or dashboards (no-code / CLI).

Non-developers: Octoparse, PhantomBuster, Apify fit well.

Engineers & data teams: Playwright, twscrape + token pools, orchestration & monitoring.

Why Scrape Twitter?

Market research & sentiment: track public reaction to product launches, events.

Journalism & academia: gather historical samples for studies (misinformation, elections).

Business intelligence: monitor competitors, hashtags, influencer activity.

Training data & scale: if the official API is restricted or costly, scraping public content is a fallback — but it requires engineering and risk management.

Safety & Legal(Read before Scraping)

Law vs Terms: Some courts have allowed scraping of publicly accessible data in certain contexts (e.g., hiQ), but outcomes depend on jurisdiction and specific facts. Platform Terms of Service may still forbid scraping — that’s a separate risk. Consult counsel for commercial or high-risk projects.

Privacy & ethics: Limit to public tweets, avoid private messages and sensitive personal data. Anonymize PII where possible and document retention and deletion policies.

Enterprise caution: For large commercial usage consider licensing options or enterprise APIs where available.

Pick Your Path

Scrape Twitter (X)

By goal

Quick small dataset: No-code (Octoparse / PhantomBuster / Apify) or twscrape CLI (code).

JS-heavy pages / robust capture: Playwright (code).

Authenticated & scale: twscrape + account pools and sticky GoProxy residential IPs.

Production: orchestration + rotating residential proxies + monitoring.

By Twitter/X Data types

Public search / hashtags: Step 2 (twscrape) or Step 3 (Playwright) if dynamic. No-code often suffices.

Single public timeline: Step 2 or 3.

Followers / following: Step 4 (authenticated) — account pools + sticky IPs.

Likes / bookmarks: Step 4 (authenticated) or Playwright with storage_state.

Media: Step 3 to capture media URLs; download via separate worker (datacenter proxies OK for downloads with rate limits).

Explore / regional trends: Step 4 with geotargeted residential IPs.

Prerequisites & Test Your Setup (5–30 minutes)

What you need

Python 3.10+ (download from python.org).

Terminal/CLI for code path.

A GoProxy account and at least 1 test proxy (IP:PORT + credentials).

An editor (VSCode) and optionally Git.

Install

python -m pip install --upgrade pip

python -m pip install twscrape playwright jmespath pandas requests

python -m playwright install

# Quick verify

python -m pip show twscrape playwright

playwright --version

Quick sandbox tests

python -c "import sys; print(sys.version)"

twscrape --help

If installs fail, re-run pip and check firewall/network settings.

Step 1. Plan (10–30 minutes)

Goal

Define the data model, volume target, compliance scope, and pilot queries.

Do now

Data model: tweet_id, created_at, author_id, author_username, text, metrics, media_urls, lang, raw_payload.

Volume & escalation: if >10k tweets/day, plan orchestration + proxy pools.

Legal check: limit to public tweets; avoid private/sensitive data; log retention policy.

Manual test: run the target query in a browser to confirm whether login or geo affects results.

If this fails

Narrow fields or consult legal/policy.

Expected output

Project spec + pilot queries.

Step 2. Prototype (Beginner & No-Code, 15–60 mins)

Goal

Get a working dataset quickly with a tool that supports auth and modern site behavior.

Code path (recommended 2025): twscrape

twscrape is the preferred 2025 CLI prototype tool (supports GraphQL, search, and auth options).

pip install twscrape

# Example: search and save JSONL

twscrape search "climate since:2023-01-01" --jsonl > tweets.jsonl

Notes: twscrape often requires account cookies/tokens for consistent results. Check the docs for login options.

No-code path

Tools: Octoparse, PhantomBuster, Apify.

Steps: Create task → start at https://twitter.com/search?q=yourquery → map tweet text, username, date, likes → add infinite scroll loop (e.g., 5–20 scrolls) → set delays 2–5s → enter GoProxy credentials in proxy settings → run & export CSV/JSON.

Hybrid

Validate queries with no-code then scale with twscrape.

If this fails

twscrape returns nothing → check whether the search requires login; add credentials.

No-code problems → verify cookie import and proxy settings, reduce scroll speed.

Expected output

tweets.jsonl (500–10k rows, depending on query). 

Step 3. Robust Scraping (Intermediate, 1–2 hours)

Goal

Capture GraphQL/XHR JSON payloads using Playwright (resilient vs UI changes).

Install

pip install playwright jmespath requests pandas

python -m playwright install

Playwright capture (code path)

# play_intercept.py

import json, time, logging

from playwright.sync_api import sync_playwright

 

logging.basicConfig(level=logging.INFO)

PROXY = {"server":"http://goproxy-host:8000","username":"goproxy_user","password":"goproxy_pass"}

 

def save_raw(url, data):

    with open("raw_responses.jsonl","a",encoding="utf-8") as f:

        f.write(json.dumps({"url":url,"data":data}, ensure_ascii=False)+"\n")

 

with sync_playwright() as p:

    browser = p.chromium.launch(headless=True, proxy=PROXY)

    page = browser.new_page()

    def on_response(response):

        try:

            url = response.url

            if "graphql" in url or "TweetResultByRestId" in url:

                text = response.text()

                data = json.loads(text)

                save_raw(url, data)

        except Exception as e:

            logging.warning("Error parsing %s : %s", getattr(response,"url", "unknown"), str(e))

    page.on("response", on_response)

    page.goto("https://twitter.com/search?q=climate", timeout=60000)

    time.sleep(5)

    browser.close()

Persist auth (one-time manual login)

# Save storage state after manual login (one-time)

from playwright.sync_api import sync_playwright

with sync_playwright() as p:

    browser = p.chromium.launch(headless=False, proxy=PROXY)

    context = browser.new_context()

    page = context.new_page()

    page.goto("https://twitter.com/login")

    input("Log in manually then press Enter...")

    context.storage_state(path="auth_state.json")

    browser.close()

# Reuse storage_state in later runs

No-code path

Use cloud actors (Apify, PhantomBuster) that expose Playwright-like execution for non-devs.

If this fails

response.text() non-JSON → save response.body() and inspect (may be an HTML error page).

Headless detection → set headless=False for debugging and add human interactions.

403/429/CAPTCHA → reduce concurrency and rotate to residential proxies.

Expected output

raw_responses.jsonl with GraphQL payloads.

Step 4. Authenticated & Account-managed Scraping(Upper-Intermediate, 1–3 days)

Goal

Collect follower lists, likes, personalized Explore results; manage tokens & per-account throttles.

Code path

Use twscrape with saved tokens or build a token/cookie pool (SQLite/Redis).

For each account, assign a sticky GoProxy residential IP  for session stability.

No-code path

Platforms like PhantomBuster or Apify support session cookie injection & proxy per actor — useful for small pilots.

If this fails

Repeated login fails → manually create auth_state.json in Playwright and reuse.

Truncated follower lists → inspect GraphQL response for cursors & pagination fields.

Expected output

Authenticated payloads (followers, likes, personalized results).

Step 5. Production & Scaling (Advanced, weeks)

Goal

Continuous ingestion at scale with observability and cost control.

Architecture & components

Account manager: secure token/cookie storage.

Proxy manager: GoProxy pool with health checks & retirement.

Worker fleet: Kubernetes pods (each pod uses 1 proxy & optionally 1 account), job queue (Redis/Celery/Kafka).

Storage: raw JSON → Parquet → data warehouse (BigQuery / Snowflake).

Monitoring: Grafana/Prometheus (success rate, captcha frequency, latency).

Simple proxy rotation example

import requests, random

PROXIES = [

    {"http":"http://user:pass@ip1:port","https":"http://user:pass@ip1:port"},

    {"http":"http://user:pass@ip2:port","https":"http://user:pass@ip2:port"},

]

def fetch_with_proxy(url):

    p = random.choice(PROXIES)

    return requests.get(url, proxies=p, timeout=20)

If this fails

High error rates → scale back concurrency, remove failing proxies/accounts, add circuit breakers and backoff.

KPI targets

Success rate per proxy/account: >95%

CAPTCHA frequency: <1 per 1,000 requests (target)

Monitor: percent 429/403/timeouts and throughput per proxy.

AI-detection & Anti-bot (2025 Trends)

Use realistic interactions (scrolls, clicks, variable delays).

Rotate user-agents and manage browser fingerprints (profiles/stealth).

Prefer residential + sticky IPs for authenticated flows.

Monitor signals and pause scaling if captcha/throttling spikes.

Troubleshooting & Helpers

Common quick fixes

twscrape returns no results → verify query in logged-in browser; add credentials/cookies.

Playwright captures non-JSON or errors → inspect raw_responses.jsonl, identify payload keys, update parsing.

High 403/429 → rotat e proxies, reduce rate, use residential IPs.

CAPTCHA spike → reduce concurrency, switch accounts/IPs, humanize interactions.

Proxy test (verify GoProxy credentials & IP)

import requests

proxy = {"http":"http://user:pass@ip:port","https":"http://user:pass@ip:port"}

r = requests.get("https://ifconfig.me/ip", proxies=proxy, timeout=15)

print("Public IP via proxy:", r.text)

Or curl:

curl -x "http://user:pass@ip:port" https://ifconfig.me/ip

jmespath flatten (pseudo)

import jmespath

expr = "data.search.results[*].tweet | [].{id:id, text:legacy.full_text, author:user.screen_name, likes:metrics.like_count}"

rows = jmespath.search(expr, data)

pandas JSONL → parquet

import pandas as pd

df = pd.read_json("raw_responses.jsonl", lines=True)

df.to_parquet("tweets.parquet")

FAQs

Q: Is scraping public tweets legal?

A: Legal status depends on jurisdiction and use. Courts have at times allowed scraping public data, but platform Terms of Service can still forbid it. For commercial projects, consult legal counsel.

Q: What proxy type should I use first?

A: Start with a small residential pilot (5–20 IPs). For bulk downloads (images/videos) datacenter proxies may be OK but rate-limit them, consider our unlimited traffic rotating residential plans for scaling.

Q: Playwright headless detection issues?

A: Debug in headful mode, save storage_state, add interactions, rotate fingerprints, and use residential IPs..

Final Thoughts

This integrated Steps + Pick-Your-Path structure helps beginners learn in order while letting pros choose the most robust tools. Start small, measure block/captcha metrics, and scale only after your proxy & account health metrics are stable. Prioritize legal review and respectful rate limits — ethical scraping protects your project long term.

Ready to try? Create a GoProxy account, start with a small 5-IP pilot, run the Prototype step, and bring logs or errors back to the discussion — we’ll help you troubleshoot, 7*24h technical support team is always ready to help you.

Next >

30-day Playbook to Search Engine Marketing Intelligence(2025)
Start Your 7-Day Free Trial Now!
GoProxy Cancel anytime
GoProxy No credit card required