How to Scrape Twitter (X): 2025 Methods + Steps
Step-by-step 2025 guide: twscrape, Playwright XHR capture, account pools, and scaling with GoProxy proxies—by data type.
Aug 19, 2025
Step-by-step 2025 guide to scrape YouTube comments using API, Selenium, or no-code tools, with GoProxy tips, ethics, and scaling.
Scraping YouTube comments is one of the fastest ways to analyze audience sentiment, build datasets for machine learning, and track engagement trends. But how do you do it effectively without running afoul of rules or getting your IP blocked? This comprehensive guide covers everything from beginner-friendly no-code methods to advanced Python scripts, with a focus on practical steps and common pitfalls.
Who this is for:
1. Start with the YouTube Data API for simple, low-risk, structured, repeatable, and few pulls.
2. Use Selenium/Playwright only for UI-only data (pinned comments, some Shorts, live chat). Use no-code scrapers for fast, non-technical exports. Use GoProxy responsibly for geo-targeting and distributing UI requests—do not use proxies to bypass API quotas or platform policies.
YouTube boasts over 2 billion monthly users, generating millions of comments daily. Scraping them is one of the fastest ways to analyze audience sentiment, build datasets for machine learning, and track engagement trends.
Common uses include:
Sentiment analysis around product releases or events (e.g., gauging reactions to a new tech gadget).
Market research: Discover complaints, feature requests, or emerging trends in viewer feedback.
Content optimization: Mine competitor comments to identify topic gaps or high-engagement ideas.
Training datasets for AI/ML models (label carefully; respect privacy and avoid biases).
Prefer the API where possible—it’s the supported, lower-risk route.
Do NOT use proxies to evade API quotas or bypass platform enforcement; that’s both unethical and risky; they're for geo-targeting, distributing legitimate requests, and avoiding local throttles.
Don’t collect sensitive personal data. If you publish results, anonymize authors or aggregate.
Commercial use: get legal review to ensure compliance with YouTube’s Terms and local law.
This short guide helps you choose first, then dive into the detailed implementations below.
Method | Use when | Pros | Cons | Difficulty | Difficulty Setup Time Estimate |
YouTube Data API | Structured, repeatable pulls of public/top-level comments; large-scale | Official, stable, low ban risk | Requires API key, quotas, extra calls for replies | Medium | 10-20 mins |
Browser Automation (Selenium/Playwright) | UI-only fields (e.g., pinned comments, Shorts, live chat) or logged-in simulation | Captures exact UI; handles lazy-loading | Fragile to UI changes, heavier resource use; requires proxy/fingerprint strategy | Medium-High | 15-30 mins |
No-code / SaaS Scrapers | Quick exports or non-technical teams need scheduled jobs | Fast, visual, low setup | Less customizable, potential costs | Easy | 5-10 mins |
Validate with the API first (one video). If missing required fields, add a short Selenium test to confirm UI needs, then decide hybrid pipeline.
It's structured, reliable, and official—ideal for most pipelines.
authorDisplayName, publishedAt, textDisplay/textOriginal, likeCount, replyCount, isPublic. Replies require extra calls.
Create Google Cloud project → enable YouTube Data API v3.
Create an API key (or OAuth if accessing private resources).
Monitor quota in Cloud Console (commentThreads and comments endpoints use quota units).
Run in terminal:
pip install google-api-python-client pandas
# api_comments.py
import re, time, pandas as pd
from googleapiclient.discovery import build
API_KEY = "YOUR_API_KEY"
youtube = build("youtube", "v3", developerKey=API_KEY)
def extract_video_id(url):
m = re.search(r'(?:v=|youtu\.be/|/v/)([A-Za-z0-9_-]{11})', url)
return m.group(1) if m else None
def get_comments_api(video_id):
comments = []
request = youtube.commentThreads().list(
part="snippet",
videoId=video_id,
maxResults=100,
textFormat="plainText"
)
while request:
response = request.execute()
for item in response.get('items', []):
s = item['snippet']['topLevelComment']['snippet']
comments.append({
'comment_id': item['id'],
'author': s.get('authorDisplayName'),
'text': s.get('textDisplay'),
'likes': int(s.get('likeCount') or 0),
'publishedAt': s.get('publishedAt'),
'replyCount': int(s.get('replyCount') or 0)
})
request = youtube.commentThreads().list_next(request, response)
time.sleep(0.1)
return comments
if __name__ == "__main__":
url = "https://www.youtube.com/watch?v=VIDEO_ID"
vid = extract_video_id(url)
df = pd.DataFrame(get_comments_api(vid))
df.to_csv("youtube_comments_api.csv", index=False)
print("Saved", len(df), "comments")
Extend for threaded data; costs extra quota:
def get_replies(parent_id):
replies = []
req = youtube.comments().list(part="snippet", parentId=parent_id, maxResults=100)
while req:
resp = req.execute()
for it in resp.get('items', []):
s = it['snippet']
replies.append({
'reply_id': it['id'],
'parent_id': parent_id,
'author': s.get('authorDisplayName'),
'text': s.get('textDisplay'),
'likes': int(s.get('likeCount') or 0),
'publishedAt': s.get('publishedAt')
})
req = youtube.comments().list_next(req, resp)
return replies
For robust error handling:
import random, time
def with_backoff(func, max_retries=5, base=0.5):
for attempt in range(1, max_retries+1):
try:
return func()
except Exception as e:
if attempt == max_retries:
raise
sleep = base * (2 ** (attempt - 1)) + random.uniform(0, base)
time.sleep(sleep)
Monitor quota in Google Cloud Console.
Use pagination with nextPageToken fully.
Don’t use proxies to bypass API quota enforcement.
For pros: Batch multiple videos in a loop.
403 Unauthorized — check API key, enable API, ensure billing if needed.
429 Too Many Requests — slow down, implement backoff, check quotas. Don’t try to bypass quotas with proxying.
Only when the API cannot provide the data you need (pinned comments, some Shorts, embedded UIs, or simulation of logged-in users).
Python 3.x, pip install selenium pandas selenium-wire (selenium-wire if using proxy auth).
Download ChromeDriver matching your Chrome versionSet Up Environment
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from time import sleep
import pandas as pd
driver = webdriver.Chrome() # Add executable_path if needed
driver.maximize_window()
Open the video URL, scroll to load comments:
url = "https://www.youtube.com/watch?v=VIDEO_ID"
driver.get(url)
sleep(5) # Wait for load
for _ in range(200): # Adjust for more comments
driver.execute_script("window.scrollBy(0, 700)")
sleep(2) # Pause to load
Use XPath; test in browser DevTools: Right-click element > Inspect > Copy XPath:
comments = []
comment_elements = driver.find_elements(By.XPATH, '//*[@id="content-text"]')
for elem in comment_elements:
comments.append(elem.text)
df = pd.DataFrame(comments, columns=['Comment'])
df.to_csv('youtube_comments.csv', index=False)
driver.quit()
from seleniumwire import webdriver
proxy_opts = {
'proxy': {
'http': 'http://username:password@proxy-host:port',
'https': 'http://username:password@proxy-host:port'
}
}
options = webdriver.ChromeOptions()
options.add_argument('--start-maximized')
driver = webdriver.Chrome(seleniumwire_options=proxy_opts, options=options)
If your proxy uses IP allowlisting, use options.add_argument('--proxy-server=http://host:port').
Use WebDriverWait for expected elements instead of blind sleeps.
Randomize intervals between scrolls and actions.
Persist browser profiles (cookies) for logged-in scrapes (sticky sessions).
Use headful mode for reliability; headless is detectable and brittle for long runs.
If Selenium returns zero comments: validate DOM selectors in DevTools (right-click → Inspect → Copy → selector/XPath), increase wait times.
1. Choose a reputable scraper supporting JS rendering and proxies, popular options like Apify.
2. Paste video URLs or upload a CSV of video IDs.
3. Configure fields (comment text, author, date, likes, replies).
4. Test on a short video (100 comments). Verify output (e.g., encoding for emojis).
5. If bulk scraping, configure a proxy pool (GoProxy) on the tool if supported and rotate every 10–50 requests depending on risk.
Note: No-code tools are perfect; please verify field coverage (replies/pinned comments) before scaling.
GoProxy excels for geo-targeting, session stickiness, and distribution—never for bypassing API quotas or platform rules. Use mobile proxies or residential proxies (mimic real users) over datacenter for YouTube.
Auth mode: username:password or IP allowlist (use exactly what GoProxy provides).
Rotation strategy: rotate every N requests or on errors (suggested: 5–50).
Concurrency per IP: start at 1–5 parallel jobs per residential IP; increase slowly.
Session stickiness: use sticky session(can up to 60min) for logged-in sessions when needed.
Geo-targeting: select country-specific endpoints for regional views.
Monitoring: log proxy id, latency, status codes, errors; alert on >5% error rate.
Implementation note: If using googleapiclient, proxy support is more complex—set HTTP(S)_PROXY env vars or use requests wrapper for explicit proxy control for non-API UI calls.
comment_id, video_id, parent_id, author, author_channel_id, text, likes, reply_count, published_at, scraped_at, source_method.
Convert publishedAt to ISO UTC: df['publishedAt'] = pd.to_datetime(df['publishedAt']).
Numerics to int: df['likes'] = df['likes'].astype(int).
Deduplicate: df.drop_duplicates(subset=['comment_id']).
Handle emojis: Preserve or normalize as needed.
from textblob import TextBlob
df['polarity'] = df['text'].apply(lambda t: TextBlob(t).sentiment.polarity)
Pro Tip: Upgrade to transformer models (e.g., Hugging Face) for accuracy.
Small: CSV.
Medium: Parquet + S3 (partition by video_id/date).
Large: Columnar DB (BigQuery, ClickHouse).
API workers for bulk top-level comments.
Browser workers for UI-only content.
Orchestrator (queue) to dispatch video IDs to workers; proxy manager to assign IPs.
Track: requests/sec, success rate, avg latency, proxy failure rate.
Log: video_id, proxy_id, status_code, parse_errors.
Alerts: >5% error rate; CAPTCHAs spike; repeated 429/403.
Randomize timing; rotate user-agents; use real browser profiles; moderate concurrency.
Use residential proxies for human-like IP behavior.
Avoid headless mode for long runs.
403 (API) — Check key, enable API, activate billing.
429 — Backoff and reduce request rate; check quota. Do not bypass with proxies.
Empty Selenium results — Update selectors in DevTools; increase waits.
CAPTCHA frequency — Reduce concurrency; switch to residential proxies; add randomness.
Can I scrape private videos? Only with proper OAuth scopes and explicit permission.
Does the API return replies? No by default—use replies endpoint.
Can I scrape live chat? Use liveChatMessages API for authorized streams; UI scraping for other cases (complex).
How many requests/month? Depends on API quota and proxy capacity — start small and monitor.
Scraping YouTube comments gives fast, actionable insights for product, marketing, and moderation workflows — but prioritize ethics, quotas, and data privacy. Start small, build observability, and expand with hybrid pipelines when necessary.
If you want reliable IP distribution and geo-targeting for your scraping pipeline, try GoProxy’s geo-targeted rotating residential proxies — sign up for a free trial and test proxy health with a pilot today.
< Previous
Next >