Step-by-Step Guide to Scrape Images from Google Images with GoProxy
Step-by-step scrape images from Google Images with Python safely and efficiently: three proven methods, anti-blocking strategies, and quality control tips.
Aug 12, 2025
Hands-on guide to scrape Bing SERPs, PAA, Maps, images, and chat outputs with code, anti-block tactics, and GoProxy proxy setups.
Scraping search engines like Bing can unlock valuable insights for SEO, market research, and competitive analysis. While Google dominates the search landscape, Bing offers unique advantages—such as detailed image results, local business data, and less aggressive anti-scraping measures—making it an appealing target for web scrapers.
This guide explains how to scrape Bing SERPs, People Also Ask (PAA), rich snippets, images, Maps, and (carefully) chat outputs—with copy-paste scripts, proxy patterns, troubleshooting, and production guidance. First, run the organic SERP minimal example; then follow the relevant data-type section for more advanced needs.
Quick Legal Note: Always review Bing’s Terms of Service and local laws before scraping. Use scrapers responsibly, rate-limit requests, and avoid private/personal data.
Beginner quick start: run the minimal examples under each data type (follow the numbered steps).
Intermediate: add headless rendering (Playwright) and GoProxy rotation for reliability.
Production: follow the production tweaks, monitoring thresholds, and bulk architecture notes.
Bing complements Google: it often surfaces different images and local signals, returns alternative SERP features, and in many cases applies less aggressive blocking — useful for:
If you only need titles/URLs → use simple requests. If you need PAA, Maps, or chat outputs → use headless browsers + geo-aware proxies.
Bing's structure is scraper-friendly with consistent HTML classes like .b_algo for results. But challenges persist: IP blocking/rate limits, JS-rendered widgets, localized results, and CAPTCHAs. In 2025, AI-driven bot detection analyzes behavior beyond IPs.
Proxies mitigate these by:
Rotating IPs to spread requests and avoid per-IP throttles.
Geo-targeted IPs to retrieve local SERP/Maps variants.
Sticky sessions to preserve state for multi-step flows.
Residential pools to reduce detection risk (trade: cost and speed).
A managed rotating + geo-sticky proxy service (e.g., GoProxy) gives the reliability and geographic coverage most projects need.
Data type | Simple requests | Headless (Playwright/Puppeteer) | GoProxy mode | Difficulty |
Organic SERP (titles/snippets) | ✓ | ◼︎ if dynamic | rotate / geo_sticky | Low → Medium |
Related queries / PAA | ◼︎ sometimes | ✓ | geo_sticky | Medium |
Rich snippets / JSON-LD | ✓ (if present) | ✓ if injected | rotate | Low → Medium |
Images / Video | ◼︎ thumbnails | ✓ for lazy-load | rotate (residential recommended) | Medium |
Bing Maps / Local data | ◼︎ endpoint-based | ✓ for UI flows | geo_sticky (local IPs) | Medium → Hard |
Bing AI / Chat | ✗ | ✓ (sticky sessions) | sticky | Hard |
Bulk rank tracking | ✓ (fast) | ◼︎ only for features | rotate (large pool) | Medium → Hard |
Legend: ✓recommended; ◼conditional.
Apply these as your baseline. Per-data-type sections override specifics.
User-Agent rotation (5–20 realistic UAs).
Accept-Language header matching target locale.
Proxy: use GoProxy — modes: rotate, sticky, geo_sticky.
Timeouts: connect 8–12s; read 15–25s.
Retries & backoff: 3 attempts; backoff [1s, 3s, 9s].
Rate limiting: jittered delays (random between min & max).
Cookie jar per session for stateful flows.
render_js flag: false by default; true when content depends on client JS.
CAPTCHA detection: search response for captcha|verify|are you human|recaptcha (case-insensitive).
Logging: record request URL, proxy IP, UA, response code; save raw HTML on error.
Proxy health: blacklist after 3 consecutive failures.
Geo-targeting: use proxies from the country/city for local SERPs/Maps.
Storage fields: query, page, position, title, url, snippet, raw_html_snapshot, fetched_at.
common:
user_agents: ["UA1", "UA2", "..."]
accept_language: "en-US,en;q=0.9"
go_proxy:
mode: "rotate" # rotate | sticky | geo_sticky
auth: "user:pass" # or token
geo: null # e.g. "US"
timeouts:
connect: 10
read: 20
retries:
attempts: 3
backoff: [1, 3, 9]
concurrency:
per_proxy: 1
workers: 4
delay:
min: 1.2
max: 3.5
render_js: false
captcha_detection: true
logging:
save_raw_on_error: true
curl -x http://GP_USER:[email protected]:8000 'https://httpbin.org/ip' -I
Expected: 200 OK and a JSON body showing the proxy IP.
1. Install Python 3.10+ and pip.
2. Create virtualenv:
python -m venv venv
source venv/bin/activate # macOS / Linux
venv\Scripts\activate # Windows
3. Install core packages:
pip install requests beautifulsoup4 playwright
playwright install
4. Prepare GoProxy credentials: GP_USER, GP_PASS, GP_HOST. Choose rotate, sticky (e.g., ?session=abc), or geo_sticky. Just register, choose a rotating residential plan, and get it in your dashboard.
5. Create USER_AGENTS pool (file or array).
6. Place configs/common.yaml and update go_proxy.auth with your credentials or plan to paste PROXY directly into scripts for beginners.
Save as utils/fetcher.py. It sets up a requests.Session with retries and basic block detection. For no-proxy testing, set PROXY = None.
python
# utils/fetcher.py
import requests, random, logging
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
USER_AGENTS = ["Mozilla/5.0 (Windows NT 10.0; Win64; x64)...", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)..."]
# Set PROXY to None to test without proxy
PROXY = "http://GP_USER:[email protected]:8000"
def make_session(proxy_url=None):
s = requests.Session()
retries = Retry(total=3, backoff_factor=1, status_forcelist=(429, 500, 502, 503, 504))
s.mount("https:/", HTTPAdapter(max_retries=retries))
s.mount("http:/", HTTPAdapter(max_retries=retries))
if proxy_url:
s.proxies.update({"http": proxy_url, "https": proxy_url})
return s
def is_blocked(resp):
if resp.status_code in (403, 429):
return True
txt = resp.text.lower() if resp.text else ""
return any(tok in txt for tok in ("captcha", "are you human", "verify", "recaptcha"))
def fetch_serp(session, query, first=1, timeout=(8,20)):
headers = {
"User-Agent": random.choice(USER_AGENTS),
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.bing.com"
}
url = f"https://www.bing.com/search?q={requests.utils.quote(query)}&first={first}"
resp = session.get(url, headers=headers, timeout=timeout)
if is_blocked(resp):
logging.warning("Blocked or rate-limited: %s", resp.status_code)
raise Exception("Blocked or rate-limited")
resp.raise_for_status()
return resp.text
Useful for rank tracking and competitor monitoring. Sample output:
[
{"position":1,"title":"Best Wireless Earbuds 2025 - Review","url":"https://example.com/best-earbuds","snippet":"A quick review of top earbuds."},
{"position":2,"title":"Top Earbuds for Running","url":"https://example.com/running-earbuds","snippet":"Lightweight earbuds for runners."}
]
1. Create scripts/serp_requests.py.
2. Paste:
python
from bs4 import BeautifulSoup
from utils.fetcher import make_session, fetch_serp, PROXY
session = make_session(PROXY)
html = fetch_serp(session, "best wireless earbuds 2025", first=1)
soup = BeautifulSoup(html, "html.parser")
results = []
for idx, li in enumerate(soup.select("li.b_algo"), start=1):
a = li.select_one("h2 a")
snippet = li.select_one("div.b_caption, p")
results.append({
"position": idx,
"title": a.get_text(strip=True) if a else "",
"url": a["href"] if a and a.has_attr("href") else "",
"snippet": snippet.get_text(" ", strip=True) if snippet else ""
})
print(results)
3. Run:
python scripts/serp_requests.py
4. Validate output matches the sample JSON..
Use per-proxy concurrency = 1; queue jobs (Redis/Celery).
Save raw HTML snapshots for failed cases (≥1% sampling).
Track metrics: 403_ratio, avg_latency_ms, proxy_failure_rate.
1. 403/429: Rotate proxy, change UA, add 5–15s sleep.
2. No li.b_algo: Set render_js=true and try Playwright.
3. Locale differences: Use geo-located proxies.
PAA gives keyword ideas and user intent signals. Output is a list of Q&A snippets.
1. Create scripts/paa_playwright.py.
2. Paste:
from playwright.sync_api import sync_playwright
PROXY = {"server":"http://proxy.goproxy.example.com:8000","username":"GP_USER","password":"GP_PASS"}
def extract_paa(query):
with sync_playwright() as p:
browser = p.chromium.launch(proxy=PROXY, headless=True)
page = browser.new_page()
page.goto(f"https://www.bing.com/search?q={query}", timeout=30000)
page.wait_for_timeout(2000)
nodes = page.query_selector_all("div.rwrl div, div.b_accordion div")
items = [n.inner_text().strip() for n in nodes if n.inner_text().strip()]
browser.close()
return items
print(extract_paa("best coffee shops near me"))
3. Run:
python scripts/paa_playwright.py
Use geo-sticky proxies to get localized suggestions.
Cache related queries per locale; de-duplicate.
1. Empty results: Run headful mode (headless=False) to visually inspect and increase wait time.
2. If blocked: reduce rate and rotate proxies; add small random mouse movements.
Structured data (JSON-LD) gives product, review, FAQ details directly.
1. Try requests first and extract <script type="application/ld+json">.
2. If empty but the snippet appears visually, use Playwright to render.
Beginner snippet
python
from bs4 import BeautifulSoup
import requests, json
PROXY = "http://GP_USER:[email protected]:8000"
r = requests.get("https://www.bing.com/search?q=site:example.com", proxies={"http":PROXY,"https":PROXY}, timeout=15)
soup = BeautifulSoup(r.text, "html.parser")
for s in soup.find_all("script", {"type":"application/ld+json"}):
try:
print(json.loads(s.string))
except Exception:
continue
1. If invalid JSON: log the string for manual inspection.
2. If missing: render with Playwright.
Lists of image thumbnails, alt text, and source pages — useful for visual competitive research.
python
# scripts/images_playwright.py
from playwright.sync_api import sync_playwright
PROXY = {"server":"http://proxy.goproxy.example.com:8000","username":"GP_USER","password":"GP_PASS"}
with sync_playwright() as p:
browser = p.chromium.launch(proxy=PROXY, headless=True)
page = browser.new_page()
page.goto("https://www.bing.com/images/search?q=coffee", timeout=30000)
page.wait_for_selector("img.mimg", timeout=10000)
imgs = page.query_selector_all("img.mimg")
for im in imgs[:20]:
print(im.get_attribute("src"), im.get_attribute("alt"))
browser.close()
Image downloads are bandwidth-heavy — set per-proxy bandwidth caps.
Prefer collecting metadata vs. bulk downloading full images (copyright).
If direct image fetch returns 403, request the image via the page (Playwright) so the Referer is present.
Place cards with name, address, phone, hours, rating.
1. Open bing.com/maps and search for a query + city.
2. DevTools → Network → filter XHR → find endpoints like /maps/overlaybfpr.
3. Copy the request URL and params (q, count, first, cp lat~lon).
python
import requests
PROXY = "http://GP_USER:[email protected]:8000"
url = "https://www.bing.com/maps/overlaybfpr?q=coffee&count=18&first=0&cp=40.7128~-74.0060"
r = requests.get(url, proxies={"http":PROXY,"https":PROXY}, timeout=15)
print(r.status_code)
print(r.text[:1000])
Parse returned HTML fragment/JSON for place cards.
Use geo-sticky proxies (IPs in the target city) for accurate local results.
Keep rate conservative (≤ 20 requests/IP/hour).
If empty: update cp coordinates or test from a local proxy in the target city.
Important: Only for internal research or if you have explicit permission. Chat interfaces are stateful and can trigger detection quickly.
1. Use Playwright with sticky session proxies.
2. Manually log in once with a persistent browser profile and save cookies.
3. Script must load cookies, keep same IP, submit prompt, wait for response, and extract results.
If session breaks: ensure sticky session + persisted cookies.
Add human-like delays and limited mouse movements to reduce detection.
1. Acquire proxy token/endpoint.
2. Create session or Playwright context with proxy.
3. Fetch page(s) → parse → save.
4. On 403/429 → mark proxy unhealthy, requeue job with exponential backoff.
Per-proxy concurrency = 1. Start ≤ 30 requests/IP/day for residential proxies; tune from monitoring.
403_ratio, avg_latency_ms, proxy_failure_rate, and cost_per_1k_requests
1. Proxy connectivity:
curl -x http://GP_USER:[email protected]:8000 'https://httpbin.org/ip'
2. SERP smoke: python scripts/serp_requests.py (compare to sample JSON).
3. Playwright smoke: python scripts/paa_playwright.py (expect PAA items).
403_ratio > 0.02 → throttle and investigate
proxy_failure_rate > 0.05 → remove proxy from pool
avg_latency_ms above baseline → check network/proxy health
You can use an LLM to propose selector fixes, but always validate generated selectors on multiple pages before auto-applying. Example LLM prompt:
“Given this HTML fragment: <paste HTML>, suggest 2 robust CSS selectors (with confidence) to extract the result title and URL.”
Proxy pool reachable and credentials validated.
UA pool rotates and Accept-Language set.
Cookie handling / sticky session for stateful flows.
Raw HTML snapshot & logging enabled.
Proxy health-check + auto-blacklist implemented.
Monitoring & alerting for block spikes in place.
Beginners get copy-paste scripts and validation tests; professionals get production thresholds, architecture, and monitoring advice. Start small, verify selectors and proxy health, then scale carefully.
Ready to try? Sign up and get a free trial of rotating residential proxies. Start your first Bing scraping today!
Next >