Master AIO Bots: Step-by-Step Sneaker Copping with GoProxy
A complete guide for beginners and pros to install, configure, and optimize AIO Bot with GoProxy proxies—plus real examples metrics, and troubleshooting tips.
Aug 4, 2025
Step-by-step scrape images from Google Images with Python safely and efficiently: three proven methods, anti-blocking strategies, and quality control tips.
This guide explains how to scrape Google Images safely at small and large scale, providing a beginner's Quick Start (Playwright + aiohttp), two lightweight methods, and a full cross-cutting section covering proxies (GoProxy), pagination, JSON extraction, CAPTCHA handling, async downloads, deduplication, metadata, monitoring, and legal best practices.
This guide is for anyone who wants to collect images from Google Images: beginners who want a working script fast, intermediate builders who want higher-quality images without a browser, and engineers who need a production-ready, scalable pipeline.
If you’re a total beginner: follow Before You Run Anything → Quick Start → Quick Check, then come back for the other sections.
Google Images is one of the largest public repositories of visual content—from stock-like photos to memes, product shots, and scientific diagrams. Scraping these images can help you:
Each purpose affects choices: scale, quality, geo-targeting, and legal caution.
Dynamic loading (infinite scroll) → use Playwright/Selenium for JS rendering.
Obfuscated/full-size URLs inside inline JSON → we show how to extract AF_initDataCallback blobs.
Rate limits and CAPTCHAs → use proxies + detection & rotation strategies.
Duplicate/low-quality images → use hashing & size checks.
This guide is educational. Scraping Google Images may violate Google’s Terms of Service.
Do not republish copyrighted images without permission or license.
For face/personal data, consult privacy laws (GDPR/CCPA) and get legal advice.
Prefer licensed image sources or APIs for commercial work.
Do this first. Without proxies, you’ll likely hit CAPTCHAs quickly, even for small scrapes. GoProxy offers residential IPs with rotation, geo-targeting (country/state/city), and session support.
Sign up for credentials. Minimal setup:
python
# proxy_config.py
proxy_host = "proxy.goproxy.com"
proxy_port = "8000"
proxy_user = "your_user"
proxy_pass = "your_pass"
PROXIES = {
"http": f"http://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}",
"https": f"http://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}"
}
# For Playwright context:
# context = browser.new_context(proxy={"server": f"http://{proxy_host}:{proxy_port}", "username": proxy_user, "password": proxy_pass})
Rotate every 1–5 requests for safety
Choose geo-targeted IPs if you need region-specific results
Monitor proxy error rates & replace dead endpoints automatically
This is the beginner path: copy/paste and run. It discovers image URLs (handles JS) and downloads them with aiohttp via your proxy.
bash
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install playwright aiohttp pillow requests
playwright install chromium
python
# discover_playwright.py
from playwright.sync_api import sync_playwright
import time, json, os
from urllib.parse import quote_plus
from proxy_config import proxy_host, proxy_port, proxy_user, proxy_pass
PLAY_PROXY = {"server": f"http://{proxy_host}:{proxy_port}",
"username": proxy_user, "password": proxy_pass}
OUT = "images.jsonl"
def detect_captcha(page):
if page.query_selector('iframe[src*="recaptcha"]'): return True
if "unusual traffic" in page.content().lower(): return True
return False
def run(query, max_images=300, scrolls=8):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
ctx = browser.new_context(proxy=PLAY_PROXY,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
locale="en-US", timezone_id="America/Los_Angeles")
page = ctx.new_page()
page.goto(f"https://www.google.com/search?q={quote_plus(query)}&tbm=isch", timeout=30000)
time.sleep(1)
if detect_captcha(page):
browser.close(); raise RuntimeError("CAPTCHA detected — rotate proxy or slow down.")
for _ in range(scrolls):
page.evaluate("window.scrollBy(0, document.body.scrollHeight);")
time.sleep(1)
imgs = page.query_selector_all("img")
urls = []
for img in imgs:
src = img.get_attribute("src")
if src and src.startswith("http"): urls.append(src)
browser.close()
uniq = list(dict.fromkeys(urls))[:max_images]
with open(OUT,"w") as f:
for u in uniq: f.write(json.dumps({"url":u})+"\n")
print(f"Saved {len(uniq)} URLs to {OUT}")
if __name__ == "__main__":
run("red running shoes", max_images=200, scrolls=10)
python
# download_aiohttp.py
import aiohttp, asyncio, hashlib, os, json, time
from proxy_config import proxy_user, proxy_pass, proxy_host, proxy_port
PROXY_URL = f"http://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}"
OUT_DIR = "images_out"
os.makedirs(OUT_DIR, exist_ok=True)
CONCURRENCY = 6
seen = set()
async def fetch_and_save(session, url):
sem = asyncio.Semaphore(CONCURRENCY)
async with sem:
tries, backoff = 3, 1.0
for _ in range(tries):
try:
async with session.get(url, timeout=30, proxy=PROXY_URL) as r:
if r.status == 200 and 'image' in r.headers.get('Content-Type',''):
data = await r.read()
sha1 = hashlib.sha1(data).hexdigest()
if sha1 in seen: return {"url":url,"status":"duplicate"}
seen.add(sha1)
fn = sha1[:16]+".jpg"
with open(os.path.join(OUT_DIR, fn),"wb") as f: f.write(data)
prov = {"url":url,"filename":fn,"sha1":sha1,"downloaded_at":time.strftime("%Y-%m-%dT%H:%M:%S")}
with open(os.path.join(OUT_DIR,"provenance.jsonl"),"a") as pf: pf.write(json.dumps(prov)+"\n")
return {"url":url,"status":"ok"}
else:
return {"url":url,"status":"bad_response","code":r.status}
except Exception:
await asyncio.sleep(backoff); backoff *= 2
return {"url":url,"status":"failed"}
async def download_list(urls):
conn = aiohttp.TCPConnector(ssl=False)
async with aiohttp.ClientSession(connector=conn) as session:
return await asyncio.gather(*(fetch_and_save(session,u) for u in urls))
if __name__ == "__main__":
with open("images.jsonl") as f: urls=[json.loads(l)["url"] for l in f]
results = asyncio.run(download_list(urls[:500]))
print("Done.", results[:5])
1. Did images_out/ contain files? Good.
2. If >5% failures, reduce CONCURRENCY, increase timeouts, or change proxy.
3. If CAPTCHA appears, rotate proxy and re-run from last good URL.
Open a few images, check resolution and variety. Inspect provenance.jsonl to confirm source URLs and timestamps. If images are tiny thumbnails or many duplicates, proceed to the JSON parsing or Selenium methods below.
Short decision guide:
Under 100 images / quick test: Method A (Requests + HTML) — easiest.
100–500 images, want higher-res: Method B (Requests + JSON parsing) — faster than browser.
500+ images / infinite scroll / personalized results: Method C (Playwright/Selenium + async downloads).
Good first step but often returns thumbnails.
python
import requests, re
from urllib.parse import quote_plus
from proxy_config import PROXIES
headers = {"User-Agent":"Mozilla/5.0","Accept-Language":"en-US,en;q=0.9","Referer":"https://www.google.com"}
q="cats"
r = requests.get(f"https://www.google.com/search?q={quote_plus(q)}&tbm=isch", headers=headers, proxies=PROXIES, timeout=20)
# Try to get full image URLs
urls = re.findall(r'"ou":"(https?://[^"]+)"', r.text)
if not urls:
urls = re.findall(r'<img[^>]+src="([^"]+)"', r.text)
Tips
Add &ijn=1 (see Pagination) to get deeper results.
Rotate User-Agent and small delays to reduce blocking.
Google embeds useful JSON; extracting it yields full-size ou/murl URLs.
Robust extraction pattern:
python
import re, json, requests
from proxy_config import PROXIES, HEADERS
r = requests.get("https://www.google.com/search?q=sunset&tbm=isch", headers=HEADERS, proxies=PROXIES, timeout=20)
matches = re.findall(r"AF_initDataCallback\(([^<]+)\);", r.text)
images=[]
for m in matches:
jstart = m.find('{'); jend = m.rfind('}')
if jstart==-1 or jend==-1: continue
try:
j=json.loads(m[jstart:jend+1])
except Exception:
continue
# recursive search for 'ou' or 'murl'
def find_urls(x):
if isinstance(x, dict):
for k,v in x.items():
if k in ("ou","murl") and isinstance(v,str) and v.startswith("http"):
images.append(v)
else:
find_urls(v)
elif isinstance(x, list):
for item in x: find_urls(item)
find_urls(j)
print(len(images),"images found")
Notes
The JSON shape changes; always inspect with DevTools.
Use &tbs=isz:l to favor large images (see Troubleshooting).
Use Playwright or Selenium to do clicks, reveal ou links, and handle infinite scroll. The Quick Start shows Playwright discovery; you can extend it to click each thumbnail to reveal the original src and collect it.
Best practices:
Use headful or stealth approaches if headless triggers fingerprinting.
Keep a cookie session if you need personalized results.
Capture screenshots on block events for debugging.
Immediate CAPTCHA → you tried scraping without proxies or with too many requests. Fix: add proxies, slow down, rotate UA.
Only base64 thumbnails → use Playwright to click thumbnails or use JSON parsing to find ou.
No new images on scroll → use ijn paging and verify your scrolls and delays.
Small list you can apply to any method:
Rotate realistic User-Agent strings.
Set Accept-Language, Referer and match timezone/locale to proxy geolocation.
Use random small delays instead of fixed sleeps.
Keep cookies between requests for session affinity (sticky IP).
Detect CAPTCHAs early and rotate proxies automatically.
CAPTCHA detection snippet (requests):
py
if r.status_code in (429,403) or "unusual traffic" in r.text.lower():
# rotate proxy
CAPTCHA detection snippet (Playwright):
py
if page.query_selector('iframe[src*="recaptcha"]') or "unusual traffic" in page.content().lower():
page.screenshot("captcha.png")
# rotate proxy/session
Use aiohttp for concurrent downloads (sample in Quick Start).
Deduplicate by hashing bytes (SHA-1/256).
Save provenance for each image (original_url, filename, sha1, downloaded_at, proxy_ip) into a provenance.jsonl.
Small dedupe snippet:
py
import hashlib
sha1 = hashlib.sha1(image_bytes).hexdigest()
if sha1 in seen_hashes: skip
else: seen_hashes.add(sha1)
Google Image results have an ijn parameter (page index). Loop ijn=0..N to fetch more.
py
for ijn in range(0, 5):
url = f"https://www.google.com/search?q={quote_plus(query)}&tbm=isch&ijn={ijn}"
r = requests.get(url, headers=HEADERS, proxies=PROXIES)
# extract images as above
Stop when new pages return no new URLs or content repeats.
ini
estimated_IPs = ceil((target_images * avg_req_per_image) / (requests_per_IP_per_min * run_minutes))
Defaults to try:
Example: 2,000 images in 60 minutes with 4 req/IP/min → ~10 IPs (start with pilot & increase for retries).
1. Run Quick Start for 50–200 images.
2. Check success rate (>95%) and captcha rate (<0.5%).
3. Verify provenance metadata and no duplicates.
4. Inspect a sample of images manually for quality.
5. Calculate transfer (GB) and proxy cost estimate.
6. Adjust concurrency and proxies based on results.
If your dataset needs faces:
Detect faces (face_recognition or OpenCV DNN).
Keep single-face images for identity datasets.
Crop & resize (e.g., 512×512) consistently.
Legal: Always verify consent/rights for face images before commercial use.
Snippet:
py
import face_recognition
from PIL import Image
img = face_recognition.load_image_file("img.jpg")
faces = face_recognition.face_locations(img)
if len(faces)==1:
top,right,bottom,left = faces[0]
Image.fromarray(img).crop((left,top,right,bottom)).resize((512,512)).save("face_512.jpg")
Download success rate (target >95%).
CAPTCHA events per minute (target <0.5%).
Error rate (429/403/timeouts) — if >5% reduce load.
Throughput (images/min per IP) — use to estimate runtime/cost.
Log every event to an operations log (query, URL, status, proxy_ip, sha1).
Start small, get a working Quick Start run, then iterate: add dedupe, improve extraction, set up monitoring, and scale proxies carefully. Keep provenance and respect legal limits — that will save you headaches later.
Try one method above with your own query — and integrate GoProxy rotating residential IPs to scrape without interruptions. Sign up and get a free trial today!
< Previous
Next >