This browser does not support JavaScript

Step-by-Step Guide to Scrape Images from Google Images with GoProxy

Post Time: 2025-08-12 Update Time: 2025-08-12

This guide explains how to scrape Google Images safely at small and large scale, providing a beginner's Quick Start (Playwright + aiohttp), two lightweight methods, and a full cross-cutting section covering proxies (GoProxy), pagination, JSON extraction, CAPTCHA handling, async downloads, deduplication, metadata, monitoring, and legal best practices.

Read This First

This guide is for anyone who wants to collect images from Google Images: beginners who want a working script fast, intermediate builders who want higher-quality images without a browser, and engineers who need a production-ready, scalable pipeline.

If you’re a total beginner: follow Before You Run Anything → Quick Start → Quick Check, then come back for the other sections.

Why Scrape Google Images?

Scrape Images from Google Images

Google Images is one of the largest public repositories of visual content—from stock-like photos to memes, product shots, and scientific diagrams. Scraping these images can help you:

  • Build visual datasets for ML (classification, fine-tuning, DreamBooth).
  • Collect product images or competitor creative for market research.
  • Gather reference imagery for design, editorial or SEO audits.
  • Track visual trends or image usage across the web.

Each purpose affects choices: scale, quality, geo-targeting, and legal caution.

Challenges You’ll Face & Handle

Dynamic loading (infinite scroll) → use Playwright/Selenium for JS rendering.

Obfuscated/full-size URLs inside inline JSON → we show how to extract AF_initDataCallback blobs.

Rate limits and CAPTCHAs → use proxies + detection & rotation strategies.

Duplicate/low-quality images → use hashing & size checks.

Legal Checklist — What You Must Consider

This guide is educational. Scraping Google Images may violate Google’s Terms of Service.

Do not republish copyrighted images without permission or license.

For face/personal data, consult privacy laws (GDPR/CCPA) and get legal advice.

Prefer licensed image sources or APIs for commercial work.

Before You Run Anything — Set Up Proxies (So You Don’t Get Blocked)

Do this first. Without proxies, you’ll likely hit CAPTCHAs quickly, even for small scrapes. GoProxy offers residential IPs with rotation, geo-targeting (country/state/city), and session support.

Sign up for credentials. Minimal setup:

python

 

# proxy_config.py

proxy_host = "proxy.goproxy.com"

proxy_port = "8000"

proxy_user = "your_user"

proxy_pass = "your_pass"

 

PROXIES = {

    "http": f"http://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}",

    "https": f"http://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}"

}

 

# For Playwright context:

# context = browser.new_context(proxy={"server": f"http://{proxy_host}:{proxy_port}", "username": proxy_user, "password": proxy_pass})

When scraping at scale

Rotate every 1–5 requests for safety

Choose geo-targeted IPs if you need region-specific results

Monitor proxy error rates & replace dead endpoints automatically

Quick Start — Run This First (Playwright discovery + async downloader)

This is the beginner path: copy/paste and run. It discovers image URLs (handles JS) and downloads them with aiohttp via your proxy.

Setup

bash

 

python -m venv venv

source venv/bin/activate   # Windows: venv\Scripts\activate

pip install playwright aiohttp pillow requests

playwright install chromium

discover_playwright.py (find image URLs)

python

 

# discover_playwright.py

from playwright.sync_api import sync_playwright

import time, json, os

from urllib.parse import quote_plus

from proxy_config import proxy_host, proxy_port, proxy_user, proxy_pass

 

PLAY_PROXY = {"server": f"http://{proxy_host}:{proxy_port}",

              "username": proxy_user, "password": proxy_pass}

OUT = "images.jsonl"

 

def detect_captcha(page):

    if page.query_selector('iframe[src*="recaptcha"]'): return True

    if "unusual traffic" in page.content().lower(): return True

    return False

 

def run(query, max_images=300, scrolls=8):

    with sync_playwright() as p:

        browser = p.chromium.launch(headless=True)

        ctx = browser.new_context(proxy=PLAY_PROXY,

                                  user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64)",

                                  locale="en-US", timezone_id="America/Los_Angeles")

        page = ctx.new_page()

        page.goto(f"https://www.google.com/search?q={quote_plus(query)}&tbm=isch", timeout=30000)

        time.sleep(1)

        if detect_captcha(page):

            browser.close(); raise RuntimeError("CAPTCHA detected — rotate proxy or slow down.")

        for _ in range(scrolls):

            page.evaluate("window.scrollBy(0, document.body.scrollHeight);")

            time.sleep(1)

        imgs = page.query_selector_all("img")

        urls = []

        for img in imgs:

            src = img.get_attribute("src")

            if src and src.startswith("http"): urls.append(src)

        browser.close()

 

    uniq = list(dict.fromkeys(urls))[:max_images]

    with open(OUT,"w") as f:

        for u in uniq: f.write(json.dumps({"url":u})+"\n")

    print(f"Saved {len(uniq)} URLs to {OUT}")

 

if __name__ == "__main__":

    run("red running shoes", max_images=200, scrolls=10)

download_aiohttp.py (async downloader + dedupe + provenance)

python

 

# download_aiohttp.py

import aiohttp, asyncio, hashlib, os, json, time

from proxy_config import proxy_user, proxy_pass, proxy_host, proxy_port

 

PROXY_URL = f"http://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}"

OUT_DIR = "images_out"

os.makedirs(OUT_DIR, exist_ok=True)

CONCURRENCY = 6

seen = set()

 

async def fetch_and_save(session, url):

    sem = asyncio.Semaphore(CONCURRENCY)

    async with sem:

        tries, backoff = 3, 1.0

        for _ in range(tries):

            try:

                async with session.get(url, timeout=30, proxy=PROXY_URL) as r:

                    if r.status == 200 and 'image' in r.headers.get('Content-Type',''):

                        data = await r.read()

                        sha1 = hashlib.sha1(data).hexdigest()

                        if sha1 in seen: return {"url":url,"status":"duplicate"}

                        seen.add(sha1)

                        fn = sha1[:16]+".jpg"

                        with open(os.path.join(OUT_DIR, fn),"wb") as f: f.write(data)

                        prov = {"url":url,"filename":fn,"sha1":sha1,"downloaded_at":time.strftime("%Y-%m-%dT%H:%M:%S")}

                        with open(os.path.join(OUT_DIR,"provenance.jsonl"),"a") as pf: pf.write(json.dumps(prov)+"\n")

                        return {"url":url,"status":"ok"}

                    else:

                        return {"url":url,"status":"bad_response","code":r.status}

            except Exception:

                await asyncio.sleep(backoff); backoff *= 2

        return {"url":url,"status":"failed"}

 

async def download_list(urls):

    conn = aiohttp.TCPConnector(ssl=False)

    async with aiohttp.ClientSession(connector=conn) as session:

        return await asyncio.gather(*(fetch_and_save(session,u) for u in urls))

 

if __name__ == "__main__":

    with open("images.jsonl") as f: urls=[json.loads(l)["url"] for l in f]

    results = asyncio.run(download_list(urls[:500]))

    print("Done.", results[:5])

Beginner checklist after quick start

1. Did images_out/ contain files? Good.

2. If >5% failures, reduce CONCURRENCY, increase timeouts, or change proxy.

3. If CAPTCHA appears, rotate proxy and re-run from last good URL.

If That Worked — Quick Check: Do Your Images Look Right?

Open a few images, check resolution and variety. Inspect provenance.jsonl to confirm source URLs and timestamps. If images are tiny thumbnails or many duplicates, proceed to the JSON parsing or Selenium methods below.

Pick One Method Fast

Short decision guide:

Under 100 images / quick test: Method A (Requests + HTML) — easiest.

100–500 images, want higher-res: Method B (Requests + JSON parsing) — faster than browser.

500+ images / infinite scroll / personalized results: Method C (Playwright/Selenium + async downloads).

Method 1. Super Simple: Requests + HTML (Under ~100 images)

Good first step but often returns thumbnails.

python

 

import requests, re

from urllib.parse import quote_plus

from proxy_config import PROXIES

 

headers = {"User-Agent":"Mozilla/5.0","Accept-Language":"en-US,en;q=0.9","Referer":"https://www.google.com"}

q="cats"

r = requests.get(f"https://www.google.com/search?q={quote_plus(q)}&tbm=isch", headers=headers, proxies=PROXIES, timeout=20)

# Try to get full image URLs

urls = re.findall(r'"ou":"(https?://[^"]+)"', r.text)

if not urls:

    urls = re.findall(r'<img[^>]+src="([^"]+)"', r.text)

Tips

Add &ijn=1 (see Pagination) to get deeper results.

Rotate User-Agent and small delays to reduce blocking.

Method 2. Fast & Clean: Parse Google’s Inline JSON (100–500 images)

Google embeds useful JSON; extracting it yields full-size ou/murl URLs.

Robust extraction pattern:

python

 

import re, json, requests

from proxy_config import PROXIES, HEADERS

 

r = requests.get("https://www.google.com/search?q=sunset&tbm=isch", headers=HEADERS, proxies=PROXIES, timeout=20)

matches = re.findall(r"AF_initDataCallback\(([^<]+)\);", r.text)

images=[]

for m in matches:

    jstart = m.find('{'); jend = m.rfind('}')

    if jstart==-1 or jend==-1: continue

    try:

        j=json.loads(m[jstart:jend+1])

    except Exception:

        continue

    # recursive search for 'ou' or 'murl'

    def find_urls(x):

        if isinstance(x, dict):

            for k,v in x.items():

                if k in ("ou","murl") and isinstance(v,str) and v.startswith("http"):

                    images.append(v)

                else:

                    find_urls(v)

        elif isinstance(x, list):

            for item in x: find_urls(item)

    find_urls(j)

 

print(len(images),"images found")

Notes

The JSON shape changes; always inspect with DevTools.

Use &tbs=isz:l to favor large images (see Troubleshooting).

Method 3. Heavy-Duty: Browser Automation (500+ images)

Use Playwright or Selenium to do clicks, reveal ou links, and handle infinite scroll. The Quick Start shows Playwright discovery; you can extend it to click each thumbnail to reveal the original src and collect it.

Best practices:

Use headful or stealth approaches if headless triggers fingerprinting.

Keep a cookie session if you need personalized results.

Capture screenshots on block events for debugging.

Common Roadblocks & How to Fix Them

Immediate CAPTCHA → you tried scraping without proxies or with too many requests. Fix: add proxies, slow down, rotate UA.

Only base64 thumbnails → use Playwright to click thumbnails or use JSON parsing to find ou.

No new images on scroll → use ijn paging and verify your scrolls and delays.

Make It Reliable — Anti-blocking & Fingerprint Tips

Small list you can apply to any method:

Rotate realistic User-Agent strings.

Set Accept-Language, Referer and match timezone/locale to proxy geolocation.

Use random small delays instead of fixed sleeps.

Keep cookies between requests for session affinity (sticky IP).

Detect CAPTCHAs early and rotate proxies automatically.

CAPTCHA detection snippet (requests):

py

 

if r.status_code in (429,403) or "unusual traffic" in r.text.lower():

    # rotate proxy

CAPTCHA detection snippet (Playwright):

py

 

if page.query_selector('iframe[src*="recaptcha"]') or "unusual traffic" in page.content().lower():

    page.screenshot("captcha.png")

    # rotate proxy/session

Get Faster — Async Downloads, Dedupe & Save Provenance

Use aiohttp for concurrent downloads (sample in Quick Start).

Deduplicate by hashing bytes (SHA-1/256).

Save provenance for each image (original_url, filename, sha1, downloaded_at, proxy_ip) into a provenance.jsonl.

Small dedupe snippet:

py

 

import hashlib

sha1 = hashlib.sha1(image_bytes).hexdigest()

if sha1 in seen_hashes: skip

else: seen_hashes.add(sha1)

Pagination & Finding More Images (the ijn trick)

Google Image results have an ijn parameter (page index). Loop ijn=0..N to fetch more.

py

 

for ijn in range(0, 5):

    url = f"https://www.google.com/search?q={quote_plus(query)}&tbm=isch&ijn={ijn}"

    r = requests.get(url, headers=HEADERS, proxies=PROXIES)

    # extract images as above

Stop when new pages return no new URLs or content repeats.

How Many Proxies Do I Need? Quick Sizing Formula

ini

 

estimated_IPs = ceil((target_images * avg_req_per_image) / (requests_per_IP_per_min * run_minutes))

Defaults to try:

  • avg_req_per_image = 1.2 (discover + download + small retries)
  • requests_per_IP_per_min = 3–6

Example: 2,000 images in 60 minutes with 4 req/IP/min → ~10 IPs (start with pilot & increase for retries).

Test Checklist Before You Scale

1. Run Quick Start for 50–200 images.

2. Check success rate (>95%) and captcha rate (<0.5%).

3. Verify provenance metadata and no duplicates.

4. Inspect a sample of images manually for quality.

5. Calculate transfer (GB) and proxy cost estimate.

6. Adjust concurrency and proxies based on results.

Optional: Prepare Images for ML (Face Detection & Cropping)

If your dataset needs faces:

Detect faces (face_recognition or OpenCV DNN).

Keep single-face images for identity datasets.

Crop & resize (e.g., 512×512) consistently.

Legal: Always verify consent/rights for face images before commercial use.

Snippet:

py

 

import face_recognition

from PIL import Image

img = face_recognition.load_image_file("img.jpg")

faces = face_recognition.face_locations(img)

if len(faces)==1:

    top,right,bottom,left = faces[0]

    Image.fromarray(img).crop((left,top,right,bottom)).resize((512,512)).save("face_512.jpg")

Monitoring & Key Metrics You Should Track

Download success rate (target >95%).

CAPTCHA events per minute (target <0.5%).

Error rate (429/403/timeouts) — if >5% reduce load.

Throughput (images/min per IP) — use to estimate runtime/cost.

Log every event to an operations log (query, URL, status, proxy_ip, sha1).

Final Thoughts

Start small, get a working Quick Start run, then iterate: add dedupe, improve extraction, set up monitoring, and scale proxies carefully. Keep provenance and respect legal limits — that will save you headaches later.

Try one method above with your own query — and integrate GoProxy rotating residential IPs to scrape without interruptions. Sign up and get a free trial today!

< Previous

How to Find Someone’s IP: Safe & Legal

Next >

What Anonymous Proxy Detected Means & How to Fix
Start Your 7-Day Free Trial Now!
GoProxy Cancel anytime
GoProxy No credit card required