Legal & Step-by-Step ChatGPT Scraping Guide(2026)
Step-by-Step guide to use ChatGPT for scraping: prompts, code, parsing, validation, and compliance.
Jan 30, 2026
Step-by-step Ruby web scraping guide: HTTParty/Nokogiri, headless tools, proxies for reliable data extraction and scaling.
Ruby's elegant syntax and powerful libraries make it a fantastic choice for building scrapers that are both efficient and maintainable. In this comprehensive guide, we'll walk you through everything from setup to production, and test on a public site like quotes.toscrape.com for quick wins, such as scraping quotes for inspiration analysis.
You can quickly build a working Ruby scraper. Use HTTParty and Nokogiri for static pages, saving results to CSV. For JavaScript sites, switch to Capybara + Selenium. To handle IP blocks or scale, integrate rotating proxies.

Ruby stands out due to its readability and rich ecosystem of gems.
Compared to Python, Ruby offers concise data handling, making it perfect for prototypes or Rails-integrated scrapers. For extremely high-concurrency needs, consider concurrency-optimized languages, but Ruby handles most business cases well.
Respect robots.txt and the site's Terms of Service.
Scraping public data is often legal, but check local regulations (e.g., GDPR in the EU—anonymize data immediately; CCPA in the US for data privacy).
Avoid CAPTCHA: Slow down, use headless browsers, or request official data access.
Detect blocks: Look for 429/403 codes or pages with "captcha""access denied"—back off, rotate IPs, or review manually.
Store credentials in environment variables; don't scrape sensitive data without consent.
Note: This guide does not instruct on evading paywalls, login mechanisms, or bypassing access controls.
1. Install Ruby: Use 3.0+ (tested on 3.2).
macOS/Linux: rbenv/rvm; Windows: RubyInstaller.
Verify: ruby -v. (For Windows, ensure chromedriver is in PATH—download from chromedriver.chromium.org.)
2. Install Bundler: gem install bundler.
3. GoProxy Account: Sign up for credentials (host/port/username/password). Start with a free trial to test latency.
4. Dev Tools: Editor (VS Code), IRB/Pry for testing, Git.
ruby-scraper/
├── Gemfile
├── scrape.rb
├── README.md
└── Dockerfile # Optional for deployment
Gemfile:
source "https://rubygems.org"
gem "httparty"
gem "nokogiri"
gem "faraday"
gem "selenium-webdriver"
gem "capybara"
gem "parallel"
gem "logger"
Run bundle install.
Next Steps: Test setup with irb and require 'nokogiri'.
| Library | Use Case | Pros | Cons |
| HTTParty | Simple HTTP requests | Easy, quick for basics | Less flexible for middleware |
| Faraday | Advanced HTTP (proxies, retries) | Modular, proxy-friendly | Slight learning curve |
| Nokogiri | HTML parsing | Fast, CSS/XPath support | Static only |
| Capybara + Selenium | Dynamic/JS sites | Browser simulation | Resource-heavy |
| Parallel | Concurrency | Speeds up large tasks | Risk of detection if overused |
Page HTML contains target data on initial load → Static: HTTParty / Faraday + Nokogiri.
Page HTML is empty or data appears only after XHR requests → Examine Network/XHR; if API endpoints exist, call them; otherwise use Capybara + Selenium or Ferrum.
You see many 403/429 errors or geo-blocked content → Use rotating proxies for scale (sticky for sessions).
Need high throughput + low latency with stateful flows → Combine sticky proxies per worker + limited concurrency and robust retries.
HEADERS = {
"User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
"Accept" => "text/html",
"Referer" => "https://www.google.com"
}.freeze
Tip: In browser dev tools (Right-click > Inspect > Network tab), copy requests to refine headers.
When to use: Static HTML (server-side rendered) or when the target exposes usable API endpoints.
HTTParty example:
require 'httparty'
resp = HTTParty.get(url, headers: HEADERS, timeout: 10)
Faraday (preferred for production):
require "faraday"
conn = Faraday.new(url: base_url) do |f|
f.adapter Faraday.default_adapter
end
resp = conn.get(path) { |req| req.headers.merge!(HEADERS) }
require 'nokogiri'
doc = Nokogiri::HTML(resp.body)
items = doc.css(".product").map do |p|
{
title: p.at_css(".title")&.text&.strip&.encode('utf-8', invalid: :replace), # Handle UTF-8
price: p.at_css(".price")&.text&.gsub(/[^\d\.]/, "")
}
end
Use safe navigation (&.) and .strip to avoid crashes and normalize text.
File.write("sample.html", resp.body)
Open sample.html in a browser to confirm selectors.
Next Steps: Test on quotes.toscrape.com—adapt for scraping quotes and authors.
Building on static fetches, use a simple loop first, then queues for advanced crawling to avoid duplicates/infinites.
base_url = 'https://quotes.toscrape.com/page/'
1.upto(5) do |page|
resp = HTTParty.get("#{base_url}#{page}", headers: HEADERS, timeout: 10)
# Parse and extract
sleep rand(1.0..3.0)
end
require "set"
to_visit = [start_url]
visited = Set.new
results = []
MAX_PAGES = 500
while (url = to_visit.shift) && results.size < 50_000
next if visited.include?(url)
visited << url
resp = HTTParty.get(url, headers: HEADERS, timeout: 10)
next unless resp.code == 200
doc = Nokogiri::HTML(resp.body)
# Extract...
next_link = doc.at_css("a.next")
to_visit << next_link["href"] if next_link && to_visit.size < MAX_PAGES
sleep rand(1.0..3.0) # Polite delay with jitter
end
Max pages or time budget (e.g., 2 hours).
Per-domain politeness: 1–3s delay; reduce concurrency if errors increase.
When to use: Content rendered by JavaScript and no helpful XHR API endpoint exists. As in static, wrap in backoff for reliability.
require "capybara"
require "capybara/dsl"
require "selenium-webdriver"
Capybara.register_driver :headless_chrome do |app|
opts = Selenium::WebDriver::Chrome::Options.new
opts.add_argument("--headless=new")
opts.add_argument("--disable-gpu")
opts.add_argument("--no-sandbox")
Capybara::Selenium::Driver.new(app, browser: :chrome, options: opts)
end
session = Capybara::Session.new(:headless_chrome)
session.visit("https://dynamic.example.com")
session.has_css?(".loaded-item", wait: 10)
items = session.all(".loaded-item").map(&:text)
session.quit
Notes: Headless browsers are resource-heavy—reuse sessions and limit concurrent instances.
Ferrum controls Chrome without Selenium—add gem 'ferrum'. Generally lighter for simple tasks.
require 'ferrum'
browser = Ferrum::Browser.new(headless: true)
browser.go_to("https://dynamic.example.com")
# Extract via browser.body
browser.quit
For testing selectors: Use RSpec (gem 'rspec'): describe 'selectors' do; it 'extracts' do; expect(doc.css('.title')).not_to be_empty; end; end.
Proxies are key for avoiding IP bans, geo-target content, and distributing requests. Building on fetches, add them when blocks occur.
Rotating: New IP per request; excellent for stateless, high-volume scraping.
Sticky (session): Same IP across multiple requests; required for login flows or multi-step transactions.
Datacenter vs Residential: Residential is more natural-looking but costlier.
Use environment variables:
export GOPROXY_HOST="proxy.goproxy.example"
export GOPROXY_PORT="8000"
export GOPROXY_USER="USERNAME"
export GOPROXY_PASS="PASSWORD"
proxy_uri = "http://#{ENV['GOPROXY_USER']}:#{ENV['GOPROXY_PASS']}@#{ENV['GOPROXY_HOST']}:#{ENV['GOPROXY_PORT']}"
conn = Faraday.new(url: "https://target.example", proxy: proxy_uri) do |f|
f.adapter Faraday.default_adapter
end
resp = conn.get("/path") { |req| req.headers.merge!(HEADERS) }
Session affinity: Keep one Faraday connection object per worker to preserve cookies.
HTTParty.get(url,
http_proxyaddr: ENV["GOPROXY_HOST"],
http_proxyport: ENV["GOPROXY_PORT"].to_i,
http_proxyuser: ENV["GOPROXY_USER"],
http_proxypass: ENV["GOPROXY_PASS"],
headers: HEADERS
)
opts.add_argument("--proxy-server=http://#{ENV['GOPROXY_HOST']}:#{ENV['GOPROXY_PORT']}")
Important: Chrome + username/password proxy is tricky. Options:
curl -x http://USER:[email protected]:8000 -I https://example.com.
Next Steps: Test latency with GoProxy trial on a blocked site.
def with_backoff(max_attempts = 5)
attempt = 0
begin
attempt += 1
yield
rescue => e
raise if attempt >= max_attempts
sleep_time = (2 ** attempt) + rand(0.0..0.5) # Exponential + jitter
sleep(sleep_time)
retry
end
End
require "thread"
queue = Queue.new
urls.each { |u| queue << u }
workers = Array.new(3) do
Thread.new do
while (url = queue.pop(true) rescue nil)
with_backoff { process_url(url) } # process_url does fetch/parse/save
sleep rand(0.5..2.0)
end
end
end
workers.each(&:join)
Tip: Start 1–3 threads per proxy; monitor error rate and scale cautiously.
Use Parallel.map(urls, in_threads: 5) { |u| ... } for quick parallelism but be mindful of detection risk
Timestamp, URL, HTTP status, latency (ms), proxy used, worker id, error message, captcha_detected (boolean).
Example:
require "logger"
LOGGER = Logger.new("scraper.log")
LOGGER.info({ts: Time.now.iso8601, url: url, status: resp.code, ms: ms, proxy: proxy}.to_json)
require "json"
File.write("visited.json", JSON.pretty_generate(visited.to_a))
# On startup: visited = Set.new(JSON.parse(File.read("visited.json")))
Alert example: if CAPTCHA rate > 2% over 10 minutes → pause jobs and investigate.
| Problem | Likely Cause | Quick Fix |
| 403 / 429 | Rate limiting or IP block | Add jitter, reduce concurrency, rotate proxies, change UA/Referer |
| Missing data after fetch | JS-rendered content | Check Network/XHR in browser; use Selenium/Ferrum or call API endpoints |
| Pagination loops | No visited set or next link stable | Use visited set; set max pages |
| Proxy auth fails | Bad creds / whitelist | Test via curl; check provider dashboard; ensure IP whitelist |
| Headless too slow/memory | Too many browser instances | Reduce concurrent browsers; reuse sessions; try Ferrum |
Save HTML to sample.html and open locally.
Use Pry to test selectors: doc = Nokogiri::HTML(File.read('sample.html')) and doc.css(...).
Record failing requests via logs and replay with curl to isolate issues.
Windows: Path issues with drivers—set ENV['PATH'].
UTF-8 errors: Use encode as shown.
Selectors break: Use RSpec for tests.
Anti-bot techniques are evolving (fingerprinting, behavioral detection). Use mixed strategies: Lightweight static fetches when possible and targeted headless sessions for critical pages.
Proxy providers that support session tokens and geo selectors simplify production scraping, like GoProxy.
Observability (error/captcha rates) will become the dominant signal for tuning scrapers automatically. Expect more AI-driven selectors by late 2026—prototype with gems like 'openai-ruby' for dynamic adaptation.
For production: Deploy with Sidekiq on Heroku for scheduled jobs—add gem 'sidekiq'.
Ruby is an excellent choice for building readable, maintainable scrapers. Start with basics, add proxies for reliability, and scale responsibly. Practice on public sites like quotes.toscrape.com. By following these steps—from basic requests to proxy-integrated crawlers—you'll build scrapers that work reliably. Remember, practice on public datasets first, and always scrape responsibly.
Next >
Cancel anytime
No credit card required