Ruby's elegant syntax and powerful libraries make it a fantastic choice for building scrapers that are both efficient and maintainable. In this comprehensive guide, we'll walk you through everything from setup to production, and test on a public site like quotes.toscrape.com for quick wins, such as scraping quotes for inspiration analysis.
Short Summary
You can quickly build a working Ruby scraper. Use HTTParty and Nokogiri for static pages, saving results to CSV. For JavaScript sites, switch to Capybara + Selenium. To handle IP blocks or scale, integrate rotating proxies.
Why Choose Ruby for Web Scraping?

Ruby stands out due to its readability and rich ecosystem of gems.
- Readable syntax: Great for parsing-heavy tasks and quick prototyping.
- Mature gems: Nokogiri for fast HTML/XML parsing, Faraday/HTTParty for HTTP requests.
- Rails integration: Ideal for seeding databases in web apps.
Compared to Python, Ruby offers concise data handling, making it perfect for prototypes or Rails-integrated scrapers. For extremely high-concurrency needs, consider concurrency-optimized languages, but Ruby handles most business cases well.
Ethical & Legal Note
Respect robots.txt and the site's Terms of Service.
Scraping public data is often legal, but check local regulations (e.g., GDPR in the EU—anonymize data immediately; CCPA in the US for data privacy).
Avoid CAPTCHA: Slow down, use headless browsers, or request official data access.
Detect blocks: Look for 429/403 codes or pages with "captcha""access denied"—back off, rotate IPs, or review manually.
Store credentials in environment variables; don't scrape sensitive data without consent.
Note: This guide does not instruct on evading paywalls, login mechanisms, or bypassing access controls.
Prerequisites: Quick Setup
1. Install Ruby: Use 3.0+ (tested on 3.2).
macOS/Linux: rbenv/rvm; Windows: RubyInstaller.
Verify: ruby -v. (For Windows, ensure chromedriver is in PATH—download from chromedriver.chromium.org.)
2. Install Bundler: gem install bundler.
3. GoProxy Account: Sign up for credentials (host/port/username/password). Start with a free trial to test latency.
4. Dev Tools: Editor (VS Code), IRB/Pry for testing, Git.
Project Skeleton
ruby-scraper/
├── Gemfile
├── scrape.rb
├── README.md
└── Dockerfile # Optional for deployment
Gemfile:
source "https://rubygems.org"
gem "httparty"
gem "nokogiri"
gem "faraday"
gem "selenium-webdriver"
gem "capybara"
gem "parallel"
gem "logger"
Run bundle install.
Next Steps: Test setup with irb and require 'nokogiri'.
Common Libraries Overview
| Library |
Use Case |
Pros |
Cons |
| HTTParty |
Simple HTTP requests |
Easy, quick for basics |
Less flexible for middleware |
| Faraday |
Advanced HTTP (proxies, retries) |
Modular, proxy-friendly |
Slight learning curve |
| Nokogiri |
HTML parsing |
Fast, CSS/XPath support |
Static only |
| Capybara + Selenium |
Dynamic/JS sites |
Browser simulation |
Resource-heavy |
| Parallel |
Concurrency |
Speeds up large tasks |
Risk of detection if overused |
Quick Decision
Page HTML contains target data on initial load → Static: HTTParty / Faraday + Nokogiri.
Page HTML is empty or data appears only after XHR requests → Examine Network/XHR; if API endpoints exist, call them; otherwise use Capybara + Selenium or Ferrum.
You see many 403/429 errors or geo-blocked content → Use rotating proxies for scale (sticky for sessions).
Need high throughput + low latency with stateful flows → Combine sticky proxies per worker + limited concurrency and robust retries.
Reusable Helpers
HEADERS = {
"User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
"Accept" => "text/html",
"Referer" => "https://www.google.com"
}.freeze
Tip: In browser dev tools (Right-click > Inspect > Network tab), copy requests to refine headers.
Static Scraping: HTTParty/Faraday + Nokogiri
When to use: Static HTML (server-side rendered) or when the target exposes usable API endpoints.
1. Fetch page with reusable headers & timeouts
HTTParty example:
require 'httparty'
resp = HTTParty.get(url, headers: HEADERS, timeout: 10)
Faraday (preferred for production):
require "faraday"
conn = Faraday.new(url: base_url) do |f|
f.adapter Faraday.default_adapter
end
resp = conn.get(path) { |req| req.headers.merge!(HEADERS) }
2. Parse with Nokogiri
require 'nokogiri'
doc = Nokogiri::HTML(resp.body)
items = doc.css(".product").map do |p|
{
title: p.at_css(".title")&.text&.strip&.encode('utf-8', invalid: :replace), # Handle UTF-8
price: p.at_css(".price")&.text&.gsub(/[^\d\.]/, "")
}
end
Use safe navigation (&.) and .strip to avoid crashes and normalize text.
3. Save raw HTML for debugging
File.write("sample.html", resp.body)
Open sample.html in a browser to confirm selectors.
Next Steps: Test on quotes.toscrape.com—adapt for scraping quotes and authors.
Pagination & Crawler Logic
Building on static fetches, use a simple loop first, then queues for advanced crawling to avoid duplicates/infinites.
Simple Loop
base_url = 'https://quotes.toscrape.com/page/'
1.upto(5) do |page|
resp = HTTParty.get("#{base_url}#{page}", headers: HEADERS, timeout: 10)
# Parse and extract
sleep rand(1.0..3.0)
end
Advanced Queue
require "set"
to_visit = [start_url]
visited = Set.new
results = []
MAX_PAGES = 500
while (url = to_visit.shift) && results.size < 50_000
next if visited.include?(url)
visited << url
resp = HTTParty.get(url, headers: HEADERS, timeout: 10)
next unless resp.code == 200
doc = Nokogiri::HTML(resp.body)
# Extract...
next_link = doc.at_css("a.next")
to_visit << next_link["href"] if next_link && to_visit.size < MAX_PAGES
sleep rand(1.0..3.0) # Polite delay with jitter
end
Tips
Max pages or time budget (e.g., 2 hours).
Per-domain politeness: 1–3s delay; reduce concurrency if errors increase.
Dynamic Scraping: Capybara + Selenium (Headless)& Ferrum
When to use: Content rendered by JavaScript and no helpful XHR API endpoint exists. As in static, wrap in backoff for reliability.
Capybara + Selenium
require "capybara"
require "capybara/dsl"
require "selenium-webdriver"
Capybara.register_driver :headless_chrome do |app|
opts = Selenium::WebDriver::Chrome::Options.new
opts.add_argument("--headless=new")
opts.add_argument("--disable-gpu")
opts.add_argument("--no-sandbox")
Capybara::Selenium::Driver.new(app, browser: :chrome, options: opts)
end
session = Capybara::Session.new(:headless_chrome)
session.visit("https://dynamic.example.com")
session.has_css?(".loaded-item", wait: 10)
items = session.all(".loaded-item").map(&:text)
session.quit
Notes: Headless browsers are resource-heavy—reuse sessions and limit concurrent instances.
Ferrum (Lighter Alternative)
Ferrum controls Chrome without Selenium—add gem 'ferrum'. Generally lighter for simple tasks.
require 'ferrum'
browser = Ferrum::Browser.new(headless: true)
browser.go_to("https://dynamic.example.com")
# Extract via browser.body
browser.quit
For testing selectors: Use RSpec (gem 'rspec'): describe 'selectors' do; it 'extracts' do; expect(doc.css('.title')).not_to be_empty; end; end.
Professional Advice for Proxy Usage: GoProxy
Proxies are key for avoiding IP bans, geo-target content, and distributing requests. Building on fetches, add them when blocks occur.
Proxy types & when to use
Rotating: New IP per request; excellent for stateless, high-volume scraping.
Sticky (session): Same IP across multiple requests; required for login flows or multi-step transactions.
Datacenter vs Residential: Residential is more natural-looking but costlier.
Environment & secrets
Use environment variables:
export GOPROXY_HOST="proxy.goproxy.example"
export GOPROXY_PORT="8000"
export GOPROXY_USER="USERNAME"
export GOPROXY_PASS="PASSWORD"
Faraday with proxy (recommended)
proxy_uri = "http://#{ENV['GOPROXY_USER']}:#{ENV['GOPROXY_PASS']}@#{ENV['GOPROXY_HOST']}:#{ENV['GOPROXY_PORT']}"
conn = Faraday.new(url: "https://target.example", proxy: proxy_uri) do |f|
f.adapter Faraday.default_adapter
end
resp = conn.get("/path") { |req| req.headers.merge!(HEADERS) }
Session affinity: Keep one Faraday connection object per worker to preserve cookies.
HTTParty proxy example
HTTParty.get(url,
http_proxyaddr: ENV["GOPROXY_HOST"],
http_proxyport: ENV["GOPROXY_PORT"].to_i,
http_proxyuser: ENV["GOPROXY_USER"],
http_proxypass: ENV["GOPROXY_PASS"],
headers: HEADERS
)
Chrome + proxy in Selenium (auth)
opts.add_argument("--proxy-server=http://#{ENV['GOPROXY_HOST']}:#{ENV['GOPROXY_PORT']}")
Important: Chrome + username/password proxy is tricky. Options:
- Use provider session tokens or proxy hosts whitelisted by IP.
- Use a local proxy tunnel tool to inject auth.
- Or use an authenticated proxy gateway.
Quick proxy test (curl)
curl -x http://USER:[email protected]:8000 -I https://example.com.
Next Steps: Test latency with GoProxy trial on a blocked site.
Best Practice to Avoid Blocks
Exponential Backoff
def with_backoff(max_attempts = 5)
attempt = 0
begin
attempt += 1
yield
rescue => e
raise if attempt >= max_attempts
sleep_time = (2 ** attempt) + rand(0.0..0.5) # Exponential + jitter
sleep(sleep_time)
retry
end
End
Thread pool / Queue
require "thread"
queue = Queue.new
urls.each { |u| queue << u }
workers = Array.new(3) do
Thread.new do
while (url = queue.pop(true) rescue nil)
with_backoff { process_url(url) } # process_url does fetch/parse/save
sleep rand(0.5..2.0)
end
end
end
workers.each(&:join)
Tip: Start 1–3 threads per proxy; monitor error rate and scale cautiously.
Parallel gem (alternative)
Use Parallel.map(urls, in_threads: 5) { |u| ... } for quick parallelism but be mindful of detection risk
What to log
Timestamp, URL, HTTP status, latency (ms), proxy used, worker id, error message, captcha_detected (boolean).
Example:
require "logger"
LOGGER = Logger.new("scraper.log")
LOGGER.info({ts: Time.now.iso8601, url: url, status: resp.code, ms: ms, proxy: proxy}.to_json)
Checkpointing to resume
require "json"
File.write("visited.json", JSON.pretty_generate(visited.to_a))
# On startup: visited = Set.new(JSON.parse(File.read("visited.json")))
Metrics & alerts
- Requests/minute (RPM)
- Success rate (2xx%)
- Error rate (4xx/5xx)
- CAPTCHA hits (count and rate)
Alert example: if CAPTCHA rate > 2% over 10 minutes → pause jobs and investigate.
Troubleshooting: Common Fixes
| Problem |
Likely Cause |
Quick Fix |
| 403 / 429 |
Rate limiting or IP block |
Add jitter, reduce concurrency, rotate proxies, change UA/Referer |
| Missing data after fetch |
JS-rendered content |
Check Network/XHR in browser; use Selenium/Ferrum or call API endpoints |
| Pagination loops |
No visited set or next link stable |
Use visited set; set max pages |
| Proxy auth fails |
Bad creds / whitelist |
Test via curl; check provider dashboard; ensure IP whitelist |
| Headless too slow/memory |
Too many browser instances |
Reduce concurrent browsers; reuse sessions; try Ferrum |
Debug tips
Save HTML to sample.html and open locally.
Use Pry to test selectors: doc = Nokogiri::HTML(File.read('sample.html')) and doc.css(...).
Record failing requests via logs and replay with curl to isolate issues.
Common pitfalls for beginners
Windows: Path issues with drivers—set ENV['PATH'].
UTF-8 errors: Use encode as shown.
Selectors break: Use RSpec for tests.
2026 Trends & Advice
Anti-bot techniques are evolving (fingerprinting, behavioral detection). Use mixed strategies: Lightweight static fetches when possible and targeted headless sessions for critical pages.
Proxy providers that support session tokens and geo selectors simplify production scraping, like GoProxy.
Observability (error/captcha rates) will become the dominant signal for tuning scrapers automatically. Expect more AI-driven selectors by late 2026—prototype with gems like 'openai-ruby' for dynamic adaptation.
For production: Deploy with Sidekiq on Heroku for scheduled jobs—add gem 'sidekiq'.
Final Thoughts
Ruby is an excellent choice for building readable, maintainable scrapers. Start with basics, add proxies for reliability, and scale responsibly. Practice on public sites like quotes.toscrape.com. By following these steps—from basic requests to proxy-integrated crawlers—you'll build scrapers that work reliably. Remember, practice on public datasets first, and always scrape responsibly.