GoProxy > Blog > Use Cases > Ruby Web Scraping: Tools, Techniques & Tips

Ruby Web Scraping: Tools, Techniques & Tips

Post Time: 2026-02-05 Update Time: 2026-02-05

Step-by-step Ruby web scraping guide: HTTParty/Nokogiri, headless tools, proxies for reliable data extraction and scaling.

Ruby's elegant syntax and powerful libraries make it a fantastic choice for building scrapers that are both efficient and maintainable. In this comprehensive guide, we'll walk you through everything from setup to production, and test on a public site like quotes.toscrape.com for quick wins, such as scraping quotes for inspiration analysis.

Short Summary

You can quickly build a working Ruby scraper. Use HTTParty and Nokogiri for static pages, saving results to CSV. For JavaScript sites, switch to Capybara + Selenium. To handle IP blocks or scale, integrate rotating proxies.

Why Choose Ruby for Web Scraping?

Ruby Web Scraping

Ruby stands out due to its readability and rich ecosystem of gems.

Readable syntax: Great for parsing-heavy tasks and quick prototyping.
Mature gems: Nokogiri for fast HTML/XML parsing, Faraday/HTTParty for HTTP requests.
Rails integration: Ideal for seeding databases in web apps.

Compared to Python, Ruby offers concise data handling, making it perfect for prototypes or Rails-integrated scrapers. For extremely high-concurrency needs, consider concurrency-optimized languages, but Ruby handles most business cases well.

Ethical & Legal Note

Respect robots.txt and the site's Terms of Service.

Scraping public data is often legal, but check local regulations (e.g., GDPR in the EU—anonymize data immediately; CCPA in the US for data privacy).

Avoid CAPTCHA: Slow down, use headless browsers, or request official data access.

Detect blocks: Look for 429/403 codes or pages with "captcha""access denied"—back off, rotate IPs, or review manually.

Store credentials in environment variables; don't scrape sensitive data without consent.

Note: This guide does not instruct on evading paywalls, login mechanisms, or bypassing access controls.

Prerequisites: Quick Setup

1. Install Ruby: Use 3.0+ (tested on 3.2).

macOS/Linux: rbenv/rvm; Windows: RubyInstaller.

Verify: ruby -v. (For Windows, ensure chromedriver is in PATH—download from chromedriver.chromium.org.)

2. Install Bundler: gem install bundler.

3. GoProxy Account: Sign up for credentials (host/port/username/password). Start with a free trial to test latency.

4. Dev Tools: Editor (VS Code), IRB/Pry for testing, Git.

Project Skeleton

ruby-scraper/

├── Gemfile

├── scrape.rb

├── README.md

└── Dockerfile # Optional for deployment

Gemfile:

source "https://rubygems.org"

gem "httparty"

gem "nokogiri"

gem "faraday"

gem "selenium-webdriver"

gem "capybara"

gem "parallel"

gem "logger"

Run bundle install.

Next Steps: Test setup with irb and require 'nokogiri'.

Common Libraries Overview

Library	Use Case	Pros	Cons
HTTParty	Simple HTTP requests	Easy, quick for basics	Less flexible for middleware
Faraday	Advanced HTTP (proxies, retries)	Modular, proxy-friendly	Slight learning curve
Nokogiri	HTML parsing	Fast, CSS/XPath support	Static only
Capybara + Selenium	Dynamic/JS sites	Browser simulation	Resource-heavy
Parallel	Concurrency	Speeds up large tasks	Risk of detection if overused

Quick Decision

Page HTML contains target data on initial load → Static: HTTParty / Faraday + Nokogiri.

Page HTML is empty or data appears only after XHR requests → Examine Network/XHR; if API endpoints exist, call them; otherwise use Capybara + Selenium or Ferrum.

You see many 403/429 errors or geo-blocked content → Use rotating proxies for scale (sticky for sessions).

Need high throughput + low latency with stateful flows → Combine sticky proxies per worker + limited concurrency and robust retries.

Reusable Helpers

HEADERS = {

"User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",

"Accept" => "text/html",

"Referer" => "https://www.google.com"

}.freeze

Tip: In browser dev tools (Right-click > Inspect > Network tab), copy requests to refine headers.

Static Scraping: HTTParty/Faraday + Nokogiri

When to use: Static HTML (server-side rendered) or when the target exposes usable API endpoints.

1. Fetch page with reusable headers & timeouts

HTTParty example:

require 'httparty'

resp = HTTParty.get(url, headers: HEADERS, timeout: 10)

Faraday (preferred for production):

require "faraday"

conn = Faraday.new(url: base_url) do |f|

f.adapter Faraday.default_adapter

end

resp = conn.get(path) { |req| req.headers.merge!(HEADERS) }

2. Parse with Nokogiri

require 'nokogiri'

doc = Nokogiri::HTML(resp.body)

items = doc.css(".product").map do |p|

{

title: p.at_css(".title")&.text&.strip&.encode('utf-8', invalid: :replace), # Handle UTF-8

price: p.at_css(".price")&.text&.gsub(/[^\d\.]/, "")

}

end

Use safe navigation (&.) and .strip to avoid crashes and normalize text.

3. Save raw HTML for debugging

File.write("sample.html", resp.body)

Open sample.html in a browser to confirm selectors.

Next Steps: Test on quotes.toscrape.com—adapt for scraping quotes and authors.

Pagination & Crawler Logic

Building on static fetches, use a simple loop first, then queues for advanced crawling to avoid duplicates/infinites.

Simple Loop

base_url = 'https://quotes.toscrape.com/page/'

1.upto(5) do |page|

resp = HTTParty.get("#{base_url}#{page}", headers: HEADERS, timeout: 10)

# Parse and extract

sleep rand(1.0..3.0)

end

Advanced Queue

require "set"

to_visit = [start_url]

visited = Set.new

results = []

MAX_PAGES = 500

while (url = to_visit.shift) && results.size < 50_000

next if visited.include?(url)

visited << url

resp = HTTParty.get(url, headers: HEADERS, timeout: 10)

next unless resp.code == 200

doc = Nokogiri::HTML(resp.body)

# Extract...

next_link = doc.at_css("a.next")

to_visit << next_link["href"] if next_link && to_visit.size < MAX_PAGES

sleep rand(1.0..3.0) # Polite delay with jitter

end

Tips

Max pages or time budget (e.g., 2 hours).

Per-domain politeness: 1–3s delay; reduce concurrency if errors increase.

Dynamic Scraping: Capybara + Selenium (Headless)& Ferrum

When to use: Content rendered by JavaScript and no helpful XHR API endpoint exists. As in static, wrap in backoff for reliability.

Capybara + Selenium

require "capybara"

require "capybara/dsl"

require "selenium-webdriver"

Capybara.register_driver :headless_chrome do |app|

opts = Selenium::WebDriver::Chrome::Options.new

opts.add_argument("--headless=new")

opts.add_argument("--disable-gpu")

opts.add_argument("--no-sandbox")

Capybara::Selenium::Driver.new(app, browser: :chrome, options: opts)

end

session = Capybara::Session.new(:headless_chrome)

session.visit("https://dynamic.example.com")

session.has_css?(".loaded-item", wait: 10)

items = session.all(".loaded-item").map(&:text)

session.quit

Notes: Headless browsers are resource-heavy—reuse sessions and limit concurrent instances.

Ferrum (Lighter Alternative)

Ferrum controls Chrome without Selenium—add gem 'ferrum'. Generally lighter for simple tasks.

require 'ferrum'

browser = Ferrum::Browser.new(headless: true)

browser.go_to("https://dynamic.example.com")

# Extract via browser.body

browser.quit

For testing selectors: Use RSpec (gem 'rspec'): describe 'selectors' do; it 'extracts' do; expect(doc.css('.title')).not_to be_empty; end; end.

Professional Advice for Proxy Usage: GoProxy

Proxies are key for avoiding IP bans, geo-target content, and distributing requests. Building on fetches, add them when blocks occur.

Proxy types & when to use

Rotating: New IP per request; excellent for stateless, high-volume scraping.

Sticky (session): Same IP across multiple requests; required for login flows or multi-step transactions.

Datacenter vs Residential: Residential is more natural-looking but costlier.

Environment & secrets

Use environment variables:

export GOPROXY_HOST="proxy.goproxy.example"

export GOPROXY_PORT="8000"

export GOPROXY_USER="USERNAME"

export GOPROXY_PASS="PASSWORD"

Faraday with proxy (recommended)

proxy_uri = "http://#{ENV['GOPROXY_USER']}:#{ENV['GOPROXY_PASS']}@#{ENV['GOPROXY_HOST']}:#{ENV['GOPROXY_PORT']}"

conn = Faraday.new(url: "https://target.example", proxy: proxy_uri) do |f|

f.adapter Faraday.default_adapter

end

resp = conn.get("/path") { |req| req.headers.merge!(HEADERS) }

Session affinity: Keep one Faraday connection object per worker to preserve cookies.

HTTParty proxy example

HTTParty.get(url,

http_proxyaddr: ENV["GOPROXY_HOST"],

http_proxyport: ENV["GOPROXY_PORT"].to_i,

http_proxyuser: ENV["GOPROXY_USER"],

http_proxypass: ENV["GOPROXY_PASS"],

headers: HEADERS

)

Chrome + proxy in Selenium (auth)

opts.add_argument("--proxy-server=http://#{ENV['GOPROXY_HOST']}:#{ENV['GOPROXY_PORT']}")

Important: Chrome + username/password proxy is tricky. Options:

Use provider session tokens or proxy hosts whitelisted by IP.
Use a local proxy tunnel tool to inject auth.
Or use an authenticated proxy gateway.

Quick proxy test (curl)

curl -x http://USER:[email protected]:8000 -I https://example.com.

Next Steps: Test latency with GoProxy trial on a blocked site.

Best Practice to Avoid Blocks

Exponential Backoff

def with_backoff(max_attempts = 5)

attempt = 0

begin

attempt += 1

yield

rescue => e

raise if attempt >= max_attempts

sleep_time = (2 ** attempt) + rand(0.0..0.5) # Exponential + jitter

sleep(sleep_time)

retry

end

End

Thread pool / Queue

require "thread"

queue = Queue.new

urls.each { |u| queue << u }

workers = Array.new(3) do

Thread.new do

while (url = queue.pop(true) rescue nil)

with_backoff { process_url(url) } # process_url does fetch/parse/save

sleep rand(0.5..2.0)

end

workers.each(&:join)

Tip: Start 1–3 threads per proxy; monitor error rate and scale cautiously.

Parallel gem (alternative)

Use Parallel.map(urls, in_threads: 5) { |u| ... } for quick parallelism but be mindful of detection risk

What to log

Timestamp, URL, HTTP status, latency (ms), proxy used, worker id, error message, captcha_detected (boolean).

Example:

require "logger"

LOGGER = Logger.new("scraper.log")

LOGGER.info({ts: Time.now.iso8601, url: url, status: resp.code, ms: ms, proxy: proxy}.to_json)

Checkpointing to resume

require "json"

File.write("visited.json", JSON.pretty_generate(visited.to_a))

# On startup: visited = Set.new(JSON.parse(File.read("visited.json")))

Metrics & alerts

Requests/minute (RPM)
Success rate (2xx%)
Error rate (4xx/5xx)
CAPTCHA hits (count and rate)

Alert example: if CAPTCHA rate > 2% over 10 minutes → pause jobs and investigate.

Troubleshooting: Common Fixes

Problem	Likely Cause	Quick Fix
403 / 429	Rate limiting or IP block	Add jitter, reduce concurrency, rotate proxies, change UA/Referer
Missing data after fetch	JS-rendered content	Check Network/XHR in browser; use Selenium/Ferrum or call API endpoints
Pagination loops	No visited set or next link stable	Use visited set; set max pages
Proxy auth fails	Bad creds / whitelist	Test via curl; check provider dashboard; ensure IP whitelist
Headless too slow/memory	Too many browser instances	Reduce concurrent browsers; reuse sessions; try Ferrum

Debug tips

Save HTML to sample.html and open locally.

Use Pry to test selectors: doc = Nokogiri::HTML(File.read('sample.html')) and doc.css(...).

Record failing requests via logs and replay with curl to isolate issues.

Common pitfalls for beginners

Windows: Path issues with drivers—set ENV['PATH'].

UTF-8 errors: Use encode as shown.

Selectors break: Use RSpec for tests.

2026 Trends & Advice

Anti-bot techniques are evolving (fingerprinting, behavioral detection). Use mixed strategies: Lightweight static fetches when possible and targeted headless sessions for critical pages.

Proxy providers that support session tokens and geo selectors simplify production scraping, like GoProxy.

Observability (error/captcha rates) will become the dominant signal for tuning scrapers automatically. Expect more AI-driven selectors by late 2026—prototype with gems like 'openai-ruby' for dynamic adaptation.

For production: Deploy with Sidekiq on Heroku for scheduled jobs—add gem 'sidekiq'.

Final Thoughts

Ruby is an excellent choice for building readable, maintainable scrapers. Start with basics, add proxies for reliability, and scale responsibly. Practice on public sites like quotes.toscrape.com. By following these steps—from basic requests to proxy-integrated crawlers—you'll build scrapers that work reliably. Remember, practice on public datasets first, and always scrape responsibly.

Next >

AI Web Scraping: How to Choose, Build & Scale in 2026

Start Your 7-Day Free Trial Now!

Cancel anytime

No credit card required