Ruby Web Scraping: Tools, Techniques & Tips
Step-by-step Ruby web scraping guide: HTTParty/Nokogiri, headless tools, proxies for reliable data extraction and scaling.
Feb 5, 2026
Step-by-step guide to scrape web data into Excel using Power Query, VBA, Python, and more—includes ethics, tips, and fixes.
Extracting information from websites directly into Excel can save hours of manual work. This guide walks you through reliable methods to extract web data into Excel, starting with no-code options and advancing to scripts like VBA and Python (including Selenium/Playwright for dynamic sites). We'll cover decision, step-by-step instructions, code examples, ethics, troubleshooting, and tips, so you can perform ethical, effective scraping.
Start with Power Query for simple HTML tables—it's fast and refreshable. Use VBA for custom extractions within Excel. Opt for Python (with Selenium/Playwright) for JavaScript-driven pages, logins, or scale. Always prefer APIs for stability and legality.

Save time: Automate repetitive copy-paste tasks.
Analyze data: Leverage Excel’s formulas, PivotTables, and charts on live or scheduled data.
Build reports: Create refreshable dashboards.
Bridge systems: Combine web data with internal sources in one workbook—e.g., merge scraped prices with your inventory for analysis.
Prioritize responsibility:
Read the site’s Terms of Service—if scraping is disallowed, stop.
Check robots.txt for crawl rules (good practice, not law).
Prefer APIs—they're designed for data access.
Avoid private/personal data—respect privacy laws (e.g., GDPR in EU if data includes personal info).
Be polite: Add delays (1–3s), use a clear User-Agent, limit requests.
Avoid obfuscation violating terms (e.g., hidden proxies). If unsure, seek legal advice.
| Scenario / Skill | Best Choice | Coding | Handles JS | Platform | Common Pitfalls |
| Simple HTML tables, refresh inside Excel | Power Query | No | Limited (improved in 365) | Windows, Mac (limited) | Brittle to site changes; no heavy JS |
| Small, custom extraction inside Excel | VBA (MSXML/DOM) | Yes | Only with browser automation | Windows only | Deprecated IE; Mac incompatible |
| Cloud auto-update / Mac users | Google Sheets (IMPORT) | No | Limited | Any (web) | Query limits; no auth for private pages |
| Dynamic pages, logins, scale | Python + Selenium/Playwright | Yes | Yes | Any | Setup overhead; potential blocks |
| Non-developer scheduled crawls | No-code cloud crawlers | No | Yes | Any | Dependency on extensions; free limits |
If an API exists, use it—stable, legal, and structured.
Excel (2010+ for Power Query; Microsoft 365 for best features, including 2026 AI previews).
Browser (Chrome/Edge) for inspecting elements (F12 key).
For Mac users: VBA/Power Query limited—favor Google Sheets or Python.
Test on public sites like Wikipedia or books.toscrape.com.
Ideal for one-offs like a news table.
Pros: Zero setup. Cons: Not scalable.
Difficulty: Very low.
When to Use: Small, static data without logins—e.g., quick list for a report. Scale up for repeats.
Steps:
1. Open website, find data.
2. Select (Ctrl+A).
3. Copy (Ctrl+C).
4. Paste in Excel (Ctrl+V; Paste Special > Text if messy).
5. Clean: Data > Text to Columns.
Tips: Timestamp for changes.
Power Query handles HTML tables with refreshes. In 2026 Excel 365, Edge engine improves JS support, potentially with AI for selectors.
Pros: Native, transformable. Cons: Struggles with heavy JS/logins.
Difficulty: Low.
When to Use: Tables needing updates—e.g., daily stats.
Steps:
1. Excel > Data tab.
2. Get Data > From Web.
3. Enter URL (e.g., https://en.wikipedia.org/wiki/List_of_countries_by_population).
4. Navigator: Select table.
5. Transform: Filter, remove columns, fix nulls.
6. Load. Refresh: Right-click > Refresh; Properties > Background every X minutes.
Tips:
Preview check: If incomplete, needs JS (try Method 5).
Automation: Pair with macros.
Mac: Available, no VBA.
Advanced: Pagination with Power Query M (In Advanced Editor):
let
GetPage = (page as number) => // Function to fetch one page
let
Source = Web.Page(Web.Contents("https://example.com/products?page=" & Number.ToText(page))),
Table = try Source{0}[Data] otherwise #table({}, {}) // Handle empty pages
in
Table,
Pages = List.Numbers(1, 5), // Adjust page count
Tables = List.Transform(Pages, each GetPage(_)),
NonEmpty = List.Select(Tables, each Table.RowCount(_) > 0),
Combined = if List.Count(NonEmpty) > 0 then Table.Combine(NonEmpty) else #table({}, {})
in
Combined
VBA for non-tables. Suggest SeleniumBasic for JS (download from GitHub, add reference).
Pros: In-Excel flexible. Cons: Windows-only.
Difficulty: Medium.
When to Use: DOM elements, small workflows—e.g., quotes database.
Prerequisites: Developer tab; References > Microsoft HTML Object Library.
Steps:
1. Alt+F11 > Insert Module.
2. Paste (MSXML for speed):
Sub ScrapeWithMSXML()
On Error GoTo ErrHandler
Dim http As Object, html As New HTMLDocument ' HTML parser
Dim url As String, resp As String
Dim items As IHTMLElementCollection, itm As Object
Dim r As Long
url = "https://books.toscrape.com" ' Test site
Set http = CreateObject("MSXML2.XMLHTTP.6.0") ' HTTP client
http.Open "GET", url, False
http.setRequestHeader "User-Agent", "Mozilla/5.0 (compatible; [email protected])" ' Polite header
http.send
If http.Status <> 200 Then ' Check response
MsgBox "HTTP error: " & http.Status
Exit Sub
End If
html.body.innerHTML = http.responseText ' Load HTML
Set items = html.getElementsByClassName("product_pod") ' Target elements
r = 2
For Each itm In items
On Error Resume Next
ThisWorkbook.Sheets(1).Cells(r, 1).Value = itm.getElementsByTagName("h3")(0).innerText ' Title
ThisWorkbook.Sheets(1).Cells(r, 2).Value = itm.getElementsByClassName("price_color")(0).innerText ' Price
r = r + 1
On Error GoTo ErrHandler
Next itm
MsgBox "Done. Rows: " & r-2
Exit Sub
ErrHandler:
MsgBox "Error: " & Err.Description
End Sub
3. Run via Macros.
4. Refresh: Rerun.
Tips:
ScreenUpdating = False for speed.
JS sites: Use SeleniumBasic.
Selectors: Modular for updates.
Mac Alt: Python.
Pros: Web-based. Cons: 50 imports/sheet limit; no private auth; refreshes on reopen, but rate-limited.
Difficulty: Low.
When to Use: Cloud/Mac—e.g., shared tracking.
Steps:
1. Sheets cell: =IMPORTHTML("url", "table", 1) or =IMPORTXML("url", "//xpath").
2. Loads auto.
3. Download > Excel.
Tips: Console debug XPath.
Pros: Cross-platform. Cons: Setup.
Difficulty: High.
When to Use: Dynamics, pagination.
Installation: Python.org; pip install requests beautifulsoup4 pandas openpyxl selenium (for Selenium); Playwright as faster alt (pip install playwright).
(requests + BS):
import requests
import time # For delays
from bs4 import BeautifulSoup
import pandas as pd
BASE = "https://books.toscrape.com/catalogue/page-{}.html" # Paginated URL
headers = {"User-Agent": "Mozilla/5.0 (compatible; [email protected])"} # Polite
rows = []
for page in range(1, 6): # Loop pages
url = BASE.format(page)
r = requests.get(url, headers=headers, timeout=15) # Timeout to avoid hangs
if r.status_code != 200:
print("Stopped at", url, r.status_code)
break
soup = BeautifulSoup(r.text, "html.parser")
for product in soup.select("article.product_pod"): # Select elements
title = product.h3.a["title"]
price = product.select_one(".price_color").get_text(strip=True)
rows.append({"title": title, "price": price})
time.sleep(1.5) # Polite delay
df = pd.DataFrame(rows)
df.to_excel("books_prices.xlsx", index=False) # Export
Selenium Example (for JS):
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import pandas as pd
opts = Options()
opts.add_argument("--headless=new") # No browser window
driver = webdriver.Chrome(options=opts) # Needs chromedriver
driver.get("https://example.com/dynamic-products")
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".product-row"))) # Wait for elements
rows = []
items = driver.find_elements(By.CSS_SELECTOR, ".product-row")
for it in items:
title = it.find_element(By.CSS_SELECTOR, "h2").text
price = it.find_element(By.CSS_SELECTOR, ".price").text
rows.append({"title": title, "price": price})
driver.quit()
pd.DataFrame(rows).to_excel("dynamic.xlsx", index=False)
Tips:
APIs: requests.get(api_url).
For larger crawls, send requests through proxies. Use rotating proxies to spread load and reduce blocking; keep polite delays and retries.
Browser automation can use a proxy per browser instance to distribute requests or access geo-restricted content. Many crawler setups create multiple browser workers, each configured with a different proxy endpoint.
Scheduling: Windows Task Scheduler > python script.py.
Pros: Beginner-friendly. Cons: Extension limits.
Difficulty: Low.
When to Use: JS sites no coding.
Steps (Free extensions like data scrapers):
1. Install from Chrome Store.
2. Open site, create sitemap.
3. Select elements.
4. Scrape, export CSV.
5. Excel: Data > From Text/CSV.
Tips: Point-and-click; scheduler options.
Data loads correctly?
Selectors match (no empties)?
No server bans (check ethics)?
Cleaned (TRIM, types)?
JavaScript content → Selenium / Playwright / cloud crawler.
Pagination → Power Query M, Python loop, or crawler pagination settings.
Login/Authentication → Selenium or authenticated API (prefer API).
Rate limits/blocks → slow down, add headers/backoff, consider API/rotating proxies(reduce blocking for legitimate crawls, but add cost and complexity, always use them responsibly).
Data cleanup → Power Query or pandas.
No table in Power Query? View source (Ctrl+U). If no <table>, Power Query won’t see it.
VBA returns empty body? Check http.responseText and User-Agent.
IMPORTXML fails? Test XPath in console.
Excel date/number issues? Use Power Query or pandas to coerce types.
Q: Will Power Query handle JavaScript-rendered tables?
A: Usually, no — Power Query reads server HTML. For JS, use Selenium/Playwright or a cloud crawler.
Q: Can I schedule Power Query refresh?
A: Yes, on Windows Excel you can set queries to refresh on open or at intervals; Mac/Excel Online support is limited.
Q: Is scraping legal?
A: It depends on the site’s Terms and your jurisdiction. Always check Terms, prefer APIs, and avoid personal/private data.
In 2026, watch for AI-assisted scraping in Excel. Try project: Scrape books.toscrape.com with PQ, then Python—compare. Build skills on public sites.
Next >
Cancel anytime
No credit card required