How to Scrape a Whole Site: Step-by-Step Guide for 2026
Practical guide to discover, crawl, extract and store all pages safely — includes command-line, Python and headless browser examples.
Jan 12, 2026
Comprehensive step-by-step tutorial on ethically extracting Airbnb listings, calendars, and reviews. Explore alternatives, legal considerations, and build reliable pipelines for market analysis.
Accessing public rental data from platforms like Airbnb can provide insights for investors, researchers, and hosts—such as market trends, pricing optimization, and competitive analysis. However, due to platform restrictions, ethical and legal considerations are paramount. This guide prioritizes alternatives to scraping, then outlines responsible approaches if you proceed, from no-code tools to custom pipelines. We'll cover why, how, and what to watch out for, ensuring you can follow along step-by-step.

Disclaimer: This guide is for educational purposes only and focuses on ethical data collection practices. Airbnb's Terms of Service (ToS) explicitly prohibit automated scraping, bots, or crawlers without permission. Violating these can result in account suspension, IP bans, or legal action under laws like the CFAA (U.S.) or GDPR (EU). Always check Airbnb's robots.txt and ToS before proceeding—these disallow most search, listing, calendar, and review endpoints. We strongly recommend using public datasets or official channels first. Consult a lawyer for your specific use case, especially for commercial applications. This article does not constitute legal advice.
Common goals include:
Define your business question upfront—it dictates data frequency, fields, and retention.
Pro Tip: If your goal is broad analysis, skip scraping and use public datasets (see below).
Public Data Only: Stick to publicly visible fields like titles, prices, and ratings. Avoid private info (e.g., contacts, messages) or data behind logins.
Platform Policies: Review Airbnb's robots.txt (disallows /s/* searches, /rooms/* details, /calendar/ except iCal) and ToS (bans automated access). As of 2026, scraping violates these, potentially leading to legal risks.
Data Retention and Consent: Create a policy: e.g., delete data after 30 days unless needed. Ensure no PII (personal identifiable information) is stored without consent.
PII Handling: Hash or redact names, IDs, or locations (e.g., use approximate coordinates).
Documentation: Log source URLs, timestamps, and methods for each record.
Commercial Use: If selling or redistributing data, seek legal counsel—it's often prohibited.
Ethical Lens: Ask: Does this harm users or the platform? Prioritize transparency and minimal impact.
Public Datasets: Use sites like Inside Airbnb for aggregated city data (listings, reviews, calendars)—free and legal.
Official Channels: Airbnb doesn't offer a public API for listings, but check for partnerships or developer programs. For calendars, use allowed iCal feeds (/calendar/ical/).
Third-Party Services: Tools like AirDNA or Mashvisor provide scraped/aggregated data legally (via subscriptions).
Manual Export: For small needs, browse and copy-paste, or use browser extensions like Web Scraper (Chrome).
If these suffice, great! If not, proceed cautiously with the steps below.
| Need | Recommended Approach | Effort Level |
| Historical/low-friction | Public datasets | Low |
| Small/occasional | No-code tools (e.g., Octoparse, ParseHub) | Beginner |
| Custom/regular | Managed services (e.g., GoProxy proxies) | Intermediate |
| High scale/flexibility | In-house pipeline | Advanced |
Start with a minimal viable product (MVP) schema:
Search Result Row: {title, short_description, url, thumbnail, price_per_night, rating, review_count, listing_type, neighborhood, superhost}
Listing Detail: {full_description, amenities[], fees: {cleaning, service}, sleeping_arrangements, host_display_name (hashed), cancellation_policy, bedrooms, photos[], approx_coordinates}
Calendar: {date, available, nightly_price, min_stay, blocked}
Review: {text, reviewer_pseudonym (hashed), date, rating}
Checkpoint: Define your MVP schema and fields before coding.
Language: Python 3.10+ (or Node.js/Go).
Setup: Create a virtual env: python -m venv env && source env/bin/activate.
Libraries: Install via pip: requests, beautifulsoup4, lxml, aiohttp, python-dotenv. For JS-rendered pages: playwright.
Secrets: Use .env for API keys/proxies.
Skill Levels: Beginner (no-code), Intermediate (XHR parsing), Advanced (distributed).
Choose area/city, time window, and MVP fields.
Create retention & sharing document.
Check robots.txt and platform policy.
Checkpoint: Approve scope & retention in writing.
Open DevTools → Network → XHR.
Perform searches and scroll while watching for JSON/XHR/GraphQL responses. Save any JSON responses you find.
Inspect page HTML for <script type="application/ld+json"> or other embedded JSON blobs.
Action: Identify at least one stable endpoint or selector and save 2–5 sample JSON/XHR responses.
Stable JSON endpoint found: prefer direct HTTP requests.
No JSON, content rendered by JS: use headless browser or rendering service.
Non-dev: use a point-and-click extraction tool that exports CSV/Sheets.
Capture canonical search parameters: city, check-in/out, guests, offsets.
Use URL-encoding utilities; test in browser → copy XHR to script.
Checkpoint: You can reproduce one full search XHR and get consistent JSON.
Prefer semantic hooks: schema.org JSON, data-*, data-testid, aria-label.
Use defensive parsing: return None when fields missing; avoid blind .group() usage.
Write small parser functions and unit tests.
Checkpoint: Extract 5–10 sample records and 100% manually verify accuracy.
If XHR-based pagination exists, call the same endpoints with updated cursor/offset.
For infinite scroll, replicate pixel-by-pixel scrolling via automation or call the XHR that loads more results.
Hint: Emulate human-like scrolling and include small random delays.
Collect listing URLs, then fetch calendar endpoints (separate XHRs).
Batch calendar fetches (e.g., 10–20 concurrent) and use delays to avoid detection.
Checkpoint: Pull calendar for one listing, parse ISO dates & prices.
Concurrency limits: max 3 concurrent fetches per IP(adjust based on performance).
Randomized pacing: sleep(random.uniform(1.0, 3.0)) with jitter; add longer sleeps after page-heavy runs.
Backoff strategy: on 429/503, sleep base * (2 ** attempt) seconds; cap backoff (e.g., 10 min).
Proxies: Rotate residential IPs; retire failing ones (>10% errors/24h).
CAPTCHA handling: do not automate circumventing CAPTCHAs; pause and route to manual review or reduce intensity.
Headers & fingerprinting: randomize user-agent, accept-language, and acceptable header values; avoid sending identical header sets for many requests.
Logging & metrics: record response codes, parse success rates, and per-IP performance.
Checkpoint: run 50 requests from your pipeline without immediate blocks.
Normalize price fields, currency codes, date format (ISO-8601).
Dedupe by listing URL or platform listing ID.
Storage tiers: CSV/SQLite for POC; Postgres or object storage + DB for scale.
Checkpoint: Run dedupe and ensure currency/date normalization works on sample.
Export to CSV or sync to a spreadsheet for quick analysis.
For live systems, stream to a DB and add change detection logic(diffs of nightly prices).
Checkpoint: Export a cleaned CSV and load into your analysis tool (e.g., Pandas).
This example demonstrates safe session usage, retries, defensive parsing and unit-test hooks. Replace selectors with verified ones from Step 1.
# airbnb_scraper_sample.py
import requests, time, random, json, re
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
BASE_HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; DataCollector/1.0)", "Accept-Language": "en-US,en;q=0.9"}
def create_session(retries=3, backoff=0.3):
s = requests.Session()
retry = Retry(total=retries, backoff_factor=backoff, status_forcelist=[429, 500, 502, 503, 504])
s.mount("https:/", HTTPAdapter(max_retries=retry))
return s
def safe_get(session, url, headers=None, timeout=12):
h = {**BASE_HEADERS, **(headers or {})}
resp = session.get(url, headers=h, timeout=timeout)
resp.raise_for_status()
return resp
def extract_price(text):
if not text: return None
m = re.search(r'([€$£]\s?[\d,.]+|\d+[,.]?\s?(USD|EUR|GBP)?)', text)
return m.group(0) if m else None
def parse_search_html(html):
soup = BeautifulSoup(html, "lxml")
out = []
cards = soup.select('[data-testid="property-card"]') # Validate selectors!
for c in cards:
try:
title = c.select_one('[data-testid="listing-card-title"]')
price_el = c.select_one('[aria-label*="price"]')
rating = c.select_one('[aria-label*="rating"]')
out.append({
"title": title.get_text(strip=True) if title else None,
"price_raw": price_el.get_text(strip=True) if price_el else None,
"price": extract_price(price_el.get_text()) if price_el else None,
"rating": rating.get_text(strip=True) if rating else None,
})
except Exception as e:
print("parse error:", e)
return out
if __name__ == "__main__":
session = create_session()
url = "https://www.airbnb.com/s/YourCity" # Replace with validated URL
resp = safe_get(session, url)
results = parse_search_html(resp.text)
with open("airbnb_sample.json", "w", encoding="utf-8") as f:
json.dump(results, f, ensure_ascii=False, indent=2)
time.sleep(random.uniform(1.0, 2.0))
Notes:
Use SESSION to reuse connections.
Rotate HEADERS['User-Agent'] and consider rotating proxies.
Unit test suggestion:
write tests against saved sample HTML to assert parser returns expected keys and values. Example frameworks: pytest.
Cost tradeoff: managed rendering/proxy = higher OPEX, less engineering. In-house = upfront engineering, lower variable costs at scale.
Q: Pages load in my browser but my script gets empty HTML.
A: The site renders content client-side — use a headless browser or call the JSON/XHR endpoints discovered in DevTools.
Q: Prices differ from what I see in the browser.
A: Many platforms show base nightly price and add cleaning, service fees and taxes at checkout. Scrape full fee breakdowns when available.
Q: How to avoid being blocked?
A: Lower concurrency, add randomized delays, rotate IPs, use session reuse, and monitor failure rates. Avoid abusive scraping patterns.
Q: Is scraping legal?
A: It depends. Collect public data responsibly, document use, and seek legal advice for commercial or redistributive uses.
1. Search for public datasets first (many city datasets exist) — if they meet your needs, use them.
2. For production scraping, start small: build a proof-of-concept that fetches and parses 100 listings reliably before scaling.
3. Automate testing for selectors and quotas so you detect breaks early.
4. Keep legality and ethics central — avoid harvesting private contact data and comply with local rules.
With these steps, you'll gather insights efficiently—test small, iterate, and scale responsibly.
< Previous
Next >
Cancel anytime
No credit card required