Web Crawling vs Scraping: Beginner’s Practical Guide to Unlocking Web Data
Beginner guide to web crawling vs scraping with legal checklist, troubleshooting, tools, and 2025 trends.
Dec 8, 2025
Step-by-step PHP guide to curl_init GET requests: examples, proxy usage, error handling, retries, and troubleshooting for beginners.
Fetching data from external sources is a common task in web development. Whether you're building an API consumer, scraping web content for analysis, or simply testing endpoints, PHP's cURL library is a powerful tool. This step-by-step guide to making reliable GET requests in PHP using curl_init(), each step tells you what to do and why now. Includes examples (API, auth, proxy), a reusable helper, debugging tips, and a troubleshooting checklist.
curl_init() → set options (curl_setopt() / curl_setopt_array()) → curl_exec() → check curl_errno() and HTTP status via curl_getinfo() → parse response (json_decode() for JSON) → curl_close(). Use http_build_query() for safe query parameters. Avoid sending request body with GET. For proxies, use CURLOPT_PROXY / CURLOPT_PROXYUSERPWD.
cURL (Client URL) is a command-line tool and library for transferring data using various protocols, including HTTP. In PHP, the cURL extension allows making network requests programmatically. A GET request, one of the fundamental HTTP methods, is used to retrieve data from a server without modifying it— you can think of it as "just reading" information.

Common Scenarios: Integrating third-party APIs, automating data collection, and debugging web services. Tip: If web scraping at scale, consider using a proxy service like GoProxy to rotate IP addresses and avoid rate limiting—more on that later.
Fetch JSON from REST APIs (read-only endpoints).
Download a public resource (image, JSON, HTML).
Simple scraping where you control rate and respect robots.txt.
When you need advanced options not supported by file_get_contents() (headers, proxy, redirects control, custom User-Agent).
Use environment variables for secrets (getenv() or server config).
Do not disable SSL verification in production (CURLOPT_SSL_VERIFYPEER should remain true).
Don’t scrape personal data unlawfully.
Always check and respect robots.txt and the website’s terms. For APIs, use the documented rate limits.
Add delays between requests (e.g., sleep() or usleep()), or implement exponential backoff on 429/5xx responses (doubling wait time each retry to avoid overwhelming the server).
Identify your crawler with a proper User-Agent and contact info if required.
Purpose: Confirm prerequisites so code won’t fail at runtime.
Why first: If cURL is missing, nothing else will work; stop and fix this immediately.
If missing:
Purpose: Choose a simple, public URL with predictable output (no auth).
Why now: Lets you verify network/environment quickly without extra complexity.
Test URL:
https://jsonplaceholder.typicode.com/posts/1
Expected Output:
{
"userId": 1,
"id": 1,
"title": "sunt aut facere ...",
"body": "quia et suscipit..."
}
Purpose: Run the smallest possible working cURL GET to prove the environment and network path work.
Why now: Quick feedback loop — you’ll see success or immediate, diagnosable error.
<?php
$ch = curl_init('https://jsonplaceholder.typicode.com/posts/1');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
if (curl_errno($ch)) {
echo 'cURL error: ' . curl_error($ch);
} else {
echo $response;
}
curl_close($ch);
Run: php test.php (CLI) or open via local webserver. If you see the JSON above, continue.
Purpose: Properly encode and assemble GET parameters using http_build_query().
Why now: You’ll use a correct URL when configuring cURL options; avoid malformed queries early.
$base = 'https://api.example.com/items';
$params = ['q' => 'cat toys', 'page' => 2];
$url = $base . '?' . http_build_query($params);
Note: Use GET (URL) for small/read-only parameters. Switch to POST for large payloads or sensitive data.
Purpose: Create the cURL session handle you will configure.
Why now: Every subsequent option and execution requires this handle.
$ch = curl_init();
Purpose: Configure the essential behavior before executing.
Why now: These are fundamental and should be set before optional features. Timeouts prevent blocking.
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true, // optional
CURLOPT_TIMEOUT => 15,
CURLOPT_CONNECTTIMEOUT => 5,
]);
Purpose: Supply Accept, User-Agent and any auth headers using environment variables. Prevents leaking secrets.
Why now: Headers are needed for many APIs and must be set before the request. Security practice enforced here.
Set environment variables (examples):
Add to cURL:
$token = getenv('MY_API_TOKEN');
curl_setopt($ch, CURLOPT_HTTPHEADER, [
'Accept: application/json',
'Authorization: Bearer ' . ($token ?: ''),
'User-Agent: MyApp/1.0 (+https://example.com)'
]);
Security tip: use getenv('API_TOKEN') for secrets — never hardcode tokens in source.
Purpose: Route requests through a proxy for geo-ip, rotation, or bypassing restrictions.
Why now: Proxy changes network routing — must be set before execution.
GoProxy example:
curl_setopt($ch, CURLOPT_PROXY, 'proxy.goproxy.example:8000');
curl_setopt($ch, CURLOPT_PROXYUSERPWD, 'username:password'); // optional
curl_setopt($ch, CURLOPT_PROXYTYPE, CURLPROXY_HTTP); // or CURLPROXY_SOCKS5
Notes:
Sticky proxies keep the same IP (for sessions);
Rotating proxies change IPs per request (for large-scale scraping). GoProxy supports custom sticky time for specific demands.
Proxies add latency — increase CURLOPT_TIMEOUT as needed.
Purpose: Capture raw response headers along with the body.
Why now: Useful for debugging rate limits, cookies, and redirects; do this before execution.
curl_setopt($ch, CURLOPT_HEADER, true); // raw headers will be in the response
Note: If enabled, you must split headers/body later using CURLINFO_HEADER_SIZE.
Purpose: Handles transient network/server errors while avoiding useless retries on client errors. Do not retry on 4xx.
Why now: Execution is the main work — retry logic improves robustness and should wrap the exec call.
$attempt = 0; $max = 3; $wait = 1;
do {
$raw = curl_exec($ch);
$errno = curl_errno($ch);
$error = curl_error($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ($errno) {
error_log("cURL transport error: $error (attempt $attempt)");
$attempt++; sleep($wait); $wait *= 2;
continue;
}
if ($httpCode >= 500) {
error_log("Server error HTTP $httpCode (attempt $attempt)");
$attempt++; sleep($wait); $wait *= 2;
continue;
}
if ($httpCode >= 400 && $httpCode < 500) {
error_log("Client error HTTP $httpCode - fix request/auth");
break;
}
// Success (2xx) or other non-retriable status
break;
} while ($attempt < $max);
Purpose: Detect transport-level problems (DNS, connection refused, timeouts).
Why now: These errors indicate network problems that retries or different handling may solve。
if (curl_errno($ch)) {
$err = curl_error($ch);
echo "cURL error: $err";
}
Purpose: Examine HTTP code, timings, and other useful metadata.
Why now: Distinguishing 2xx vs 4xx vs 5xx guides next actions (retry, fix request, etc.)
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$info = curl_getinfo($ch); // timings, URL, content_type, header_size...
print_r($info);
Purpose: Separate and extract headers from response when CURLOPT_HEADER used.
Why now: You need header size info from curl_getinfo() right after exec.
$headerSize = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
$headersRaw = substr($raw, 0, $headerSize);
$body = substr($raw, $headerSize);
Purpose: Turn raw headers into a usable key→value map (e.g., Retry-After).
Why now: Makes programmatic checks for rate limits or cookies easy and safe.
function parse_headers(string $rawHeaders): array {
$headers = [];
$lines = preg_split('/\r\n|\n|\r/', trim($rawHeaders));
foreach ($lines as $i => $line) {
if ($i === 0) { $headers['status_line'] = $line; continue; }
if (strpos($line, ':') !== false) {
[$k,$v] = explode(':', $line, 2);
$headers[trim($k)] = trim($v);
}
}
return $headers;
}
$hdrs = parse_headers($headersRaw);
Example parsed headers (sample):
[
'status_line' => 'HTTP/1.1 200 OK',
'Content-Type' => 'application/json; charset=utf-8',
'Retry-After' => '30',
'Set-Cookie' => 'session=abc123; Path=/; HttpOnly'
]
Purpose: Read server-directed retry periods and wait accordingly instead of blind retries.
Why now: If server tells you when to retry, obey it to be polite and avoid bans.
if ($httpCode === 429 && !empty($hdrs['Retry-After'])) {
$retryAfter = (int)$hdrs['Retry-After'];
sleep($retryAfter);
// then retry per your backoff strategy
}
Purpose: Convert the response into structured data for use in your application.
Why now: After ensuring transport & status are OK, parse so you can act on the data.
$data = json_decode($body, true);
if (json_last_error() === JSON_ERROR_NONE) {
print_r($data);
} else {
$dom = new DOMDocument();
@$dom->loadHTML($body);
// DOM parsing here
}
Purpose: Handle large payloads without exhausting memory.
Why now: If the response is big, streaming avoids loading everything into memory — do this instead of CURLOPT_RETURNTRANSFER for large files.
$fp = fopen(__DIR__ . '/large.bin', 'w');
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_exec($ch);
fclose($fp);
Purpose: Speed up many independent requests by parallelizing.
Why now: After you’re comfortable with single requests and backoff, move to parallelism; it requires prior understanding of error handling and throttling.
Purpose: Free cURL handles and close file pointers.
Why now: Always do this at the end of an execution path to avoid leaks and resource exhaustion.
curl_close($ch);
Purpose: Centralize logic in a reusable fetchGet() helper(below) so you don’t duplicate retry, proxy, header, and parse logic.
Why now: After verifying the full flow, encapsulate it so future work is simpler and more reliable.
Robust helper: fetchGet()
<?php
function fetchGet(string $url, array $opts = []): array {
$maxRetries = $opts['max_retries'] ?? 3;
$timeout = $opts['timeout'] ?? 15;
$includeHeaders = $opts['include_headers'] ?? false;
$customHeaders = $opts['headers'] ?? [];
$proxy = $opts['proxy'] ?? null;
$attempt = 0;
$wait = 1;
$lastError = null;
do {
$ch = curl_init();
$curlOpts = [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_TIMEOUT => $timeout,
CURLOPT_CONNECTTIMEOUT => 5,
CURLOPT_HEADER => $includeHeaders,
CURLOPT_HTTPHEADER => $customHeaders,
CURLOPT_USERAGENT => $opts['user_agent'] ?? 'MyApp/1.0 (+https://example.com)',
];
if ($proxy && !empty($proxy['host'])) {
$curlOpts[CURLOPT_PROXY] = $proxy['host'];
if (!empty($proxy['user'])) {
$curlOpts[CURLOPT_PROXYUSERPWD] = $proxy['user'] . ':' . ($proxy['pass'] ?? '');
}
if (!empty($proxy['type'])) {
$curlOpts[CURLOPT_PROXYTYPE] = $proxy['type']; // e.g., CURLPROXY_SOCKS5
}
}
curl_setopt_array($ch, $curlOpts);
$raw = curl_exec($ch);
$errno = curl_errno($ch);
$error = $errno ? curl_error($ch) : null;
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$headerSize = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
curl_close($ch);
$lastError = $error;
if ($errno) {
$attempt++; sleep($wait); $wait *= 2; continue;
}
$headers = [];
$body = $raw;
if ($includeHeaders && $raw !== false) {
$headersRaw = substr($raw, 0, $headerSize);
$body = substr($raw, $headerSize);
$lines = preg_split('/\r\n|\n|\r/', trim($headersRaw));
foreach ($lines as $i => $line) {
if ($i === 0) { $headers['status_line'] = $line; continue; }
if (strpos($line, ':') !== false) {
[$k,$v] = explode(':', $line, 2);
$headers[trim($k)] = trim($v);
}
}
}
if ($httpCode >= 200 && $httpCode < 300) {
return ['http_code'=>$httpCode,'headers'=>$headers,'body'=>$body,'error'=>null];
}
if ($httpCode >= 500 && $attempt < $maxRetries - 1) {
$attempt++; sleep($wait); $wait *= 2; continue;
}
return ['http_code'=>$httpCode,'headers'=>$headers,'body'=>$body,'error'=>$error];
} while ($attempt < $maxRetries);
return ['http_code'=>0,'headers'=>[],'body'=>null,'error'=>$lastError ?? 'Max retries reached'];
}
Example usage:
$result = fetchGet('https://jsonplaceholder.typicode.com/posts/1', [
'include_headers' => true,
'max_retries' => 3,
'timeout' => 10,
// 'proxy' => ['host'=>'proxy.goproxy.example:8000','user'=>'u','pass'=>'p','type'=>CURLPROXY_HTTP],
]);
var_dump($result);
CURLOPT_TIMEOUT — total timeout in seconds
CURLOPT_CONNECTTIMEOUT — connection phase timeout
CURLOPT_FOLLOWLOCATION — follow redirects
CURLOPT_HTTPHEADER — custom headers
CURLOPT_USERAGENT — user agent string
CURLOPT_SSL_VERIFYPEER — keep true in production; use CURLOPT_CAINFO for custom CA bundles
CURLOPT_COOKIEJAR / CURLOPT_COOKIEFILE — persist cookies
CURLOPT_MAXREDIRS — limit number of redirects
HTTP does not strictly forbid a body with GET, but it’s uncommon and many servers ignore it. In PHP cURL, CURLOPT_POSTFIELDS typically sets the method to POST. You can force a GET with CURLOPT_CUSTOMREQUEST = 'GET', but avoid this unless you control both client and server.
1. Enable verbose logging:
curl_setopt($ch, CURLOPT_VERBOSE, true);
$verbose = fopen('php://temp', 'w+');
curl_setopt($ch, CURLOPT_STDERR, $verbose);
After exec:
rewind($verbose);
echo stream_get_contents($verbose);
2. Log curl_getinfo($ch) for timings and status.
3. For TLS issues, use CURLOPT_CAINFO with a valid CA bundle; do not disable CURLOPT_SSL_VERIFYPEER in production.
Empty response: check $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE) and verbose logs.
SSL errors: check CA bundle and OpenSSL versions; use CURLOPT_CAINFO.
403/401: check Authorization header and IP allowlist.
Requests turning into POST: remove CURLOPT_POSTFIELDS.
Rate-limited (429): read Retry-After and back off.
Memory issues: stream with CURLOPT_FILE.
Resource leaks: always curl_close($ch).
Use headless browsers (Playwright, Puppeteer) for JS-heavy pages or when you must execute client-side code. For extreme scale or anti-bot measures, consider managed scraping services.
Q: Can I send a request body with GET?
A: Technically possible but unreliable — use query parameters instead.
Q: Where to store API keys?
A: Environment variables or secret managers; never commit to source.
Q: Why was I blocked when scraping?
A: Too many requests, missing User-Agent, no IP rotation, or violating robots.txt.
curl_init() is a compact, reliable way to make GET requests in PHP. For beginners, focus on the canonical flow (init → set url/options → exec → error-check → close), use query strings for GET parameters, and add timeouts and headers. If you require proxy routing or IP rotation, plug in GoProxy using CURLOPT_PROXY and CURLOPT_PROXYUSERPWD. Keep your scraping polite and legal. Get your free trial here!
Next >
Cancel anytime
No credit card required