Fix“Something Went Wrong While Serving Your Request”in 7 Steps
Step-by-step guide to diagnose and fix “Something went wrong while serving your request” across apps, APIs, and portals.
Aug 7, 2025
Step-by-step guide to design, implement, benchmark, and operate parallel concurrent processing with code, heuristics, and runbooks.
Combine concurrency (overlap) and parallelism (true simultaneity) to speed up and harden workloads — this guide shows how to choose the right model, run reproducible benchmarks, implement safe workers and PCP-style orchestration, deploy with canaries and chaos tests, and operate with concrete metrics, heuristics and runbooks.
What you’ll be able to do: After reading this you can decide concurrency vs parallelism for your workload, run a reproducible benchmark that reports median and P95, implement safe workers (graceful shutdown, idempotency, retry/backoff), and deploy/manage with heartbeat/failover, monitoring, and an operator runbook.
Beginners: Seeking a safe starting point with clear examples.
Developers: Deciding between threads, async, or processes.
SREs / DevOps: Deploying high-availability job managers with monitoring.
Architects: Designing distributed, PCP-style orchestration systems.
Concurrency — tasks overlap in time; they can make progress together but are not necessarily executing at the precise same instant.
Analogy: juggling — you rapidly switch between balls.
Parallelism — tasks execute at the same time on different CPU cores or machines.
Analogy: multiple burners cooking different dishes simultaneously.
PCP-style orchestration — an application/infra approach that distributes concurrent managers (controllers) across nodes so managers and workers can migrate on failure, balance load, and centralize logs and control.
CPython: Global Interpreter Lock (GIL) limits CPU-bound threading.
Other Languages: Go (goroutines), Java (threads), Rust (async/threads), C# (tasks/threads) support true parallelism. For example, Go’s goroutines are lightweight and excel at both concurrency and parallelism.
Language | Concurrency model | Parallelism support | GIL? |
Python (CPython) | Threads / asyncio | Processes (ProcessPool) | Yes |
Go | Goroutines (lightweight threads) | True multi-core | No |
Java | Threads / ExecutorService | True multi-threaded | No |
Rust | Async / Threads | True multi-threaded | No |
C# (.NET) | Async / Tasks / Threads | True multi-threaded | No |
Use this table when choosing library patterns: e.g., choose threads/async for I/O in Python, processes for CPU-heavy Python tasks, or a language/runtime that fits your compute model.
If tasks wait mostly on network/disk → concurrency (async or thread pools).
If tasks spend most time computing → parallelism (process pools or distributed compute).
If you need high availability across machines → PCP-style orchestration (distributed managers + centralized logs).
Real systems often combine both: concurrency inside workers, parallelism across workers/nodes.
Start
├─► Is task I/O-bound? ──► Yes ──► Use async/thread pool
│ No
├─► Is task CPU-bound? ──► Yes ──► Use process pool / parallel compute
│ No
└─► Need HA across nodes? ──► Yes ──► PCP-style orchestration
Note: Install dependencies with pip install requests aiohttp. Replace httpbin.org with your endpoints and respect terms of service.
python
# I/O-bound example (ThreadPoolExecutor + requests.Session)
from concurrent.futures import ThreadPoolExecutor, as_completed
import time, requests
session = requests.Session() # connection pooling
sites = ["https://httpbin.org/delay/1"] * 6
def fetch(url):
r = session.get(url, timeout=10)
return url, r.status_code
start = time.time()
with ThreadPoolExecutor(max_workers=6) as ex:
futures = [ex.submit(fetch, s) for s in sites]
for fut in as_completed(futures):
url, status = fut.result()
print(url, status)
print("Elapsed:", time.time() - start)
Expect: roughly 1–2s wall time for 6 1-second delayed requests (network overlap).
python
# Async I/O example
import asyncio, aiohttp, time
async def fetch(session, url):
async with session.get(url) as r:
return url, r.status
async def main():
urls = ["https://httpbin.org/delay/1"] * 20
start = time.time()
async with aiohttp.ClientSession() as session:
tasks = [asyncio.create_task(fetch(session, u)) for u in urls]
for t in asyncio.as_completed(tasks):
url, status = await t
print(url, status)
print("Elapsed:", time.time() - start)
asyncio.run(main())
Tip: do not call blocking I/O inside async loops; use async libraries.
python
from concurrent.futures import ProcessPoolExecutor, as_completed
import time, math
def process_image(n):
# Simulate image processing: compute pixel transformations
s = 0.0
for i in range(1, n):
s += math.sqrt(i) * math.sin(i)
return s
chunks = [600_000] * 4 # Tune to cores
start = time.time()
with ProcessPoolExecutor() as ex:
futures = [ex.submit(process_image, c) for c in chunks]
for fut in as_completed(futures):
print("Done chunk:", fut.result())
print("Elapsed:", time.time() - start)
Expect: With 4 cores, ~time of one chunk.
When parallelism is applied to network-intensive workloads, combining it with distributed request routing through multiple IPs ensures both speed and reliability.
Save as benchmark.py. It runs sequential, threaded and async tests REPS times and writes medians & P95 to CSV.
python
# benchmark.py
import time, csv, statistics, asyncio
from concurrent.futures import ThreadPoolExecutor, as_completed
import requests, aiohttp
SITES = ["https://httpbin.org/delay/1"] * 20
REPS = 5
def seq_fetch():
start = time.time()
for u in SITES:
requests.get(u, timeout=15)
return time.time() - start
def threaded_fetch(max_workers=20):
start = time.time()
session = requests.Session()
with ThreadPoolExecutor(max_workers=max_workers) as ex:
futures = [ex.submit(session.get, u, timeout=15) for u in SITES]
for f in as_completed(futures):
_ = f.result()
return time.time() - start
async def async_once():
async with aiohttp.ClientSession() as session:
start = time.time()
tasks = [asyncio.create_task(session.get(u)) for u in SITES]
for t in asyncio.as_completed(tasks):
r = await t
await r.release()
return time.time() - start
def run_sync(func, *args, reps=REPS, **kwargs):
times = []
for _ in range(reps):
times.append(func(*args, **kwargs))
return times
def run_async(reps=REPS):
times = []
for _ in range(reps):
times.append(asyncio.run(async_once()))
return times
def summarize(name, times):
med = statistics.median(times)
p95 = sorted(times)[min(len(times)-1, max(0, int(0.95*len(times))-1))]
return [name, len(times), med, p95, min(times), max(times)]
if __name__ == "__main__":
results = []
seq_times = run_sync(seq_fetch)
results.append(summarize("sequential", seq_times))
thread_times = run_sync(threaded_fetch, 20)
results.append(summarize("threaded_20", thread_times))
async_times = run_async()
results.append(summarize("asyncio", async_times))
with open("benchmark_results.csv", "w", newline="") as f:
w = csv.writer(f)
w.writerow(["mode", "reps", "median_s", "p95_s", "min_s", "max_s"])
w.writerows(results)
print("Results:", results)
1. pip install requests aiohttp
2. python benchmark.py
3. Open benchmark_results.csv, review medians and P95. Run multiple times; use medians.
threaded ≈ async ≪ sequential → I/O-bound; concurrency helps.
If threading ≈ sequential and ProcessPool speeds up CPU examples → GIL / CPU-bound behavior.
1. Run representative jobs (5–20). Measure CPU%, iowait, network latency, task duration.
2. Compute WaitTime (time waiting) and ServiceTime (time processing).
3. Classify: I/O-bound, CPU-bound, or mixed.
4. Define SLOs: throughput (jobs/sec), latency P50/P95/P99, availability (e.g., 99.9%).
Deliverable: profiling CSV + classification table + SLO doc.
I/O worker: single-process async loop or thread pool; use connection pooling.
CPU worker: one process per core or process pool; avoid oversubscription.
Hybrid pipeline (concept only): decouple I/O and compute. I/O stage (async/threaded) writes work into a queue; CPU stage (process pool) consumes and processes. This lets you scale I/O and compute independently and keeps components simple.
Managers run across nodes. Each manager controls workers and can be migrated to another node if the node fails.
Primary/secondary assignment per manager ensures failover.
Centralized logs / shared storage so any node can access job output.
Admin UI/CLI for start/stop/migrate/reassign.
Use job_id. Before processing: check processed_jobs store (DB/Redis). If exists → skip. After success, atomically mark processed. Prefer marking processed in the same transaction that commits side effects.
python
import signal, threading
stop_event = threading.Event()
def handle_sigterm(signum, frame):
stop_event.set()
signal.signal(signal.SIGTERM, handle_sigterm)
while not stop_event.is_set():
process_one_job()
# flush logs, checkpoint and exit
python
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time, threading
start_http_server(8000) # exposes /metrics
JOB_COUNT = Counter("jobs_processed_total", "Jobs processed")
JOB_TIME = Histogram("job_processing_seconds", "Job processing time")
HEARTBEAT = Gauge("worker_heartbeat_unixtime", "Last heartbeat (unix time)")
def heartbeat_loop(interval=10):
while True:
HEARTBEAT.set(time.time())
time.sleep(interval)
threading.Thread(target=heartbeat_loop, daemon=True).start()
def process_job(job):
start = time.time()
# ... work ...
JOB_TIME.observe(time.time() - start)
JOB_COUNT.inc()
Blocking calls inside async loops → hidden stalls.
Oversubscribing threads/processes → context-switch heavy, worse latency.
Missing backpressure → queue grows uncontrollably..
Deploy to a small percentage (5–10%) of capacity. Monitor for 30–60 minutes: errors, queue backlogs, CPU, latency.
Use k6, locust, or your benchmark harness; mock external dependencies when possible.
Kill a primary manager node, verify failover within failover_timeout.
Simulate network partition or slow downstream to exercise backpressure and retry logic.
jobs_processed_total, jobs_failed_total, job_processing_seconds P50/P95/P99, queue_depth, worker CPU/memory, worker_heartbeat_unixtime.
N = λ × W
Example (digit-by-digit):
λ = 10 jobs/sec, W = 0.5 s → N = 10 × 0.5 = 5 workers.
Result: need N = 5 concurrent workers.
OptimalThreads ≈ Cores × (1 + WaitTime / ServiceTime)
Example (digit-by-digit):
Cores = 4.
From profiling: WaitTime = 200 ms, ServiceTime = 50 ms.
Compute ratio: WaitTime / ServiceTime = 200 / 50 = 4.
Add 1: 1 + 4 = 5.
Multiply by cores: 4 × 5 = 20.
Start value: ~20 threads. Tune from there.
1. Start with Little’s Law for worker count.
2. Use thread-sizing for I/O pools.
3. Increase concurrency until CPU/network saturates.
4. Use batching for tiny tasks.
yaml
groups:
- name: worker_alerts
rules:
- alert: WorkerHeartbeatMissing
expr: time() - worker_heartbeat_unixtime > 30
for: 1m
labels: { severity: "critical" }
- alert: HighQueueDepth
expr: queue_depth > (avg_over_time(queue_depth[5m]) * 2)
for: 5m
labels: { severity: "warning" }
yaml
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
env:
- name: HEARTBEAT_INTERVAL
value: "10"
- name: MAX_WORKERS
value: "20"
Use Prometheus Adapter to expose queue_depth as a custom metric. Scale pods on queue_depth or queue_depth / replicas thresholds.
Example: scale when queue_depth > 100 or queue_depth / replicas > threshold.
1. Inspect job_processing_seconds and error traces.
2. Check worker_cpu_percent and worker_memory_bytes. If CPU saturated → scale out or increase processes (if cores free).
3. If network/I/O saturated → reduce concurrency and check upstream rate limits.
4. If poison messages → move to DLQ and fix handler.
1. Check worker_heartbeat_unixtime across nodes.
2. Validate failover_timeout and manager logs.
3. Manually reassign managers via admin CLI.
4. Restart failed node and run root cause analysis.
1. Aggregate errors and check for common patterns.
2. For transient errors, increase retry with backoff.
3. For systemic errors, patch handler and redeploy.
Scaling up may require additional infrastructure, careful smart load balancing across IP pools, and monitoring. Over-provisioning can waste resources; under-provisioning can cause delays or failures.
Idempotency tokens and dedupe storage.
Graceful shutdown handlers implemented.
Heartbeat & health checks.
Rate limiting / backpressure.
Dead-letter queue (DLQ) for poison messages.
Unit tests for idempotency and error paths.
Integration tests with queue and DB.
Load tests that mimic peak behavior.
Chaos tests (node kill, disk full).
Failover/failback tests.
Canary passed
Runbook & alerts configured
Recovery test completed
A: If tasks wait on network/disk use concurrency (async/thread). If tasks are CPU-bound use parallelism (processes / distributed compute). Most systems combine both.
A: Use Threads ≈ Cores × (1 + WaitTime/ServiceTime). Profile to get WaitTime and ServiceTime.
A: No, GIL limits CPU-bound threads. Use processes or other runtimes.
A: PCP-style orchestration distributes managers across nodes for failover and load distribution; centralizes logs and gives per-manager control.
A: Respect robots.txt, rate limits, terms of service, and applicable laws.
This guide offers a step-by-step path to mastering parallel concurrent processing. From first scripts to distributed systems, it provides the concepts, code, and tools to optimize performance, ensure reliability, and scale efficiently.
Next >