Q15 of 38 · Performance
How do you isolate whether a slow response is the database, application, or network?
Short answer
Short answer: Use distributed tracing to break the request into spans — DB query time, app compute time, network legs. APM (Datadog, New Relic) shows per-span breakdown. Compare to baselines, check error logs for timeouts, and run targeted tests on each layer in isolation.
Detail
Three-layer isolation is one of the most common senior-level diagnostic skills.
Step 1 — Trace the request. A modern APM (Datadog, New Relic, Honeycomb, or open-source Jaeger/Tempo with OpenTelemetry) breaks one request into spans. A typical trace might show: HTTP IN 1100ms → app.compute 50ms → db.query 950ms → http_response 100ms. Now you know it's the DB.
Step 2 — If no tracing, deduce by elimination. Hit the slow endpoint with a tool that strips the network: curl from the same host, then from a remote host. If local-curl is fast and remote-curl is slow, network. If both slow, server-side. If both fast, suspect the test client or the ingress.
Step 3 — Layer-isolation tests:
- Database: run the slow query directly via psql/mysql client. If the raw query is fast but the app's call is slow, it's the ORM/connection pool/N+1, not the DB itself.
- Application: profile the code path with a sampling profiler (py-spy, pprof, async-profiler). Look for hot functions or unexpected lock contention.
- Network:
mtrfor path latency,iperf3for throughput,ss -ifor socket-level retransmit rate.
Step 4 — Confirm with a controlled change. A bottleneck hypothesis is just a hypothesis until you change it and the system improves. Add an index, expand the pool, or move the service closer to the DB — measure latency before/after.
The senior insight: bottlenecks shift. You fix the database, the app becomes the bottleneck. You fix the app, the network becomes the bottleneck. Always re-measure after a fix; the next limit may surprise you.