Q22 of 38 · Performance
How do you reproduce and root-cause a flaky performance result?
Short answer
Short answer: Check noisy-neighbour effects (shared infra, CI runner contention), GC pauses, connection-pool warm-up, downstream rate limits. Re-run with a controlled environment (dedicated host, fixed time of day), capture full traces, compare percentile shapes — flaky usually means external variance, not stochastic system behaviour.
Detail
First principle: a "flaky" perf result almost always has a deterministic cause. The variance is real, but it's coming from somewhere — find the somewhere.
Common sources of variance:
- Noisy neighbours — shared CI runner, shared cloud host, other tests running simultaneously. Symptom: same test, same SHA, different result by time of day.
- GC stop-the-world — JVM/V8 occasionally pauses for hundreds of ms. Symptom: p50 stable, p99 jumps unpredictably.
- Cold cache / cold pool — the first run after a deploy is slower. Symptom: first run regresses, subsequent runs pass.
- Downstream rate limits — vendor APIs (Stripe, Auth0, Algolia) have per-second limits. Symptom: 1 in 10 runs has a tail of 429-induced timeouts.
- Network jitter — load tester to system network varies. Symptom: TCP-level latency spikes you can see in
tcpdumpbut not the app. - Background processes — cron jobs, log rotation, backup, OS updates. Symptom: regression every Monday at 3am.
Diagnostic playbook:
- Re-run on a dedicated runner, same time of day, in isolation. If variance disappears → noisy neighbour.
- Tag spans with full trace — every request gets propagated to APM. Compare a fast run's spans to a slow run's. Where's the time gone?
- Capture system metrics on the load tester — CPU, network. The test client can be the bottleneck if maxed.
- Check the target env's logs for warnings during the slow runs — vendor 429s, slow query alerts, GC log lines.
- Compare percentile shapes — flaky-but-bimodal (mostly fast, sometimes very slow) vs. flaky-with-drift (slowly degrading) point to different causes.
Knowing when to give up: some genuine variance is irreducible (real-world latency floor, network weather). The right response is wider thresholds, not more diagnosis. The line: if you've spent two weeks on root cause and the variance is bounded and bearable, document it and move on.
Senior insight: don't lower SLOs to mask variance. Either fix the cause or widen the threshold and tag it as "tolerating variance from cause X" so it surfaces in retros.