How do you reproduce and root-cause a flaky performance result?

Question

Accepted Answer

Check noisy-neighbour effects (shared infra, CI runner contention), GC pauses, connection-pool warm-up, downstream rate limits. Re-run with a controlled environment (dedicated host, fixed time of day), capture full traces, compare percentile shapes — flaky usually means external variance, not stochastic system behaviour. First principle: a "flaky" perf result almost always has a deterministic cause. The variance is real, but it's coming from somewhere — find the somewhere. Common sources of variance: Noisy neighbours — shared CI runner, shared cloud host, other tests running simultaneously. Symptom: same test, same SHA, different result by time of day. GC stop-the-world — JVM/V8 occasionally pauses for hundreds of ms. Symptom: p50 stable, p99 jumps unpredictably. Cold cache / cold pool — the first run after a deploy is slower. Symptom: first run regresses, subsequent runs pass. Downstream rate limits — vendor APIs (Stripe, Auth0, Algolia) have per-second limits. Symptom: 1 in 10 runs h

How do you reproduce and root-cause a flaky performance result?

// WHAT INTERVIEWERS LOOK FOR

// COMMON PITFALL

How do you reproduce and root-cause a flaky performance result?

Short answer

Detail

// WHAT INTERVIEWERS LOOK FOR

// COMMON PITFALL