Q22 of 38 · Performance

How do you reproduce and root-cause a flaky performance result?

PerformanceSeniorperformanceflakediagnosticsvariancenoisy-neighbour

Short answer

Short answer: Check noisy-neighbour effects (shared infra, CI runner contention), GC pauses, connection-pool warm-up, downstream rate limits. Re-run with a controlled environment (dedicated host, fixed time of day), capture full traces, compare percentile shapes — flaky usually means external variance, not stochastic system behaviour.

Detail

First principle: a "flaky" perf result almost always has a deterministic cause. The variance is real, but it's coming from somewhere — find the somewhere.

Common sources of variance:

  • Noisy neighbours — shared CI runner, shared cloud host, other tests running simultaneously. Symptom: same test, same SHA, different result by time of day.
  • GC stop-the-world — JVM/V8 occasionally pauses for hundreds of ms. Symptom: p50 stable, p99 jumps unpredictably.
  • Cold cache / cold pool — the first run after a deploy is slower. Symptom: first run regresses, subsequent runs pass.
  • Downstream rate limits — vendor APIs (Stripe, Auth0, Algolia) have per-second limits. Symptom: 1 in 10 runs has a tail of 429-induced timeouts.
  • Network jitter — load tester to system network varies. Symptom: TCP-level latency spikes you can see in tcpdump but not the app.
  • Background processes — cron jobs, log rotation, backup, OS updates. Symptom: regression every Monday at 3am.

Diagnostic playbook:

  1. Re-run on a dedicated runner, same time of day, in isolation. If variance disappears → noisy neighbour.
  2. Tag spans with full trace — every request gets propagated to APM. Compare a fast run's spans to a slow run's. Where's the time gone?
  3. Capture system metrics on the load tester — CPU, network. The test client can be the bottleneck if maxed.
  4. Check the target env's logs for warnings during the slow runs — vendor 429s, slow query alerts, GC log lines.
  5. Compare percentile shapes — flaky-but-bimodal (mostly fast, sometimes very slow) vs. flaky-with-drift (slowly degrading) point to different causes.

Knowing when to give up: some genuine variance is irreducible (real-world latency floor, network weather). The right response is wider thresholds, not more diagnosis. The line: if you've spent two weeks on root cause and the variance is bounded and bearable, document it and move on.

Senior insight: don't lower SLOs to mask variance. Either fix the cause or widen the threshold and tag it as "tolerating variance from cause X" so it surfaces in retros.

// WHAT INTERVIEWERS LOOK FOR

Deterministic-cause mindset, common variance categories, structured diagnosis playbook, and knowing when to stop diagnosing and live with bounded variance.

// COMMON PITFALL

Adding retries until the test is green. The flake is now invisible but the underlying variance still hits production users — it's just hidden from CI.