Load tests in CI: the honest version

qa.codes · 4 November 2025 · 8 min read

Advanced

performance-testingci-cdk6opinion

The pitch: 'run load tests on every PR.' The reality: you'll have flaky thresholds in three days and disabled tests in two weeks. Here's what actually works for load testing in CI — and the four-tier strategy that survives contact with a real team.

part ofPerformance for QA engineers

Why "load tests on every PR" fails

The idea sounds reasonable: catch performance regressions early, before they reach production. If functional tests catch bugs in CI, why shouldn't performance tests catch slowness in CI?

The problem is the fundamental difference between functional results and performance results. A functional test is binary: the assertion either passes or fails, and the result is deterministic. A performance test produces a distribution: the p95 latency was 480ms this run and 520ms last run and 490ms the run before — depending on what else was running on the CI runner, whether the database had warm caches, whether a cron job happened to fire during the test window. Performance measurements are inherently noisy, and the noise source is the environment, not the code.

Absolute thresholds on noisy signals fail constantly. Set a p95 threshold at 500ms and a run that hit 502ms because the CI runner was under load fails the build. A developer sees a red CI that they didn't cause, reruns it, it passes, and they learn to ignore performance failures. That's the end of the programme — not a dramatic shutdown, just a slow accumulation of "eh, just rerun it" until the tests are disabled.

The second problem is CI runner isolation. Load testing involves generating concurrent requests to measure how a service performs under load. On a shared CI runner, you're competing with other jobs for CPU and network. The load you can generate from a shared runner is not representative of real traffic. The latency you measure on a shared runner reflects both your code and whatever else happened to be running. The signal is real but fuzzy.

These are not solvable with better tooling. They're inherent to running performance tests in shared CI infrastructure. The strategy is to design around them, not fix them.

The four-tier strategy that survives

Tier 1: Smoke load test on every PR. Not a real load test — a catastrophic-regression check. One virtual user, 10–20 requests, loose thresholds. This catches disasters: an endpoint that now times out completely, a query that grew from 100ms to 30 seconds, a service that crashes under any load.

// k6 smoke test — runs fast, thresholds catch only catastrophes
export const options = {
  vus: 1,
  iterations: 15,
  thresholds: {
    http_req_duration: ['p99<5000'],  // 5 seconds: only catches disasters
    http_req_failed: ['rate<0.1'],   // 10% error rate: catches crashes
  },
};
 
export default function () {
  const res = http.get(`${__ENV.BASE_URL}/api/users`);
  check(res, { 'status is 200': (r) => r.status === 200 });
}

This runs in under 30 seconds. It doesn't block PRs on noise. It catches regressions that are clearly wrong, not regressions that are subtly slower.

Tier 2: Full load test on a schedule. Real load, real concurrency, real duration — but not on every PR. Run on cron against a staging environment: nightly, or weekly for less-active services. Thresholds here are tighter and violations trigger investigation, not immediate CI failure. Results feed a dashboard that the team reviews periodically, not in the critical path of every merge.

Tier 3: Pre-release load gate. Before a significant release, run the full suite manually and review the results with the team. This is a deliberate checkpoint, not an automated gate. The benefit: humans with context review the results and make a call about whether the numbers are acceptable. No threshold misconfiguration blocks a ship; no "it passed the threshold" auto-approves a regression that the numbers don't tell the full story about.

Tier 4: Production traffic observation. Real users, real infrastructure, real concurrency. The only fully representative "load test" is production. Distributed tracing and APM tools (Datadog, Honeycomb, New Relic) provide p95/p99 latency for real traffic. Alerting on production latency degradation is more reliable than any synthetic test because the environment is real. This tier doesn't use a load testing tool — it uses your observability stack.

The threshold-tuning problem — and the trick that works

Whatever absolute threshold you set, it will eventually need to be updated. Your service gets faster after a query optimisation. You add a feature that makes an endpoint slightly slower. A dependency update changes the performance profile. The p95 latency that was reliably 200ms is now reliably 280ms for legitimate reasons.

When thresholds need to be manually updated every few weeks, they become maintenance burden. When they're not updated, they either block valid changes (threshold is now too tight for the new baseline) or provide no signal (threshold has been loosened to the point where only catastrophes trip it).

The approach that avoids this: relative thresholds.

Instead of "p95 must be under 500ms," assert "p95 must not be more than 20% higher than the previous baseline." Relative thresholds absorb gradual drift and flag sudden changes — which is the signal you actually want. A service that's been running at 300ms p95 and is now at 310ms (3% increase) is probably fine. A service at 300ms that's now at 420ms (40% increase) probably broke something.

k6's native threshold system is absolute, but relative thresholds are implementable by reading the baseline at test time:

// Read baseline from environment (set by a previous run stored in CI)
const baselineP95 = parseFloat(__ENV.BASELINE_P95 || '500');
const allowedDrift = 0.20; // 20% tolerance
 
export const options = {
  thresholds: {
    // 20% above baseline — recalibrates automatically as baseline updates
    http_req_duration: [`p(95)<${Math.ceil(baselineP95 * (1 + allowedDrift))}`],
  },
};

This requires storing baseline results between runs — a JSON file committed to the repository, a value in a CI secret, a row in a simple database. The setup overhead is real, but it removes the manual threshold-tuning cycle.

Tooling that fits each tier

Tier 1 (smoke, every PR): k6 open-source, on your existing CI runner. The Go runtime is lean and starts fast. One virtual user generates minimal load on the CI machine. Under 30 seconds total.

Tier 2 (full, scheduled): k6 open-source or k6 Cloud for distributed generation. Run against a dedicated staging environment, not the environment shared with functional tests. Output results to a Grafana dashboard or structured JSON for async review.

Tier 3 (pre-release, manual): same tool as Tier 2. The ceremony is the process around the tool, not the tool itself — someone runs the suite, someone reviews the results, someone makes a go/no-go call.

Tier 4 (production): your existing observability stack. No additional tooling unless you don't have APM yet, in which case this tier is the argument for adding it.

Performance flakes are functional flakes' meaner cousin

The reason performance tests in CI lose team trust faster than functional tests is that the causal chain is longer and more ambiguous. When a functional test fails, the failure usually points at code just changed. When a performance threshold fails, it might be the code, or the CI runner was hot, or the database cache was cold, or a background job ran, or the threshold was wrong for the current baseline.

Flaky tests cost you in developer trust, and the compounding interest on that trust loss is expensive. Performance tests in the PR merge path that fail unpredictably cost more trust than they save in regression prevention — unless the Tier 1 thresholds are loose enough that they only fire on genuine catastrophes.

Keep the catastrophe check in the PR path. Run the real signal on a schedule. Trust the signal; don't automate the response to noise.

// related

Comparisons·14 October 2025 · 9 min read

k6 vs JMeter vs Gatling in 2026: what I'd pick for a modern stack

Three load-testing tools with three radically different ergonomics. JMeter has the 2004 XML/GUI legacy. Gatling stakes everything on Scala. k6 is the JavaScript-first newcomer. Here's the pick.

performance-testingk6jmetergatling

Comparisons·23 January 2026 · 9 min read

GitHub Actions vs CircleCI for test suites: my pick after running both

I've run production Cypress and Playwright suites in both GitHub Actions and CircleCI for the last year. Here's where each one pulls ahead, where each one tripped me up, and the single factor that should decide it.

github-actionscirclecici-cd