Trend Analysis and Performance Regression Detection

8 min read

A single test run is a snapshot. It tells you whether the system met its SLAs today. A trend across multiple runs tells you whether performance is stable, improving, or slowly degrading — and it catches regressions that a single snapshot cannot see.

Storing baselines in git

The simplest approach: commit a JSON summary from a known-good run and compare future runs against it.

# Establish baseline
k6 run --summary-export=baselines/load-test-baseline.json tests/load-test.js
 
# Commit the baseline
git add baselines/load-test-baseline.json
git commit -m "perf: update load test baseline (p95=210ms)"

The --summary-export flag writes the final aggregated metrics (same data as handleSummary's data parameter) to a JSON file without streaming every individual sample.

Comparing runs in CI

A GitHub Actions workflow that compares the current run against the committed baseline:

- name: Run load test
  uses: grafana/k6-action@v0.3.1
  with:
    filename: tests/load-test.js
  env:
    K6_SUMMARY_EXPORT: current-run.json
 
- name: Compare against baseline
  run: |
    node scripts/compare-baseline.js \
      baselines/load-test-baseline.json \
      current-run.json \
      --tolerance 0.20

The comparison script (scripts/compare-baseline.js) checks whether current metrics are within 20% of the baseline:

const baseline = JSON.parse(fs.readFileSync(process.argv[2]));
const current  = JSON.parse(fs.readFileSync(process.argv[3]));
const tolerance = parseFloat(process.argv[4].replace('--tolerance ', '')) || 0.20;
 
const metrics = ['http_req_duration', 'http_reqs'];
let hasRegression = false;
 
for (const metric of metrics) {
  const baseP95 = baseline.metrics[metric]?.values['p(95)'];
  const currP95 = current.metrics[metric]?.values['p(95)'];
 
  if (baseP95 && currP95 && currP95 > baseP95 * (1 + tolerance)) {
    console.error(`REGRESSION: ${metric} p95 was ${baseP95.toFixed(0)}ms, now ${currP95.toFixed(0)}ms (${((currP95/baseP95 - 1) * 100).toFixed(1)}% slower)`);
    hasRegression = true;
  }
}
 
process.exit(hasRegression ? 1 : 0);

Automated regression detection inside the K6 script

Alternatively, embed the comparison inside handleSummary — no external script required:

import { textSummary } from 'https://jslib.k6.io/k6-summary/0.0.2/index.js';
import { htmlReport } from 'https://raw.githubusercontent.com/benc-uk/k6-reporter/main/dist/bundle.js';
 
const BASELINE_P95_MS = 250;      // from last known-good run
const TOLERANCE = 1.20;           // 20% regression tolerance
 
export function handleSummary(data) {
  const currentP95 = data.metrics['http_req_duration']?.values['p(95)'];
  const regressionThreshold = BASELINE_P95_MS * TOLERANCE;
 
  const regressionAlert = currentP95 > regressionThreshold
    ? `REGRESSION: p95 is ${currentP95.toFixed(0)}ms — exceeds baseline ${BASELINE_P95_MS}ms + 20% tolerance (${regressionThreshold.toFixed(0)}ms)`
    : `OK: p95 is ${currentP95.toFixed(0)}ms — within baseline tolerance`;
 
  return {
    'report.html': htmlReport(data),
    'regression-check.txt': regressionAlert,
    stdout: textSummary(data, { indent: ' ', enableColors: true }),
  };
}

In CI, read regression-check.txt and fail the pipeline if it starts with REGRESSION::

- name: Check for regression
  run: |
    if grep -q "^REGRESSION:" regression-check.txt; then
      cat regression-check.txt
      exit 1
    fi
    cat regression-check.txt

Trend dashboards in Grafana

When metrics are streamed to InfluxDB across multiple test runs, Grafana can show multi-run trend panels:

  • "p95 over last 30 test runs" — each data point is one test run's p95. A flat line means stability; a rising line means degradation.
  • "Throughput per build" — cross-reference with your deployment log to see whether each deploy maintained or changed RPS capacity.
  • "Error rate across weekly stress tests" — weekly stress test results on one panel; spot when the breaking point VU count changes.

To distinguish runs on Grafana, add a test run identifier tag when streaming:

k6 run \
  --out influxdb=http://localhost:8086/k6 \
  --tag testRun=$(date +%Y%m%d-%H%M) \
  --tag gitSha=$(git rev-parse --short HEAD) \
  tests/load-test.js

The testRun and gitSha tags appear on every metric point, making it possible to filter and compare individual runs in Grafana.

When to update baselines

Baselines should be updated intentionally — not automatically overwritten on every passing run:

  • After a performance improvement: update the baseline to lock in the gain, then tighten the threshold
  • After a deliberate architectural change: a new database layer might change baseline latency; document and re-establish
  • Never automatically: auto-updating baselines on every passing run defeats the purpose — a gradual regression never triggers because the baseline moves with it

Treat baseline updates like dependency version bumps: intentional, reviewed, and merged via pull request.

⚠️ Common mistakes

  • Auto-updating baselines on every CI run. If your CI workflow updates the baseline file after every passing run, a 5% performance regression over 10 runs looks like 10 passing runs. Baselines must be updated manually and intentionally.
  • Using averages instead of percentiles for baselines. A baseline p(95) of 200ms with a tolerance of 20% flags anything above 240ms. A baseline average of 150ms looks similar but misses tail latency — the worst 5% of users might be at 800ms. Always baseline on p95 or p99.
  • Treating a threshold failure as the only signal. The slow-climb degradation pattern — 5% worse per week — never crosses a threshold set 30% above baseline. Add trend visualisation (Grafana) alongside thresholds. Thresholds catch step changes; trends catch gradual drift.

🎯 Practice task

Build a baseline comparison workflow. 35 minutes.

Use https://test.k6.io.

  1. Write a K6 script with vus: 10, duration: '2m'. Add handleSummary that writes current-run.json using JSON.stringify(data, null, 2).
  2. Run the test. Copy current-run.json to baselines/load-test-baseline.json. Examine the file — find the http_req_duration metric and its p(95) value.
  3. Write a Node.js script compare.js (runs outside K6) that:
    • Reads both JSON files
    • Extracts p(95) from each
    • Prints PASS or REGRESSION based on 20% tolerance
    • Exits with code 1 on regression
  4. Run node compare.js baselines/load-test-baseline.json current-run.json. It should report PASS (same run).
  5. Artificially modify the baseline to have a lower p95 (e.g., divide by 2). Run the comparison again — verify it reports REGRESSION.
  6. Add --tag testRun=$(date +%Y%m%d) to your K6 run command. Examine how the tag appears in the JSON output. Describe in a comment how you would use this tag in Grafana to filter for a specific run.

// tip to track lessons you complete and pick up where you left off across devices.