Trend Analysis and Performance Regression Detection

A single test run is a snapshot. It tells you whether the system met its SLAs today. A trend across multiple runs tells you whether performance is stable, improving, or slowly degrading — and it catches regressions that a single snapshot cannot see.

Why trends matter

Trend patterns and what they mean

	What you see in Grafana / comparison data	What to do
Stable	p(95) consistently 180–210ms across 30 daily runs. Normal variance.	No action needed. Document the baseline. Continue monitoring.
Sudden jump (regression)	p(95) was 200ms Monday, 580ms Tuesday. Threshold fails on Tuesday.	Bisect Tuesday's commits. The regression is in one PR. Revert or fix before production.
Slow climb (degradation)	p(95) goes 200ms → 220ms → 245ms → 280ms → 320ms over 5 weeks. Never crosses threshold.	This is the most dangerous pattern — no threshold trips, but performance is steadily worsening. Check for memory leaks, unbounded caches, growing tables.
Improvement	p(95) drops from 400ms to 180ms after an optimisation is deployed.	Update the baseline to lock in the improvement. Tighten the threshold to match the new normal.

Storing baselines in git

The simplest approach: commit a JSON summary from a known-good run and compare future runs against it.

# Establish baseline
k6 run --summary-export=baselines/load-test-baseline.json tests/load-test.js
 
# Commit the baseline
git add baselines/load-test-baseline.json
git commit -m "perf: update load test baseline (p95=210ms)"

The --summary-export flag writes the final aggregated metrics (same data as handleSummary's data parameter) to a JSON file without streaming every individual sample.

Comparing runs in CI

A GitHub Actions workflow that compares the current run against the committed baseline:

- name: Run load test
  uses: grafana/k6-action@v0.3.1
  with:
    filename: tests/load-test.js
  env:
    K6_SUMMARY_EXPORT: current-run.json
 
- name: Compare against baseline
  run: |
    node scripts/compare-baseline.js \
      baselines/load-test-baseline.json \
      current-run.json \
      --tolerance 0.20

The comparison script (scripts/compare-baseline.js) checks whether current metrics are within 20% of the baseline:

const baseline = JSON.parse(fs.readFileSync(process.argv[2]));
const current  = JSON.parse(fs.readFileSync(process.argv[3]));
const tolerance = parseFloat(process.argv[4].replace('--tolerance ', '')) || 0.20;
 
const metrics = ['http_req_duration', 'http_reqs'];
let hasRegression = false;
 
for (const metric of metrics) {
  const baseP95 = baseline.metrics[metric]?.values['p(95)'];
  const currP95 = current.metrics[metric]?.values['p(95)'];
 
  if (baseP95 && currP95 && currP95 > baseP95 * (1 + tolerance)) {
    console.error(`REGRESSION: ${metric} p95 was ${baseP95.toFixed(0)}ms, now ${currP95.toFixed(0)}ms (${((currP95/baseP95 - 1) * 100).toFixed(1)}% slower)`);
    hasRegression = true;
  }
}
 
process.exit(hasRegression ? 1 : 0);

Automated regression detection inside the K6 script

Alternatively, embed the comparison inside handleSummary — no external script required:

import { textSummary } from 'https://jslib.k6.io/k6-summary/0.0.2/index.js';
import { htmlReport } from 'https://raw.githubusercontent.com/benc-uk/k6-reporter/main/dist/bundle.js';
 
const BASELINE_P95_MS = 250;      // from last known-good run
const TOLERANCE = 1.20;           // 20% regression tolerance
 
export function handleSummary(data) {
  const currentP95 = data.metrics['http_req_duration']?.values['p(95)'];
  const regressionThreshold = BASELINE_P95_MS * TOLERANCE;
 
  const regressionAlert = currentP95 > regressionThreshold
    ? `REGRESSION: p95 is ${currentP95.toFixed(0)}ms — exceeds baseline ${BASELINE_P95_MS}ms + 20% tolerance (${regressionThreshold.toFixed(0)}ms)`
    : `OK: p95 is ${currentP95.toFixed(0)}ms — within baseline tolerance`;
 
  return {
    'report.html': htmlReport(data),
    'regression-check.txt': regressionAlert,
    stdout: textSummary(data, { indent: ' ', enableColors: true }),
  };
}

In CI, read regression-check.txt and fail the pipeline if it starts with REGRESSION::

- name: Check for regression
  run: |
    if grep -q "^REGRESSION:" regression-check.txt; then
      cat regression-check.txt
      exit 1
    fi
    cat regression-check.txt

Trend dashboards in Grafana

When metrics are streamed to InfluxDB across multiple test runs, Grafana can show multi-run trend panels:

"p95 over last 30 test runs" — each data point is one test run's p95. A flat line means stability; a rising line means degradation.
"Throughput per build" — cross-reference with your deployment log to see whether each deploy maintained or changed RPS capacity.
"Error rate across weekly stress tests" — weekly stress test results on one panel; spot when the breaking point VU count changes.

To distinguish runs on Grafana, add a test run identifier tag when streaming:

k6 run \
  --out influxdb=http://localhost:8086/k6 \
  --tag testRun=$(date +%Y%m%d-%H%M) \
  --tag gitSha=$(git rev-parse --short HEAD) \
  tests/load-test.js

The testRun and gitSha tags appear on every metric point, making it possible to filter and compare individual runs in Grafana.

When to update baselines

Baselines should be updated intentionally — not automatically overwritten on every passing run:

After a performance improvement: update the baseline to lock in the gain, then tighten the threshold
After a deliberate architectural change: a new database layer might change baseline latency; document and re-establish
Never automatically: auto-updating baselines on every passing run defeats the purpose — a gradual regression never triggers because the baseline moves with it

Treat baseline updates like dependency version bumps: intentional, reviewed, and merged via pull request.

⚠️ Common mistakes

Auto-updating baselines on every CI run. If your CI workflow updates the baseline file after every passing run, a 5% performance regression over 10 runs looks like 10 passing runs. Baselines must be updated manually and intentionally.
Using averages instead of percentiles for baselines. A baseline p(95) of 200ms with a tolerance of 20% flags anything above 240ms. A baseline average of 150ms looks similar but misses tail latency — the worst 5% of users might be at 800ms. Always baseline on p95 or p99.
Treating a threshold failure as the only signal. The slow-climb degradation pattern — 5% worse per week — never crosses a threshold set 30% above baseline. Add trend visualisation (Grafana) alongside thresholds. Thresholds catch step changes; trends catch gradual drift.

🎯 Practice task

Build a baseline comparison workflow. 35 minutes.

Use https://test.k6.io.

Write a K6 script with vus: 10, duration: '2m'. Add handleSummary that writes current-run.json using JSON.stringify(data, null, 2).
Run the test. Copy current-run.json to baselines/load-test-baseline.json. Examine the file — find the http_req_duration metric and its p(95) value.
Write a Node.js script compare.js (runs outside K6) that:
- Reads both JSON files
- Extracts p(95) from each
- Prints PASS or REGRESSION based on 20% tolerance
- Exits with code 1 on regression
Run node compare.js baselines/load-test-baseline.json current-run.json. It should report PASS (same run).
Artificially modify the baseline to have a lower p95 (e.g., divide by 2). Run the comparison again — verify it reports REGRESSION.
Add --tag testRun=$(date +%Y%m%d) to your K6 run command. Examine how the tag appears in the JSON output. Describe in a comment how you would use this tag in Grafana to filter for a specific run.