Why average response time lies
If your performance report leads with average response time, it is hiding the problem. Here is why the mean lies and what to report instead.
part ofPerformance for QA engineersI'll say it plainly: average response time is the single most misleading number in performance testing, and it's the one most reports lead with. Not because anyone's being dishonest — the average is easy to compute, easy to explain, and feels rigorous. But it systematically hides the exact thing you're testing for. If I could ban one metric from sign-off, it'd be the standalone average.
The mean assumes a shape your data doesn't have
The average is a great summary when data is roughly symmetric — heights, say. Response times are nothing like that. They're a pile of fast requests with a long tail of slow ones, and the mean of a skewed distribution lands in a no-man's-land that describes almost nobody. Take nine requests at 100ms and one at 5000ms: the average is 590ms. No request was anywhere near 590ms. The fast majority experienced 100ms; the unlucky one experienced 5000ms; the "average user" — the 590ms user — does not exist.
Now flip it: that one 5000ms request is the bug you were hired to find, and the average quietly diluted it into a number that still looks like it might be acceptable. The mean doesn't just fail to surface the tail — it actively launders it.
Why teams keep using it anyway
Three reasons, and none of them are good once you say them out loud. It's the default in a lot of tooling, so it's what's on the dashboard. It's a single number, and single numbers feel like answers. And it almost always looks better than the percentiles, so — consciously or not — it's the flattering number to put in the deck. "Average response time: 180ms ✅" closes the meeting. "p99: 2.3s" opens an awkward one. Guess which gets reported.
What to report instead
- Percentiles, always. p50 for the typical user, p95 and p99 for the ones having a bad time. If you report one number, report p95, not the mean. (The p95 explainer covers the mechanics.)
- The tail's shape across load, not a single snapshot. A p99 that's flat until 200 users and then rockets is the headline finding — the average will glide along smoothly right through that cliff.
- The max, and what caused it. The single slowest request often points straight at a real bug (a cold cache, an unindexed query, a lock).
- A count. A scary percentile over 50 requests is noise; say how many requests are behind the number.
The honest version of a performance report
"p50 is 120ms and steady. p95 is 480ms, under our 500ms target. But p99 climbs from 600ms to 2.4s past 150 concurrent users, and the slowest requests are all the search endpoint with a large account — that's our risk." That report changes a decision. "Average is 190ms, we're fine" ends a conversation that should have continued. The difference between those two reports is the difference between performance testing and performance theatre.
Where this fits
This is the opinionated companion to the p95 latency explainer, which covers how percentiles actually work. For what belongs in your pipeline, load tests in CI: the honest version; and the glossary has latency, throughput, and percentile definitions for when you write the report up.
Before you report a performance number
- Lead with percentiles (p50/p95/p99), not the average
- If you cite one number, make it p95 — never the standalone mean
- Show the tail across load levels, not a single snapshot
- Include the max and what caused it
- State the request count behind the percentiles
- Ask: does this report change a decision, or just close the meeting?
// RELATED QA.CODES RESOURCES
// related
Load tests in CI: the honest version
The pitch: 'run load tests on every PR.' The reality: you'll have flaky thresholds in three days and disabled tests in two weeks. Here's the four-tier strategy that actually survives.
Why mobile bugs escape web-first QA teams
Web-first teams carry assumptions that quietly break on mobile — permissions, offline state, lifecycle, and updates.