Running a performance test without understanding the metrics is like driving a car while only watching the speedometer — you are missing most of the information you need. Performance test reports produce dozens of numbers. This lesson covers the ones that matter, explains what they actually mean, and shows you why the most commonly cited metric — the average — is almost always the wrong one to report.
The metrics that matter
Response time is the duration from when a client sends a request to when it receives the complete response. It is the metric most directly connected to user experience. Users do not think about throughput or CPU utilisation — they feel response time.
Latency is the network delay component of response time — the time for a packet to travel from client to server and back, excluding processing. In practice, "latency" and "response time" are often used interchangeably; in performance testing, response time includes server processing, while latency strictly refers to network round-trip delay.
Throughput is the number of requests the system successfully handles per second (or per minute). A system with high throughput processes many requests in a given time. A system with high response time processes each request slowly. The two can diverge: a system might have high throughput but still feel slow to individual users if response times are high.
Error rate is the percentage of requests that fail under load. An error rate below 1% is typically the target for most systems; above 5% signals serious problems. Zero errors at normal load, with error rate climbing as load increases, is a normal pattern — the question is where the threshold sits.
Concurrent users is the number of simulated users active at the same time. This is your primary load lever — increase it to apply more load, decrease it to reduce load.
Resource utilisation covers CPU, memory, network bandwidth, and disk I/O on the server side. A response time that meets SLA while CPU sits at 95% means you have no headroom for traffic growth. Resource metrics explain why a system behaves as it does, not just what it does.
Percentiles: the only way to read response time correctly
Averages are dangerous. A system with an average response time of 200ms might have 90% of requests completing in 100ms and 10% completing in 1,200ms. The average reports 200ms; the user experience for one in ten users is terrible.
Percentiles tell the real story:
- p50 (median) — half of all requests complete faster than this value. A reasonable general indicator, but it hides the tail.
- p95 — 95% of requests complete faster than this value. The standard SLA metric. If your p95 is 2 seconds, that means 5% of your users — potentially thousands of people — experience worse than 2 seconds.
- p99 — 99% complete faster than this. The tail. If you have 10,000 requests per minute, your p99 represents 100 users per minute experiencing the worst response times.
Example response time distribution — same system, different views
The chart above shows a real-world pattern: median and average look reasonable; the p95 and p99 tell a very different story. Reporting only the average is misleading — and makes it easy to miss that a meaningful slice of your users is having a terrible experience.
Always report p50, p95, and p99. Always define your SLA against a percentile, not an average.
The Apdex score
Apdex (Application Performance Index) converts raw response times into a satisfaction score between 0 and 1. It requires a single target: the response time threshold T that defines "satisfying."
- Satisfied: response time ≤ T
- Tolerating: response time > T and ≤ 4T
- Frustrated: response time > 4T
Apdex = (Satisfied + Tolerating/2) / Total
If T = 500ms and you have 100 requests:
- 70 complete in under 500ms → 70 satisfied
- 20 complete between 500ms and 2s → 20 tolerating
- 10 complete in over 2s → 10 frustrated
Apdex = (70 + 20/2) / 100 = 80/100 = 0.80
Apdex scores above 0.94 are typically rated "excellent." Below 0.70 is "poor." Apdex is useful because it converts a complex distribution into a single trackable number that has business meaning — it represents user satisfaction, not just raw timing.
Industry benchmarks to orient your targets
These are common targets — your specific requirements will depend on your product and SLA:
- API response time (interactive): under 200ms at p95
- API response time (data-heavy queries): under 1 second at p95
- Web page fully loaded: under 3 seconds on a standard connection
- Search results: under 300ms at p95 (instant-feeling to users)
- Error rate under normal load: under 0.1%
- Error rate under peak load: under 1%
These benchmarks are starting points for conversation, not universal requirements. A healthcare system might accept 2-second searches; a trading platform might require 50ms APIs.
⚠️ Common mistakes
- Reporting only the average. Average response time is almost always the wrong metric to highlight. Always include percentiles — at minimum p95. Averages let bad tail latency hide behind good median performance.
- Measuring at too-low load. A p95 of 200ms with 10 concurrent users tells you nothing about behaviour under 500 concurrent users. Always measure at the load level that reflects your actual use case.
- Ignoring resource metrics. A system that meets response time SLAs at 95% CPU utilisation has no headroom. If traffic grows 10%, it fails. Resource metrics predict future capacity problems before they become user-visible.
- Setting targets without user context. A 500ms p95 target that comes from a user research study is meaningful. A 500ms p95 target that someone picked arbitrarily is not. Anchor performance targets to user perception data where possible.
🎯 Practice task
Find a public API or a service you have access to. Use a simple tool (curl with timing flags, Postman, or a basic script) to collect 20 requests and record the response times.
- Calculate the mean, p50, p95, and p99 from your 20 samples. Even with a small sample, you will likely see the mean and p50 diverge from the p95 and p99.
- Pick a T value for Apdex (e.g., 500ms) and calculate the score.
- If the p95 is higher than you expected, investigate: is it consistent? Does it happen on specific request types? Does it correlate with specific times of day?
The goal is not a rigorous performance test — it is to develop intuition for how percentiles reveal the user experience that averages conceal. The k6 tool page covers how to run proper distributed load tests that measure these metrics at scale.