Stress Testing — Finding Breaking Points — Performance Testing with K6

A stress test asks a different question than a load test. A load test asks: "Does the system meet SLAs at expected traffic?" A stress test asks: "Where does it break, and how does it break?" The answer shapes autoscaling policies, circuit breaker thresholds, and on-call runbooks.

What a stress test reveals

Stress test observation points by VU count

	Expected http_req_duration p(95)	Expected error rate	Watch for
100 VUs (baseline)	< 500ms — system within SLA	< 0.1% — healthy	Stable baseline. Note p(95) and error rate to compare against higher stages.
200 VUs (push past normal)	500ms–1000ms — latency climbing	0.1%–1% — some timeouts starting	Connection pool saturation, database query queue backing up, GC pauses increasing.
300 VUs (degradation zone)	1000ms–3000ms — sharp inflection	1%–10% — errors accumulating fast	Thread pool exhaustion, memory pressure, upstream dependency timeouts cascading.
400 VUs (breaking point)	> 5000ms or timeout — system overwhelmed	> 10% — cascading failures	OOM kills, connection refused errors, downstream services failing. This is the breaking point.

The incremental stage pattern

The defining characteristic of a stress test is the incremental ramp — adding load steadily so you can observe exactly where degradation begins:

export const options = {
  stages: [
    { duration: '3m', target: 100 },   // establish baseline
    { duration: '5m', target: 100 },   // hold baseline — confirm system is stable
    { duration: '3m', target: 200 },   // push past normal load
    { duration: '5m', target: 200 },   // hold — observe if latency stabilises or keeps growing
    { duration: '3m', target: 300 },   // approach capacity limit
    { duration: '5m', target: 300 },   // hold — watch error rate
    { duration: '3m', target: 400 },   // find the breaking point
    { duration: '5m', target: 400 },   // hold — observe failure mode
    { duration: '5m', target: 0 },     // ramp down — observe recovery
  ],
  thresholds: {
    http_req_duration: ['p(95)<5000'],   // wide threshold — we expect degradation
    http_req_failed:   ['rate<0.50'],    // allow up to 50% errors — stress test expects failure
  },
};

The hold stages after each ramp matter. Without them, you are observing a system constantly adapting to new load — not a system's steady-state behaviour at any specific VU level. Hold for at least 3–5 minutes at each plateau to see whether the system stabilises or continues degrading.

Recognising healthy vs unhealthy degradation

Not all degradation is equal. The failure mode tells you as much as the breaking point.

Healthy degradation (graceful)

VUs: 100 → p(95): 200ms,  errors: 0.0%
VUs: 200 → p(95): 400ms,  errors: 0.2%
VUs: 300 → p(95): 900ms,  errors: 1.1%
VUs: 400 → p(95): 2800ms, errors: 8.0%

Response times climb linearly. Errors appear slowly. The system is queueing requests and processing them — it is overloaded but not crashing. This is autoscaler territory: deploy more instances.

Unhealthy degradation (cascading failure)

VUs: 100 → p(95): 200ms,  errors: 0.0%
VUs: 200 → p(95): 210ms,  errors: 0.1%
VUs: 300 → p(95): 5800ms, errors: 42% ← cliff
VUs: 400 → p(95): timeout, errors: 89%

The system appears healthy until a threshold is crossed, then collapses suddenly. This pattern indicates a shared resource hitting a hard limit — database connection pool exhaustion, a mutex bottleneck, or a downstream service hitting its own connection limit.

Watching the recovery phase

The ramp-down after a stress test is as diagnostic as the peak. Add a recovery observation stage:

stages: [
  // ... incremental ramp stages ...
  { duration: '5m', target: 400 },   // breaking point
  { duration: '3m', target: 100 },   // drop to normal load
  { duration: '5m', target: 100 },   // observe recovery at baseline
  { duration: '2m', target: 0 },     // ramp down
],

A system that recovers to baseline latency within 2 minutes of dropping to normal load is resilient. A system where http_req_duration p(95) stays at 3000ms even at 100 VUs after peak load indicates a resource that was exhausted and has not released — connection pool connections not being returned, heap memory not being garbage collected, thread pool threads stuck waiting.

Using abortOnFail as a safety valve

Stress testing against production-adjacent environments risks leaving the system in a bad state. Use abortOnFail to stop the test if error rates exceed a safe threshold:

export const options = {
  stages: [
    { duration: '3m', target: 100 },
    { duration: '5m', target: 200 },
    { duration: '3m', target: 300 },
    { duration: '5m', target: 400 },
    { duration: '3m', target: 0 },
  ],
  thresholds: {
    http_req_failed: [{
      threshold: 'rate<0.30',
      abortOnFail: true,
      delayAbortEval: '2m',   // give the system 2 minutes to stabilise before aborting
    }],
  },
};

delayAbortEval: '2m' prevents the threshold from aborting during ramp-up where transient spikes are expected. Only sustained failure over 2 minutes triggers the abort.

What to do with the results

The output of a stress test is a set of numbers that feed directly into infrastructure decisions:

Breaking point VU count → set autoscaler scale-out threshold at 70–80% of this
Failure mode (graceful queue vs cliff) → determines whether you need rate limiting or just more instances
Recovery time → informs the unhealthy grace period in your health check configuration
Error type at breaking point → stack traces of 503s vs 502s vs timeouts point to different components to scale

⚠️ Common mistakes

No hold stages between ramp steps. If you ramp from 100 to 200 to 300 to 400 VUs with no hold time, you are measuring the system under continuous load increase — not at any stable operating point. You cannot identify where degradation actually starts.
Setting thresholds that abort too early. A stress test intentionally pushes past the system's limits. If your threshold aborts the test at 5% error rate, you will never observe the failure mode or breaking point. Either set thresholds wide or disable abortOnFail for stress tests.
Only running stress tests against staging with tiny databases. A memory leak that appears at 300 VUs when scanning a 50M-row table will not appear if staging has 1,000 rows. Database size relative to production matters for stress tests.
Not capturing server-side metrics during the test. K6 tells you when the system degraded — your APM tool or server logs tell you why. Running a stress test without simultaneously watching CPU, memory, GC pause time, and DB connection count produces half an answer.

🎯 Practice task

Run a stress test against a public API and find its inflection point. 40 minutes.

Use https://test.k6.io — Grafana's public K6 test endpoint, designed for load testing practice.

Write a stress test with this stage pattern: 30s ramp to 10 VUs → hold 1m → 30s ramp to 25 VUs → hold 1m → 30s ramp to 50 VUs → hold 1m → 30s ramp to 100 VUs → hold 2m → 2m ramp down.
Set thresholds wide enough not to abort: http_req_duration: ['p(95)<10000'] and http_req_failed: ['rate<0.50'].
Add sleep(1) between requests. Tag each request with { tags: { name: 'Homepage' } }.
Run the test. Record http_req_duration p(95) and error rate at each VU plateau — observe where (if anywhere) latency inflects sharply.
Add a second endpoint: GET /news.php. Tag it separately. Compare how each endpoint degrades under the same load increase.
Add abortOnFail: true, delayAbortEval: '30s' to the error rate threshold and set it to rate<0.20. Run again — notice whether the test completes or aborts before reaching 100 VUs.