Reliability and Stability Testing — Non-Functional Testing Overview

A system that works correctly in a 5-minute test can still fail in production after running for 14 hours. Memory leaks, connection pool exhaustion, cache invalidation bugs, and race conditions only become visible under sustained operation. Reliability testing asks a different question from functional testing: not "does it work?" but "does it keep working?"

Key reliability metrics

Before testing reliability, you need to know what you are measuring. Four metrics dominate reliability conversations:

MTBF (Mean Time Between Failures) — the average time a system operates between failures. A payment service with an MTBF of 720 hours fails roughly once a month. Higher is better.

MTTR (Mean Time To Recovery) — the average time from failure to full restoration. Lower is better. A system that fails rarely but takes 8 hours to recover has worse MTTR than one that fails occasionally but recovers in 10 minutes.

Availability — the percentage of time the system is operational. Typically expressed as "nines": 99% availability means 87.6 hours of downtime per year. "Five nines" (99.999%) means less than 6 minutes per year. The target for your system should be explicitly defined and tested against.

Error rate — the percentage of operations that fail. Tracking error rate over time reveals gradual degradation that availability metrics alone may not catch.

Availability targets in practice

Availability tiers — what each SLA target means in practice

	Downtime per year	Typical use case	Requires
99% (two nines)	87.6 hours	Internal tools, dev environments	Basic stability tests
99.9% (three nines)	8.76 hours	Most consumer web apps	Soak + recovery testing
99.99% (four nines)	52.6 minutes	E-commerce, SaaS platforms	Failover tests, auto-scaling
99.999% (five nines)	5.3 minutes	Banking, healthcare, payments	Active-active, full DR drills

The right availability target is a business decision, not a technical one. Five nines costs significantly more than three nines — in infrastructure redundancy, engineering investment, and operational complexity. The target should reflect what actual downtime costs the business in lost revenue, SLA penalties, and user trust. Most consumer products target 99.9%; financial and healthcare systems often require 99.99% or higher.

Types of reliability testing

Stability testing runs the application for an extended period under normal or moderate load and monitors for signs of degradation — increasing response times, rising memory usage, growing error rates. A test that runs for 8 hours catches problems that no 30-minute test will see. This is often called a soak test when combined with load (covered in the performance chapter).

Recovery testing deliberately induces specific failures and verifies that the system recovers correctly. Cut the database connection and verify the application handles it gracefully rather than throwing unhandled exceptions. Restart the app server and verify it comes back cleanly. Simulate a network partition and verify that queued operations are not lost.

Failover testing verifies that redundancy works when called upon. If the primary database node is terminated, does the replica promote and take traffic within the defined RTO? If a load balancer node fails, does traffic shift to the healthy node without user-visible impact? These tests should be run periodically in production-equivalent environments — not just once at setup.

Common reliability failure patterns

Memory leaks are gradual: memory usage climbs slowly over hours as objects are allocated and never released. The symptom is increasing response times and eventually an out-of-memory crash. Catch them with extended soak tests where memory metrics are monitored over time.

Connection pool exhaustion happens when database or HTTP connections are not released after use. The pool fills up, new requests wait for a connection that never becomes free, and the application effectively halts. Connection counts should be tracked during long-duration tests.

Race conditions are timing-dependent bugs that appear only when specific sequences of operations happen in specific order — often under concurrent load. Stress tests and soak tests surface race conditions that unit tests cannot.

Cache invalidation bugs cause stale data to be served indefinitely when cache entries are not invalidated on update. A user updates their profile and sees the old version for hours. Test cache behaviour explicitly — particularly after update operations.

Chaos engineering

Chaos engineering is the practice of deliberately injecting failures into a system to verify that resilience mechanisms work as designed. Netflix's Chaos Monkey (which randomly terminates production instances) popularised the approach, but the principle applies at any scale.

For most QA engineers, chaos engineering territory means: simulating infrastructure failures in staging environments — killing a container, dropping network packets, filling a disk — and observing the application's response. The tools (Chaos Monkey, Gremlin, AWS Fault Injection Simulator) are operated by SRE and platform teams. QA's contribution is verifying the user-visible behaviour during and after failures: do users see appropriate error messages, or do they see raw stack traces? Do queued operations complete when the service recovers?

⚠️ Common mistakes

Treating a passing 5-minute test as proof of reliability. Short-duration tests cannot surface time-dependent failures. Any system claiming reliability must be validated with extended soak tests — 8 hours minimum for most production services.
Testing failover only at initial setup. Infrastructure changes, certificate rotations, and software updates can silently break failover configurations. Failover tests should be run regularly — quarterly for most systems.
Monitoring the wrong metrics. Error rate and response time measured at the load balancer miss database connection exhaustion and backend queue depth. Monitor at multiple layers — application, database, and infrastructure — during reliability tests.

🎯 Practice task

Design a basic reliability test plan for a product you test.

State the availability target for the system (or propose one if it is not defined). Calculate what that availability target means in hours of allowed downtime per year.
Write one stability test scenario: what load level, what duration, and what metrics would you monitor?
Write one recovery test scenario: what failure would you inject, how would you inject it, and what would pass/fail look like?
Identify one scenario where your product's reliability might be invisible to functional tests but would affect real users. That gap is where reliability testing adds value.