SLO (Service Level Objective)
// Definition
An internal target for service performance — usually stricter than the SLA, giving an early-warning buffer before the contractual line is crossed. If the SLA promises 99.9% uptime, the SLO might be 99.95%, so the team reacts before customers are affected. SLOs are paired with "error budgets" — how much you're allowed to miss.
// Why it matters
The SLO is the internal line QA and SRE actually steer by — it's deliberately tighter than the SLA so breaches get caught before they become customer-facing SLA violations. QA cares because testing and alerting against the SLO (not the looser SLA) is what gives the team time to fix degradation before it costs money.
// How to test
SLO-based testing/monitoring (load tools + observability): • set the SLO threshold stricter than the SLA (the buffer) • track the error budget — how much of the allowed miss is consumed • alert/fail when burn rate threatens the SLO, before the SLA is at risk • the gap between SLO and SLA is your reaction time — test that it's enough
// Common mistakes
- Setting the SLO equal to the SLA (no early-warning buffer left)
- No error budget, so there's no objective "are we spending reliability too fast?"
- Measuring the SLO only in production, never in pre-release load tests
// Related terms
SLA (Service Level Agreement)
A formal, often contractual commitment about service performance — e.g. "99.9% uptime" or "p95 response under 500ms" — with consequences (credits, penalties) if missed. The SLA is the externally-promised number; it's what the business has told customers it will deliver.
Percentile (p95, p99)
A statistic that reports the value below which a given percentage of measurements fall. p95 means 95% of requests were faster than this number — and 5% were slower. Performance teams report tail percentiles because averages hide the slow long tail.
Observability
The ability to understand the internal state of a system from the signals it emits externally, without needing to redeploy or modify the system. The three pillars are logs (timestamped records of discrete events), metrics (numeric measurements aggregated over time, such as request rate, error rate, and latency percentiles), and traces (end-to-end records of a request's path through distributed services, linked by a correlation ID). High observability means a QA engineer can diagnose a failure purely from existing output, without attaching a debugger or reproducing the issue locally. In test environments, observability enables post-run failure analysis: instead of re-running a flaky test with extra logging, query structured logs for the test's correlation ID and see exactly which service call failed and why. Contrast with monitoring, which alerts on known failure thresholds — observability enables exploration of previously unknown failure modes.