Q38 of 38 · Performance

How do you define and maintain performance SLOs across multiple services in a growing organisation?

PerformanceLeadperformancesloslagovernanceleadershipreliabilityobservability

Short answer

Short answer: Define SLOs at the service level, derived from user-facing business requirements rather than infrastructure capabilities. Review and update them quarterly. Tie SLO breaches to an on-call escalation process so they are treated as incidents, not just metrics.

Detail

The common failure mode: SLOs are set once by the first engineer who ran a load test, never reviewed, and gradually become either irrelevant (the system is 10x faster now) or unachievable (the system has grown and the SLO was set against a tiny dataset).

Define SLOs from user behaviour. What latency causes a meaningful increase in abandonment? For a checkout flow, research suggests conversions drop noticeably above 3 s. That informs the SLO — not "what latency can our servers achieve?"

Three layers of SLO:

  • Product SLO: user-perceived behaviour ("checkout completes in under 3 s for 95% of users in production"). Owned by product.
  • Service SLO: per-service technical target ("payment API p95 under 400 ms at 500 RPS"). Owned by the service team.
  • Infrastructure SLO: resource-level targets ("database p99 under 50 ms"). Owned by platform.

Review cadence: quarterly review of actual field data (RUM, APM) versus SLO. If production p95 has been 200 ms for 6 months and the SLO is 500 ms, tighten the SLO — otherwise it provides no signal.

Error budget: when a service consumes its error budget (SLO is breached for more than X% of a rolling window), features freeze and reliability work takes priority. This is the operational teeth that makes SLOs more than aspirational numbers.

// WHAT INTERVIEWERS LOOK FOR

SLOs derived from user behaviour, not infrastructure. Three layers: product, service, infrastructure. Quarterly review with real field data. Error budget as the accountability mechanism.