What is Mean time to recovery (incidents)?

⚙️ Process

Mean time to recovery (incidents)

The average time to restore production service after an incident — the DORA metric for operational resilience.

processdorarecovery-time

// Formula

avg(incident-resolution time)

// About this metric

Mean time to recovery (MTTR) for production incidents measures how quickly a team can restore normal service after a production failure. It is one of the four DORA metrics, measuring operational resilience and recovery capability.

This MTTR is distinct from MTTR for defects (which measures repair throughput for pre-release and post-release bugs). Incident MTTR is the time from when an incident is declared — typically when monitoring fires or a customer reports an outage — to when service is restored to normal operating parameters.

The DORA 2024 bands place Elite performance at under one hour, High at one hour to one day, Medium at one day to one week, and Low at over one week. These are dramatic differences — a team that takes a week to recover from incidents is operating at qualitatively different reliability than one that recovers in 45 minutes.

Improvement drivers include: better observability (faster detection means faster response), documented runbooks (faster diagnosis and remediation), feature flags and gradual rollouts (smaller blast radius), and practised incident response processes (on-call runbooks, blameless postmortems, chaos engineering).

// Calculator

🧮 Calculator

Average time to recover(hours)

Your MTTR2.0hours

// Benchmark

You're in the DORA 'High' tier — 2.00 hours puts you between 1.00 and 24.0 hours (2024 State of DevOps Report).

Source: DORA State of DevOps Report 2024

This is MTTR for PRODUCTION INCIDENTS, distinct from MTTR for defects (Defect category). Same name, different scope.

// When to use this metric

Use MTTR to evaluate your on-call and incident management capability. Track it by severity: P1/SEV1 MTTR and P2 MTTR should be separate metrics, since customer SLAs and internal expectations differ significantly between severity levels.

Review MTTR trends after postmortems and system improvements to validate that remediation investments are working. Declining MTTR after a systematic improvement — better runbooks, new monitoring — is one of the clearest signals that the investment paid off.

// Common pitfall

This is MTTR for PRODUCTION INCIDENTS — distinct from MTTR for defects (in the Defect category). Same acronym, entirely different scope, different benchmarks. Confusing the two produces misleading reports. If a stakeholder asks "what's your MTTR?", clarify whether they mean incident recovery time or defect repair time — the answer will be very different.

// Related metrics

Change failure rate

Process

Deployment frequency

Process

Mean time to repair (MTTR)

Defect

Mean time to detect (MTTD)

Defect