Mean Time To Recovery (MTTR)

# Mean Time To Recovery (MTTR) MTTR measures the average time it takes to restore a system to full operation after a failure or incident — from the moment the failure occurs to the moment normal service resumes. It is one of the four core **DORA metrics** used to assess software delivery and operational performance, alongside: - **Deployment frequency** — how often code is deployed to production - **Lead time for changes** — time from commit to production - **Change failure rate** — % of deployments causing incidents Lower MTTR = more resilient, better-operated system. ## Formula ``` MTTR = Total downtime / Number of incidents ``` ## What drives MTTR up - Poor observability (hard to detect and diagnose failures) - Unfamiliar or opaque code (no one knows how it works) - Lack of runbooks or incident playbooks - Tightly coupled systems (blast radius of failures is wide) - Manual deployment and rollback processes - High [[Cognitive load]] on the team during incidents ## What drives MTTR down - Strong observability (logs, traces, metrics) - Clear ownership and on-call rotation - Automated rollbacks and feature flags - Well-understood codebases - Blameless post-mortems that improve playbooks ## MTTR and AI-generated code MTTR for AI-generated code tends to be significantly higher than for hand-written code — particularly when developers accept code without fully understanding it. When an incident occurs in a system containing opaque AI-generated code: - The debugging surface is larger (the developer didn't write it, didn't review it deeply) - Mental models of the code are weaker or absent - Root cause analysis takes longer because the code's intent isn't obvious - Confidence in fixes is lower, slowing decision-making This directly undercuts a core argument for AI-assisted development: the velocity gains. If you ship faster but recover slower (and more rarely, more catastrophically), the net productivity story degrades — especially in complex, evolving systems where incidents compound and the codebase drifts further from anyone's mental model over time. The implication: **velocity metrics alone don't capture the true cost of AI-generated code**. MTTR, change failure rate, and [[Technical debt]] accumulation are the counter-weights that reveal the real trade-off. ## References - DORA metrics: https://dora.dev/guides/dora-metrics-four-keys/ ## Related - [[DevOps]] - [[Technical debt]] - [[Cognitive load]] - [[Large Language Models (LLMs)]] - [[AI Agents]]