Skip to content

Operations Failure Patterns

Scope

Covers common operational failure patterns including runbook drift, missing on-call rotations, knowledge silos around the one engineer who built it, deployment freezes that miss the security patch, post-incident lessons that never made it back into the runbook, the "we'll document it later" pattern, change-control processes that block legitimate changes while allowing dangerous ones, the bus-factor failure mode, and the diagnostic patterns for catching operational drift before incidents force the issue. Does not cover deployment failure patterns (see failures/deployment.md) or operational architecture (see general/operational-runbooks.md).

Checklist

  • [Critical] Runbook drift: the runbook says one thing, the system does another. Goes wrong: an incident occurs and the on-call follows the runbook step by step. Step 4 says "restart the service via systemctl". The service no longer runs under systemd because it was migrated to a Kubernetes deployment 8 months ago. The runbook step fails. The on-call has to figure out the actual recovery procedure during the incident. Happens because: runbooks are written once and then the system evolves without anyone updating the runbook. Prevent by: every change to the system that affects how it is operated must include the runbook update in the same PR; periodic runbook review (quarterly) where each runbook is read against the current system; runbook tests where the on-call walks through each runbook on a non-incident day.

  • [Critical] No on-call rotation, or "the one engineer" is on call for everything. Goes wrong: the engineer who built the system is the only person who knows how to operate it. They are paged for every incident, day or night, and they cannot take a vacation. After two years they burn out and leave. The next incident happens and there is nobody who knows the system. Happens because: the system was built by a single engineer and operational ownership was never deliberately transferred; the team grew but the on-call rotation did not grow with it. Prevent by: every production system must have an on-call rotation with at least 3 people, none of whom is the original author; the rotation must be tested with simulated incidents to ensure all rotation members can respond; document the system to a level where someone who did not build it can recover it.

  • [Critical] Knowledge silos: the bus factor of one. Goes wrong: the engineer who knows how the legacy authentication system works leaves the company. Two weeks later, the system has an incident and nobody can debug it. The original engineer is contacted out of professional courtesy but is not obligated to help, and the company is paying them a consulting rate to do work they used to do for salary. Happens because: knowledge transfer was never prioritized; the engineer was the documentation. Prevent by: identify single-point-of-knowledge systems via "what would happen if this person left tomorrow" exercises; pair-program critical work; require written documentation as a condition of approval for any system that has only one person who knows it; treat bus factor as a tracked metric.

  • [Critical] Deployment freeze that misses the security patch. Goes wrong: a year-end deployment freeze is in effect from December 15 to January 5 to reduce risk during the holidays. On December 20, a critical CVE is announced affecting a dependency in production. The deployment freeze prevents the patch from being deployed. By January 5, the vulnerability has been exploited. Happens because: the deployment freeze was designed to prevent unnecessary changes but does not have a documented exception path for security patches. Prevent by: every deployment freeze must explicitly carve out emergency security patches with a documented approval path; the carve-out must be exercised on a non-emergency basis to verify the path works; track the time-to-deploy for security patches as a metric.

  • [Critical] Post-incident lessons that never make it back into the runbook. Goes wrong: an incident is resolved. The post-incident review identifies five contributing factors and assigns five action items. The action items are added to a backlog. Six months later, none of them have been completed because they were lower priority than feature work. The same incident pattern recurs because nothing was actually changed. Happens because: post-incident action items are treated as suggestions rather than requirements; there is no enforcement mechanism for completion. Prevent by: post-incident action items have a hard deadline (typically 30–90 days) and a named owner; the team's velocity budget includes time for action item completion; track the percentage of action items completed within the deadline as a metric; for incidents with low completion rates, escalate to engineering management.

  • [Critical] Change-control process that blocks legitimate changes while allowing dangerous ones. Goes wrong: the change advisory board meets weekly and reviews every change with the same level of scrutiny. Routine config changes get blocked or delayed because the CAB is overloaded. Risky changes (e.g., a database schema migration) get approved without sufficient review because they are batched with the routine changes and nobody on the CAB has the context. Happens because: change control is treated as a uniform process rather than risk-tiered; the CAB does not have the right reviewers for the risky changes. Prevent by: tier changes by risk (routine / standard / high-impact); routine changes are pre-approved templates; high-impact changes require named experts in the review; reduce CAB load by automating the routine path so the CAB only sees the high-impact tier.

  • [Critical] Documentation that nobody reads. Goes wrong: the team has extensive documentation in Confluence (or similar), spread across hundreds of pages written over years. New team members are told to "read the docs". They open the docs, find them disorganized, find broken links, find pages last updated 4 years ago, and give up. Happens because: documentation accumulates without curation; the cost of updating old pages is real and the benefit is invisible. Prevent by: aggressively delete or archive old documentation; maintain a small set of "canonical" docs that are kept current; on-call onboarding includes a documentation walkthrough where the new on-call must verify the docs match the system; treat documentation as code with PR review and decay.

  • [Critical] The "we'll document it later" pattern. Goes wrong: a system is built quickly to meet a deadline. The intention is to "document it later". The deadline passes, the team moves to the next thing, and the documentation never happens. Six months later the system is in production, the original author has moved teams, and no documentation exists. Happens because: documentation is treated as a separate task from the work itself, and the time to do it is always "later". Prevent by: documentation is a deliverable of every project, not a follow-up; PR review includes documentation review; "done" includes documentation being reviewable.

  • [Recommended] No fire drill: the on-call has never practiced a real incident. Goes wrong: an incident happens and the on-call has all the right tools but has never used them under pressure. They fumble through the runbook, struggle with the alerting tool, and miss steps because they have not internalized the workflow. The incident takes longer than necessary. Happens because: fire drills are seen as overhead and skipped; the assumption is that the on-call will "figure it out" when needed. Prevent by: monthly or quarterly fire drills where the on-call walks through a simulated incident from page to resolution; track time-to-resolve in the drill and use it as a baseline; rotate the drill scenarios so different failure modes are practiced.

  • [Recommended] Permission to do operational work but not to change the system. Goes wrong: the operations team has the permissions to start, stop, and restart services, but not to change the underlying configuration. When an incident requires a config change, they have to wake up an on-call developer to make the change in the IaC repository, wait for the build, and deploy. The incident extends by an hour because of the permission split. Happens because: the principle of least privilege was applied without considering operational responsiveness. Prevent by: define operational scenarios that require configuration changes and grant the operations team scoped permissions for those scenarios; or, if security requires the strict split, ensure the on-call developer rotation is fast and reliable.

  • [Critical] Backup that has never been restored. Goes wrong: backups have been running successfully for years. The team has never actually restored from backup. An incident requires a restore. The restore fails because the backup format is incompatible with the current schema, the destination has changed, the credentials have expired, or some other reason that only manifests at restore time. Happens because: backup success is monitored but restore success is not; the restore path is not exercised. Prevent by: scheduled restore tests (quarterly or monthly) that actually restore a backup to a test environment and verify the data is usable; treat the restore path as a tested feature, not a hope.

  • [Recommended] Runbook that requires access the on-call does not have. Goes wrong: the runbook says "use the production database admin role to run this query". The on-call does not have the production database admin role because that role is restricted to a different team. The incident extends while access is granted. Happens because: runbook access requirements are documented at the time the runbook is written and not validated against the current access of the on-call rotation. Prevent by: every runbook step is tagged with the access required; access for runbook steps is granted to the on-call rotation by role, not by individual; runbook tests verify that the on-call can complete every step with their actual permissions.

  • [Optional] The "tribal knowledge" anti-pattern. Goes wrong: the team has a set of "things you have to know" that are passed down orally. New engineers learn by asking questions and making mistakes. The tribal knowledge changes over time as the system changes, and there is no record of what the current truth is. Happens because: writing the tribal knowledge down is hard, the act of writing forces a clarity that the oral tradition does not require, and nobody wants to be the person who writes incorrect documentation. Prevent by: turn tribal knowledge into documentation deliberately; treat "things only the senior engineer knows" as a backlog item; reward documentation in performance reviews.

Why This Matters

Operational failures are the failures that happen when the team meets reality. The system was designed correctly. The code was reviewed correctly. The infrastructure was provisioned correctly. The operational layer is where all of that runs into the actual humans who have to operate it under pressure, with imperfect information, at 2 AM. Operational failures are mostly about the humans, not about the code.

The highest-leverage controls are the ones that prevent the human from being asked to do something they cannot do: practice the runbook so it is muscle memory, distribute the on-call so no single person is irreplaceable, exercise the backup restore so it actually works when needed, document the system so the next on-call can recover it without the original author. Each of these is more work at design time and dramatically less work at incident response time.

The audit posture of "we have a runbook" is the document. The operational posture of "the on-call can recover the system using the runbook" is the practiced ability to do so. The two are very different things, and the second is what matters.

Common Failure Combinations

  • Single engineer + no documentation + the engineer leaves = the system that becomes unmaintainable overnight
  • Runbook drift + no fire drills + the next incident = the recovery procedure that fails at the worst possible time
  • Post-incident action items + no enforcement + the same incident again = the lessons that were not learned
  • Backup success monitoring + no restore tests + the restore that fails = the data loss that "could not happen because we have backups"
  • Deployment freeze + no security carve-out + a critical CVE = the vulnerability that could not be patched because of the freeze

See Also

  • failures/deployment.md — deployment-specific failure patterns
  • failures/identity.md — identity failures including credential management
  • general/operational-runbooks.md — operational architecture and runbook patterns
  • general/incident-response.md — incident response practices (when added)
  • general/governance.md — broader cloud governance