Operational Runbooks¶

Scope¶

This file covers operational runbook and playbook design decisions: runbook structure and standardization, incident response playbooks, automated vs. manual execution, runbook-as-code tooling, on-call documentation, post-incident review practices, runbook maintenance, integration with monitoring and alerting, game day validation, and SRE operational practices. For alerting and monitoring tool selection, see general/observability.md. For disaster recovery runbooks specifically, see general/disaster-recovery.md.

Checklist¶

Why This Matters¶

Operational runbooks are the bridge between alerting systems that detect problems and engineers who fix them. Without well-structured runbooks, incident response depends entirely on tribal knowledge — whichever engineer happens to remember how to fix a particular problem. This creates single points of failure in the on-call rotation, extends mean time to resolution (MTTR) when the knowledgeable engineer is unavailable, and makes onboarding new team members unnecessarily painful. A 15-minute fix becomes a 2-hour investigation when the engineer has never seen the problem before and has no documented procedure to follow.

The distinction between automated and manual runbooks is a critical cost-benefit decision. Automating every runbook sounds appealing but carries real risks: automated remediation for a misdiagnosed problem can make things worse (restarting a service that is crash-looping due to a configuration error just restarts the crash loop faster), and maintaining automation for rarely-triggered runbooks costs more in engineering time than the manual execution it replaces. The most effective approach is to automate high-frequency, low-risk remediations (disk cleanup, certificate renewal, service restart) while keeping human judgment in the loop for destructive actions, complex failovers, and novel failure modes.

Post-incident review is where operational maturity compounds. Organizations that rigorously conduct blameless postmortems, track action items to completion, and feed findings back into runbooks and architecture decisions experience fewer repeat incidents over time. Organizations that skip postmortems or treat them as blame exercises repeat the same incidents, burn out their on-call engineers, and accumulate operational debt that eventually manifests as a major outage. The postmortem is also the primary mechanism for connecting incidents to SLO impact — without this feedback loop, error budgets are just numbers on a dashboard rather than operational decision-making tools.

Common Decisions (ADR Triggers)¶

ADR: Runbook Hosting and Tooling¶

Context: The organization needs to decide where runbooks are authored, stored, and executed.

Options:

Criterion	Wiki/Docs (Confluence, GitBook)	Git Repository (Markdown)	Runbook-as-Code (SSM, Ansible, Rundeck)	Integrated Platform (PagerDuty Runbook Automation, Shoreline)
Authoring Experience	Rich editor, non-engineers can contribute	Plain text, requires git workflow	Code/YAML, requires engineering skills	Guided UI with code blocks
Version Control	Built-in but limited diff/review	Full git history, PR review	Full git history, PR review	Platform-managed versioning
Execution	Manual (copy-paste commands)	Manual (copy-paste commands)	Automated or semi-automated	Automated with approval gates
Alert Integration	Link from alert to wiki page	Link from alert to repo page	Trigger from alert webhook	Native alert-to-runbook binding
Searchability	Full-text search built in	Requires separate search tooling	Searchable via platform	Built-in search and tagging
Maintenance Burden	Low authoring effort, high staleness risk	Moderate (PR process enforces review)	Higher (code must be tested and maintained)	Moderate (platform manages execution)
Best Fit	Small teams, simple operations	Engineering-heavy teams, GitOps culture	Mature SRE teams automating remediation	Organizations wanting turnkey automation

Decision drivers: Team size and technical depth, frequency of runbook execution (manual is fine for rare events, automation is essential for frequent ones), existing tooling ecosystem (GitOps shop vs. wiki-centric), and budget for dedicated runbook platforms.

ADR: Incident Severity Classification Model¶

Context: The team must define severity levels that determine response urgency, staffing, communication requirements, and escalation timing.

Options: - 4-level model (SEV1-SEV4): Most common. SEV1 for total outage, SEV2 for degraded service, SEV3 for minor issues with workarounds, SEV4 for cosmetic or backlog items. Simple to understand, sufficient for most organizations. - 5-level model with P0: Adds a P0/SEV0 for existential threats (data breach, complete platform failure, safety-critical systems). Useful for organizations where the distinction between "major outage" and "company-threatening event" drives materially different responses. - Impact/urgency matrix (ITIL-style): Classifies incidents on two dimensions (impact: how many users affected; urgency: how quickly must it be resolved) to derive priority. More nuanced but more complex to apply under pressure. Common in enterprises with ITSM processes.

Decision drivers: Organizational complexity, number of on-call teams that need a shared classification language, regulatory requirements for incident classification, and whether the model must integrate with an existing ITSM platform (ServiceNow, Jira Service Management).

ADR: Post-Incident Review Process¶

Context: The organization needs a structured process for learning from incidents and preventing recurrence.

Options: - Lightweight postmortem (template in Slack/doc): Fill out a brief template (what happened, root cause, action items) within 48 hours. Low overhead, high completion rate. Risk of shallow analysis that misses systemic issues. - Formal blameless postmortem (dedicated meeting + document): Scheduled meeting with all responders, structured timeline reconstruction, 5 Whys root cause analysis, action items with owners and due dates, published to the organization. Higher overhead but deeper analysis. Standard SRE practice. - Learning review (resilience engineering approach): Focuses on what went right in addition to what went wrong, examines how the system adapted, and identifies systemic contributors rather than single root causes. Most thorough analysis. Requires facilitator training and organizational maturity.

Decision drivers: Incident frequency (high-frequency environments need lightweight processes to avoid postmortem fatigue), organizational culture (blameless culture is a prerequisite for effective postmortems), and whether action item completion is tracked and enforced.

ADR: Automated Remediation Strategy¶

Context: The team must decide which operational responses to automate and what safeguards to implement.

Options: - No automation (manual runbooks only): All remediation requires human execution. Simplest to implement. Highest MTTR. Acceptable for small environments with low incident frequency. - Auto-remediation for known issues: Automate specific, well-understood remediations (restart crashed service, clear disk space, rotate expiring certificate) triggered by alerts. Requires guardrails: cooldown periods to prevent repeated execution, circuit breakers to stop automation if the issue recurs, and audit logging of all automated actions. - Full autonomous remediation (AIOps): ML-driven anomaly detection and automated response. Highest investment. Risk of unexpected automated actions. Appropriate only for very large-scale environments where manual response cannot keep pace with incident volume.

Recommendation: Start with manual runbooks for all scenarios. Automate the top 3-5 most frequent, lowest-risk remediations. Add human-in-the-loop approval for anything destructive. Expand automation incrementally based on confidence gained from game day testing.