Observability Failure Patterns¶

Scope¶

Covers common observability failure patterns including alert fatigue, missing signals, log loss (sampling, ingestion limits, retention misconfig), trace sampling that drops the interesting requests, dashboards nobody looks at, metrics that disappear during the incident because the metric pipeline depends on the failing system, and the diagnostic patterns for detecting blind spots before they bite. Does not cover observability architecture design (see general/observability.md) or specific tooling (see providers/aws/observability.md, providers/azure/observability.md, vendor pages).

Checklist¶

[Critical] Alert fatigue: noisy alerts trained the team to ignore everything. Goes wrong: a critical alert fires during an incident but nobody on call investigates because the alert is one of 500 the team has seen this week and 99% of them are noise. The incident extends for hours longer than necessary. Happens because: alerts were created for every metric anomaly anyone ever wanted to see, the team never tuned them down, and "I'll look at it tomorrow" became the default response. Prevent by: enforce a strict signal-to-noise budget per on-call rotation (e.g., max 5 pages per week per person on average); review alert cause and disposition weekly; auto-suppress or delete any alert with a >80% false-positive rate; the rule is "every page must be actionable, every actionable signal must page".
[Critical] Missing signals: the metric you need does not exist. Goes wrong: an incident happens, the on-call needs to answer "is this affecting all users or just some users", and the answer is "we don't have a per-user-segment success rate metric". The incident response stalls because the data is not collected. Happens because: the metric was not needed at the time the service was built, and no one added it during normal development because there was never a triggering event. Prevent by: post-incident review must include a "what signal would have made this faster" question, and the answer must become a backlog ticket with a deadline; instrument signals proactively before incidents force them to be added retroactively under pressure.
[Critical] Log loss from sampling, ingestion limits, or retention. Goes wrong: an incident happens and the on-call goes to investigate the relevant logs, only to find that the logs for the affected service were either dropped at the agent (sampling), rejected at the destination (rate-limit / quota exceeded), or already aged out (retention too short). The incident is now blind. Happens because: log volume management is done at the cost layer (sampling and short retention reduce cost), and no one verifies that the cost-driven defaults still leave enough visibility for incident response. Prevent by: define the minimum retention and sampling rate per log source based on incident response needs, not on cost; alert when a log destination starts dropping or rejecting; budget for log volume as a function of expected incident frequency.
[Critical] Trace sampling drops the interesting requests. Goes wrong: distributed tracing is configured with a uniform 1% sample rate. An incident happens that affects 50 requests out of 10,000. The probability that any of the affected requests are in the trace sample is 50/100 = 50%, and the per-request data needed for diagnosis is in those traces. Half the time, the answer is "we have no traces for the affected requests". Happens because: uniform sampling is the default and nobody changed it. Prevent by: use head-based sampling driven by request properties (always-sample errors, always-sample slow requests, always-sample requests from known-affected user segments); use tail-based sampling where the trace export decision happens after the request completes, so all error and slow traces are kept regardless of base rate.
[Critical] Dashboards that nobody looks at. Goes wrong: an incident happens and the on-call goes to "the dashboard" to see the system state. The dashboard takes 30 seconds to load, has 40 panels covering everything anyone has ever wanted to see, and the relevant signal is buried in panel 27. The on-call gives up and goes to the metrics directly. Happens because: dashboards accumulate panels over time as different people add what they care about; nobody removes old panels; the result is unusable for the incident response use case. Prevent by: separate "browse" dashboards (everything anyone might care about) from "incident response" dashboards (the 5–10 signals on-call needs first); the IR dashboard must load in <5 seconds and answer "is the service up, how many users affected, what are the top error types" without scrolling.
[Critical] Metrics that disappear during the incident because the metric pipeline depends on the failing system. Goes wrong: an incident affects the metrics database (Prometheus, Datadog agent, CloudWatch Metrics ingestion). All the dashboards show "no data" exactly when the on-call needs them most. The incident response is now blind to its own progress because the tool used to monitor recovery is itself broken. Happens because: the metrics pipeline runs on the same infrastructure (or depends on the same services) as the workloads it monitors, so a regional outage takes both down together. Prevent by: design the metrics pipeline so that it survives the failure modes it monitors — out-of-region or out-of-account metrics destinations, dual ingestion paths, separate identity for the metrics agent, no dependency on the application's own services for monitoring delivery.
[Recommended] Alert that pages on a transient signal (cardinality 1) instead of a sustained one. Goes wrong: an alert fires every time a single error occurs in a high-volume service. The team is paged at 2 AM for an error that happened once and self-recovered. After three weeks of this, the on-call disables the alert. Happens because: the alert was written as "fire when error rate > 0" without thinking about the noise floor of a real service. Prevent by: every alert threshold must be based on the actual baseline of the service, not on the absence of errors; use rate or percentage thresholds with appropriate windows (e.g., "error rate > 1% sustained over 5 minutes"); test the alert against historical data before deploying it.
[Critical] Logs forwarded to a destination nobody can access. Goes wrong: logs are correctly captured and forwarded to an S3 bucket / Storage Account / Cloud Storage bucket in a security tooling account. The bucket has a restrictive policy that allows only the SIEM service to read it. During an incident, the on-call cannot read the logs without going through the security team's escalation, which takes 30 minutes during business hours and longer at night. Happens because: log centralization is treated as a security requirement (lock everything down) without considering the operational requirement (on-call must be able to read these in an incident). Prevent by: grant read access on log destinations to the on-call rotation's IAM role / Entra group; document the access path in the runbook; test it as part of the on-call onboarding.
[Recommended] Cardinality explosion in custom metrics. Goes wrong: a developer adds a custom metric tagged with user_id and request_id. Each unique combination becomes a new time series. After a week, the metrics service is rejecting writes due to cardinality limits, the cost has 10x'd, and the team finds out from the bill. Happens because: the documentation for custom metrics rarely explains cardinality limits clearly, and the developer experience for adding tags is "just add another tag". Prevent by: enforce cardinality budgets per service (e.g., max 1000 unique tag combinations per metric); reject high-cardinality dimensions in the metrics SDK at compile time where possible; include cardinality cost in the design review for any new metric.
[Critical] Alert routing that depends on a single integration that breaks silently. Goes wrong: alerts are routed via PagerDuty, which is integrated with the on-call tool via an API key that is revoked or expired. Alerts fire and are silently dropped by the broken integration. The team finds out about the missed page when the customer calls. Happens because: integrations are configured once and never tested; the failure mode is "no alert fired" which is indistinguishable from "everything is healthy". Prevent by: synthetic test alerts on a regular schedule (daily or weekly) that fire from the metrics system through the entire alert pipeline to the on-call tool, with a check that the synthetic alert was received; alert when the synthetic test fails.
[Recommended] No correlation ID across services. Goes wrong: an incident traces through five services. The on-call needs to follow a single user request through all five to understand what happened. The services log independently with their own request IDs and there is no shared correlation. The trace cannot be reconstructed. Happens because: correlation ID propagation was never enforced because "we use distributed tracing for that" and the trace turned out to be sampled out (see above). Prevent by: every entry-point service generates a correlation ID and propagates it to every downstream call (HTTP header, message attribute, gRPC metadata); every log line includes the correlation ID; the correlation ID is the primary key for cross-service investigation.
[Optional] The "we have observability" assumption based on having the tools. Goes wrong: leadership asks "are we observable" and the answer is "yes, we have Datadog/New Relic/Prometheus". The reality is that the agents are deployed but the signals are not used, the alerts are noisy, the dashboards are stale, and the on-call has not run a fire drill in a year. Happens because: tooling adoption is treated as the goal rather than the means. Prevent by: measure observability outcomes, not tooling adoption — mean time to detect, mean time to diagnose, percentage of incidents where the dashboard provided the answer, percentage of post-mortems where "missing observability" is in the contributing factors.
[Recommended] No diagnostic settings forwarding for managed services. Goes wrong: a managed service (Azure Front Door, AWS RDS, GCP Cloud SQL) has its own diagnostic logs that capture detailed operational events. The diagnostic settings are not configured to forward those logs to the central log destination. During an incident, the relevant logs only exist in the per-service portal view and are subject to that service's default retention. Happens because: diagnostic settings are off by default and require explicit per-resource configuration; the team relied on the managed service to "just log" and never checked. Prevent by: enforce diagnostic settings via Azure Policy / AWS Config rules / GCP Org Policy; verify forwarding is working by checking the central destination for each new service deployment.

Why This Matters¶

Observability failures are the failures you do not see until you need the signal. The metrics, logs, and traces are working "fine" until an incident happens, and then the failure modes reveal themselves all at once. The recovery from observability failure is much harder than the prevention because the failure is in the past — by the time you notice the missing signal, the data that would have answered the question is gone.

The most expensive observability failures are not the ones where the tool is broken. They are the ones where the tool is working but the signal is wrong: the alert that paged on the wrong threshold, the trace that was sampled out, the metric that has the wrong labels, the dashboard that takes 30 seconds to load. Each of these is technically functional. Each of them is functionally useless. The audit posture of "we have observability" is the tooling; the operational posture of "we can answer questions during incidents" is the signal quality, and the two are very different things.

The highest-leverage controls are post-incident reviews that ask "what signal would have made this faster" and treat the answer as a deliverable. Every incident should leave the observability surface measurably better than it was before. Without that discipline, observability stays where it was when the team built it, and the gaps accumulate.

Common Failure Combinations¶

Alert fatigue + a real critical alert = the missed page that becomes the incident that becomes the post-mortem
Trace sampling + low error rate = no traces for the affected requests, no diagnostic signal for the incident
Metrics pipeline depends on the failing region = no monitoring during the only time monitoring matters
High-cardinality custom metric + ingestion limit = the cost surprise that arrives as a partial outage
No correlation ID + sampled-out traces = the investigation that has to be done by hand from raw logs

Observability Failure Patterns¶

Scope¶

Checklist¶

Why This Matters¶

Common Failure Combinations¶

See Also¶