Skip to content

Data Failure Patterns

Scope

Covers common data-layer failure patterns including replication lag, untested backups, encryption key management gaps, connection pool exhaustion, split-brain scenarios, and schema migration failures. Does not cover general data architecture design (see general/data.md) or database-specific provider configurations (see providers/ files).

Checklist

  • [Critical] Unmonitored replication lag causing stale reads — Goes wrong: read replicas fall behind the primary by seconds or minutes, and applications reading from replicas return stale or inconsistent data, leading to incorrect business decisions or user-visible bugs. Happens because: replication lag is not monitored or alerting thresholds are too lenient. Prevent by: monitoring replication lag with alerts at meaningful thresholds (e.g., >1 second), routing critical reads to the primary, and investigating sustained lag as a capacity signal.

  • [Critical] Backups exist but have never been tested with a restore — Goes wrong: during an actual data loss event, the restore process fails due to corrupted backups, missing permissions, incompatible versions, or nobody knowing the procedure, turning a recoverable incident into permanent data loss. Happens because: backup configuration is set-and-forget, and restore drills are never scheduled. Prevent by: performing quarterly restore drills to a separate environment, automating restore validation, and documenting the full recovery procedure in a runbook.

  • [Critical] Encryption key rotation gaps or lost key material — Goes wrong: encrypted data becomes permanently inaccessible when keys expire or are deleted, or compliance audits fail because keys have not been rotated within the required period. Happens because: key rotation is manual, key lifecycle is not tracked, or KMS policies allow key deletion without safeguards. Prevent by: enabling automatic key rotation, setting deletion waiting periods (minimum 30 days), and alerting when keys approach expiry or rotation deadlines.

  • [Critical] Connection pool exhaustion under load — Goes wrong: application instances run out of database connections, causing requests to queue and eventually timeout, leading to cascading failures across the application tier. Happens because: connection pool sizes are set to defaults without load testing, connection leaks exist in application code, or too many microservices independently connect to the same database. Prevent by: right-sizing connection pools based on load testing, using connection pooling proxies (PgBouncer, RDS Proxy), monitoring active connection counts, and fixing connection leaks.

  • [Critical] Split-brain in multi-region database setups — Goes wrong: both regions accept writes simultaneously after a network partition, creating conflicting data that is extremely difficult or impossible to reconcile. Happens because: multi-region active-active writes are configured without proper conflict resolution, or failover automation promotes a replica while the primary is still accepting writes. Prevent by: using single-writer architectures with read replicas, implementing conflict-free replicated data types (CRDTs) for active-active, and requiring manual confirmation for failover when network status is ambiguous.

  • [Critical] Storage volume fills to 100% capacity — Goes wrong: the database crashes or enters read-only mode, writes fail, and the application goes down. Recovery may require downtime to resize storage. Happens because: storage growth is not monitored, auto-scaling storage is not enabled, or log/temp files consume unexpected space. Prevent by: enabling storage auto-scaling where available, alerting at 70% and 85% capacity thresholds, setting up automated cleanup of old logs and temp tables, and capacity planning based on growth trends.

  • [Critical] Missing point-in-time recovery (PITR) capability — Goes wrong: an accidental DELETE or UPDATE without a WHERE clause destroys data, and the only recovery option is restoring the last nightly backup, losing hours of transactions. Happens because: PITR is not enabled due to cost or oversight, or retention is too short. Prevent by: enabling PITR with sufficient retention (minimum 7 days, 35 days recommended), and documenting the exact steps to restore to a specific timestamp.

  • [Critical] Schema migration breaks running application — Goes wrong: a database migration adds a NOT NULL column without a default, renames a column still referenced by running code, or locks a large table for minutes, causing application errors or downtime. Happens because: migrations are not tested against production-scale data, or the migration and code deployment are not coordinated. Prevent by: using expand-and-contract migration patterns, testing migrations against production-size datasets, avoiding destructive schema changes in a single step, and running migrations independently from application deploys.

  • [Critical] No read replica promotion plan for primary failure — Goes wrong: the primary database fails and the team scrambles to manually promote a replica, taking 30-60 minutes instead of the expected 1-2 minutes, because the process has never been practiced. Happens because: managed failover is assumed to be automatic but was never configured or tested. Prevent by: configuring automated failover (Multi-AZ, Aurora failover), testing promotion during maintenance windows, and documenting manual promotion steps as a backup procedure.

  • [Critical] Sensitive data stored without encryption at rest — Goes wrong: a compromised storage volume, snapshot, or backup exposes plaintext sensitive data (PII, financial records, health data), resulting in regulatory penalties and breach notifications. Happens because: encryption is not enabled by default, or teams assume network security is sufficient. Prevent by: enabling encryption at rest on all data stores (databases, object storage, EBS volumes, backups), using customer-managed keys for sensitive workloads, and scanning for unencrypted resources with automated compliance tools.

  • [Recommended] Cache stampede on cache expiry or failure — Goes wrong: when a cache key expires or the cache layer fails, all application instances simultaneously query the database to rebuild the cache, overwhelming the database and causing a cascading outage. Happens because: no cache warming strategy exists, and all instances treat cache misses identically. Prevent by: implementing cache warming on deployment, using staggered TTLs to avoid simultaneous expiry, adding request coalescing (single-flight) for cache rebuilds, and designing the application to degrade gracefully when the cache is unavailable.

  • [Critical] Database credentials shared or hardcoded — Goes wrong: credentials leaked in source control, logs, or config files are used to access the database, and because one set of credentials is shared across services, revocation requires coordinating changes across all consumers. Happens because: using a single shared credential is simpler than managing per-service credentials. Prevent by: issuing per-service credentials with least-privilege grants, storing credentials in a secrets manager with automatic rotation, and auditing database access logs for anomalous patterns.

  • [Recommended] No monitoring on database deadlocks or long-running queries — Goes wrong: deadlocks cause transaction failures and user-facing errors; long-running queries consume connections and block other operations, degrading performance for all users. Happens because: database performance monitoring is not set up, or default monitoring does not surface these metrics. Prevent by: enabling database performance insights, alerting on deadlock counts and query duration thresholds, implementing query timeouts, and regularly reviewing slow query logs.

  • [Recommended] Orphaned snapshots and backups accumulating cost — Goes wrong: hundreds of old database snapshots and backups accumulate over months, driving storage costs into thousands of dollars per month with no retention policy governing their lifecycle. Happens because: automated backups create snapshots but no lifecycle policy deletes old ones. Prevent by: implementing snapshot retention policies (e.g., 35 days for automated, tagged manual snapshots with expiry dates), automating cleanup of untagged snapshots, and reviewing storage costs monthly.

Why This Matters

Data is the most valuable and hardest-to-replace asset in any system. Unlike compute or networking, which can be rebuilt in minutes, lost or corrupted data may be unrecoverable. Replication lag, untested backups, and missing PITR turn routine incidents into catastrophic data loss events. Connection pool exhaustion and split-brain scenarios cause outages that are difficult to diagnose and resolve under pressure.

Common Decisions (ADR Triggers)

  • Backup and PITR retention policy — retention length, restore drill frequency, RPO/RTO targets
  • Multi-region data strategy — single-writer vs active-active, conflict resolution approach
  • Connection pooling architecture — application-level pools vs sidecar proxies vs managed proxies
  • Encryption key management — provider-managed vs customer-managed keys, rotation policy
  • Cache failure strategy — degrade gracefully vs fail closed, cache warming approach
  • Schema migration workflow — expand-and-contract vs lock-and-migrate, migration tooling selection

See Also

  • general/data.md — Data architecture patterns and database selection
  • general/disaster-recovery.md — Backup, recovery, and business continuity planning
  • failures/scaling.md — Scaling failures that compound data-layer bottlenecks
  • general/security.md — Encryption and credential management