Skip to content

Database High Availability

Scope

This file covers database high availability patterns and failover strategies including replication topologies, connection management during failover, split-brain prevention, and read scaling. These patterns are cloud-agnostic and apply to both managed and self-managed database deployments. For database migration strategies, see general/database-migration.md. For broader disaster recovery planning including RPO/RTO definition and DR testing, see general/disaster-recovery.md. For database design decisions and storage engine selection, see general/data.md.

Checklist

  • [Critical] What replication topology is used? (active-passive with a single primary and one or more standbys is simplest and avoids write conflicts; active-active or multi-writer enables writes at multiple nodes but requires conflict resolution such as last-write-wins, CRDTs, or application-level merge -- only justified when write availability across regions is a hard requirement)
  • [Critical] Is replication synchronous or asynchronous? (synchronous guarantees zero data loss but adds write latency proportional to round-trip time to the replica -- only practical within a single region or metro area at < 5ms RTT; asynchronous eliminates write latency impact but introduces replication lag and potential data loss during failover equal to uncommitted transactions; semi-synchronous offers a middle ground where the primary waits for at least one replica acknowledgment before confirming the write)
  • [Critical] How is failover detected and executed? (automated failover via health checks reduces RTO but risks false positives during network partitions; manual failover adds human judgment but increases RTO by 5-30 minutes; hybrid approach with automated detection and manual approval is most common for production databases; tools include Patroni for PostgreSQL, MySQL Group Replication, MongoDB election protocol, or managed service automatic failover such as RDS Multi-AZ)
  • [Critical] How is split-brain prevented? (when network partitions occur, multiple nodes may believe they are the primary; prevention requires quorum-based consensus with an odd number of voters, witness nodes in a third failure domain, and STONITH/fencing to forcibly shut down the old primary before promoting a new one; without fencing, both nodes accept writes and data diverges irrecoverably)
  • [Critical] How do applications connect during and after failover? (DNS-based failover updates a CNAME or uses Route 53 health checks with 60-300 second TTL but client DNS caching may delay reconnection; proxy-based failover via PgBouncer, ProxySQL, HAProxy, or cloud-native proxies like RDS Proxy provides transparent rerouting without application changes; driver-based failover uses multi-host connection strings where the driver discovers the new primary automatically -- supported by libpq, MySQL Connector, MongoDB drivers)
  • [Recommended] Is connection pooling configured? (connection pooling reduces database connection overhead and improves failover behavior; PgBouncer for PostgreSQL, ProxySQL for MySQL, RDS Proxy for AWS managed databases, Azure SQL connection pooling for Azure; pool size must be tuned to match database max_connections minus headroom for administrative connections; transaction-mode pooling is preferred over session-mode for connection efficiency but does not support session-level features like prepared statements or advisory locks)
  • [Recommended] Are read replicas used for read scaling? (read replicas offload SELECT queries from the primary; requires query routing at the application, ORM, or proxy layer to direct reads to replicas and writes to the primary; replication lag means replicas may serve stale data -- applications must tolerate eventual consistency or use primary for read-after-write scenarios; monitor replication lag as an SLI with alerting thresholds)
  • [Recommended] How does the application handle transient connection failures? (implement retry logic with exponential backoff and jitter for connection errors and query timeouts; use circuit breakers to avoid overwhelming a recovering database with reconnection storms; ensure write operations are idempotent so retries do not create duplicate data -- use idempotency keys, upserts, or conditional writes)
  • [Recommended] Is cross-region replication configured for DR? (cross-region async replication enables regional failover but RPO equals replication lag, typically seconds to minutes; network latency between regions makes synchronous replication impractical for most workloads; managed services offer cross-region read replicas that can be promoted -- RDS cross-region replicas, Cloud SQL cross-region replicas, Azure SQL geo-replication; evaluate whether cross-region HA is needed vs cross-region backup restore which has higher RTO but lower cost)
  • [Recommended] What managed database HA features are used? (RDS Multi-AZ provides synchronous standby with automatic failover in 60-120 seconds; Cloud SQL HA uses regional instances with automatic failover; Azure SQL zone-redundant deployment distributes replicas across availability zones; understand what the managed service handles automatically vs what requires configuration -- some require manual promotion of cross-region replicas, some pause replication during maintenance windows)
  • [Recommended] How is self-managed database HA orchestrated? (Patroni with etcd for PostgreSQL provides leader election, automatic failover, and REST API for health checks; MySQL Group Replication or InnoDB Cluster with MySQL Router for MySQL; MongoDB replica sets with built-in election protocol; all require careful configuration of timeouts, health check intervals, and failure thresholds to balance fast failover against false positive risk)
  • [Optional] Is connection string management automated? (hardcoded connection strings break during failover; use service discovery such as Consul or Kubernetes services, environment-injected connection strings, or secrets management with automatic rotation; for Kubernetes deployments, database endpoints can be managed via ExternalName services or operator-managed secrets that update on failover)
  • [Optional] Are failover procedures tested regularly? (scheduled failover tests validate that detection, promotion, connection rerouting, and application reconnection all work end-to-end; test during maintenance windows initially, then progress to unannounced tests; measure actual RTO and compare against targets; verify that monitoring and alerting triggers correctly during failover events)

Why This Matters

Database downtime is disproportionately impactful compared to application-tier outages. When an application server fails, a load balancer routes traffic to healthy instances within seconds. When a database fails without HA, every application instance loses access to state simultaneously, and recovery requires manual intervention -- restoring from backup, replaying transaction logs, and reconfiguring connection strings. Even brief database outages cascade into extended application outages because connection pools fill with stale connections, retry storms overwhelm the recovering database, and caches go cold.

The choice between replication modes has profound consequences. Synchronous replication guarantees zero data loss but couples application write latency to network distance -- a decision that constrains where replicas can be placed. Asynchronous replication eliminates the latency penalty but introduces a window of potential data loss that must be quantified, communicated to stakeholders, and accounted for in business continuity planning. Many teams discover their actual replication lag only during an incident, when it is too late to change the architecture.

Split-brain is the most dangerous failure mode in database HA. If two nodes both accept writes as primary, the resulting data divergence is extremely difficult to reconcile -- some transactions will be lost regardless of how the merge is performed. Prevention requires consensus mechanisms, fencing, and witness nodes, all of which add complexity. Teams that skip split-brain prevention to simplify their architecture often learn its importance through a production incident that causes permanent data loss.

Connection management during failover is frequently the weakest link. Even when the database fails over in under a minute, applications may take much longer to reconnect if they cache DNS records, hold stale connection pool entries, or lack retry logic. The effective RTO is not how fast the database promotes a new primary -- it is how long until the last application instance successfully reconnects and resumes serving traffic.

Common Decisions (ADR Triggers)

  • Replication topology -- active-passive (simpler operations, no write conflicts, standard for most workloads) vs active-active (write availability in multiple regions, requires conflict resolution, justified only when cross-region write latency is unacceptable)
  • Replication mode -- synchronous (zero data loss, write latency penalty, viable within a region) vs asynchronous (no latency penalty, potential data loss equal to replication lag) vs semi-synchronous (compromise, at least one replica acknowledges)
  • Failover mechanism -- automated (lowest RTO, false positive risk, requires fencing) vs manual (human judgment, higher RTO) vs hybrid (automated detection, manual approval, most common for production)
  • Connection routing strategy -- DNS-based (simple, TTL-dependent, client caching issues) vs proxy-based (transparent failover, additional infrastructure to manage) vs driver-based (application-aware, no additional infrastructure, requires driver support)
  • Connection pooler selection -- PgBouncer (lightweight, PostgreSQL-specific, transaction vs session mode tradeoff) vs ProxySQL (MySQL, query routing and caching built in) vs RDS Proxy (AWS-managed, no infrastructure to operate, adds cost) vs application-side pooling (simpler deployment, per-instance pools do not share connections)
  • Managed vs self-managed HA -- managed database HA (less operational burden, limited customization, provider-specific behavior) vs self-managed with Patroni/Group Replication/replica sets (full control, requires expertise, portable across environments)
  • Read replica consistency model -- eventual consistency acceptable for most reads (simpler, better performance) vs read-after-write consistency required for specific flows (route those reads to primary, adds complexity to query routing)
  • Cross-region replication scope -- cross-region HA with promotable replicas (lower RTO, higher cost) vs cross-region backup restore only (higher RTO, lower steady-state cost, acceptable for Tier 2/3 workloads)

See Also

  • disaster-recovery.md -- RPO/RTO definition, failover model selection, DR testing methodology
  • data.md -- database engine selection, storage design, data modeling
  • database-migration.md -- migration strategies, cutover planning, and rollback
  • enterprise-backup.md -- backup strategies, retention, and restore procedures
  • networking.md -- network architecture relevant to cross-region replication and latency
  • observability.md -- monitoring replication lag, failover events, and connection pool health