Database High Availability¶

Scope¶

This file covers database high availability patterns and failover strategies including replication topologies, connection management during failover, split-brain prevention, and read scaling. These patterns are cloud-agnostic and apply to both managed and self-managed database deployments. For database migration strategies, see general/database-migration.md. For broader disaster recovery planning including RPO/RTO definition and DR testing, see general/disaster-recovery.md. For database design decisions and storage engine selection, see general/data.md.

Checklist¶

Why This Matters¶

Database downtime is disproportionately impactful compared to application-tier outages. When an application server fails, a load balancer routes traffic to healthy instances within seconds. When a database fails without HA, every application instance loses access to state simultaneously, and recovery requires manual intervention -- restoring from backup, replaying transaction logs, and reconfiguring connection strings. Even brief database outages cascade into extended application outages because connection pools fill with stale connections, retry storms overwhelm the recovering database, and caches go cold.

The choice between replication modes has profound consequences. Synchronous replication guarantees zero data loss but couples application write latency to network distance -- a decision that constrains where replicas can be placed. Asynchronous replication eliminates the latency penalty but introduces a window of potential data loss that must be quantified, communicated to stakeholders, and accounted for in business continuity planning. Many teams discover their actual replication lag only during an incident, when it is too late to change the architecture.

Split-brain is the most dangerous failure mode in database HA. If two nodes both accept writes as primary, the resulting data divergence is extremely difficult to reconcile -- some transactions will be lost regardless of how the merge is performed. Prevention requires consensus mechanisms, fencing, and witness nodes, all of which add complexity. Teams that skip split-brain prevention to simplify their architecture often learn its importance through a production incident that causes permanent data loss.

Connection management during failover is frequently the weakest link. Even when the database fails over in under a minute, applications may take much longer to reconnect if they cache DNS records, hold stale connection pool entries, or lack retry logic. The effective RTO is not how fast the database promotes a new primary -- it is how long until the last application instance successfully reconnects and resumes serving traffic.

Common Decisions (ADR Triggers)¶

Replication topology -- active-passive (simpler operations, no write conflicts, standard for most workloads) vs active-active (write availability in multiple regions, requires conflict resolution, justified only when cross-region write latency is unacceptable)
Replication mode -- synchronous (zero data loss, write latency penalty, viable within a region) vs asynchronous (no latency penalty, potential data loss equal to replication lag) vs semi-synchronous (compromise, at least one replica acknowledges)
Failover mechanism -- automated (lowest RTO, false positive risk, requires fencing) vs manual (human judgment, higher RTO) vs hybrid (automated detection, manual approval, most common for production)
Connection routing strategy -- DNS-based (simple, TTL-dependent, client caching issues) vs proxy-based (transparent failover, additional infrastructure to manage) vs driver-based (application-aware, no additional infrastructure, requires driver support)
Connection pooler selection -- PgBouncer (lightweight, PostgreSQL-specific, transaction vs session mode tradeoff) vs ProxySQL (MySQL, query routing and caching built in) vs RDS Proxy (AWS-managed, no infrastructure to operate, adds cost) vs application-side pooling (simpler deployment, per-instance pools do not share connections)
Managed vs self-managed HA -- managed database HA (less operational burden, limited customization, provider-specific behavior) vs self-managed with Patroni/Group Replication/replica sets (full control, requires expertise, portable across environments)
Read replica consistency model -- eventual consistency acceptable for most reads (simpler, better performance) vs read-after-write consistency required for specific flows (route those reads to primary, adds complexity to query routing)
Cross-region replication scope -- cross-region HA with promotable replicas (lower RTO, higher cost) vs cross-region backup restore only (higher RTO, lower steady-state cost, acceptable for Tier 2/3 workloads)

Reference Links¶

Patroni -- HA template for PostgreSQL with etcd, Consul, or ZooKeeper
PgBouncer -- lightweight connection pooler for PostgreSQL
ProxySQL -- high-performance MySQL proxy with query routing and connection pooling
MySQL InnoDB Cluster -- MySQL Group Replication with MySQL Router and MySQL Shell
MongoDB Replica Sets -- MongoDB native replication and automatic failover
AWS RDS Multi-AZ -- managed synchronous standby with automatic failover
AWS RDS Proxy -- managed connection pooling and failover handling
Google Cloud SQL HA -- regional instances with automatic failover
Azure SQL Zone-Redundant -- zone-redundant HA for Azure SQL Database