AWS Disaster Recovery (Elastic Disaster Recovery, Cross-Region, Backup)¶

Scope¶

AWS business continuity and disaster recovery services. Covers Elastic Disaster Recovery (DRS) for agent-based replication and failover, cross-region RDS/Aurora replication, Aurora Global Database, S3 Cross-Region Replication (CRR), Route 53 health checks and DNS failover, AWS Backup with cross-region vault copy, multi-region EKS/ECS patterns, DynamoDB Global Tables, CloudFormation/Terraform for environment reconstruction, and Fault Injection Service (FIS) for DR testing.

Checklist¶

Why This Matters¶

AWS regions are independent infrastructure deployments designed for fault isolation, but regional outages have occurred and can last hours. Without a tested DR strategy, recovery depends entirely on AWS restoring the affected region — which is outside your control. AWS provides extensive DR building blocks (DRS, Aurora Global Database, DynamoDB Global Tables, S3 CRR, Route 53 failover), but each must be explicitly configured, tested, and maintained to deliver the promised RPO/RTO.

The most common DR failure is untested recovery plans. Organizations configure cross-region replication but never execute a full failover sequence: promoting a database replica, updating DNS routing, reconfiguring application connection strings, validating data consistency, and confirming end-to-end application functionality. Elastic Disaster Recovery drill instances and Aurora Global Database switchover testing exist specifically for non-disruptive validation — use them regularly.

Multi-region cost is the primary constraint on DR strategy selection. A pilot light approach (database replicas plus infrastructure-as-code for on-demand reconstruction) adds roughly 10-15% to infrastructure cost, while active-active multi-region deployment can double or triple it. The cost difference between pilot light (RTO 1-4 hours) and hot standby (RTO < 15 minutes) is substantial. These tradeoffs must be driven by business impact analysis — what does one hour of downtime actually cost? — not by engineering preference for the most resilient architecture.

AWS Backup cross-region vault copy provides a safety net independent of replication. Even with real-time replication configured, logical corruption (application bugs, accidental deletions, ransomware) replicates to the secondary region within seconds. Point-in-time recovery from AWS Backup vaults provides an independent recovery path. Vault lock with immutable retention prevents backup deletion even by compromised administrative credentials.

Common Decisions (ADR Triggers)¶

Pilot light vs warm standby vs hot standby vs active-active — Pilot light maintains only database replicas and core networking in the DR region, reconstructing compute from infrastructure-as-code during failover (RTO 1-4 hours, lowest cost). Warm standby runs a scaled-down copy of production (smaller instance types, reduced auto-scaling) for faster failover (RTO 15-60 min, moderate cost). Hot standby mirrors production capacity in the DR region for near-instant failover (RTO < 15 min, near-double cost). Active-active serves traffic from multiple regions simultaneously (near-zero RTO, highest cost, requires multi-region data strategy). Choose based on business-defined RTO requirements and downtime cost analysis — not all workloads need the same tier.
Aurora Global Database vs cross-region RDS read replica — Aurora Global Database provides managed cross-region replication with RPO typically < 1 second and RTO < 1 minute via managed planned failover, supports up to 5 secondary regions, and handles replication infrastructure automatically. Cross-region RDS read replicas work with non-Aurora engines (PostgreSQL, MySQL, MariaDB), provide asynchronous replication with RPO of seconds-to-minutes, but require manual promotion (irreversible) and application reconfiguration. Use Aurora Global Database for Tier 1 workloads requiring the fastest RPO/RTO. Use cross-region read replicas when Aurora is not available for the engine or when cost constraints apply (Aurora Global Database requires Aurora pricing in every region).
DynamoDB Global Tables vs application-level replication — Global Tables provide fully managed multi-region, multi-active replication with sub-second latency and last-writer-wins conflict resolution. Application-level replication (DynamoDB Streams to Lambda to cross-region writes) offers custom conflict resolution logic and selective replication (filter which items replicate) but adds operational complexity, failure modes, and Lambda execution costs. Use Global Tables for most multi-region DynamoDB workloads. Use application-level replication only when custom conflict resolution or selective replication is required.
Elastic Disaster Recovery (DRS) vs infrastructure-as-code reconstruction — DRS provides block-level continuous replication for servers (EC2, on-premises, other clouds) with sub-second RPO and minutes RTO, maintaining a near-real-time copy of server state including OS, applications, and data. IaC reconstruction (CloudFormation, Terraform) rebuilds infrastructure from templates and restores data from backups, with RTO dependent on provisioning time and data restore duration (typically hours). DRS is superior for stateful workloads, legacy applications, and lift-and-shift servers. IaC reconstruction is better for cloud-native, stateless services where the application state lives entirely in managed databases and object storage.
Route 53 failover vs Global Accelerator failover — Route 53 failover routing uses DNS-based failover with RTO dependent on DNS TTL and client-side caching (typically 60-300 seconds after failure detection). Global Accelerator uses anycast IP addresses with BGP-based routing that failover within seconds, independent of DNS. Route 53 is simpler and cheaper, sufficient for most workloads. Global Accelerator provides faster failover and is preferred for latency-sensitive applications, TCP/UDP workloads, or applications where DNS caching causes unacceptable failover delays.
S3 CRR with Replication Time Control vs standard CRR — Standard CRR replicates most objects within minutes but provides no SLA on replication time. S3 Replication Time Control (RTC) guarantees 99.99% of objects replicate within 15 minutes, backed by an SLA, with replication metrics and notifications for compliance monitoring. RTC adds cost (~$0.015/GB replicated). Use standard CRR for non-critical data or when replication timeliness is best-effort. Use RTC when regulatory requirements or RPO targets demand guaranteed replication timing.
AWS Backup centralized vs per-service native backup — AWS Backup provides unified backup policies, cross-region vault copy, vault lock for immutability, and audit reporting across supported services (EC2, EBS, RDS, DynamoDB, EFS, FSx, S3). Per-service native backup (RDS automated snapshots, DynamoDB on-demand backups, EBS snapshots) offers tighter integration and service-specific features (RDS PITR, DynamoDB PITR). Use AWS Backup as the centralized policy layer for cross-region DR and compliance. Supplement with native backup features where AWS Backup does not cover specific needs (e.g., RDS PITR granularity).

AWS Disaster Recovery (Elastic Disaster Recovery, Cross-Region, Backup)¶

Scope¶

Checklist¶

Why This Matters¶

Common Decisions (ADR Triggers)¶

Reference Links¶

See Also¶