Skip to content

AWS Disaster Recovery (Elastic Disaster Recovery, Cross-Region, Backup)

Scope

AWS business continuity and disaster recovery services. Covers Elastic Disaster Recovery (DRS) for agent-based replication and failover, cross-region RDS/Aurora replication, Aurora Global Database, S3 Cross-Region Replication (CRR), Route 53 health checks and DNS failover, AWS Backup with cross-region vault copy, multi-region EKS/ECS patterns, DynamoDB Global Tables, CloudFormation/Terraform for environment reconstruction, and Fault Injection Service (FIS) for DR testing.

Checklist

  • [Critical] Define RPO and RTO targets per workload tier and map to AWS service capabilities: Tier 1 (mission-critical, RPO < 1 min, RTO < 15 min) requires Aurora Global Database or DynamoDB Global Tables with pre-provisioned standby; Tier 2 (important, RPO < 1 hour, RTO < 4 hours) uses cross-region RDS read replicas with automated promotion; Tier 3 (non-critical, RPO < 24 hours, RTO < 24 hours) relies on AWS Backup cross-region vault copy and infrastructure-as-code reconstruction
  • [Critical] Configure Elastic Disaster Recovery (DRS) for server replication: install the replication agent on source servers (EC2, on-premises, or other clouds), continuous block-level replication to staging area in the DR region using low-cost EBS volumes, configure launch settings (instance type, VPC, subnet, security groups) and post-launch actions (scripts for application configuration, DNS updates); test with non-disruptive drill instances that launch from the latest recovery point without affecting replication; plan failback by reversing replication direction after primary site recovery
  • [Critical] Design Aurora Global Database for Tier 1 database DR: managed cross-region replication with typical RPO < 1 second and RTO < 1 minute via managed planned failover; up to 5 secondary regions with 16 read replicas each; use managed planned failover (zero data loss, for controlled switchover) vs detach-and-promote (potential data loss, for unplanned region failure); monitor replication lag via AuroraGlobalDBReplicationLag CloudWatch metric and alarm if lag exceeds RPO target
  • [Critical] Configure cross-region RDS read replicas for non-Aurora databases: asynchronous replication to DR region (RPO = replication lag, typically seconds), promote read replica to standalone primary during region failure (RTO = minutes for promotion + DNS/application reconfiguration); note that promotion is irreversible — the promoted instance becomes independent and a new replica must be created for future DR; automate promotion and DNS update via Lambda and Step Functions
  • [Critical] Set up Route 53 health checks and DNS failover: configure health checks against primary region endpoints (HTTP, HTTPS, TCP, or CloudWatch alarm-based), use failover routing policy to direct traffic to DR region when primary health check fails; set health check intervals (10s or 30s) and failure threshold (1-10 consecutive failures); design health check endpoints that verify downstream dependency health (database connectivity, cache availability), not just application process status; account for DNS TTL (60-300s) and client-side caching in RTO calculations
  • [Critical] Configure S3 Cross-Region Replication (CRR) for data DR: enable versioning on source and destination buckets, create replication rules with appropriate scope (entire bucket or prefix/tag filters), choose replication time control (S3 RTC) for SLA-backed replication within 15 minutes for 99.99% of objects; monitor replication metrics (bytes pending, replication latency) via S3 Replication Metrics and CloudWatch; note that CRR does not replicate existing objects retroactively — use S3 Batch Replication for initial sync
  • [Critical] Implement AWS Backup for centralized cross-region data protection: create backup plans with lifecycle policies (transition to cold storage, retention period), configure cross-region vault copy rules for all critical resources (EC2 snapshots, EBS volumes, RDS snapshots, DynamoDB tables, EFS file systems), enable vault lock for immutable backups (compliance mode for regulatory requirements, governance mode for operational protection); use AWS Backup Audit Manager to verify backup compliance and generate audit-ready reports
  • [Critical] Design DynamoDB Global Tables for multi-region active-active data: automatic multi-region, multi-active replication with typical replication latency under 1 second; all replicas accept reads and writes with last-writer-wins conflict resolution; provision capacity independently per region based on regional traffic patterns; monitor ReplicationLatency metric per region pair; consider cost — each replica consumes write capacity for replicated writes (replicated WCU billed separately from standard WCU)
  • [Recommended] Plan multi-region EKS/ECS disaster recovery: replicate cluster configuration via GitOps (Flux, Argo CD), store container images in ECR with cross-region replication enabled, use Route 53 or Global Accelerator for traffic routing between clusters; for EKS, consider Velero for persistent volume backup and cluster state migration; for ECS, maintain task definitions and service configurations in CloudFormation/Terraform for rapid redeployment; pre-provision DR cluster (warm standby) or deploy on-demand (pilot light) depending on RTO target
  • [Recommended] Automate environment reconstruction with CloudFormation or Terraform: maintain all infrastructure as code in version control, use CloudFormation StackSets or Terraform workspaces for multi-region deployment, pre-validate templates in the DR region periodically to catch resource limits or service availability issues; store Terraform state in S3 with cross-region replication and DynamoDB state locking replicated via Global Tables; keep AMI copies current in the DR region via automated cross-region AMI copy pipelines
  • [Recommended] Test DR regularly using Fault Injection Service (FIS) and planned failovers: create FIS experiment templates simulating region-level failures (network disruption, AZ unavailability), run DRS recovery drills without impacting production, execute Aurora Global Database switchover tests, test Route 53 failover by intentionally failing health checks; schedule quarterly DR drills, document measured RTO/RPO vs targets, automate drill orchestration with Step Functions; use FIS guardrails (stop conditions) to limit blast radius during testing
  • [Recommended] Select multi-region pattern based on cost and RTO tradeoffs: pilot light (minimal DR footprint — database replicas and core networking only, RTO 1-4 hours, ~10-15% additional cost), warm standby (scaled-down copy of production running in DR region, RTO 15-60 min, ~30-50% additional cost), hot standby / active-passive (full-scale DR environment receiving replicated data, RTO < 15 min, ~80-100% additional cost), active-active (traffic served from multiple regions simultaneously via Route 53 or Global Accelerator, near-zero RTO, ~100%+ additional cost with added complexity for data consistency)
  • [Optional] Configure multi-region active-active patterns with Global Accelerator: use AWS Global Accelerator for anycast IP-based traffic distribution across regions with automatic failover in seconds (faster than DNS-based failover); combine with DynamoDB Global Tables or Aurora Global Database for multi-region write capability; design application tier for statelessness with session data in ElastiCache Global Datastore or DynamoDB Global Tables; account for conflict resolution in application logic for concurrent multi-region writes

Why This Matters

AWS regions are independent infrastructure deployments designed for fault isolation, but regional outages have occurred and can last hours. Without a tested DR strategy, recovery depends entirely on AWS restoring the affected region — which is outside your control. AWS provides extensive DR building blocks (DRS, Aurora Global Database, DynamoDB Global Tables, S3 CRR, Route 53 failover), but each must be explicitly configured, tested, and maintained to deliver the promised RPO/RTO.

The most common DR failure is untested recovery plans. Organizations configure cross-region replication but never execute a full failover sequence: promoting a database replica, updating DNS routing, reconfiguring application connection strings, validating data consistency, and confirming end-to-end application functionality. Elastic Disaster Recovery drill instances and Aurora Global Database switchover testing exist specifically for non-disruptive validation — use them regularly.

Multi-region cost is the primary constraint on DR strategy selection. A pilot light approach (database replicas plus infrastructure-as-code for on-demand reconstruction) adds roughly 10-15% to infrastructure cost, while active-active multi-region deployment can double or triple it. The cost difference between pilot light (RTO 1-4 hours) and hot standby (RTO < 15 minutes) is substantial. These tradeoffs must be driven by business impact analysis — what does one hour of downtime actually cost? — not by engineering preference for the most resilient architecture.

AWS Backup cross-region vault copy provides a safety net independent of replication. Even with real-time replication configured, logical corruption (application bugs, accidental deletions, ransomware) replicates to the secondary region within seconds. Point-in-time recovery from AWS Backup vaults provides an independent recovery path. Vault lock with immutable retention prevents backup deletion even by compromised administrative credentials.

Common Decisions (ADR Triggers)

  • Pilot light vs warm standby vs hot standby vs active-active — Pilot light maintains only database replicas and core networking in the DR region, reconstructing compute from infrastructure-as-code during failover (RTO 1-4 hours, lowest cost). Warm standby runs a scaled-down copy of production (smaller instance types, reduced auto-scaling) for faster failover (RTO 15-60 min, moderate cost). Hot standby mirrors production capacity in the DR region for near-instant failover (RTO < 15 min, near-double cost). Active-active serves traffic from multiple regions simultaneously (near-zero RTO, highest cost, requires multi-region data strategy). Choose based on business-defined RTO requirements and downtime cost analysis — not all workloads need the same tier.
  • Aurora Global Database vs cross-region RDS read replica — Aurora Global Database provides managed cross-region replication with RPO typically < 1 second and RTO < 1 minute via managed planned failover, supports up to 5 secondary regions, and handles replication infrastructure automatically. Cross-region RDS read replicas work with non-Aurora engines (PostgreSQL, MySQL, MariaDB), provide asynchronous replication with RPO of seconds-to-minutes, but require manual promotion (irreversible) and application reconfiguration. Use Aurora Global Database for Tier 1 workloads requiring the fastest RPO/RTO. Use cross-region read replicas when Aurora is not available for the engine or when cost constraints apply (Aurora Global Database requires Aurora pricing in every region).
  • DynamoDB Global Tables vs application-level replication — Global Tables provide fully managed multi-region, multi-active replication with sub-second latency and last-writer-wins conflict resolution. Application-level replication (DynamoDB Streams to Lambda to cross-region writes) offers custom conflict resolution logic and selective replication (filter which items replicate) but adds operational complexity, failure modes, and Lambda execution costs. Use Global Tables for most multi-region DynamoDB workloads. Use application-level replication only when custom conflict resolution or selective replication is required.
  • Elastic Disaster Recovery (DRS) vs infrastructure-as-code reconstruction — DRS provides block-level continuous replication for servers (EC2, on-premises, other clouds) with sub-second RPO and minutes RTO, maintaining a near-real-time copy of server state including OS, applications, and data. IaC reconstruction (CloudFormation, Terraform) rebuilds infrastructure from templates and restores data from backups, with RTO dependent on provisioning time and data restore duration (typically hours). DRS is superior for stateful workloads, legacy applications, and lift-and-shift servers. IaC reconstruction is better for cloud-native, stateless services where the application state lives entirely in managed databases and object storage.
  • Route 53 failover vs Global Accelerator failover — Route 53 failover routing uses DNS-based failover with RTO dependent on DNS TTL and client-side caching (typically 60-300 seconds after failure detection). Global Accelerator uses anycast IP addresses with BGP-based routing that failover within seconds, independent of DNS. Route 53 is simpler and cheaper, sufficient for most workloads. Global Accelerator provides faster failover and is preferred for latency-sensitive applications, TCP/UDP workloads, or applications where DNS caching causes unacceptable failover delays.
  • S3 CRR with Replication Time Control vs standard CRR — Standard CRR replicates most objects within minutes but provides no SLA on replication time. S3 Replication Time Control (RTC) guarantees 99.99% of objects replicate within 15 minutes, backed by an SLA, with replication metrics and notifications for compliance monitoring. RTC adds cost (~$0.015/GB replicated). Use standard CRR for non-critical data or when replication timeliness is best-effort. Use RTC when regulatory requirements or RPO targets demand guaranteed replication timing.
  • AWS Backup centralized vs per-service native backup — AWS Backup provides unified backup policies, cross-region vault copy, vault lock for immutability, and audit reporting across supported services (EC2, EBS, RDS, DynamoDB, EFS, FSx, S3). Per-service native backup (RDS automated snapshots, DynamoDB on-demand backups, EBS snapshots) offers tighter integration and service-specific features (RDS PITR, DynamoDB PITR). Use AWS Backup as the centralized policy layer for cross-region DR and compliance. Supplement with native backup features where AWS Backup does not cover specific needs (e.g., RDS PITR granularity).

See Also

  • general/disaster-recovery.md — General DR planning (RPO/RTO tiering, failover models, testing methodology)
  • general/enterprise-backup.md — Backup tool selection, storage tiering, and ransomware protection
  • providers/aws/rds-aurora.md — RDS and Aurora configuration, HA, and replication details
  • providers/aws/dynamodb.md — DynamoDB capacity planning, Global Tables, and backup strategies
  • providers/aws/s3.md — S3 storage classes, versioning, and replication configuration
  • providers/aws/route53.md — Route 53 routing policies, health checks, and DNS failover
  • providers/aws/containers.md — EKS and ECS deployment patterns, multi-region considerations
  • providers/aws/ec2-asg.md — EC2 Auto Scaling and cross-AZ deployment for resilience