Skip to content

Azure Disaster Recovery (Site Recovery, Geo-Redundancy, Backup)

Scope

Azure business continuity and disaster recovery services. Covers Azure Site Recovery (VM replication and recovery plans), storage redundancy (LRS/ZRS/GRS/GZRS), Azure SQL geo-replication and failover groups, Cosmos DB multi-region writes, Azure Backup, and DR drill automation.

Checklist

  • [Critical] Define RPO and RTO targets per workload tier: Tier 1 (mission-critical, RPO < 5 min, RTO < 1 hour), Tier 2 (important, RPO < 1 hour, RTO < 4 hours), Tier 3 (non-critical, RPO < 24 hours, RTO < 24 hours); map each Azure service's DR capabilities to these targets
  • [Critical] Configure Azure Site Recovery (ASR) for VM replication: continuous replication from primary region to DR region, crash-consistent recovery points every 5 minutes, app-consistent snapshots every 1-4 hours; test failover monthly without impacting production using isolated VNet
  • [Critical] Design ASR recovery plans: group VMs into ordered boot sequences (database tier first, then application tier, then web tier), add pre/post-actions (Azure Automation runbooks for DNS updates, load balancer reconfiguration, connection string changes), and test the full plan regularly
  • [Critical] Select storage redundancy based on DR requirements: LRS (3 copies in single datacenter), ZRS (3 copies across availability zones), GRS (6 copies across two regions, async replication with RPO < 15 min), GZRS (ZRS + async cross-region), RA-GRS/RA-GZRS (read access to secondary region for read-heavy workloads)
  • [Critical] Configure Azure SQL geo-replication: active geo-replication for up to 4 readable secondaries in any region with RPO < 5 seconds, or auto-failover groups for automatic failover with read/write and read-only listener endpoints that follow the primary; failover groups support graceful failover (no data loss) and forced failover (potential data loss)
  • [Critical] Set up Cosmos DB multi-region writes or single-write with automatic failover: multi-region writes provide RPO = 0 with conflict resolution (last-writer-wins, custom, stored procedure); single-write with multi-region reads provides automatic failover with RPO based on consistency level (RPO = 0 for strong consistency, RPO = ~K updates for bounded staleness)
  • [Critical] Implement Traffic Manager or Front Door failover: Traffic Manager priority routing for DNS-level failover (TTL-dependent, 30-60s); Front Door origin groups with health probes for Layer 7 failover (sub-second); design health probe endpoints that verify downstream dependency health, not just application process health
  • [Critical] Configure Azure Backup for all stateful resources: VMs (daily snapshots, instant restore from snapshot tier), Azure SQL (automated backups with PITR up to 35 days, LTR policies for compliance), Azure Files (share snapshots), managed disks (incremental snapshots); store backups in Recovery Services vault with geo-redundancy
  • [Critical] Design Recovery Services vault strategy: one vault per region per workload group, GRS replication for cross-region restore capability, soft delete enabled (14-day retention of deleted backup data), immutable vaults for ransomware protection (backup data cannot be deleted before expiry)
  • [Recommended] Plan AKS disaster recovery: replicate cluster configuration (Helm charts, Kubernetes manifests) via GitOps, use Velero for persistent volume backup and cluster state migration, deploy standby cluster in DR region or use AKS automatic provisioning in recovery plans
  • [Critical] Implement cross-region failover testing: schedule quarterly DR drills using ASR test failover, Azure SQL failover group planned failover, and Cosmos DB manual region failover; document actual RTO/RPO achieved vs targets; automate drill execution with runbooks
  • [Recommended] Configure backup monitoring and compliance: Azure Backup Center for centralized monitoring across vaults and subscriptions, Azure Policy to enforce backup on all VMs (built-in policy: "Azure Backup should be enabled for Virtual Machines"), backup compliance reports for audit

Why This Matters

Azure regions can experience outages lasting hours to days. Without a tested DR strategy, recovery depends on Azure restoring the affected region -- which is outside your control. Azure Site Recovery, geo-redundant storage, and database geo-replication provide the building blocks, but they must be configured, tested, and maintained to deliver the promised RPO/RTO.

The most common DR failure is untested recovery plans. Organizations configure ASR replication but never run test failover, discovering during an actual outage that recovery plans have missing steps, incorrect boot ordering, or stale connection strings. Monthly test failovers in an isolated VNet are essential and cost only compute for the duration of the test.

Storage redundancy choices have significant cost implications: GRS costs roughly 2x LRS but provides cross-region durability. RA-GRS adds read access to the secondary but only provides eventual consistency (up to 15 minutes lag). For databases, the cost of geo-replication (additional database instance in DR region) must be budgeted alongside the primary.

Azure SQL failover groups vs active geo-replication is a key decision: failover groups provide DNS-based connection endpoints that follow failover automatically (no connection string changes), while active geo-replication requires application-level failover logic but supports more than one secondary and finer-grained control.

Common Decisions (ADR Triggers)

  • Active/active vs active/passive DR -- Active/active deploys application instances in two or more regions serving traffic simultaneously via Front Door or Traffic Manager. Provides near-zero RTO (traffic shifts instantly) but requires stateless application design, multi-region database writes (Cosmos DB multi-region or Azure SQL with conflict handling), and double infrastructure cost. Active/passive keeps a standby region with replicated data but no active compute until failover; lower cost but RTO depends on startup time and DNS propagation.
  • Azure Site Recovery vs application-level DR -- ASR provides infrastructure-level replication for VMs with automated failover and recovery plans. Application-level DR uses IaC (Bicep/Terraform) to deploy fresh infrastructure in the DR region with database geo-replication for data. ASR is simpler for lift-and-shift workloads. Application-level DR is more resilient for cloud-native applications (no VM state dependency) and integrates with CI/CD. Many organizations use both: ASR for legacy VMs, application-level DR for cloud-native services.
  • Azure SQL failover groups vs active geo-replication -- Failover groups provide automatic failover, DNS listener endpoints that follow the primary (no connection string changes), and support for both single databases and elastic pools. Active geo-replication supports up to 4 readable secondaries in any region, allows failover of individual databases (not the whole group), and provides finer control over replication lag. Use failover groups for most applications needing automatic failover. Use active geo-replication for read-scale scenarios or when failover granularity is needed.
  • Cosmos DB consistency level for DR -- Strong consistency (RPO = 0) provides guaranteed zero data loss across regions but increases write latency (requires cross-region quorum) and reduces availability during regional outages. Bounded staleness provides bounded RPO with better write performance. Session consistency (default) provides RPO of recent uncommitted writes with good performance. Choose based on data loss tolerance: financial transactions may need strong; user-generated content can tolerate session.
  • GRS vs GZRS vs cross-region replication -- GRS replicates from LRS primary to LRS secondary region (6 copies total, protects against region failure). GZRS replicates from ZRS primary to LRS secondary region (protects against both zone and region failure). Cross-region replication (for services like ACR, Key Vault) independently replicates service data. Use GZRS for data requiring both zone and region resilience. Use GRS when zone resilience is not needed and cost savings matter.
  • Recovery Services vault vs Azure Backup vault -- Recovery Services vault is the traditional backup infrastructure supporting VMs, SQL, Files, and SAP. Azure Backup vault (newer) supports Azure Disks, Azure Blobs, Azure Database for PostgreSQL, and Kubernetes. Some workloads only support one vault type. Plan vault topology early: per-region, per-subscription, aligned with management group structure.

Reference Architectures

Multi-Region Active/Passive with Automated Failover

Primary region (East US): Application Gateway -> AKS cluster -> Azure SQL with auto-failover group -> Blob Storage (GZRS). DR region (West US): standby AKS cluster (scaled to minimum), Azure SQL geo-secondary (readable), Blob Storage RA-GZRS read endpoint. Traffic Manager priority routing: primary endpoint health-checked, secondary endpoint activated on primary failure. ASR for any IaaS VMs. Recovery plan runbook: scale up DR AKS cluster, verify SQL failover, update app configuration, validate health endpoints. Target: RPO < 5 min, RTO < 30 min.

Database-Tier DR Strategy

Tier 1 (Cosmos DB): multi-region writes with session consistency, automatic failover priority list, RPO = 0 for committed writes. Tier 2 (Azure SQL): auto-failover group with grace period (1 hour, allowing transient issues to resolve before failover), read-only listener for reporting workloads in DR region. Tier 3 (Azure Database for PostgreSQL): geo-redundant backup with PITR, no hot standby (restore from backup on failover, RTO = 1-2 hours). All tiers: Azure Backup with geo-redundant vault for point-in-time recovery independent of replication.

Ransomware-Resilient Backup Architecture

Recovery Services vault with immutable policy (backup data cannot be deleted or reduced before retention expiry). Soft delete enabled with 14-day retention. Multi-user authorization requiring two identities to delete backup items. Azure Backup for VMs (daily, 30-day retention), Azure SQL (PITR + LTR weekly/monthly/yearly), Azure Files (daily snapshots, 30-day retention). Vault stored in separate subscription with restricted RBAC (Backup Operator role only, no Contributor). Azure Policy enforcing backup on all VMs and immutable vaults across all subscriptions.

Full-Stack DR Drill Automation

Azure Automation runbook triggered monthly: (1) ASR test failover to isolated VNet in DR region, (2) Azure SQL failover group planned failover (graceful, no data loss), (3) Cosmos DB manual region failover, (4) Smoke test suite against DR endpoints, (5) Record actual RTO/RPO metrics to Log Analytics, (6) ASR cleanup test failover, (7) SQL failback to primary, (8) Cosmos DB failback. Azure Monitor workbook displaying DR readiness dashboard with last drill date, measured RTO/RPO vs targets, and replication health status.


See Also

  • general/disaster-recovery.md -- General disaster recovery patterns, RPO/RTO planning, and testing strategies
  • providers/azure/storage.md -- Storage redundancy tiers (LRS/ZRS/GRS/GZRS) and failover behavior
  • providers/azure/database.md -- Azure SQL failover groups and Cosmos DB multi-region configuration
  • providers/azure/dns.md -- Traffic Manager and Front Door for DNS-level failover routing