Azure Disaster Recovery (Site Recovery, Geo-Redundancy, Backup)¶

Scope¶

Azure business continuity and disaster recovery services. Covers Azure Site Recovery (VM replication and recovery plans), storage redundancy (LRS/ZRS/GRS/GZRS), Azure SQL geo-replication and failover groups, Cosmos DB multi-region writes, Azure Backup, and DR drill automation.

Checklist¶

Why This Matters¶

Azure regions can experience outages lasting hours to days. Without a tested DR strategy, recovery depends on Azure restoring the affected region -- which is outside your control. Azure Site Recovery, geo-redundant storage, and database geo-replication provide the building blocks, but they must be configured, tested, and maintained to deliver the promised RPO/RTO.

The most common DR failure is untested recovery plans. Organizations configure ASR replication but never run test failover, discovering during an actual outage that recovery plans have missing steps, incorrect boot ordering, or stale connection strings. Monthly test failovers in an isolated VNet are essential and cost only compute for the duration of the test.

Storage redundancy choices have significant cost implications: GRS costs roughly 2x LRS but provides cross-region durability. RA-GRS adds read access to the secondary but only provides eventual consistency (up to 15 minutes lag). For databases, the cost of geo-replication (additional database instance in DR region) must be budgeted alongside the primary.

Azure SQL failover groups vs active geo-replication is a key decision: failover groups provide DNS-based connection endpoints that follow failover automatically (no connection string changes), while active geo-replication requires application-level failover logic but supports more than one secondary and finer-grained control.

Common Decisions (ADR Triggers)¶

Active/active vs active/passive DR -- Active/active deploys application instances in two or more regions serving traffic simultaneously via Front Door or Traffic Manager. Provides near-zero RTO (traffic shifts instantly) but requires stateless application design, multi-region database writes (Cosmos DB multi-region or Azure SQL with conflict handling), and double infrastructure cost. Active/passive keeps a standby region with replicated data but no active compute until failover; lower cost but RTO depends on startup time and DNS propagation.
Azure Site Recovery vs application-level DR -- ASR provides infrastructure-level replication for VMs with automated failover and recovery plans. Application-level DR uses IaC (Bicep/Terraform) to deploy fresh infrastructure in the DR region with database geo-replication for data. ASR is simpler for lift-and-shift workloads. Application-level DR is more resilient for cloud-native applications (no VM state dependency) and integrates with CI/CD. Many organizations use both: ASR for legacy VMs, application-level DR for cloud-native services.
Azure SQL failover groups vs active geo-replication -- Failover groups provide automatic failover, DNS listener endpoints that follow the primary (no connection string changes), and support for both single databases and elastic pools. Active geo-replication supports up to 4 readable secondaries in any region, allows failover of individual databases (not the whole group), and provides finer control over replication lag. Use failover groups for most applications needing automatic failover. Use active geo-replication for read-scale scenarios or when failover granularity is needed.
Cosmos DB consistency level for DR -- Strong consistency (RPO = 0) provides guaranteed zero data loss across regions but increases write latency (requires cross-region quorum) and reduces availability during regional outages. Bounded staleness provides bounded RPO with better write performance. Session consistency (default) provides RPO of recent uncommitted writes with good performance. Choose based on data loss tolerance: financial transactions may need strong; user-generated content can tolerate session.
GRS vs GZRS vs cross-region replication -- GRS replicates from LRS primary to LRS secondary region (6 copies total, protects against region failure). GZRS replicates from ZRS primary to LRS secondary region (protects against both zone and region failure). Cross-region replication (for services like ACR, Key Vault) independently replicates service data. Use GZRS for data requiring both zone and region resilience. Use GRS when zone resilience is not needed and cost savings matter.
Recovery Services vault vs Azure Backup vault -- Recovery Services vault is the traditional backup infrastructure supporting VMs, SQL, Files, and SAP. Azure Backup vault (newer) supports Azure Disks, Azure Blobs, Azure Database for PostgreSQL, and Kubernetes. Some workloads only support one vault type. Plan vault topology early: per-region, per-subscription, aligned with management group structure.

Reference Architectures¶

Multi-Region Active/Passive with Automated Failover¶

Primary region (East US): Application Gateway -> AKS cluster -> Azure SQL with auto-failover group -> Blob Storage (GZRS). DR region (West US): standby AKS cluster (scaled to minimum), Azure SQL geo-secondary (readable), Blob Storage RA-GZRS read endpoint. Traffic Manager priority routing: primary endpoint health-checked, secondary endpoint activated on primary failure. ASR for any IaaS VMs. Recovery plan runbook: scale up DR AKS cluster, verify SQL failover, update app configuration, validate health endpoints. Target: RPO < 5 min, RTO < 30 min.

Database-Tier DR Strategy¶

Tier 1 (Cosmos DB): multi-region writes with session consistency, automatic failover priority list, RPO = 0 for committed writes. Tier 2 (Azure SQL): auto-failover group with grace period (1 hour, allowing transient issues to resolve before failover), read-only listener for reporting workloads in DR region. Tier 3 (Azure Database for PostgreSQL): geo-redundant backup with PITR, no hot standby (restore from backup on failover, RTO = 1-2 hours). All tiers: Azure Backup with geo-redundant vault for point-in-time recovery independent of replication.

Ransomware-Resilient Backup Architecture¶

Recovery Services vault with immutable policy (backup data cannot be deleted or reduced before retention expiry). Soft delete enabled with 14-day retention. Multi-user authorization requiring two identities to delete backup items. Azure Backup for VMs (daily, 30-day retention), Azure SQL (PITR + LTR weekly/monthly/yearly), Azure Files (daily snapshots, 30-day retention). Vault stored in separate subscription with restricted RBAC (Backup Operator role only, no Contributor). Azure Policy enforcing backup on all VMs and immutable vaults across all subscriptions.

Full-Stack DR Drill Automation¶

Azure Automation runbook triggered monthly: (1) ASR test failover to isolated VNet in DR region, (2) Azure SQL failover group planned failover (graceful, no data loss), (3) Cosmos DB manual region failover, (4) Smoke test suite against DR endpoints, (5) Record actual RTO/RPO metrics to Log Analytics, (6) ASR cleanup test failover, (7) SQL failback to primary, (8) Cosmos DB failback. Azure Monitor workbook displaying DR readiness dashboard with last drill date, measured RTO/RPO vs targets, and replication health status.

Reference Links¶

Azure Site Recovery documentation -- VM replication, recovery plans, failover, and DR drill automation
Azure Backup documentation -- VM backup, Recovery Services vaults, immutable vaults, and soft delete
Azure SQL geo-replication and failover groups -- automatic failover, listener endpoints, and grace periods
Azure Storage redundancy -- LRS, ZRS, GRS, GZRS, and read-access options