GCP Disaster Recovery (Cross-Region, Backup, Multi-Region Services)¶
Scope¶
GCP cross-region disaster recovery patterns: Cloud SQL HA and replica promotion, Spanner multi-region, Cloud Storage dual/multi-region, GKE multi-cluster failover, Backup and DR Service, global load balancing failover, and DR testing procedures.
Checklist¶
- [Critical] Define RPO and RTO targets per workload tier and map to GCP service capabilities: Tier 1 (RPO < 5 min, RTO < 15 min) requires multi-region services and pre-provisioned standby; Tier 2 (RPO < 1 hour, RTO < 4 hours) uses cross-region replication with automated failover; Tier 3 (RPO < 24 hours, RTO < 24 hours) relies on backup and restore
- [Critical] Design Cloud SQL high availability: regional HA with automatic failover to standby instance in a different zone (RPO = 0, RTO = 1-2 minutes for same-region zone failure); cross-region read replicas for DR (promote replica to primary on region failure, RPO = replication lag typically seconds, RTO = minutes for promotion + DNS update)
- [Critical] Configure Spanner for multi-region deployments: multi-region instance configurations (nam6, eur6, nam-eur-asia1) provide automatic replication across regions with RPO = 0 and RTO = 0 (transparent failover); single-region configurations require manual DR planning; multi-region costs roughly 3x single-region
- [Recommended] Plan GKE multi-cluster disaster recovery: Multi Cluster Ingress or Gateway API for cross-cluster traffic routing, Backup for GKE (managed Velero) for cluster state and persistent volume snapshots, GitOps (Config Sync, Argo CD) for declarative cluster configuration replication; standby cluster can be scaled-down or full-capacity depending on RTO requirements
- [Critical] Configure Cloud Storage dual-region or multi-region for data resilience: dual-region buckets (e.g., us-central1+us-east1) provide automatic replication with turbo replication option (RPO < 15 minutes vs default RPO < 12 hours); multi-region buckets (us, eu, asia) provide broadest geo-redundancy; both provide 99.95% availability SLA vs 99.9% for regional
- [Recommended] Enable turbo replication for dual-region Cloud Storage buckets storing critical data: guarantees RPO < 15 minutes for 100% of objects (vs best-effort < 12 hours default); additional cost but essential for data that cannot tolerate longer replication lag
- [Critical] Set up Backup and DR Service (formerly Actifio) for VM and database backup: policy-driven backup scheduling, application-consistent snapshots, instant mount for rapid recovery, cross-region backup vault for regional DR; supports Compute Engine VMs, Cloud SQL, and SAP HANA
- [Critical] Design cross-region failover patterns: Global External Application Load Balancer (automatic backend failover across regions for HTTP), Cloud DNS failover routing policy (DNS-level failover for non-HTTP), and Traffic Director (service mesh failover with Envoy); health checks determine backend availability at each layer
- [Critical] Plan Firestore/Datastore multi-region: choose multi-region location (nam5, eur3) at database creation for automatic multi-region replication with strong consistency; single-region databases require export/import for cross-region DR; multi-region location cannot be changed after creation
- [Recommended] Implement deployment automation for DR: Terraform or Deployment Manager for infrastructure provisioning in DR region, Cloud Build triggers for automated DR environment deployment, pre-built container images in multi-region Artifact Registry for fast service deployment during failover
- [Optional] Configure Memorystore Redis cross-region replication: create secondary instance in DR region with automatic async replication; manual failover promotes secondary to primary; useful for cache warming in DR region before application failover
- [Critical] Test DR regularly: schedule quarterly failover drills using Cloud SQL replica promotion (followed by rebuild), GKE cluster failover with Multi Cluster Ingress, and Cloud DNS routing policy switch; document measured RTO/RPO vs targets; automate with Cloud Workflows
Why This Matters¶
GCP regions are designed for independent failure, but regional outages do occur (networking issues, power events, cooling failures). Without a tested DR strategy, recovery time depends entirely on Google restoring the affected region. GCP provides multi-region services (Spanner, Firestore, Cloud Storage multi-region) with built-in cross-region replication, but many commonly used services (Cloud SQL, Memorystore, GKE, Compute Engine) are regional and require explicit DR configuration.
The gap between DR capability and DR readiness is the testing gap. Organizations configure cross-region replication but never test the full failover sequence: promoting a Cloud SQL replica, reconfiguring application connection strings, switching DNS or load balancer routing, validating data consistency, and confirming full application functionality. Untested DR plans fail under real outage conditions due to missing steps, stale configurations, or unexpected dependencies.
Cloud Storage dual-region with turbo replication vs multi-region is a critical cost/RPO trade-off. Standard dual-region replication has a best-effort RPO of 12 hours for geo-redundancy (most objects replicate much faster, but the SLA is 12 hours). Turbo replication guarantees RPO < 15 minutes but costs approximately 2x the standard dual-region storage price. For data where a 12-hour RPO is unacceptable, turbo replication is essential.
Common Decisions (ADR Triggers)¶
- Multi-region services vs cross-region replication -- Multi-region services (Spanner multi-region, Firestore multi-region, Cloud Storage multi-region) handle replication automatically with zero operational overhead and near-zero RPO/RTO. Cross-region replication (Cloud SQL read replicas, Memorystore replication, GKE multi-cluster) requires configuration, monitoring, and manual or scripted failover. Multi-region services cost more (Spanner multi-region is 3x single-region) but eliminate DR operational burden. Use multi-region services for data requiring the strongest DR guarantees. Use cross-region replication for cost-sensitive workloads with acceptable RTO.
- Hot standby vs cold standby vs pilot light -- Hot standby maintains a fully running DR environment (duplicate compute, databases, load balancers) for near-instant failover but at full infrastructure cost. Pilot light maintains the minimum core infrastructure (database replicas, base GKE cluster) with application deployment on failover. Cold standby relies on infrastructure-as-code to provision from scratch during DR events. Hot standby for RPO/RTO < 15 minutes. Pilot light for RTO < 1 hour. Cold standby for RTO < 4 hours where cost savings outweigh recovery speed.
- Cloud SQL HA vs cross-region read replica for DR -- Cloud SQL HA provides automatic failover within a region (zone failure protection) with RPO = 0 and RTO = 1-2 minutes. Cross-region read replicas protect against region failure but require manual promotion (RTO = minutes for promotion + application reconfiguration) and have non-zero RPO (replication lag). Use both: HA for zone failures (automatic), cross-region replica for region failures (manual failover). Cloud SQL HA alone does not protect against region-level outages.
- Cloud Storage dual-region vs multi-region -- Dual-region stores data in two specific regions (e.g., us-central1 + us-east1) with deterministic data placement and optional turbo replication. Multi-region stores data across a broad geographic area (US, EU, Asia) for maximum availability but without control over specific regions. Dual-region is preferred when data residency requirements dictate specific regions or when turbo replication SLA is needed. Multi-region provides broadest availability for globally accessed data.
- Backup and DR Service vs native backup tools -- Backup and DR Service provides centralized policy management, application-consistent backups, instant mount recovery, and cross-region vault. Native tools (Cloud SQL automated backups, GKE Backup, Compute Engine snapshots) are simpler and tightly integrated with individual services. Use native tools for single-service backup needs. Use Backup and DR Service for centralized management across multiple services, compliance-driven backup policies, or SAP/Oracle workloads.
- GKE multi-cluster strategy -- Active/active multi-cluster with Multi Cluster Ingress distributes traffic across clusters in different regions for both performance and DR. Active/passive with a standby cluster scaled to minimum (or zero with cluster autoscaler) reduces cost but increases failover time. Stateless workloads favor active/active. Stateful workloads (with persistent volumes) are more complex and may favor active/passive with Backup for GKE for PV replication.
Reference Architectures¶
Multi-Region Active/Active Web Application¶
Global External Application Load Balancer -> GKE clusters in us-central1 and europe-west1 (both serving traffic). Spanner multi-region instance (nam-eur-asia1) for database with RPO = 0 and transparent failover. Cloud Storage multi-region (us) for static assets. Memorystore Redis in each region for session cache (no cross-region replication needed for stateless sessions). Pub/Sub (global) for async messaging. If one region fails, load balancer automatically routes all traffic to healthy region. No manual intervention required.
Cloud SQL DR with Cross-Region Failover¶
Primary: Cloud SQL PostgreSQL in us-central1 with HA enabled (zone failover). DR: cross-region read replica in us-east1. Application connection via Cloud SQL Proxy. Failover procedure (automated via Cloud Workflows): (1) promote us-east1 replica to standalone primary, (2) update Cloud DNS private zone CNAME from primary to promoted instance, (3) restart application pods to pick up new connection, (4) validate data consistency and application health, (5) create new read replica from promoted instance for future DR. Target: RPO < 30 seconds (replication lag), RTO < 15 minutes.
GKE Multi-Cluster with Backup for GKE¶
Primary GKE cluster in us-central1 with Backup for GKE (daily backup of cluster configuration + persistent volumes). Standby GKE cluster in us-east1 with minimal node pool (scale-to-zero with autoscaler). Config Sync replicating Kubernetes manifests from Git to both clusters. Failover: (1) scale up standby cluster node pool, (2) restore persistent volumes from latest Backup for GKE snapshot, (3) Multi Cluster Ingress shifts traffic to standby cluster, (4) validate workload health. Post-failover: rebuild primary cluster and re-establish as standby.
Tiered DR Strategy¶
Tier 1 (payment service): Spanner multi-region (RPO = 0, RTO = 0), Global Load Balancer active/active across regions, Cloud Storage dual-region with turbo replication. Tier 2 (order service): Cloud SQL with cross-region read replica (RPO < 30s, RTO < 15 min), GKE standby cluster in DR region with pilot light configuration. Tier 3 (reporting): Cloud SQL automated backups with cross-region storage (RPO < 24 hours), restore from backup on demand (RTO < 4 hours), no standby infrastructure. All tiers: Terraform for infrastructure provisioning, Cloud Build for automated DR deployment, quarterly DR drill with documented results.
Reference Links¶
- GCP disaster recovery planning guide -- DR patterns, architecture scenarios, and RPO/RTO planning
- Cloud SQL high availability documentation -- regional HA, cross-region read replicas, and failover configuration
- Backup and DR Service documentation -- policy-driven backup, instant mount recovery, and cross-region vaults
- Cloud Storage turbo replication -- RPO guarantees for dual-region buckets
See Also¶
general/disaster-recovery.md-- general DR planning (RPO/RTO, tiering, testing)providers/gcp/data.md-- GCP database services and HA configurationproviders/gcp/storage.md-- GCP Cloud Storage classes and replicationproviders/gcp/networking.md-- GCP global load balancing and Cloud DNS failover routing