OpenShift Data Protection¶
Scope¶
OpenShift data protection: OADP operator (Velero), backup storage locations, volume snapshots (CSI, Restic/Kopia), scheduled backups, etcd backup/restore, namespace-level backup strategies, cross-cluster restore, and DR runbooks.
Checklist¶
- [Critical] Deploy OADP operator (OpenShift API for Data Protection) with Velero and appropriate plugins (AWS, Azure, GCP, CSI)
- [Critical] Configure backup storage location (BSL): S3-compatible object storage (AWS S3, MinIO, ODF MCG)
- [Critical] Configure volume snapshot location (VSL): CSI snapshots for block storage, Restic/Kopia for filesystem-level backup
- [Critical] Define scheduled backup policies via
ScheduleCRDs: daily application backups, weekly full-cluster backups - [Critical] Implement etcd backup and restore procedures: automated CronJob or manual
etcdctl snapshot savevia debug pod - [Recommended] Design namespace-level backup strategy: label selectors for application grouping, include/exclude resource filters
- [Critical] Test restore procedures regularly: full namespace restore, single resource restore, cross-cluster restore
- [Recommended] Plan PV backup approach: CSI VolumeSnapshot (fast, storage-level) vs Restic/Kopia (portable, file-level, slower)
- [Critical] Configure backup encryption: Velero supports server-side encryption (SSE-S3, SSE-KMS) for backup artifacts
- [Recommended] Set retention policies: backup TTL per schedule, snapshot lifecycle, storage cost management
- [Critical] Document disaster recovery runbook: cluster rebuild, etcd restore, application restore priorities (RPO/RTO targets)
- [Optional] Evaluate RHACM for multi-cluster failover: managed cluster backup, hub recovery, application DR policies
- [Recommended] Plan GitOps-based DR: rebuild cluster from Git (infrastructure-as-code + GitOps manifests) vs restore from backup
- [Critical] Back up cluster configuration: OAuth, certificates, custom CRDs, operator configurations, cluster-scoped resources
Why This Matters¶
Data protection in OpenShift spans two distinct domains: cluster-level protection (etcd, certificates, cluster configuration) and application-level protection (namespaces, deployments, PVs, secrets). Losing etcd means losing the entire cluster state -- every API object, every secret, every certificate. OADP handles application-level backup and restore but does not back up etcd or cluster-scoped configuration. These are complementary, not substitutes.
OADP is built on Velero and provides Backup, Restore, and Schedule CRDs. A backup captures Kubernetes resources (stored as JSON in object storage) and optionally snapshots persistent volumes. The CSI plugin uses VolumeSnapshot objects for fast, storage-level snapshots (seconds for block storage). The Restic/Kopia integration provides file-level backup for storage backends that do not support CSI snapshots -- it is portable across storage providers but significantly slower for large volumes.
etcd backup is a separate concern. The recommended approach is to run etcdctl snapshot save from a control plane node debug pod or automate it via a CronJob that mounts the etcd data directory. The backup produces a single file that can restore the full cluster state. etcd backups should be stored off-cluster (S3, NFS) and tested regularly.
Disaster recovery strategy depends on RPO (how much data loss is acceptable) and RTO (how quickly recovery must complete). GitOps-based DR (rebuild from Git + restore PV data) is slower but more reliable for complex environments. Backup-based DR (OADP restore to a standby cluster) is faster but requires consistent backup schedules and tested restore procedures. RHACM adds managed cluster backup and application failover policies for multi-cluster architectures.
A common failure mode is backing up everything but never testing restores. Backup validation -- including periodic restore drills to a non-production cluster -- is as important as the backup itself. OADP supports backup validation via the Restore CRD with namespace mapping (restore to a different namespace for validation).
Common Decisions (ADR Triggers)¶
- CSI snapshots vs Restic/Kopia: CSI snapshots are fast and atomic but tied to the storage provider (cannot restore to a different storage backend). Restic/Kopia is slower but portable. Many organizations use CSI snapshots for operational recovery (fast rollback) and Restic/Kopia for disaster recovery (cross-cluster portability).
- Backup scope: Namespace-level (fine-grained, team-managed) vs cluster-level (coarse, platform-managed). Namespace-level backups are easier to test and restore. Cluster-level backups capture cross-namespace dependencies but are harder to partially restore.
- DR strategy: Active-passive (standby cluster with OADP restore) vs active-active (application-level replication) vs rebuild-from-Git (GitOps + PV restore). Active-passive is simplest. Active-active requires application-level replication (database replication, event streaming). Rebuild-from-Git is slowest but most resilient to infrastructure-level failures.
- etcd backup frequency: Hourly (low RPO, more storage) vs daily (higher RPO, less storage). etcd backups are typically small (hundreds of MB) so hourly is feasible. Critical to take a backup before cluster upgrades.
- RHACM for DR: RHACM provides
ManagedClusterBackupfor hub cluster DR andApplicationDRPlacementfor workload failover. Evaluate whether the RHACM subscription cost is justified vs manual DR procedures. - Backup storage location: Same region (fast backup, shared failure domain) vs cross-region (slower backup, independent failure domain). Cross-region is recommended for true DR. Consider multi-cloud backup (e.g., on-prem cluster backing up to AWS S3) for infrastructure independence.
Reference Architectures¶
- Standard production DR: OADP with daily scheduled backups (CSI snapshots for PVs, 30-day retention), etcd backup CronJob every 4 hours to S3, GitOps repo as source of truth for cluster configuration, quarterly restore drills, RPO 4 hours / RTO 4 hours.
- Regulated environment (finance, healthcare): OADP with hourly backups, CSI snapshots with cross-region replication (ODF Regional-DR), etcd backup every hour, encrypted backup storage (SSE-KMS), documented restore runbook with compliance attestation, RPO 1 hour / RTO 2 hours.
- Multi-cluster DR with RHACM: Hub cluster with ManagedClusterBackup, spoke clusters with OADP, ApplicationDRPlacement policies for automated failover, ODF Metro-DR for synchronous PV replication (RPO ~0), application placement rules for active-passive failover.
- GitOps-first DR: Cluster infrastructure defined in Terraform/Ansible, all application manifests in Git (ArgoCD), OADP only for stateful PV data, restore procedure: provision new cluster -> bootstrap ArgoCD -> sync from Git -> restore PV snapshots from S3. RTO 2-4 hours depending on cluster size.
- Edge / remote site DR: SNO (Single Node OpenShift) clusters managed by RHACM, OADP with Restic to central S3 (no CSI snapshots on some edge storage), application state replicated to hub via Kafka/MQTT, edge rebuild via ZTP (Zero Touch Provisioning) from hub cluster, RPO 24 hours / RTO 4 hours.
Reference Links¶
- OADP (OpenShift API for Data Protection) documentation -- Velero-based application backup, restore, and migration on OpenShift
- OpenShift backup and restore -- etcd backup, cluster state recovery, and disaster recovery procedures
- OpenShift disaster recovery -- control plane recovery scenarios and procedures
See Also¶
general/disaster-recovery.md-- general DR planning (RPO/RTO, tiering)general/enterprise-backup.md-- enterprise backup architecture patternsproviders/openshift/storage.md-- ODF and CSI storage backends for snapshotsproviders/kubernetes/operations.md-- etcd backup and cluster operations