OpenShift Data Protection¶

Scope¶

OpenShift data protection: OADP operator (Velero), backup storage locations, volume snapshots (CSI, Restic/Kopia), scheduled backups, etcd backup/restore, namespace-level backup strategies, cross-cluster restore, and DR runbooks.

Checklist¶

Why This Matters¶

Data protection in OpenShift spans two distinct domains: cluster-level protection (etcd, certificates, cluster configuration) and application-level protection (namespaces, deployments, PVs, secrets). Losing etcd means losing the entire cluster state -- every API object, every secret, every certificate. OADP handles application-level backup and restore but does not back up etcd or cluster-scoped configuration. These are complementary, not substitutes.

OADP is built on Velero and provides Backup, Restore, and Schedule CRDs. A backup captures Kubernetes resources (stored as JSON in object storage) and optionally snapshots persistent volumes. The CSI plugin uses VolumeSnapshot objects for fast, storage-level snapshots (seconds for block storage). The Restic/Kopia integration provides file-level backup for storage backends that do not support CSI snapshots -- it is portable across storage providers but significantly slower for large volumes.

etcd backup is a separate concern. The recommended approach is to run etcdctl snapshot save from a control plane node debug pod or automate it via a CronJob that mounts the etcd data directory. The backup produces a single file that can restore the full cluster state. etcd backups should be stored off-cluster (S3, NFS) and tested regularly.

Disaster recovery strategy depends on RPO (how much data loss is acceptable) and RTO (how quickly recovery must complete). GitOps-based DR (rebuild from Git + restore PV data) is slower but more reliable for complex environments. Backup-based DR (OADP restore to a standby cluster) is faster but requires consistent backup schedules and tested restore procedures. RHACM adds managed cluster backup and application failover policies for multi-cluster architectures.

A common failure mode is backing up everything but never testing restores. Backup validation -- including periodic restore drills to a non-production cluster -- is as important as the backup itself. OADP supports backup validation via the Restore CRD with namespace mapping (restore to a different namespace for validation).

Common Decisions (ADR Triggers)¶

CSI snapshots vs Restic/Kopia: CSI snapshots are fast and atomic but tied to the storage provider (cannot restore to a different storage backend). Restic/Kopia is slower but portable. Many organizations use CSI snapshots for operational recovery (fast rollback) and Restic/Kopia for disaster recovery (cross-cluster portability).
Backup scope: Namespace-level (fine-grained, team-managed) vs cluster-level (coarse, platform-managed). Namespace-level backups are easier to test and restore. Cluster-level backups capture cross-namespace dependencies but are harder to partially restore.
DR strategy: Active-passive (standby cluster with OADP restore) vs active-active (application-level replication) vs rebuild-from-Git (GitOps + PV restore). Active-passive is simplest. Active-active requires application-level replication (database replication, event streaming). Rebuild-from-Git is slowest but most resilient to infrastructure-level failures.
etcd backup frequency: Hourly (low RPO, more storage) vs daily (higher RPO, less storage). etcd backups are typically small (hundreds of MB) so hourly is feasible. Critical to take a backup before cluster upgrades.
RHACM for DR: RHACM provides ManagedClusterBackup for hub cluster DR and ApplicationDRPlacement for workload failover. Evaluate whether the RHACM subscription cost is justified vs manual DR procedures.
Backup storage location: Same region (fast backup, shared failure domain) vs cross-region (slower backup, independent failure domain). Cross-region is recommended for true DR. Consider multi-cloud backup (e.g., on-prem cluster backing up to AWS S3) for infrastructure independence.

Reference Architectures¶

Standard production DR: OADP with daily scheduled backups (CSI snapshots for PVs, 30-day retention), etcd backup CronJob every 4 hours to S3, GitOps repo as source of truth for cluster configuration, quarterly restore drills, RPO 4 hours / RTO 4 hours.
Regulated environment (finance, healthcare): OADP with hourly backups, CSI snapshots with cross-region replication (ODF Regional-DR), etcd backup every hour, encrypted backup storage (SSE-KMS), documented restore runbook with compliance attestation, RPO 1 hour / RTO 2 hours.
Multi-cluster DR with RHACM: Hub cluster with ManagedClusterBackup, spoke clusters with OADP, ApplicationDRPlacement policies for automated failover, ODF Metro-DR for synchronous PV replication (RPO ~0), application placement rules for active-passive failover.
GitOps-first DR: Cluster infrastructure defined in Terraform/Ansible, all application manifests in Git (ArgoCD), OADP only for stateful PV data, restore procedure: provision new cluster -> bootstrap ArgoCD -> sync from Git -> restore PV snapshots from S3. RTO 2-4 hours depending on cluster size.
Edge / remote site DR: SNO (Single Node OpenShift) clusters managed by RHACM, OADP with Restic to central S3 (no CSI snapshots on some edge storage), application state replicated to hub via Kafka/MQTT, edge rebuild via ZTP (Zero Touch Provisioning) from hub cluster, RPO 24 hours / RTO 4 hours.

Reference Links¶

OADP (OpenShift API for Data Protection) documentation -- Velero-based application backup, restore, and migration on OpenShift
OpenShift backup and restore -- etcd backup, cluster state recovery, and disaster recovery procedures
OpenShift disaster recovery -- control plane recovery scenarios and procedures