OpenStack Data Protection and Disaster Recovery¶

Scope¶

Covers OpenStack data protection and disaster recovery: Cinder volume snapshots and backups, Swift replication, Masakari instance HA, Nova evacuate and live migration, Freezer backup (retired), Trove database backups, multi-site DR strategies, and boot-from-volume considerations.

Checklist¶

Why This Matters¶

OpenStack does not provide data protection by default -- it provides the building blocks (snapshots, backups, replication) that must be deliberately configured into a protection strategy. Cinder snapshots are commonly mistaken for backups, but they typically share the same storage backend and failure domain as the parent volume (a Ceph pool failure loses both volumes and snapshots). Masakari provides instance HA similar to VMware HA, but only works with boot-from-volume instances on shared storage -- ephemeral instances are rebuilt with empty disks. The evacuate command is destructive on the source host and must only be used when the source host is confirmed down. Without an external backup tool, there is no built-in way to perform application-consistent, scheduled, retained backups of instance filesystems (Freezer is retired). Multi-site DR requires explicit configuration of every layer (Keystone, Glance, Cinder, Neutron) and is not an out-of-the-box capability.

Snapshot and Image Lifecycle: Dependent vs Independent Artifacts¶

OpenStack creates several kinds of protection artifact, and the single most useful axis for reasoning about their lifecycle is dependent vs independent. It determines deletion order, which failure mode an estate suffers, and how reclamation must behave. This is the OpenStack-native half of the lifecycle that the third-party-backup pattern (patterns/backup-lifecycle-synchronization.md) does not cover: the snapshots and images OpenStack itself produces.

Artifact	Class	Storage	On source delete
Cinder volume snapshot	Dependent	Same backend as parent volume	Blocks parent volume deletion until released
Array-managed snap copy (vendor integration)	Dependent	On the array, referenced by Cinder	Blocks until released through Cinder
Cinder backup	Independent	Separate target (Swift/NFS/S3/secondary Ceph)	Survives → orphan
Nova→Glance image snapshot	Independent	Glance image store	Survives → orphan

Dependent artifacts -- release first, they block deletion¶

A Cinder volume snapshot lives on the parent volume's storage and is dependent on it; the volume cannot be deleted while it has snapshots (Cannot delete volume ...: has dependent snapshots). The lifecycle rule is therefore release dependents before deleting the parent: enumerate and delete (or detach into independence) the snapshots first.

Snapshot trees make this non-trivial. A snapshot can be the parent of a clone (openstack volume create --snapshot <snap>), and on copy-on-write backends the clone depends on the snapshot's blocks. The dependency graph is backend-specific: - Ceph RBD -- snapshots and clones are COW; a snapshot with children cannot be removed until the children are flattened (copied to independence) or deleted. --cascade orchestrates this but can be expensive (flatten copies data). - LVM -- thin-pool dependencies; deleting in the wrong order can fail or leave dangling thin devices. - Vendor drivers -- each implements snapshot/clone dependency differently; --cascade and delete-with-snapshots semantics vary, so verify against the deployed driver rather than assuming.

The practical guidance: never issue a naive --cascade against a tree you have not mapped, and never delete an array snapshot out-of-band on the array -- release it through Cinder so the Cinder catalog stays consistent (an out-of-band delete leaves a phantom snapshot record needing cinder reset-state to repair).

Independent artifacts -- they become orphans¶

A Nova instance snapshot uploads a full, independent image to Glance. It does not depend on the instance and is not freed when the instance is deleted -- it simply persists in the image store, accumulating cost. Orphan reclamation means identifying Glance images with no referencing instance and no role as a base image, with caveats: protected images cannot be deleted until unprotected; public/community/shared images may serve other projects; and base images backing running instances must not be touched. Key the reclamation on the image and instance UUID, never the display name (a reused name mis-correlates -- the same join-key discipline as the backup-lifecycle pattern).

A Cinder backup is likewise independent (its whole purpose is a separate-failure-domain copy) and likewise orphans when its source volume is deleted, since Cinder has no native retention engine (see the retention checklist item) -- the backup persists until something explicitly prunes it.

Boot-disk strategy is the hinge¶

Which failure mode dominates is selected per instance by its boot disk: - Boot-from-volume → a Cinder volume (often with a snapshot tree) → the dependent/blocking path dominates: snapshots must be released before the VM/volume can be deleted, and cascade order matters. - Ephemeral boot disk → no Cinder volume to snapshot → protection is Nova→Glance images plus external backups → the independent/orphan path dominates: nothing blocks deletion, but images and backups linger.

A real estate runs both, so it carries both failure modes simultaneously, keyed off each VM's boot-disk type -- the deletion runbook and any reclamation automation must branch on boot-disk type rather than assuming one model.

OpenStack-specific immutability¶

The immutability concepts in general/ransomware-resilience.md map onto OpenStack targets: Swift object versioning (X-Versions-Location / X-History-Location) and Ceph RGW S3 Object Lock provide WORM for Cinder backup containers; Barbican supplies managed keys for encrypted backup targets; and the backup target must sit in a separate failure domain from primary Cinder storage. For preservation (not just ransomware), an immutable backup target in legal-hold mode is the storage-enforced implementation of the hold gate described in general/legal-hold.md.

Common Decisions (ADR Triggers)¶

Snapshot vs backup -- Cinder snapshots (fast, same storage, dependent on parent) vs Cinder backups (slower, independent copy, separate storage) vs external backup tools (Veeam, Commvault) -- Freezer is retired; RPO requirements and failure domain isolation drive this
Dependent-snapshot deletion order -- release snapshots before parent volume; map snapshot trees and choose flatten-to-independence vs cascade-delete (backend-specific: Ceph RBD COW + flatten, LVM thin-pool, vendor drivers) -- naive --cascade risk vs explicit ordered teardown
Image-orphan reclamation policy -- automatically reclaim unreferenced Glance instance-snapshot images vs retain (protected/public/base-image caveats); UUID-keyed identification; whether reclamation is gated by approval and legal hold
Boot-disk strategy as lifecycle driver -- boot-from-volume (dependent/blocking path, snapshot release before delete) vs ephemeral (independent/orphan path, image+backup cleanup) -- HA needs and lifecycle-cleanup model both follow from this single choice
Instance HA strategy -- Masakari auto-evacuate (automated, requires shared storage) vs manual evacuate procedures (simpler, operator-driven) vs application-level HA (no platform dependency, workload manages own failover) -- SLA requirements and workload architecture
Backup scope -- volume-level only (Cinder backup) vs file-level (external agent inside instance) vs application-level (database dump + Trove backups) -- granularity of recovery requirements
Multi-site DR model -- active-passive (Cinder replication, DNS failover, longer RTO) vs active-active (stretched Ceph cluster, complex, low RTO/RPO) vs pilot light (minimal standby infrastructure, Heat re-creation, longest RTO) -- cost vs recovery time
Boot disk strategy -- boot-from-volume (survives host failure, Cinder manages lifecycle) vs ephemeral boot disk (faster provisioning, data lost on evacuate, simpler) -- instance HA requirements determine this
Backup retention -- time-based (keep 30 days) vs count-based (keep last 10) vs GFS rotation (daily/weekly/monthly) -- compliance and storage cost constraints
Backup orchestration -- external tools (Veeam, Commvault, Rubrik with OpenStack plugins) vs custom scripts with cron -- Freezer is retired; existing tooling and support requirements drive this
Database protection -- Trove managed backups (automated, integrated) vs application-managed backups (mysqldump/pg_dump in cron) vs Cinder snapshot of database volume (crash-consistent only) -- recovery granularity and consistency requirements

Version Notes¶

Feature	Pike (16) Oct 2017	Queens (17) Feb 2018	Rocky (18) Aug 2018	Stein (19) Apr 2019	Train (20) Oct 2019	Ussuri (21) May 2020	Victoria (22) Oct 2020	Wallaby (23) Apr 2021	Xena (24) Oct 2021	Yoga (25) Mar 2022	Zed (26) Oct 2022	2023.1 Antelope (27)	2023.2 Bobcat (28)	2024.1 Caracal (29)	2024.2 Dalmatian (30)	2025.1 Epoxy (31)	2025.2 Flamingo (32)
Masakari (instance HA)	Introduced (incubated)	GA (basic host monitoring)	GA (process monitoring)	GA (recovery workflows)	GA (improved evacuation)	GA	GA (improved host monitor)	GA	GA	GA	GA	GA	GA	GA (improved notifications)	GA	GA	GA
Cinder backup (Swift driver)	GA	GA	GA	GA (incremental improvements)	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA
Cinder backup (Ceph driver)	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA
Cinder backup (S3 driver)	Not available	Introduced	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA
Cinder backup (GCS driver)	Not available	Not available	Not available	Not available	Not available	Introduced	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA
Cinder backup chunked improvements	Basic	Basic	Improved	Improved	Improved	Improved	Improved	Improved	Improved	Improved	Improved	GA (improved chunked)	GA	GA	GA	GA	GA
Cinder volume revert to snapshot	Not available	Introduced	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA
Freezer (backup/DR)	GA (incubated)	GA	GA	Maintenance mode	Maintenance mode	Maintenance mode	Maintenance mode	Maintenance mode	Retired discussion	Retired discussion	Effectively retired	Effectively retired	Effectively retired	Effectively retired	Effectively retired	Retired	Retired
Nova evacuate	GA	GA	GA	GA (improved error handling)	GA	GA (improved reporting)	GA	GA	GA	GA	GA	GA (improved rebuild)	GA	GA (force options)	GA	GA	GA
Nova live-migrate (TLS)	Not available	Not available	Not available	Introduced (QEMU native TLS)	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA
Trove automated backups	GA	GA	GA	GA	GA	GA (redesigned Trove)	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA
Cinder replication v2.1	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA
Boot-from-volume (evacuate support)	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA
Swift geo-replication	GA (container sync)	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA

Key changes across releases: - Masakari evolution (Pike+): Masakari was incubated in Pike and became an official project in Queens. It provides instance HA by monitoring compute host failures (hostmonitor), instance failures (instancemonitor), and process failures (processmonitor). Recovery workflows improved in Stein with better orchestration of evacuate operations. Masakari requires shared storage or boot-from-volume instances -- ephemeral instances are rebuilt with empty disks on evacuate. - Cinder backup driver improvements: The S3 backup driver was added in Queens, enabling backup to any S3-compatible object store. Google Cloud Storage driver arrived in Ussuri. Chunked backup performance improvements in 2023.1 reduced backup time for large volumes. The backup service should always target storage in a different failure domain from primary Cinder backends. - Freezer retirement: Freezer was an incubated backup and DR project that entered maintenance mode around Stein. Community activity declined significantly, and the project is effectively retired as of Zed/2023.1 and fully retired in Epoxy (2025.1). Organizations needing file-level backup should use external tools (Veeam, Commvault, Rubrik with OpenStack plugins) or custom solutions using Cinder backup APIs. - Nova evacuate improvements: Evacuate error handling improved in Stein, reporting improved in Ussuri, and rebuild behavior improved in 2023.1. The --force option was added in 2024.1 (Caracal). Evacuate must only be used when the source host is confirmed down -- using it on a healthy host risks data corruption. Boot-from-volume instances survive evacuate because the volume is on shared storage. - Nova live-migrate with TLS (Stein+): QEMU native TLS for live migration was introduced in Stein, encrypting the migration data stream. This eliminates the need for SSH tunnelling (live_migration_tunnelled) and provides better performance for encrypted live migration. - Trove redesign (Ussuri): Trove was significantly redesigned in Ussuri with a simplified architecture, improved guest agent, and better integration with modern OpenStack services. Automated backups to Swift with configurable retention continue to be the primary database protection mechanism. - Cinder replication v2.1: Volume replication has been stable across all releases from Pike onward. It enables asynchronous replication between Cinder backends for DR scenarios. Combined with Masakari for compute HA and Cinder replication for storage DR, a comprehensive active-passive DR strategy can be built. - Epoxy (2025.1) data protection changes: Freezer fully retired. Continued Cinder backup performance improvements. Masakari stability improvements. - Flamingo (2025.2) data protection changes: Continued improvements to Cinder replication and backup reliability. No major new data protection features.

Reference Links¶

OpenStack backup and recovery guide -- control plane and data backup strategies for OpenStack
Cinder backup service -- volume backup to Swift, NFS, or Ceph and restore procedures