Backup Lifecycle Synchronization¶

Scope¶

This file covers the cross-system process that synchronizes a source system's resource lifecycle with a backup tool's data lifecycle -- the integration glue that ensures backups are reclaimed (or deliberately retained) when the resource they protect is deleted. It is provider-agnostic: the "source system" is any platform that creates and destroys protected resources (an OpenStack project deleting a VM, a vCenter deleting a guest, a cloud account terminating an instance, a database platform dropping an instance), and the "backup tool" is any data-protection product (Commvault, Cohesity, Rubrik, Veeam, Zerto, cloud-native snapshot services). The pattern resolves a three-way tension between cost governance, the GDPR Article 17 right-to-erasure, and legal hold.

This file is the end-to-end pattern. It does not duplicate the failure mode (orphaned snapshots/backups accumulating -- see failures/data.md), the regulatory rule it must satisfy (compliance/gdpr.md), the inventory-side disposition of orphans (general/inventory-analysis.md), the CMDB reconciliation mechanics (providers/servicenow/cmdb.md), or the per-vendor delete/retention mechanics that implement it (providers/commvault/backup.md, providers/cohesity/backup.md, providers/rubrik/backup.md, providers/veeam/backup.md). It is the layer that ties those together.

Overview¶

A source system and a backup tool each maintain their own lifecycle for the same logical object. The source system knows when a VM is created, renamed, and deleted. The backup tool knows when that VM was last protected, how many recovery points exist, and when they age out. Nothing inherently connects the two: deleting the VM does not delete its backups, and deleting the backups does not stop the source system from re-presenting the VM. The space between these two lifecycles is where orphaned backups accumulate -- driving storage cost, retention-policy drift, and right-to-erasure exposure -- and, in the opposite failure, where a backup is deleted while a legal hold still requires it.

The pattern has three moving parts: a discovery/diff mechanism that detects when a protected resource no longer exists at the source, a set of governance gates that decide whether and when reclamation is allowed, and an action path that executes the soft or hard reclamation against the backup tool. The hard design choices are how diff is detected (event-driven vs reconciliation vs hybrid), what stable join key correlates the two systems, and which gates must pass before a backup is destroyed.

Checklist¶

Why This Matters¶

The space between a source system's lifecycle and a backup tool's lifecycle is invisible until it is expensive. No single component is broken: the source platform correctly deletes the VM, the backup tool correctly retains the data it was told to protect, and neither was ever told that the deletion of the former should affect the latter. The result is a steady, quiet accumulation -- a few orphaned recovery points per week, never enough to notice in any one month, cumulatively enough that after a year a meaningful fraction of backup storage protects resources that no longer exist. This is the orphaned-backup failure mode documented in failures/data.md, seen from the process side rather than the symptom side.

The naive fix -- "delete the backups when the source emits a delete event" -- is worse than the disease, for two reasons. First, events are lossy: any deletion event dropped during a message-bus outage, a consumer crash, or a maintenance window leaves an orphan that the event-only design will never reclaim, because there is no second mechanism that re-checks. Second, events are premature: a resource that briefly disappears from the source API during a control-plane event, a live migration, or a rebuild looks identical to a deletion at the instant the event fires, and acting immediately destroys backups for a resource that was only transiently gone. This is why the durable design is the hybrid: events provide low-latency flagging of candidates, and a reconciliation loop -- which lists both sides and computes the actual set difference -- is the authority that enforces, with a grace period in between. The reconciliation loop is also self-healing: an orphan missed in one cycle is caught in the next, because the loop re-derives ground truth every time rather than depending on a one-shot event.

The join key is where this pattern produces data-loss bugs instead of cost savings. Correlating source resources to backup objects by display name or IP address feels natural because that is what humans use, but both are mutable and reusable. A VM renamed after its backup was configured no longer matches; worse, a name reused for a brand-new resource matches the old resource's deletion record, and a name-based reconciliation loop will happily reclaim the new resource's backups in response to the old resource's deletion. The join key must be the stable, immutable identifier each side assigns (source UUID / instance-id / managed-object-reference), carried as the correlation token in both the backup tool's metadata and the audit trail. This is the same join-key discipline that the CMDB's Identification and Reconciliation Engine enforces for exactly the same reason (providers/servicenow/cmdb.md).

The pattern implicitly resolves a three-way tension, and the governance gates are where that resolution lives. Cost governance pulls toward deleting orphaned backups as fast as possible. The GDPR Article 17 right-to-erasure pulls toward propagating deletions all the way into backups on a deadline, not leaving personal data in recovery points indefinitely (compliance/gdpr.md). Legal hold and regulatory retention pull the opposite way -- some data must not be deleted even though the source resource is gone and both cost and erasure logic want it removed. No single default satisfies all three. The gates encode the priority order: legal-hold and compliance-lock win over both cost and erasure (Article 17(3) explicitly exempts data retained for legal claims); a grace period protects against accidental and transient deletion; do-not-delete tags and risk-based approval keep a human in the loop for the consequential cases; and the soft-vs-hard action choice lets cost reclamation (soft, age-out) and erasure satisfaction (hard, explicit delete) be selected per object rather than globally. A design that hard-codes "always delete" optimizes cost and erasure at the cost of a legal-hold violation; one that hard-codes "never delete" is safe but never satisfies an erasure deadline and never reclaims the cost. The gates exist so the choice is made per object, on evidence, and is auditable.

Common Decisions (ADR Triggers)¶

ADR: Diff Mechanism -- Event-Driven vs Reconciliation vs Hybrid¶

Context: The synchronization must detect when a protected resource no longer exists at the source. The detection mechanism determines latency, completeness, and resilience to outages.

Options:

Criterion	Event-driven only	Reconciliation-loop only	Hybrid (events flag, loop enforces)
Latency to detect	Low (near-real-time)	High (poll interval)	Low to flag, poll to enforce
Resilience to dropped events	None -- missed event = permanent orphan	Full -- re-derives truth each cycle	Full -- loop backstops the events
Load on source/backup APIs	Low	Higher (periodic full listing)	Moderate
Handles transient absence	Poorly (acts immediately)	Well (grace period built in)	Well
Recommended for	Augmentation only	Small/static estates	Default for production

Decision factors: Whether the source system emits reliable deletion events; the acceptable orphan-detection latency; the cost of a missed deletion (cost-only vs erasure-deadline); estate size and API rate limits. Default to hybrid; use reconciliation-only when no event bus exists; never use event-only as the sole mechanism where erasure or cost-governance guarantees matter.

ADR: Soft vs Hard Reclamation Action¶

Context: Once a resource is confirmed deleted at the source and all gates pass, the backup data can be reclaimed by deconfiguring protection and letting recovery points age out (soft) or by explicitly deleting the backup data now (hard).

Decision factors: Whether a right-to-erasure deadline applies (forces hard within the deadline); the recoverability requirement (soft preserves the ability to restore until age-out; hard is irreversible); the urgency of cost reclamation (hard frees storage immediately, soft frees it gradually); data classification and approval risk. A common policy is soft-by-default with hard reserved for erasure requests and explicitly tagged data, selected per object rather than globally.

ADR: Grace Period and Transient-State Settling¶

Context: A resource absent from the source API may be deleted or only transiently gone (migrating, rebuilding, control-plane outage). Reclaiming too early destroys backups for a resource that still exists; too late leaves orphans costing money and holds personal data past an erasure deadline.

Decision factors: The platform's transient-absence behavior (how long a healthy resource can be missing from the API); the accidental-deletion recovery window the business requires; any erasure-deadline ceiling that bounds how long the grace period can be. The grace period must be longer than the worst-case transient absence and shorter than any erasure deadline; if those two constraints conflict, the conflict itself is an ADR.

ADR: Reclamation Authorization -- Automated vs Risk-Based Approval¶

Context: Hard deletion of backup data is irreversible. Requiring human approval for every object does not scale; approving none of them is reckless for production and regulated data.

Decision factors: Object risk tier (dev/test vs production vs regulated); size/value thresholds; recency of last activity; data classification tags. Typical resolution: automatic reclamation for low-risk objects after the grace period, mandatory approval for high-risk objects, and fail-safe-to-review on any gate ambiguity.

ADR: Join-Key Strategy¶

Context: Correlating source resources to backup-tool protected objects requires a key present and stable on both sides.

Decision factors: Availability of a stable immutable identifier on both sides (source UUID/instance-id/MoRef vs backup-tool client/subclient/object id); whether that identifier is carried into the backup tool's metadata at protection time; the consequences of a mismatch (name-reuse collision causing wrong-object deletion). Always use the immutable identifier; never name or IP. If the backup tool only records names, the remediation (recording the UUID as a property/tag at protection time) is itself an ADR and a prerequisite for safe automation.

Reference Links¶

GDPR Article 17 -- Right to erasure -- the erasure obligation and its 17(3) exemptions (legal claims, regulatory retention) that the legal-hold gate encodes
OpenStack Nova notifications -- example source-system lifecycle event bus (instance.delete.end) that drives the event-flagging side of the hybrid
AWS Backup -- delete and lifecycle -- cloud-native example of retention-age (soft) vs explicit recovery-point deletion (hard)
ITIL Service Asset and Configuration Management -- decommissioning and CMDB reconciliation framing for the source-of-truth gate