Data Management and Database Strategy¶

Scope¶

This file covers what data management decisions need to be made during architecture design: database engine selection, replication, backup, encryption, compliance, schema management, and performance tuning. For provider-specific how (managed service configuration, pricing tiers, region availability), see the provider data files listed in See Also.

Checklist¶

Why This Matters¶

Data is the most valuable and least replaceable component of any system. While compute and networking can be rebuilt in hours, lost or corrupted data may be unrecoverable. A database engine mismatch — such as forcing a graph traversal workload into a relational schema or cramming time-series telemetry into a document store — creates performance problems that no amount of hardware can solve, eventually requiring a costly re-architecture.

Replication and failover strategy directly determine how much data the business loses (RPO) and how long it stays down (RTO) during an incident. Synchronous replication guarantees zero data loss but adds write latency; asynchronous replication is faster but means the replica is always slightly behind. Organizations that defer this decision until a failure occurs discover their actual RPO and RTO the hard way — and the answer is rarely acceptable to the business.

Encryption and compliance are not optional add-ons. Regulatory requirements like GDPR, PCI-DSS, and HIPAA impose specific controls on how data is stored, accessed, transmitted, and deleted. Retrofitting encryption or audit logging onto an existing database is significantly more disruptive than designing it in from the start. Key management decisions — especially who holds the keys and how rotation works — have long-term operational implications that are difficult to change later.

Common Decisions (ADR Triggers)¶

ADR: Database Engine Selection¶

Context: The application requires persistent data storage, and the team must choose an engine that matches the data model, query patterns, and operational requirements.

Options:

Criterion	Relational (PostgreSQL, MySQL)	Document (MongoDB, Cosmos DB)	Key-Value (Redis, DynamoDB)	Time-Series (TimescaleDB, InfluxDB)	Graph (Neo4j, Neptune)
Data Model	Structured, normalized tables	Flexible JSON/BSON documents	Simple key-to-value pairs	Timestamped metric data	Nodes and edges
Query Strength	Complex joins, aggregations, transactions	Nested document queries, flexible schema	Sub-millisecond lookups by key	Time-range queries, downsampling	Relationship traversal, path finding
ACID Support	Full	Document-level (multi-doc varies)	Limited (varies by engine)	Varies	Full (Neo4j), varies (others)
Scaling Model	Vertical + read replicas	Horizontal sharding native	Horizontal sharding native	Time-based partitioning	Vertical primarily
Best Fit	Business data, transactions, reporting	Content management, catalogs, user profiles	Session state, caching, feature flags	Monitoring, IoT, financial tick data	Social networks, fraud detection, knowledge graphs

Decision drivers: Data structure predictability, transaction requirements, query complexity, scale trajectory, and team expertise with the engine.

ADR: Replication and Failover Model¶

Context: The database must survive infrastructure failures while meeting the application's consistency and availability requirements.

Options: - Single-region, multi-AZ synchronous replication: Zero RPO within a region, automatic failover in 1-2 minutes. Standard for production workloads. Does not protect against regional outages. - Cross-region asynchronous replication: RPO of seconds to minutes depending on lag. Protects against regional disasters. Requires application-level handling of stale reads from replicas and promotion procedures for failover. - Multi-region active-active (e.g., CockroachDB, Cosmos DB, Spanner): Writes accepted in any region with distributed consensus. Lowest latency for global users. Highest complexity and cost; requires conflict resolution strategy and careful partition design. - Manual failover with cold standby: Lowest cost, highest RTO (hours). Acceptable for non-critical systems where extended downtime is tolerable.

Decision drivers: RPO/RTO requirements, geographic distribution of users, consistency model tolerance (strong vs. eventual), operational maturity, and budget.

ADR: Encryption Key Management¶

Context: Data at rest must be encrypted, and the organization must decide who manages the encryption keys.

Options: - Provider-managed keys (default): Cloud provider generates, stores, and rotates keys automatically. Zero operational burden. No customer control over key lifecycle; provider has theoretical access. - Customer-managed keys (CMK) in cloud KMS: Customer controls key creation, rotation schedule, and revocation via AWS KMS, Azure Key Vault, or GCP Cloud KMS. Audit trail for key usage. Requires IAM policy management; accidental key deletion causes permanent data loss. - External HSM (CloudHSM, on-prem HSM): Keys never leave FIPS 140-2 Level 3 validated hardware. Required by some financial and government regulations. Highest cost and operational complexity; HSM cluster must be highly available.

Recommendation: Use customer-managed keys in cloud KMS for most production workloads. Reserve HSM for workloads with explicit regulatory mandates. Provider-managed keys are acceptable for non-sensitive or development environments.

ADR: Schema Migration Strategy¶

Context: The database schema will evolve over the application lifecycle, and changes must be applied without data loss or extended downtime.

Options: - Sequential versioned migrations (Flyway, Alembic, Liquibase): Each change is a numbered, version-controlled script. Applied in order. Supports rollback scripts. Standard approach for most teams. - Blue-green database deployment: Run old and new schema versions in parallel during transition. Zero-downtime for schema changes. Requires double storage temporarily and backward-compatible application code. - Expand-contract pattern: Add new columns/tables first (expand), migrate data, update application, then remove old structures (contract). Safe for zero-downtime deployments. Requires multiple deployment cycles to complete a single change.

Decision drivers: Downtime tolerance, deployment frequency, team size, and whether the application supports running against multiple schema versions simultaneously.