Storage Architecture¶

Scope¶

Storage tier selection, protocol decisions, capacity planning, performance requirements, data lifecycle management, and integration with compute and backup infrastructure. This file covers what storage decisions need to be made and the trade-offs involved. For provider-specific how, see the provider storage files. For backup strategy and tooling, see general/enterprise-backup.md. For database-specific storage considerations, see general/data.md.

Checklist¶

Why This Matters¶

Storage decisions are among the most persistent in any architecture. Migrating data between storage platforms — especially at scale — is time-consuming, risky, and often requires application downtime. A 100 TB dataset migrating at 1 Gbps sustained takes over 9 days of continuous transfer, assuming no errors or retransmissions. Choosing the wrong storage tier, protocol, or platform at design time creates a technical debt that compounds as data volume grows, eventually forcing a disruptive migration under pressure.

Performance mismatches between storage and compute are the most common root cause of application latency that teams struggle to diagnose. A database running on storage that delivers 1,000 IOPS when it needs 10,000 will exhibit query timeouts, connection pool exhaustion, and cascading failures — symptoms that appear to be application problems rather than infrastructure problems. Measuring and specifying storage performance requirements during architecture design, rather than discovering them during load testing, prevents months of troubleshooting.

Storage costs are frequently underestimated because capacity is only one dimension. Snapshot retention, replication overhead, retrieval fees from cold tiers, and data transfer charges between storage and compute can double the effective cost of stored data. Organizations that plan storage costs based on raw capacity alone consistently exceed their budgets. Lifecycle tiering policies that automatically move aging data to cheaper tiers are the most effective cost control mechanism, but they must be designed at architecture time — retrofitting tiering onto a flat storage model requires data reorganization and application changes.

Common Decisions (ADR Triggers)¶

ADR: Storage Tier Selection¶

Context: The architecture must store data with varying access patterns, performance requirements, and cost sensitivities.

Options:

Criterion	Block Storage	Object Storage	File Storage (NAS)	Archive Storage
Access pattern	Random read/write, single-host attach	HTTP/API-based, write-once-read-many	Shared read/write across multiple hosts	Write-once, read-rarely
Latency	Sub-millisecond (NVMe/SSD), low-ms (HDD)	Milliseconds to seconds	Low to mid milliseconds	Hours (retrieval request required)
Throughput	High, dedicated per volume	Very high aggregate, variable per request	Moderate, shared across clients	N/A (batch retrieval)
Scalability	Terabytes per volume	Petabytes per bucket, unlimited objects	Terabytes to low petabytes	Petabytes
Cost	Highest per GB (especially SSD tier)	Low per GB, pay for retrieval and API calls	Moderate per GB	Lowest per GB
Best fit	Databases, transactional apps, boot volumes	Backups, media, logs, data lake, static content	Shared home directories, CMS, legacy apps	Compliance archives, cold backups, audit logs

Decision drivers: Access pattern (random vs. sequential, read vs. write ratio), latency requirements, number of concurrent consumers, data volume and growth trajectory, and compliance requirements for immutability or retention.

ADR: Storage Protocol Selection¶

Context: The architecture must connect compute workloads to storage with appropriate performance, compatibility, and network requirements.

Options:

Criterion	NFS	iSCSI	Fibre Channel	S3-Compatible API	SMB/CIFS
Transport	TCP/IP (Ethernet)	TCP/IP (Ethernet)	Dedicated FC fabric (or FCoE)	HTTP/HTTPS (Ethernet)	TCP/IP (Ethernet)
Access model	File-level, multi-host	Block-level, typically single-host	Block-level, typically single-host	Object-level, multi-client	File-level, multi-host
Performance	Good for general workloads; jumbo frames improve throughput	Near-FC performance on dedicated VLANs	Lowest latency, highest reliability	Depends on network; higher latency per request	Good for general workloads
Infrastructure cost	Uses existing Ethernet	Uses existing Ethernet; may need dedicated VLANs	Requires FC HBAs, switches, and cabling	Uses existing Ethernet	Uses existing Ethernet
Best fit	VMware datastores, Linux shared mounts, Kubernetes RWX	Hypervisor datastores, databases on Ethernet-only networks	Mission-critical databases, high-transaction systems	Cloud-native apps, backup targets, hybrid cloud	Windows file shares, Active Directory environments

Decision drivers: Existing network infrastructure (Ethernet-only vs. FC fabric), workload latency sensitivity, multi-host access requirements, team expertise with storage networking, and budget for dedicated storage networking.

ADR: Shared Storage vs. Local Storage¶

Context: The architecture must balance storage performance against operational flexibility for high availability and workload mobility.

Options: - Centralized shared storage (SAN/NAS): All compute nodes access a shared storage pool. Enables live migration, simplified backup, and centralized management. Introduces a network dependency and potential single point of contention. Higher cost for enterprise arrays with redundant controllers. Best for: virtualized environments, workloads requiring live migration, and centralized data management. - Local storage (DAS/NVMe): Each compute node uses directly attached disks. Lowest latency, highest IOPS, no network dependency. Workloads are tied to specific hosts; HA requires application-level replication or distributed storage. Lower upfront cost per IOPS. Best for: distributed databases (Cassandra, MongoDB), high-performance caching, Kubernetes local persistent volumes for latency-sensitive pods. - Distributed software-defined storage (Ceph, vSAN, MinIO): Aggregates local disks across nodes into a shared, replicated pool. Combines local-disk performance with shared-storage flexibility. Adds CPU and network overhead on compute nodes. Requires minimum node counts (typically 3+) and careful network design. Best for: hyperconverged infrastructure, private cloud, and environments seeking shared storage without dedicated array hardware.

Decision drivers: Workload latency requirements, HA and live migration needs, infrastructure budget, team operational expertise with storage platforms, and whether the environment is hyperconverged or uses dedicated storage hardware.

ADR: Data Lifecycle and Tiering Strategy¶

Context: Stored data ages over time, and retaining all data on the highest-performance tier wastes budget without improving application performance.

Options: - Manual lifecycle management: Administrators periodically review and migrate data between tiers. Low automation investment. Requires ongoing operational effort and discipline; commonly neglected, resulting in hot-tier bloat and overspend. - Policy-based automatic tiering: Storage platform or cloud service moves data between tiers based on rules (last access time, creation date, size). Examples: S3 Lifecycle Rules, Azure Blob Lifecycle Management, NetApp FabricPool. Low operational effort once configured. Requires careful policy design to avoid prematurely tiering frequently accessed data. - Application-driven tiering: Application code writes to the appropriate tier based on data type at creation time (e.g., thumbnails to object storage, metadata to database, raw media to archive). Most efficient but requires application awareness and developer cooperation. Best for greenfield applications designed around multi-tier storage. - No tiering (single tier): All data remains on one performance tier. Simplest operationally. Cost-effective only when total data volume is small (under 1 TB) or access patterns are uniformly hot. Becomes prohibitively expensive as data grows.

Decision drivers: Total data volume and growth rate, percentage of data that is active vs. dormant, retrieval latency tolerance for aged data, regulatory retention requirements, and operational capacity for lifecycle policy management.