MongoDB¶

Scope¶

This file covers MongoDB architecture decisions: Atlas vs self-hosted deployment, replica set configuration, sharding strategy and shard key design, schema design patterns (embedding vs referencing), aggregation pipeline optimization, change streams for event-driven architectures, Atlas Search for full-text and vector search, read/write concern tuning, connection management, and migration from relational databases. For general database strategy (engine selection, replication patterns, encryption), see general/data.md. For migration methodology and cutover planning, see general/database-migration.md.

Checklist¶

Why This Matters¶

MongoDB's document model provides flexibility that relational databases do not, but this flexibility creates its own class of architectural mistakes. The most common failure pattern is treating MongoDB like a relational database — normalizing data across collections and relying on $lookup for joins, which eliminates the performance advantages of the document model. The second most common failure is schema-less development without schema validation, which leads to inconsistent documents that break application logic and make queries unpredictable.

Shard key selection is the single most consequential MongoDB architecture decision. A poorly chosen shard key creates "jumbo chunks" that cannot be split or migrated, leading to unbalanced shards where one node handles disproportionate load while others sit idle. In pre-5.0 versions, the only remedy is to dump and reload the entire collection with a new shard key. Even with MongoDB 5.0+ resharding support, the process is I/O intensive and can impact production performance for hours on large collections.

Connection management is another area where MongoDB deployments fail at scale. Unlike connection-pooled relational databases with lightweight per-connection overhead, each MongoDB connection consumes approximately 1 MB of RAM on the server. In microservices architectures with many small services, the aggregate connection count across all service instances can exhaust server memory before CPU or disk becomes the bottleneck. Atlas tier connection limits compound this problem — an M30 instance supports a maximum of 2,000 connections, which can be consumed quickly by a dozen services each running 20 pods with a pool size of 10.

Common Decisions (ADR Triggers)¶

ADR: Atlas vs Self-Hosted MongoDB¶

Context: The organization must decide between MongoDB Atlas (fully managed) and self-hosted MongoDB (on VMs or Kubernetes).

Options:

Criterion	MongoDB Atlas	Self-Hosted (VM)	Self-Hosted (Kubernetes)
Operational Overhead	Lowest (fully managed)	High (manual patching, backups, HA)	Moderate (operator-managed, but K8s complexity)
Cost Model	Per-hour cluster + data transfer + backup	Infrastructure + DBA time	Infrastructure + K8s overhead + DBA time
Customization	Limited (managed configuration)	Full control	Full control
Built-in Search	Atlas Search and Vector Search	Requires separate Elasticsearch	Requires separate Elasticsearch
Multi-Region	Built-in Global Clusters	Manual replica set across regions	Complex cross-cluster replication
Security	SOC 2, HIPAA, PCI DSS compliant	Customer-managed	Customer-managed

Decision drivers: Operational team MongoDB expertise, total cost of ownership including personnel, multi-region requirements, search feature needs, and compliance certification requirements.

ADR: Embedding vs Referencing Data Model¶

Context: MongoDB schema design requires deciding which related data to embed within documents versus reference across collections.

Options: - Embed (denormalize): Store related data in nested subdocuments or arrays within the parent document. Optimal when related data is always accessed with the parent, has a bounded size, and does not need independent access. Single-document reads and writes are atomic in MongoDB. Risk of exceeding 16 MB document limit with unbounded arrays. - Reference (normalize): Store related data in separate collections with ObjectId references. Requires $lookup or multiple queries to assemble related data. Optimal for many-to-many relationships, independently accessed entities, or data that would create unbounded document growth. Provides flexibility for evolving access patterns. - Hybrid (Extended Reference): Embed frequently accessed fields from the referenced document while maintaining the reference. Reduces $lookup frequency for common queries. Requires application-level management to keep embedded copies consistent with the source document.

Decision drivers: Query access patterns, data relationship cardinality (1:1, 1:few, 1:many, many:many), document growth rate, data consistency requirements, and whether related data is read independently.

ADR: Sharding Strategy¶

Context: The collection has grown beyond what a single replica set can serve in terms of storage or throughput, requiring horizontal scaling via sharding.

Options: - Hashed Shard Key: Hash of a single field (often _id). Provides even write distribution across shards. Does not support range queries on the shard key — all range queries become scatter-gather operations. Best for write-heavy workloads where queries filter on non-shard-key fields. - Ranged Shard Key: Uses field value ranges to partition data. Supports range queries on the shard key that target a single shard (targeted queries). Risks hotspotting if the key is monotonically increasing. Best when range queries on the shard key are the primary access pattern. - Compound Shard Key: Combines multiple fields (e.g., tenant_id + timestamp). First field provides coarse partitioning (e.g., per tenant), subsequent fields provide fine-grained distribution. Best for multi-tenant applications where queries always include the tenant identifier.

Decision drivers: Write distribution requirements, primary query patterns (point lookups vs range scans), data growth pattern (monotonic vs distributed), multi-tenancy isolation needs, and whether queries include the shard key.

Reference Links¶

MongoDB Architecture Guide -- document model, replication, sharding, and storage engine internals
MongoDB Atlas Documentation -- managed service configuration, Atlas Search, Atlas Vector Search, and serverless instances
MongoDB Schema Design Best Practices -- embedding vs referencing patterns and common anti-patterns
MongoDB Sharding Documentation -- shard key selection, chunk migration, and balancer configuration
MongoDB University -- free courses on schema design, aggregation, and MongoDB administration