Elasticsearch¶

Scope¶

This file covers Elasticsearch (and OpenSearch) architecture decisions: index design (mappings, shards, replicas), cluster sizing and node roles, Index Lifecycle Management (ILM) policies, observability vs search use cases and their different design patterns, ELK/EFK stack deployment, security configuration (TLS, RBAC, field-level security), cross-cluster replication and search, snapshot and restore strategy, and managed service options (Elastic Cloud, Amazon OpenSearch Service, Azure Cognitive Search). For general observability strategy, see general/observability.md. For application search alternatives, see providers/mongodb/database.md (Atlas Search) or providers/redis/database.md (RediSearch).

Checklist¶

Why This Matters¶

Elasticsearch is the dominant technology for both log analytics and application search, but these two use cases have fundamentally different architectural requirements. Organizations that deploy a single Elasticsearch cluster for both logging and application search invariably encounter problems: a spike in log ingestion from a production incident consumes all indexing capacity, causing user-facing search to slow or fail at the exact moment when observability is most needed. Separating these workloads into different clusters — or at minimum, different node pools with resource isolation — is essential for production reliability.

The most expensive Elasticsearch mistake is shard proliferation. Each index defaults to one shard, but time-based indexing patterns that create daily indexes (e.g., logs-2024.01.15) accumulate thousands of shards over months. Each shard maintains in-memory data structures that consume heap space regardless of the shard's data size. A cluster with 50,000 shards across 200 daily indexes may spend more heap memory on shard metadata than on actual data caching, resulting in frequent garbage collection pauses and eventual cluster instability. ILM with rollover-based indexing (create new index when current reaches a size threshold) prevents this by controlling shard count independently of time granularity.

Security is non-negotiable but historically overlooked. Before Elasticsearch 8.0, security was disabled by default and required explicit configuration. The result was thousands of internet-exposed Elasticsearch clusters containing sensitive data. Even in internal deployments, lack of RBAC means any application that can reach the cluster can read or delete any index. Field-level security prevents applications from accessing PII fields they do not need, which is a common compliance requirement for GDPR, HIPAA, and PCI DSS.

Common Decisions (ADR Triggers)¶

ADR: Elasticsearch vs OpenSearch¶

Context: The organization needs a distributed search and analytics engine and must choose between the two major forks.

Options:

Criterion	Elasticsearch (Elastic)	OpenSearch (AWS)
License	AGPL (from 8.12) / SSPL+Elastic License (7.11-8.11)	Apache 2.0
Managed Service	Elastic Cloud (multi-cloud)	Amazon OpenSearch Service (AWS)
Security Plugin	Built-in (from 8.0)	OpenDistro Security plugin
ML Features	Anomaly detection, NLP inference, ESQL	Anomaly detection, k-NN vector search
Community	Elastic-led, large ecosystem	AWS-led, Linux Foundation member
Migration From Other	Difficult from OpenSearch 2.x+	Difficult from Elasticsearch 8.x+

Decision drivers: Licensing requirements (AGPL vs Apache 2.0 implications), cloud provider alignment (AWS favors OpenSearch), feature requirements (ESQL, Elastic ML vs OpenSearch plugins), existing team expertise, and long-term vendor strategy.

ADR: Observability vs Search Cluster Architecture¶

Context: The organization uses Elasticsearch for both log/metrics observability and user-facing application search.

Options: - Single cluster with namespace isolation: Shared infrastructure, index-level RBAC. Lower cost. Risk of resource contention between logging spikes and search queries. Requires careful capacity planning for combined workloads. - Separate dedicated clusters: Independent clusters for observability and search. Complete resource isolation. Higher infrastructure cost. Independent scaling, tuning, and upgrade schedules. Different ILM policies per use case. - Hybrid with dedicated node pools: Single cluster with hot/warm/cold tiers for logging and dedicated search nodes. Partial resource isolation through node allocation awareness. Moderate cost. Complex configuration.

Decision drivers: SLA requirements for application search latency, log ingest volume variability, budget for infrastructure, operational team capacity to manage multiple clusters, and whether observability and search have different retention requirements.

ADR: Log Pipeline Architecture¶

Context: The organization needs to ship logs and metrics from applications and infrastructure to Elasticsearch.

Options: - Direct shipping (Beats/Agent to Elasticsearch): Simplest architecture. Beats ship directly to Elasticsearch. No intermediate buffering. Elasticsearch backpressure directly affects log shippers. Risk of data loss during Elasticsearch maintenance or outages. - Buffered pipeline (Beats to Kafka to Logstash to Elasticsearch): Kafka provides durable buffering between shippers and Elasticsearch. Absorbs ingest spikes. Enables replay from Kafka if Elasticsearch is unavailable. Higher infrastructure complexity and cost. Recommended for high-volume production deployments. - Kubernetes-native (Fluent Bit to Fluentd to Elasticsearch): Fluent Bit as lightweight DaemonSet agent on each node. Fluentd as aggregator with buffering and enrichment. Cloud-native with Kubernetes-aware metadata enrichment. Well-suited for containerized environments.

Decision drivers: Log volume (below or above 10 GB/day), tolerance for data loss during outages, Kubernetes vs VM-based infrastructure, need for log enrichment and transformation, and team familiarity with each component.

Reference Links¶

Elasticsearch Reference -- index management, mappings, queries, aggregations, and cluster administration
OpenSearch Documentation -- OpenSearch-specific features, plugins, security configuration, and migration guides
Elasticsearch Sizing Guide -- shard sizing, heap sizing, and node configuration best practices
Elastic Observability Guide -- ELK stack deployment for logs, metrics, APM, and uptime monitoring
OpenSearch Benchmark -- performance testing and cluster sizing methodology