Open-Source Observability Stack (Prometheus, Grafana, Loki, AlertManager)¶
Scope¶
Open-source observability stack: Prometheus (retention, sizing, recording rules, federation), AlertManager (routing, grouping, silencing), Grafana (provisioning, authentication, dashboards-as-code), Loki (label design, retention), Promtail/Alloy agents, node_exporter, blackbox_exporter, and Ceph monitoring integration.
Checklist¶
- [Critical] Is Prometheus retention configured appropriately for the environment (15d default is often insufficient for capacity planning; extend to 30-90d for local, or configure remote write to Thanos/Cortex/Mimir for long-term storage)?
- [Critical] Is Prometheus sized correctly for the active time series count -- each active series consumes ~1-2 KB of RAM, so 1M active series requires ~2-4 GB RAM for ingestion alone, plus query overhead?
- [Critical] Are alerting rules defined for infrastructure essentials (node down, disk >85%, memory >90%, certificate expiry <30d) and routed through AlertManager to appropriate on-call channels (PagerDuty, Slack, email)?
- [Critical] Is AlertManager configured with proper routing tree, grouping (group_by: [alertname, cluster]), group_wait (30s), group_interval (5m), and repeat_interval (4h) to prevent alert storms?
- [Recommended] Are recording rules created for frequently queried expensive expressions (e.g., pre-compute
rate(http_requests_total[5m])intojob:http_requests:rate5m) to reduce query-time CPU load? - [Recommended] Is Grafana provisioning configured for dashboards-as-code (JSON/YAML in Git, deployed via provisioning directory or Grafana API) to prevent dashboard drift and enable version control?
- [Recommended] Is Grafana authentication integrated with the organization's identity provider (LDAP, OIDC via Keycloak/Okta/Entra ID, or SAML) rather than relying on local accounts?
- [Recommended] Is Loki label design reviewed to avoid high-cardinality labels (never use user_id, request_id, or IP as labels -- these should be structured log fields queried with LogQL filters)?
- [Recommended] Are Promtail or Grafana Alloy agents deployed on all hosts with appropriate pipeline stages to parse log formats, extract structured fields, and attach environment/service labels?
- [Optional] Is Prometheus federation configured for multi-cluster environments, with a global Prometheus scraping aggregated metrics from cluster-level Prometheus instances?
- [Recommended] If Ceph storage is in scope, is the Ceph Prometheus module scraped by this Prometheus instance? (cephadm deploys its own Prometheus/Grafana by default — decide whether to use the central stack instead to avoid duplicate infrastructure. Scrape
ceph-exporterat port 9283, import Ceph Grafana dashboards from ceph-mixins. See Ceph storage) - [Optional] Is Loki retention configured with table_manager or compactor retention (e.g., 30d for application logs, 90d for audit/security logs) to manage storage growth?
- [Optional] Are Grafana plugins installed for specialized data sources (e.g., Elasticsearch, Zabbix, SNMP) or visualization needs (flowchart, diagram panels)?
- [Recommended] Is a node_exporter deployed on every Linux host and windows_exporter on every Windows host, with blackbox_exporter probing external endpoints (HTTP, TCP, ICMP, DNS)?
Why This Matters¶
Commercial monitoring solutions (Datadog, New Relic, Splunk) carry per-host or per-GB pricing that becomes prohibitive at scale in on-prem environments -- a 200-node deployment with Datadog can easily exceed $100K/yr. The Prometheus/Grafana/Loki stack provides equivalent capabilities (metrics, dashboards, alerting, log aggregation) at zero licensing cost, with the trade-off of operational responsibility. However, this stack requires deliberate sizing and configuration. Prometheus is a single-node, in-memory time-series database by design -- it does not cluster natively, and running out of RAM causes OOM kills that create monitoring blackouts during the exact incidents you need visibility into. Loki's label-based architecture is fundamentally different from Elasticsearch's full-text indexing; misunderstanding this leads to either massive index bloat (too many labels) or unusable query performance (no labels, grep through everything). AlertManager's routing tree is the single most impactful configuration for on-call experience -- misconfigured grouping causes either alert floods (hundreds of individual alerts during an outage) or silently dropped alerts.
Common Decisions (ADR Triggers)¶
- Long-term storage backend -- Thanos (sidecar pattern, object storage, globally queryable, battle-tested at scale), Cortex (multi-tenant, horizontally scalable, complex to operate), Grafana Mimir (Cortex successor by Grafana Labs, simplified deployment, enterprise features), or VictoriaMetrics (single binary, high performance, drop-in Prometheus replacement with built-in long-term storage). Thanos is the most widely deployed; Mimir is gaining adoption rapidly. VictoriaMetrics is simplest operationally if you do not need multi-tenancy.
- Grafana OSS vs Grafana Cloud -- Self-hosted Grafana is free but requires infrastructure and operational effort. Grafana Cloud provides managed Prometheus (Mimir), Loki, and Grafana with a generous free tier (10K metrics series, 50GB logs/mo) and per-usage pricing. For teams without deep observability expertise, Grafana Cloud eliminates the operational burden at ~$8/user/mo plus usage. On-prem agents (Grafana Alloy) can remote-write to Grafana Cloud.
- Log aggregation: Loki vs Elasticsearch/OpenSearch -- Loki is cheaper to operate (indexes only labels, not log content) and integrates natively with Grafana, but LogQL is less powerful than Elasticsearch KQL for complex full-text search. Elasticsearch is better for security/SIEM use cases (correlating across diverse log sources) but requires significant memory and storage (plan 1 GB RAM per 1 TB indexed data). Choose Loki for operational logs, Elasticsearch for security analytics.
- Agent: Promtail vs Grafana Alloy vs OpenTelemetry Collector -- Promtail is Loki-specific (simple, reliable). Grafana Alloy (formerly Grafana Agent) is a unified agent that can scrape Prometheus metrics, collect logs (replacing Promtail), and receive OpenTelemetry traces. OpenTelemetry Collector is vendor-neutral and supports multiple backends. If using only the Grafana stack, Alloy simplifies to one agent per host.
- Deployment: VMs vs containers -- Prometheus and Grafana run well on VMs (systemd services) or in containers (Docker Compose, Kubernetes). For on-prem without Kubernetes, VM deployment with Ansible/Puppet is simpler. For Kubernetes environments, use the kube-prometheus-stack Helm chart, which deploys Prometheus Operator, Grafana, AlertManager, and node_exporter with sensible defaults.
- Ceph monitoring integration -- cephadm-managed Ceph clusters deploy their own Prometheus, Grafana, Alertmanager, and Node Exporter by default (skip with
--skip-monitoring-stack). When a centralized Prometheus/Grafana stack exists, this creates duplicate infrastructure and split dashboards. Options: (1) disable Ceph's built-in monitoring and scrapeceph-exporterfrom the central Prometheus using cephadm's service discovery endpoint (https://<mgr-ip>:8765/sd/prometheus/sd-config?service=ceph-exporter), then import Ceph dashboards from ceph-mixin into the central Grafana; (2) federate Ceph's Prometheus into the central instance; (3) keep separate stacks for storage team autonomy. Rook-managed Ceph (Kubernetes) does not deploy its own stack — it exposesServiceMonitorCRDs for Prometheus Operator. See Ceph storage for full configuration details. - AlertManager receivers -- PagerDuty for critical/P1 (phone call escalation), Slack/Teams for warning/P2-P3 (chat notification), email for informational. Define escalation paths: if P1 is not acknowledged within 15 minutes, escalate to secondary on-call. Use inhibition rules so that "host down" suppresses all service alerts on that host.
Stack Sizing Guidelines¶
Small Environment (<50 VMs)¶
| Component | CPU | RAM | Storage | Notes |
|---|---|---|---|---|
| Prometheus | 2 vCPU | 4 GB | 100 GB SSD | ~50K active series, 30d retention |
| Grafana | 1 vCPU | 2 GB | 10 GB | SQLite backend sufficient |
| Loki | 2 vCPU | 4 GB | 200 GB | Filesystem chunk store |
| AlertManager | 1 vCPU | 512 MB | 1 GB | Co-locate with Prometheus |
| Total | 6 vCPU | 10.5 GB | 311 GB | Can co-locate on 1-2 VMs |
Medium Environment (50-500 VMs)¶
| Component | CPU | RAM | Storage | Notes |
|---|---|---|---|---|
| Prometheus | 4 vCPU | 16 GB | 500 GB SSD | ~500K active series, 30d retention |
| Grafana | 2 vCPU | 4 GB | 20 GB | PostgreSQL backend for HA |
| Loki | 4 vCPU | 8 GB | 1 TB | S3/MinIO chunk store recommended |
| AlertManager | 1 vCPU | 1 GB | 5 GB | Clustered (2-3 instances) |
| Total | 11 vCPU | 29 GB | 1.5 TB | Separate VMs per component |
Large Environment (500+ VMs)¶
| Component | CPU | RAM | Storage | Notes |
|---|---|---|---|---|
| Prometheus | 8+ vCPU | 32-64 GB | 1+ TB NVMe | ~2M+ active series, remote write to Thanos/Mimir |
| Thanos/Mimir | 8+ vCPU | 16-32 GB | Object storage (S3/MinIO) | Handles long-term queries |
| Grafana | 4 vCPU | 8 GB | 50 GB | HA pair with PostgreSQL, caching proxy |
| Loki | 8+ vCPU | 16-32 GB | Object storage (S3/MinIO) | Distributed mode (read/write/backend) |
| AlertManager | 2 vCPU | 2 GB | 10 GB | 3-node cluster |
| Total | 30+ vCPU | 74-138 GB | 1+ TB local + object storage | Consider dedicated monitoring cluster |
Key Configuration Patterns¶
Prometheus Cardinality Management¶
High cardinality is the primary cause of Prometheus performance issues. Monitor with:
- prometheus_tsdb_head_series -- total active series (alert if growing unexpectedly)
- topk(10, count by (__name__)({__name__=~".+"})) -- top 10 metric names by series count
- Use metric_relabel_configs to drop unused labels or entire metrics at scrape time
- Set sample_limit per scrape job to prevent a single target from exploding cardinality
Loki Label Design¶
Good labels (low cardinality):
{job="nginx", env="production", cluster="dc1"}
Bad labels (high cardinality - DO NOT USE):
{job="nginx", user_id="12345", request_id="abc-def-ghi"}
Query high-cardinality fields with LogQL filter expressions instead:
{job="nginx"} |= "user_id=12345" | json | line_format "{{.message}}"
Reference Architectures¶
- Prometheus documentation: prometheus.io/docs -- scrape configuration, recording rules, remote write, federation, and operational best practices
- Thanos project: thanos.io -- sidecar, store gateway, compactor, and querier architecture for global view across Prometheus instances
- Grafana Mimir: grafana.com/docs/mimir -- horizontally scalable long-term storage, drop-in Prometheus remote write target
- Grafana Loki: grafana.com/docs/loki -- architecture overview, deployment modes (monolithic, simple scalable, microservices), and LogQL reference
- kube-prometheus-stack Helm chart: github.com/prometheus-community/helm-charts -- production-ready Kubernetes deployment with Prometheus Operator
- Awesome Prometheus alerts: samber/awesome-prometheus-alerts -- community-curated alerting rules for common infrastructure and applications
- Grafana dashboards: grafana.com/grafana/dashboards -- community dashboards (Node Exporter Full: ID 1860, Kubernetes: ID 315)
See Also¶
general/observability.md-- general observability patternsproviders/kubernetes/observability.md-- Kubernetes-specific Prometheus deploymentproviders/ceph/storage.md-- Ceph Prometheus module and exporter configuration