Kubernetes Observability¶

Scope¶

Kubernetes observability: Prometheus deployment, metrics-server, kube-state-metrics, logging stack selection (Fluent Bit, Loki, Elasticsearch), distributed tracing (OpenTelemetry), event collection, Grafana dashboards, Alertmanager, and long-term metric storage (Thanos, Mimir).

Checklist¶

Why This Matters¶

Kubernetes adds layers of abstraction (pods, services, controllers, operators) that obscure what is happening at the infrastructure and application level. Without purpose-built observability, common failure modes are invisible: pods stuck in CrashLoopBackOff on a Friday night, PVCs pending due to zone mismatch, memory leaks causing OOM kills, DNS resolution failures causing intermittent 5xx errors. The three pillars (metrics, logs, traces) each address different debugging needs. Metrics detect problems (CPU spike, error rate increase). Logs provide context (stack traces, error messages). Traces identify bottlenecks across distributed services. Kubernetes-specific metrics (kube-state-metrics, etcd metrics, kubelet metrics) are essential for understanding cluster health independently of application health.

Common Decisions (ADR Triggers)¶

Prometheus vs cloud-native monitoring (CloudWatch, Cloud Monitoring, Azure Monitor): Prometheus is the Kubernetes standard with deep integration (ServiceMonitor CRDs, native metric format). Cloud-native monitoring is simpler to operate but often has Kubernetes metric gaps, higher cost at scale, and weaker PromQL support. Use Prometheus for Kubernetes-native monitoring; cloud-native for infrastructure outside Kubernetes. Many teams run both.
Loki vs Elasticsearch for log storage: Loki indexes only labels (cheaper, simpler) and requires label-based queries. Elasticsearch indexes full text (richer search) but is expensive to operate and scale. Loki is the default choice for Kubernetes-native logging (pairs with Grafana). Elasticsearch is better when full-text search across unstructured logs is a primary use case or when an existing ELK investment exists.
Fluent Bit vs Fluentd vs Vector: Fluent Bit is lightweight (C, ~15MB memory) and optimized for Kubernetes log collection. Fluentd (Ruby) has a richer plugin ecosystem but higher resource usage. Vector (Rust) offers high performance and a unified pipeline for logs and metrics. Use Fluent Bit as the default Kubernetes log collector; Vector for high-throughput or unified pipelines.
Jaeger vs Tempo for trace storage: Jaeger is mature with a full UI and supports multiple backends (Elasticsearch, Cassandra, Kafka). Tempo is Grafana's trace backend (pairs with Grafana for visualization, uses object storage for cost-effective retention). Choose Tempo for Grafana-centric stacks; Jaeger for teams that need the Jaeger UI or already run Elasticsearch/Cassandra.
Prometheus local storage vs Thanos/Mimir: Local Prometheus retention is limited by disk and not durable across pod restarts. Thanos adds long-term storage (S3/GCS), multi-cluster querying, and downsampling. Mimir (from Grafana Labs) provides similar capabilities with a simpler architecture. Use local Prometheus for single small clusters; Thanos/Mimir for multi-cluster, long-term retention, or high-availability Prometheus.
OpenTelemetry Collector vs purpose-built agents: OTel Collector handles metrics, logs, and traces in a single agent, reducing DaemonSet proliferation. But it is more complex to configure and less mature than purpose-built tools (Fluent Bit for logs, Prometheus for metrics). Use OTel Collector when standardizing on OpenTelemetry across the stack; purpose-built agents for simpler deployments.

Reference Architectures¶

Full Observability Stack (Grafana-Centric)¶

[Application Pods]
  - OTel SDK (traces + metrics)          [kube-state-metrics]  [metrics-server]
  - stdout/stderr (logs)                        |                     |
        |                                       |                     |
  [Fluent Bit DaemonSet] --> [Loki]       [Prometheus]          [HPA/VPA/kubectl top]
  (log collection)           (log store)   (scrape targets)
        |                       |               |
        +-------+-------+-------+-------+-------+
                |                               |
          [Grafana]                       [Alertmanager]
          - Log dashboards (Loki)         - PagerDuty
          - Metric dashboards (Prometheus) - Slack
          - Trace visualization (Tempo)    - OpsGenie
          - SLO dashboards (Pyrra)
                |
          [OTel Collector] --> [Tempo]
          (receive traces)     (trace store, S3 backend)

Fluent Bit collects container logs as a DaemonSet, forwards to Loki. Prometheus scrapes metrics from pods (via ServiceMonitor), kube-state-metrics, and node-exporter. OTel Collector receives traces from application SDKs and forwards to Tempo. Grafana provides unified dashboards across all three signals. Alertmanager routes alerts based on severity and team ownership.

Multi-Cluster Monitoring with Thanos¶

[Cluster A]                    [Cluster B]
+-------------------+          +-------------------+
| Prometheus        |          | Prometheus        |
| + Thanos Sidecar  |          | + Thanos Sidecar  |
| (upload to S3)    |          | (upload to S3)    |
+-------------------+          +-------------------+
         |                              |
         +---------- [S3 Bucket] -------+
                         |
                  [Thanos Store Gateway]
                  (reads from S3)
                         |
                  [Thanos Querier]
                  (federated PromQL)
                         |
                  [Thanos Compactor]
                  (downsampling, retention)
                         |
                  [Grafana]
                  (multi-cluster dashboards)

Each cluster runs Prometheus with a Thanos sidecar that uploads blocks to S3. Thanos Store Gateway reads historical data from S3. Thanos Querier provides a unified PromQL endpoint across all clusters and time ranges. Compactor handles downsampling (5m, 1h resolution for old data) and retention enforcement. Grafana connects to Thanos Querier for multi-cluster visibility.

SLO-Based Alerting¶

[SLI Definition]
  - Request latency p99 < 500ms
  - Error rate < 0.1%
  - Availability > 99.9%
        |
  [Pyrra / Sloth]
  - Generates PrometheusRule CRs
  - Multi-window burn-rate alerts
  - 1h/6h windows for fast burn
  - 3d/30d windows for slow burn
        |
  [Alertmanager]
  - Fast burn (>14.4x) --> Page (PagerDuty)
  - Slow burn (>1x)    --> Ticket (Jira/Slack)
        |
  [Grafana SLO Dashboard]
  - Error budget remaining
  - Burn rate trend
  - Time to budget exhaustion

SLO-based alerting replaces threshold-based alerts (CPU > 80%) with business-meaningful alerts (error budget burning too fast). Multi-window burn rates detect both sudden spikes (fast burn: alert in minutes) and gradual degradation (slow burn: alert in hours). This reduces alert fatigue by tying alerts to user-facing impact rather than infrastructure metrics.

Reference Links¶

Kubernetes Monitoring Architecture -- metrics-server, resource metrics pipeline, and custom metrics
Logging Architecture -- container logging patterns, node-level logging agents, and sidecar approaches
Kubernetes Dashboard -- web-based cluster monitoring and management UI