OpenStack Observability (Monitoring, Metering, Logging, Testing)¶

Scope¶

Covers OpenStack observability stack: Ceilometer telemetry, Gnocchi metrics backend, Aodh alarming, Monasca monitoring-as-a-service, centralized logging (ELK/Loki), Prometheus integration, CloudKitty rating, OSProfiler/OpenTelemetry tracing, Rally benchmarking, and Tempest functional testing.

Checklist¶

Why This Matters¶

OpenStack is composed of dozens of services communicating via REST APIs and message queues -- without comprehensive observability, failures cascade silently and root cause analysis becomes guesswork. Ceilometer polling intervals directly affect autoscaling responsiveness: a 600-second polling interval means 10-minute detection latency for scaling events. Gnocchi archive policy misconfiguration causes either excessive storage consumption or loss of granular historical data. RabbitMQ is the nervous system of OpenStack -- queue congestion in one service (e.g., Neutron agent heartbeats) can starve other services of message delivery. Log volume from a medium OpenStack deployment (50 compute nodes) can exceed 10 GB/day, requiring deliberate retention and aggregation strategy. Rally and Tempest are not optional extras -- they are the only reliable way to validate that an upgrade or configuration change has not introduced performance regression or functional breakage.

Common Decisions (ADR Triggers)¶

Metrics stack -- Ceilometer/Gnocchi/Aodh (native OpenStack, integrated with Heat autoscaling) vs Prometheus/Grafana (industry standard, large ecosystem, not OpenStack-native) -- Monasca is retired; integration depth vs ecosystem breadth
Logging stack -- ELK/OpenSearch (full-text search, powerful, resource-heavy) vs Loki/Grafana (label-indexed, lightweight, pairs with Prometheus) vs Graylog (structured logging, built-in alerting) -- query capabilities vs resource footprint
Log shipping -- Fluentd (flexible, plugin-rich, Ruby) vs Fluent Bit (lightweight C, low memory) vs Filebeat (Elastic ecosystem) -- footprint on compute nodes and destination compatibility
Alerting pipeline -- Aodh alarms to Heat webhooks (native autoscaling) vs Prometheus Alertmanager (routing, silencing, grouping) vs PagerDuty/OpsGenie integration (incident management) -- operational workflow integration
Performance testing -- Rally continuous benchmarking (proactive capacity insights) vs Rally only pre-upgrade (reactive validation) vs no performance testing (risk of undetected degradation) -- operational maturity level
Functional validation -- Tempest in CI/CD (every change validated) vs Tempest post-upgrade only (periodic validation) vs manual smoke tests (lowest coverage) -- deployment confidence vs pipeline complexity
Tracing -- OpenTelemetry with Jaeger (industry standard, recommended for new deployments) vs OSProfiler (native but maintenance mode) vs no tracing (log correlation only) -- debugging capability vs implementation effort
Metering for billing -- Ceilometer + CloudKitty (rating engine for chargeback/showback) vs custom metering from Gnocchi/Prometheus (flexible but build-it-yourself) vs external billing platforms -- internal chargeback requirements

OpenStack Exporter Configuration¶

The openstack-exporter is a Go-based Prometheus exporter that queries OpenStack APIs and exposes metrics for Nova, Neutron, Cinder, Glance, Keystone, Octavia, Heat, Swift, and 10+ other services.

Deployment options: Docker, Snap, Kolla-Ansible, Helm chart, or binary.

Authentication uses standard clouds.yaml:

clouds:
  mycloud:
    auth:
      auth_url: https://keystone.example.com:5000/v3
      project_name: admin
      username: monitoring
      password: <from-secret>
      user_domain_name: Default
      project_domain_name: Default
    region_name: RegionOne

Prometheus scrape config — single-cloud mode (port 9180):

- job_name: 'openstack'
  scrape_interval: 120s  # OpenStack API calls are slow; avoid overloading
  scrape_timeout: 60s
  static_configs:
  - targets: ['openstack-exporter:9180']

Multi-cloud mode — use /probe endpoint:

- job_name: 'openstack'
  scrape_interval: 120s
  metrics_path: /probe
  params:
    cloud: ['mycloud']
  static_configs:
  - targets: ['openstack-exporter:9180']

Performance tuning for large deployments: - Enable caching (--cache, --cache-ttl 300s) to reduce API calls - Disable slow metrics that query per-instance diagnostics: --disable-metric nova_server_diagnostics - Use domain filtering to limit scope when only monitoring specific projects - Set scrape_interval to 120s+ (OpenStack API calls are expensive compared to typical exporters)

Recommended exporter stack for a full OpenStack deployment:

Exporter	Port	Metrics
openstack-exporter	9180	API-level: servers, volumes, networks, images, quotas
node_exporter	9100	Host: CPU, memory, disk, network on all nodes
libvirt_exporter	9177	Hypervisor: per-VM CPU, memory, disk I/O, network I/O
ceph_exporter	9283	Storage: OSD latency, PG states, pool utilization (see Ceph storage)
ipmi_exporter	9290	Hardware: temperatures, fan speeds, power draw, hardware events
rabbitmq_exporter	9419	Messaging: queue depth, message rates, consumer counts
mysqld_exporter / postgres_exporter	9104/9187	Database: query latency, connections, replication lag
haproxy_exporter	9101	Load balancer: request rates, backend health, connection counts

Version Notes¶

Feature	Pike (16) Oct 2017	Queens (17) Feb 2018	Rocky (18) Aug 2018	Stein (19) Apr 2019	Train (20) Oct 2019	Ussuri (21) May 2020	Victoria (22) Oct 2020	Wallaby (23) Apr 2021	Xena (24) Oct 2021	Yoga (25) Mar 2022	Zed (26) Oct 2022	2023.1 Antelope (27)	2023.2 Bobcat (28)	2024.1 Caracal (29)	2024.2 Dalmatian (30)	2025.1 Epoxy (31)	2025.2 Flamingo (32)
Ceilometer polling agent	GA	GA	GA (polling/notification split)	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA
Ceilometer notification agent	GA	GA	GA (primary collection method)	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA
Ceilometer pipeline refactoring	Monolithic	Polling + notification separated	GA (separate processes)	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA
Gnocchi (metrics backend)	In-tree (integrated)	Independent project	Independent (GA)	GA (4.x)	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA
Gnocchi storage drivers	File, Ceph, Swift	File, Ceph, Swift, S3	Same + Redis	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA
Panko (event storage)	GA (integrated)	GA	GA	Deprecated notice	Deprecated	Deprecated	Deprecated	Deprecated	Retired	Retired	Retired	Retired	Retired	Retired	Retired	Retired	Retired
Aodh (alarming)	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA
Aodh composite alarms	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA
Monasca (monitoring-as-a-service)	GA	GA	GA	GA	Maintenance mode	Maintenance mode	Maintenance mode	Maintenance mode	Maintenance mode	Effectively retired	Effectively retired	Effectively retired	Effectively retired	Effectively retired	Effectively retired	Retired	Retired
CloudKitty (rating/billing)	GA	GA	GA (v2 API)	GA (v2 storage)	GA	GA	GA	GA	GA (Prometheus collector)	GA	GA	GA	GA	GA	GA	GA	GA
CloudKitty storage backends	SQL	SQL	SQL, InfluxDB	SQL, InfluxDB, OpenSearch	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA
CloudKitty collectors	Ceilometer, Gnocchi	Same	Same + Monasca	Same	Same	Same	Same	Same	Prometheus added	GA	GA	GA	GA	GA	GA	GA	GA
OSProfiler (tracing)	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA (maintenance)	GA (maintenance)	Maintenance mode	Maintenance mode (use OpenTelemetry)
Rally (benchmarking)	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA
Tempest (functional testing)	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA	GA
Prometheus integration	Community	Community	Community (exporters)	Community (exporters)	Community	Community	Community	Community	Community	Community (openstack-exporter)	GA (recommended)	GA	GA	GA	GA	GA	GA

Key changes across releases: - Ceilometer polling vs notification agents: In Pike, Ceilometer was a monolithic service. Rocky separated it into distinct polling and notification agents. The notification agent (consuming oslo.messaging notifications) became the primary and recommended collection method, as it is event-driven and lower overhead. The polling agent is still needed for metrics not exposed via notifications (e.g., libvirt VM metrics on compute nodes) but should have its polling interval tuned to reduce load. - Gnocchi becoming independent: Gnocchi was part of the Ceilometer project through Pike. In Queens it became an independent project with its own release cycle. This decoupling allowed Gnocchi to evolve faster and be used as a general-purpose time-series database beyond OpenStack telemetry. Gnocchi 4.x (Stein+) added performance improvements and new archive policy features. - Panko deprecation and retirement: Panko (event storage for Ceilometer events) was deprecated in Stein and retired in Xena. Events should be forwarded to external systems (Elasticsearch/OpenSearch, SIEM platforms) via oslo.messaging notification listeners rather than stored in Panko. Organizations still using Panko should migrate event storage to their centralized logging stack. - Monasca status: Monasca (monitoring-as-a-service with Kafka pipeline, multi-tenant metrics, and built-in alarming) entered maintenance mode in Train due to declining community activity. It is effectively retired as of Yoga. Organizations that adopted Monasca should plan migration to Prometheus/Grafana or Ceilometer/Gnocchi/Aodh stacks. - CloudKitty evolution: CloudKitty (rating engine for chargeback/showback) introduced its v2 API and storage backend in Rocky/Stein, adding InfluxDB and OpenSearch as storage options. A Prometheus collector was added in Xena, enabling CloudKitty to rate metrics from Prometheus in addition to Gnocchi and Ceilometer. This makes CloudKitty compatible with both native OpenStack telemetry and Prometheus-based monitoring stacks. - Prometheus as recommended monitoring: While not an official OpenStack project, Prometheus with the openstack-exporter has become the de facto standard for infrastructure monitoring in OpenStack deployments. Starting with Zed, many deployment tools (Kolla-Ansible, TripleO/Director) include Prometheus integration out of the box. Combined with Grafana dashboards, it provides richer visualization than native OpenStack telemetry. - OSProfiler to OpenTelemetry transition: OSProfiler remains functional for per-request distributed tracing across OpenStack services but entered maintenance mode in 2024.1 (Caracal). For new deployments, OpenTelemetry with Jaeger is the recommended tracing approach. Existing OSProfiler users should plan migration to OpenTelemetry instrumentation. - Epoxy (2025.1) observability changes: OSProfiler formally in maintenance mode -- OpenTelemetry recommended for all new tracing instrumentation. Monasca fully retired. Continued Prometheus integration improvements across deployment tools. - Flamingo (2025.2) observability changes: Continued improvements to OpenTelemetry integration across OpenStack services. OSProfiler remains available but deprecated in favor of OpenTelemetry.