OpenStack Operations¶

Scope¶

This file covers OpenStack operational depth -- the concrete commands, diagnostic-capture flows, and incident-response branching that operators execute during control-plane and data-plane events. It is the per-vendor implementation layer beneath the framework guidance in general/operational-runbooks.md and is distinct from the architecture-decision content in providers/openstack/control-plane-ha.md (HA topology, quorum, fencing) and providers/openstack/networking.md (Neutron design). Topics: control-plane outage triage (keystone/nova/neutron), Neutron L2/L3/DHCP-agent flapping, RabbitMQ split-brain and queue mirroring failures, MariaDB/Galera state recovery, Keystone token store recovery (UUID vs Fernet), and information-only vs change-control command boundaries. For deployment-tool internals (kolla-ansible, OpenStack-Helm, OpenStack-Ansible, TripleO), see providers/openstack/deployment-tools.md. For Cinder/Ceph integration symptoms, see providers/ceph/operations.md.

Checklist¶

Lifecycle Notifications for Governance Automation¶

OpenStack publishes resource-lifecycle events on the same oslo.messaging bus that carries control-plane RPC. Beyond the incident-triage use of that bus (above), these events are the integration point for downstream governance automation -- the platform telling external systems when a resource was created, updated, or destroyed so they can react without polling.

Enabling the stream. Versioned notifications are configured in nova.conf under [notifications] (notify_on_state_change = vm_and_task_state, notification_format = versioned); Nova then publishes to the versioned_notifications topic on the configured transport ([oslo_messaging_notifications] driver = messagingv2, transport_url defaulting to the control-plane RabbitMQ unless a dedicated notification bus is split out). Neutron, Cinder, and Keystone publish their own notifications (port.create.end, volume.delete.end, identity.project.deleted) on the same mechanism.

The deletion event that matters for lifecycle sync. instance.delete.end fires when Nova has finished tearing down an instance. A consumer on the versioned_notifications topic receives the payload (instance UUID, project id, timestamps) and can use it as the low-latency flag that a protected resource is gone -- the event-driven half of the hybrid described in patterns/backup-lifecycle-synchronization.md. The discipline from that pattern applies here: the event is a flag, not an authority. Build the consumer to be idempotent (the same delete may be redelivered under oslo.messaging at-least-once semantics) and fail-safe (a dropped event during a RabbitMQ partition must be backstopped by a reconciliation loop that lists Nova instances and diffs against the downstream system's records), and key everything on the immutable instance UUID, never the display name.

Operational caveat -- the bus is shared. Because notifications ride the control-plane RabbitMQ by default, the same partition or queue-overflow event that the triage order hunts for (Keystone/RabbitMQ/Galera) also silently halts the notification stream. A consumer that has gone quiet is therefore ambiguous: it can mean "no resources changed" or "the message bus is down and events are being lost." For any automation where missed events have a cost or compliance consequence, either split notifications onto a dedicated transport_url so control-plane pressure does not drop governance events, or -- more robustly -- treat the event stream as best-effort and let a reconciliation loop own correctness. This is the same hybrid conclusion the cross-system pattern reaches, here grounded in OpenStack's specific message-bus coupling.

Why This Matters¶

OpenStack operational symptoms are routinely misleading at the surface. A user reports "instances are not booting"; the literal cause might be Nova, Glance, Neutron, Cinder, Keystone, RabbitMQ, MariaDB, or any combination. Each of those services has its own logs, its own health-check commands, and its own failure modes. An operator who jumps straight to nova-compute because the symptom is "boot failure" will spend hours chasing the wrong service when the actual cause is RabbitMQ partition or Keystone token-key drift. The triage-order discipline -- Keystone first, message bus second, database third, services last -- exists because dependencies flow downward and checking the lower layers first eliminates entire classes of false positives.

The Galera quorum-loss recovery is OpenStack's most consequential operational foot-gun. Galera's design is "automatic everything as long as a majority is alive"; the moment all nodes are simultaneously down (data center power event, simultaneous reboot, network partition that drops every node from quorum), the cluster does not self-recover. One node must be manually designated as the bootstrap node based on the highest seqno in grastate.dat. Bootstrapping the wrong node makes that node's state authoritative and silently discards transactions from the others. The runbook needs the seqno-comparison step to be explicit and unskippable, because under pressure the temptation is to bootstrap whichever node responds to SSH first. The same care applies to Etcd in OpenStack-Helm clusters and to PostgreSQL with patroni in some deployments.

RabbitMQ split-brain is the runner-up foot-gun. Classic mirrored queues (the historical default for OpenStack) are explicitly documented as not recommended for partition tolerance; quorum queues (Raft-based, default in newer OpenStack releases) tolerate partitions correctly but require all queues to be re-declared as quorum type, which is a non-trivial migration. After a partition, OpenStack agents do not automatically reconnect to the surviving cluster -- they keep their stale connection state and miss heartbeats until the agent process restarts. The runbook needs a "rolling restart of all OpenStack services after RabbitMQ partition recovery" step that is easy to forget but high-impact.

Neutron agent flapping is the third common symptom that gets misdiagnosed. The flapping appears in openstack network agent list as the Alive column toggling between :-) and xxx. The naive response is "restart the agent". The correct response is to check whether the agent's heartbeat is failing because of: (a) RPC layer (RabbitMQ partition or queue overflow -- not an agent problem), (b) host load (agent process being scheduled out long enough to miss the heartbeat window -- a host problem, not an agent problem), or (c) clock drift between agent and controller (NTP issue). Restarting the agent makes the symptom go away briefly because the agent reconnects, but the underlying cause keeps causing flaps until it is identified. The agent log is rarely the right place to look for these; the controller's neutron-server.log and the RabbitMQ management UI are.

The nova evacuate versus nova migrate distinction is where operators destroy data. nova evacuate rebuilds an instance on a new host from shared storage assuming the original host is dead; if the original host is actually still running and writing to the same shared volume (Cinder, NFS, or Ceph RBD), both copies of the instance write to the same backend and corrupt it. The runbook must require source-host-dead confirmation at the BMC level (IPMI/iDRAC/iLO power off, switchport down) before evacuate, not just "the heartbeat is missed." This is one of the few OpenStack commands where the wrong call corrupts data in a way that cannot be recovered.

Common Decisions (ADR Triggers)¶

Token format: Fernet vs UUID -- Fernet is the default and recommended choice (stateless, no DB pressure, key rotation built-in). UUID tokens persist in the database and require token_flush cron jobs to prevent unbounded table growth. Fernet's operational concern is key distribution: all Keystone instances must have identical key repositories or token validation fails. Use Fernet; document the key-distribution mechanism explicitly (config-management push, shared filesystem, custom replication).
RabbitMQ queue type: classic-mirrored vs quorum -- Classic mirrored queues are deprecated and partition-intolerant; quorum queues are the supported choice from RabbitMQ 3.8+ but require explicit per-queue migration in OpenStack and have higher resource usage. New deployments should default to quorum; existing classic-mirrored deployments should plan a migration during a maintenance window.
Galera vs PostgreSQL/Patroni for OpenStack DB -- Galera (MariaDB) is the historical OpenStack default; PostgreSQL with Patroni is supported and used in some deployments. Galera's strength is multi-write nodes; its weakness is the manual-bootstrap requirement after total-cluster outage. Patroni's strength is a clear primary/replica model with automatic failover; its weakness is single-write topology. Either works; the choice is mostly about operational familiarity.
nova evacuate automation: enabled vs manual-only -- Some deployments enable masakari (instance HA) for automatic evacuation on host failure. The trade-off is faster recovery vs the risk of double-mount data corruption if the host-down detection has a false positive. Manual-only is safer; automatic is appropriate only when host-down detection is reliable (BMC-confirmed, fenced, multi-signal).

Reference Links¶

OpenStack troubleshooting -- official operations guide covering routine maintenance, common failures, and recovery procedures
Nova troubleshooting -- Nova-specific failure modes, log locations, evacuation procedures
Neutron troubleshooting -- Neutron operations including agent management, namespace inspection, OVN troubleshooting
Keystone Fernet token operations -- Fernet key setup, rotation, distribution
Galera Cluster recovery -- official Galera bootstrap and recovery procedures
RabbitMQ partition handling -- partition behavior, quorum queues, recovery procedures
Red Hat OpenStack Platform troubleshooting -- enterprise troubleshooting workflows