OpenStack Operations¶
Scope¶
This file covers OpenStack operational depth -- the concrete commands, diagnostic-capture flows, and incident-response branching that operators execute during control-plane and data-plane events. It is the per-vendor implementation layer beneath the framework guidance in general/operational-runbooks.md and is distinct from the architecture-decision content in providers/openstack/control-plane-ha.md (HA topology, quorum, fencing) and providers/openstack/networking.md (Neutron design). Topics: control-plane outage triage (keystone/nova/neutron), Neutron L2/L3/DHCP-agent flapping, RabbitMQ split-brain and queue mirroring failures, MariaDB/Galera state recovery, Keystone token store recovery (UUID vs Fernet), and information-only vs change-control command boundaries. For deployment-tool internals (kolla-ansible, OpenStack-Helm, OpenStack-Ansible, TripleO), see providers/openstack/deployment-tools.md. For Cinder/Ceph integration symptoms, see providers/ceph/operations.md.
Checklist¶
- [Critical] Is the boundary between information-only commands (
openstack server list,openstack network agent list,openstack endpoint list,rabbitmqctl cluster_status,nodetool statusfor MariaDB/Galera nodes viamysql -e "SHOW STATUS LIKE 'wsrep_%'") and change-control commands (anything that restarts an agent, evacuates a hypervisor, forces a Galera bootstrap, or modifies Keystone catalog) explicit in the runbook -- so on-call operators know which side of the line each step sits on? - [Critical] Is the control-plane outage triage order documented -- check Keystone first (everything depends on token validation; if Keystone is down, every other API returns 401/503 and the surface symptom is misleading), then the message bus (RabbitMQ; if down, agents stop receiving RPC and services appear hung even though APIs respond), then the database (MariaDB/Galera; if quorum lost, APIs return 500 on writes), then individual services (Nova, Neutron, Cinder, Glance) -- so operators do not chase Neutron symptoms when the actual cause is Keystone or RabbitMQ?
- [Critical] Is diagnostic capture performed before restarting any service --
journalctl -u <service> --since "30 minutes ago"for systemd-managed services (kolla-ansible, OpenStack-Ansible),kubectl logs -n openstack <pod> --previousfor OpenStack-Helm, RabbitMQ logs (/var/log/rabbitmq/) for queue-related symptoms, MariaDB error log for Galera state-transfer issues, the Neutron agent logs (/var/log/neutron/<agent>.log) for agent-flapping events -- so the post-incident review has the evidence to identify root cause rather than just confirming the restart cleared the symptom? - [Critical] Is Neutron agent flapping investigated by checking the cause, not just the symptom --
openstack network agent listshows:-)(alive) vsxxx(dead) per agent; flapping usually means: heartbeat (agent_down_time, default 75s) lost via RPC failure (RabbitMQ issue, not agent issue), the host's load is causing the agent process to miss its heartbeat window, or a split between the agent's clock and the controller's clock (NTP drift); restarting the agent without identifying which class of cause masks the real issue? - [Critical] Is RabbitMQ split-brain recovery procedure documented --
rabbitmqctl cluster_statusshows partitions; classic-queue mirroring (deprecated in 3.10+, removed in 4.0) and quorum queues behave differently under partition; recovery requires choosing the authoritative side and runningrabbitmqctl forget_cluster_node <node>from the partition to keep, then rejoining the dropped nodes; OpenStack services typically need a rolling restart after partition resolution because agents have stale RPC connections that stay half-open? - [Critical] Is the Galera quorum-loss recovery procedure documented and reviewed --
SHOW STATUS LIKE 'wsrep_cluster_size'andwsrep_local_state_commentidentify state; if all nodes restart simultaneously, the cluster will not auto-bootstrap and one node must be force-bootstrapped usinggalera_new_clusteror--wsrep-new-clusterafter confirming viagrastate.datthat it has the highest seqno (orsafe_to_bootstrap=1); bootstrapping the wrong node causes data loss and is the classic Galera foot-gun? - [Critical] Is Keystone token store state understood -- Fernet tokens (default since Mitaka, stateless, no DB rows) require the key repository (
/etc/keystone/fernet-keys/) to be in sync across all Keystone instances; UUID tokens (legacy, deprecated) require periodickeystone-manage token_flushor the token table grows without bound; if Fernet keys are out of sync after a config-management run, half the API calls fail with 401 because the token was issued by a key the validator does not have -- checkkeystone-manage fernet_setup --keystone-user keystone --keystone-group keystonerotation cadence and key distribution? - [Recommended] Is the Nova evacuation procedure distinct from migration --
openstack server migrate(cold, requires source hypervisor up),openstack server migrate --live(live, requires source hypervisor up),nova evacuate(rebuild on a different host from shared storage, requires source hypervisor down and forced; do not run while source is alive or instances will be corrupted by dual-mount); the runbook must require the source to be confirmed dead before evacuate, including hardware-level confirmation (BMC power state, switchport down) not just a missed heartbeat? - [Recommended] Are Neutron L3-agent / DHCP-agent / OVN issues triaged before restarting -- L3 agent: check
openstack router show <id>forexternal_gateway_info,ip netnson the agent host forqrouter-<router-id>namespace,ip netns exec qrouter-<id> ip routefor routing state; DHCP agent:qdhcp-<network-id>namespace anddnsmasqprocess running inside it; OVN-based Neutron has different troubleshooting (ovn-nbctl show,ovn-sbctl show, no per-router namespaces) and conflating the two leads to wrong commands? - [Recommended] Are service-state inspection commands documented per service -- Nova:
openstack hypervisor list,openstack compute service list,nova-manage cell_v2 list_cellsfor cell health; Neutron:openstack network agent list,openstack router list,openstack port list --device-owner network:dhcp; Cinder:openstack volume service list,cinder-manage service list; Glance:openstack image list --status activeand check--propertyfor image registration consistency -- so on-call has a quick health check per service? - [Recommended] Is the
nova service-disableprocedure used for planned hypervisor maintenance --openstack compute service set --disable --disable-reason "<reason>" <host> nova-computeprevents the scheduler from placing new instances on the host without affecting running instances; pair with live-migration evacuation (nova host-evacuate-live <host>or per-instanceopenstack server migrate --live) before maintenance, and--enableafter; forgetting--enablecauses capacity to silently leak as fewer hosts are eligible for new placement? - [Recommended] Is the Cinder volume-state mismatch recovery understood --
cinder reset-state(changes the API's view of volume state without touching the backend, useful when a stuck volume's state is wrong but the backend is fine),cinder force-delete(skips state machine and deletes the API record, leaves backend orphaned and is a last resort) -- and is the runbook clear that these are change-control commands that should not be the first response to a stuck volume? - [Recommended] Is the OpenStack endpoint catalog verified during outage triage --
openstack endpoint listfor the public/internal/admin URL of each service; if a service's endpoint URL points to a load balancer that is itself down (HAProxy on a controller that is part of the outage), every API call to that service fails regardless of the service's own health; the catalog is the single source of truth for "where is this service" and getting it wrong during DR or controller swap creates cascading failures? - [Optional] Is
oslo.messagingRPC timeout behavior understood -- the defaultrpc_response_timeout(60s) means RPC calls fail before TCP keepalive detects a dead RabbitMQ connection (default 600s+); during a network partition or RabbitMQ pause, services pile up waiting for replies and exhaust their worker pool; tuningrpc_response_timeoutlower (30s) andexecutor_thread_pool_sizehigher provides earlier failure detection and more capacity for retries? - [Optional] Are deployment-tool-specific operations documented -- kolla-ansible: containers managed via Docker, restart with
docker restart <container>orkolla-ansible -i inventory deploy --tags <component>; OpenStack-Helm:kubectl rollout restart deployment/<name> -n openstack, OpenStack-specific values via Helm chart overrides; OpenStack-Ansible: LXC containers on infra hosts,lxc-attach -n <container>, restart viasystemctl restart <service>inside the container -- so operators do not run wrong-tool commands? - [Optional] Is the Nova lifecycle notification bus consumed where downstream governance automation depends on resource-lifecycle events --
instance.create.end,instance.delete.end,instance.updateare published overoslo.messaging(the same RabbitMQ the control plane uses) whennotifications.notify_on_state_change/notification_format = versionedis configured; a consumer subscribing to theversioned_notificationstopic can drive CMDB updates, chargeback metering, and backup-reclamation workflows off real platform events rather than polling the Nova API -- and is it understood that this bus is for automation, not operability (it shares the message bus whose health the triage order checks, so a RabbitMQ partition both breaks the control plane and silently stops the notification stream)?
Lifecycle Notifications for Governance Automation¶
OpenStack publishes resource-lifecycle events on the same oslo.messaging bus that carries control-plane RPC. Beyond the incident-triage use of that bus (above), these events are the integration point for downstream governance automation -- the platform telling external systems when a resource was created, updated, or destroyed so they can react without polling.
Enabling the stream. Versioned notifications are configured in nova.conf under [notifications] (notify_on_state_change = vm_and_task_state, notification_format = versioned); Nova then publishes to the versioned_notifications topic on the configured transport ([oslo_messaging_notifications] driver = messagingv2, transport_url defaulting to the control-plane RabbitMQ unless a dedicated notification bus is split out). Neutron, Cinder, and Keystone publish their own notifications (port.create.end, volume.delete.end, identity.project.deleted) on the same mechanism.
The deletion event that matters for lifecycle sync. instance.delete.end fires when Nova has finished tearing down an instance. A consumer on the versioned_notifications topic receives the payload (instance UUID, project id, timestamps) and can use it as the low-latency flag that a protected resource is gone -- the event-driven half of the hybrid described in patterns/backup-lifecycle-synchronization.md. The discipline from that pattern applies here: the event is a flag, not an authority. Build the consumer to be idempotent (the same delete may be redelivered under oslo.messaging at-least-once semantics) and fail-safe (a dropped event during a RabbitMQ partition must be backstopped by a reconciliation loop that lists Nova instances and diffs against the downstream system's records), and key everything on the immutable instance UUID, never the display name.
Operational caveat -- the bus is shared. Because notifications ride the control-plane RabbitMQ by default, the same partition or queue-overflow event that the triage order hunts for (Keystone/RabbitMQ/Galera) also silently halts the notification stream. A consumer that has gone quiet is therefore ambiguous: it can mean "no resources changed" or "the message bus is down and events are being lost." For any automation where missed events have a cost or compliance consequence, either split notifications onto a dedicated transport_url so control-plane pressure does not drop governance events, or -- more robustly -- treat the event stream as best-effort and let a reconciliation loop own correctness. This is the same hybrid conclusion the cross-system pattern reaches, here grounded in OpenStack's specific message-bus coupling.
Why This Matters¶
OpenStack operational symptoms are routinely misleading at the surface. A user reports "instances are not booting"; the literal cause might be Nova, Glance, Neutron, Cinder, Keystone, RabbitMQ, MariaDB, or any combination. Each of those services has its own logs, its own health-check commands, and its own failure modes. An operator who jumps straight to nova-compute because the symptom is "boot failure" will spend hours chasing the wrong service when the actual cause is RabbitMQ partition or Keystone token-key drift. The triage-order discipline -- Keystone first, message bus second, database third, services last -- exists because dependencies flow downward and checking the lower layers first eliminates entire classes of false positives.
The Galera quorum-loss recovery is OpenStack's most consequential operational foot-gun. Galera's design is "automatic everything as long as a majority is alive"; the moment all nodes are simultaneously down (data center power event, simultaneous reboot, network partition that drops every node from quorum), the cluster does not self-recover. One node must be manually designated as the bootstrap node based on the highest seqno in grastate.dat. Bootstrapping the wrong node makes that node's state authoritative and silently discards transactions from the others. The runbook needs the seqno-comparison step to be explicit and unskippable, because under pressure the temptation is to bootstrap whichever node responds to SSH first. The same care applies to Etcd in OpenStack-Helm clusters and to PostgreSQL with patroni in some deployments.
RabbitMQ split-brain is the runner-up foot-gun. Classic mirrored queues (the historical default for OpenStack) are explicitly documented as not recommended for partition tolerance; quorum queues (Raft-based, default in newer OpenStack releases) tolerate partitions correctly but require all queues to be re-declared as quorum type, which is a non-trivial migration. After a partition, OpenStack agents do not automatically reconnect to the surviving cluster -- they keep their stale connection state and miss heartbeats until the agent process restarts. The runbook needs a "rolling restart of all OpenStack services after RabbitMQ partition recovery" step that is easy to forget but high-impact.
Neutron agent flapping is the third common symptom that gets misdiagnosed. The flapping appears in openstack network agent list as the Alive column toggling between :-) and xxx. The naive response is "restart the agent". The correct response is to check whether the agent's heartbeat is failing because of: (a) RPC layer (RabbitMQ partition or queue overflow -- not an agent problem), (b) host load (agent process being scheduled out long enough to miss the heartbeat window -- a host problem, not an agent problem), or (c) clock drift between agent and controller (NTP issue). Restarting the agent makes the symptom go away briefly because the agent reconnects, but the underlying cause keeps causing flaps until it is identified. The agent log is rarely the right place to look for these; the controller's neutron-server.log and the RabbitMQ management UI are.
The nova evacuate versus nova migrate distinction is where operators destroy data. nova evacuate rebuilds an instance on a new host from shared storage assuming the original host is dead; if the original host is actually still running and writing to the same shared volume (Cinder, NFS, or Ceph RBD), both copies of the instance write to the same backend and corrupt it. The runbook must require source-host-dead confirmation at the BMC level (IPMI/iDRAC/iLO power off, switchport down) before evacuate, not just "the heartbeat is missed." This is one of the few OpenStack commands where the wrong call corrupts data in a way that cannot be recovered.
Common Decisions (ADR Triggers)¶
- Token format: Fernet vs UUID -- Fernet is the default and recommended choice (stateless, no DB pressure, key rotation built-in). UUID tokens persist in the database and require
token_flushcron jobs to prevent unbounded table growth. Fernet's operational concern is key distribution: all Keystone instances must have identical key repositories or token validation fails. Use Fernet; document the key-distribution mechanism explicitly (config-management push, shared filesystem, custom replication). - RabbitMQ queue type: classic-mirrored vs quorum -- Classic mirrored queues are deprecated and partition-intolerant; quorum queues are the supported choice from RabbitMQ 3.8+ but require explicit per-queue migration in OpenStack and have higher resource usage. New deployments should default to quorum; existing classic-mirrored deployments should plan a migration during a maintenance window.
- Galera vs PostgreSQL/Patroni for OpenStack DB -- Galera (MariaDB) is the historical OpenStack default; PostgreSQL with Patroni is supported and used in some deployments. Galera's strength is multi-write nodes; its weakness is the manual-bootstrap requirement after total-cluster outage. Patroni's strength is a clear primary/replica model with automatic failover; its weakness is single-write topology. Either works; the choice is mostly about operational familiarity.
nova evacuateautomation: enabled vs manual-only -- Some deployments enable masakari (instance HA) for automatic evacuation on host failure. The trade-off is faster recovery vs the risk of double-mount data corruption if the host-down detection has a false positive. Manual-only is safer; automatic is appropriate only when host-down detection is reliable (BMC-confirmed, fenced, multi-signal).
Reference Links¶
- OpenStack troubleshooting -- official operations guide covering routine maintenance, common failures, and recovery procedures
- Nova troubleshooting -- Nova-specific failure modes, log locations, evacuation procedures
- Neutron troubleshooting -- Neutron operations including agent management, namespace inspection, OVN troubleshooting
- Keystone Fernet token operations -- Fernet key setup, rotation, distribution
- Galera Cluster recovery -- official Galera bootstrap and recovery procedures
- RabbitMQ partition handling -- partition behavior, quorum queues, recovery procedures
- Red Hat OpenStack Platform troubleshooting -- enterprise troubleshooting workflows
See Also¶
providers/openstack/control-plane-ha.md-- OpenStack HA design: quorum, fencing, load balancers (this file is the operational counterpart)providers/openstack/controller-lifecycle.md-- planned day-2 changes: single-node hardware maintenance, Pacemaker standby/maintenance-mode, STONITH/Ironic BMC-IP-change handling (this file covers unplanned incident triage)providers/openstack/deployment-tools.md-- kolla-ansible, OpenStack-Helm, OpenStack-Ansible, TripleO operational specificsproviders/openstack/networking.md-- Neutron design decisions, OVS vs OVN, ML2 pluginsproviders/openstack/storage.md-- Cinder, Glance, Manila design decisionsproviders/ceph/operations.md-- Ceph operational depth (Cinder/Glance/Manila symptoms often originate in Ceph)general/operational-runbooks.md-- runbook framework: structure, severity, automation decisions, postmortem processpatterns/backup-lifecycle-synchronization.md-- consuminginstance.delete.endto drive backup-reclamation and CMDB governance (this file is the OpenStack-side event source)patterns/change-window-alert-suppression.md-- suppressing the predictable alert storm thesenova service-disable/ live-migration / evacuation operations generate during planned change windows