OpenStack Controller Lifecycle & Day-2 Operations¶

Scope¶

This file covers the planned-change day-2 operations that take an OpenStack controller in and out of service safely: single-node hardware maintenance and chassis replacement, Pacemaker/Corosync + STONITH operation (for Pacemaker-managed estates such as director-based deployments), and TripleO/director + Ironic baremetal day-2 management. It is the lifecycle counterpart to the design content in providers/openstack/control-plane-ha.md (HA topology, quorum, fencing design) and the incident-triage content in providers/openstack/operations.md (unplanned outage diagnosis). The distinction: control-plane-ha.md answers "how should HA be built," operations.md answers "something broke, what do I check," and this file answers "I need to deliberately take a controller down, work on it, and bring it back without dropping the cluster below quorum."

Topics: clustered-controller standby/return runbook with quorum pre-checks; VIP/HAProxy migration impact; Galera IST/SST and RabbitMQ rejoin return-verification; pcs node standby vs cluster-wide maintenance-mode; STONITH/fence-agent operation and reconfiguration after a BMC/iDRAC IP change; the live-vs-persisted-config distinction; Ironic baremetal node driver-info and port (NIC MAC) management; the undercloud-as-power-manager relationship; director node-replacement flow; and the BMC-IP-change "triad" that spans Ironic, Pacemaker STONITH, and deployment templates.

For deployment-tool selection and the day-1 deploy flow, see providers/openstack/deployment-tools.md. For unplanned failure triage (split-brain, agent flapping, token-key drift), see providers/openstack/operations.md. For HA design decisions (keepalived vs Pacemaker, queue type, SST method), see providers/openstack/control-plane-ha.md.

Checklist¶

Why This Matters¶

The single most common way a planned maintenance window turns into an unplanned outage is taking a controller into standby without first proving the survivors are quorate. The whole premise of "I can pull one of three controllers for a chassis swap" rests on the other two forming a Galera/RabbitMQ majority. If a second controller is silently degraded — a Galera node stuck in Donor/Desynced from a prior incident, a RabbitMQ node that never fully rejoined, a failed Pacemaker resource — then standby on the third drops the cluster to one live datastore node, which loses quorum and stops accepting writes. Every API write begins to fail (500s), and the operator who initiated a "routine" maintenance is now in an outage they caused. The quorum pre-check is therefore not paperwork; it is the step that distinguishes a safe drain from a self-inflicted control-plane outage. It must be performed against the remaining nodes specifically, and recorded, before standby.

The return-to-service verification is the mirror-image trap. Pacemaker reporting a resource "Started" on the returned node, or RabbitMQ showing the process up, says nothing about whether the datastore re-synced. A Galera node can rejoin the Pacemaker cluster and present as healthy while it is still streaming a multi-gigabyte SST, or worse, stuck mid-transfer. During that window the cluster is at N-1 redundancy while the dashboard is green — so the next single failure becomes a quorum loss that nobody expected because "all three controllers are up." The return step must explicitly confirm wsrep_local_state_comment=Synced and full wsrep_cluster_size, and confirm RabbitMQ queue replication, not merely that processes and resources started.

The standby vs maintenance-mode confusion is a Pacemaker-specific foot-gun with opposite failure modes. pcs node standby <node> is "evict this node's resources to the survivors" — exactly what single-node hardware maintenance wants. pcs property set maintenance-mode=true is "stop monitoring everything, leave it running un-managed" — what you want when patching Pacemaker/Corosync itself, and catastrophic if you then power the node off (you just killed services Pacemaker was told to ignore). Operators reach for whichever they remember; the runbook must name the correct one for the task and explain the difference, because the wrong choice is an outage in one direction and unwanted failovers in the other.

The BMC-IP-change triad is where chassis swaps go wrong weeks after the fact. A new chassis often means a new BMC/iDRAC IP. The obvious update — point Pacemaker's fence_ipmilan at the new address — fixes fencing. But on a director-managed estate, Ironic also holds that BMC address in driver-info for power management, and both the STONITH config and the Ironic registration live in deployment templates that a future overcloud deploy will re-assert. Fix only the live Pacemaker config and: (a) the undercloud still can't power-manage the node, and (b) the next deploy reverts your fence-agent fix and re-breaks fencing. Fencing failures are insidious because they are silent until the moment you need a failover — the cluster cannot fence a node it can't reach, so it refuses to fail its resources over, and a routine node failure becomes a hung cluster. The triad — Ironic driver-info, Pacemaker STONITH, deployment templates — must be updated together, and the live-vs-persisted discipline applies to every one of them.

Common Decisions (ADR Triggers)¶

ADR: Single-Node Maintenance — Drain Mechanism (Pacemaker standby vs service-level disable)¶

Trigger: Planning a hardware-maintenance window for one controller in an HA cluster. Considerations: - On a Pacemaker-managed estate (typically director/RHOSP), pcs node standby <node> is the correct drain — it relocates the VIP and managed resources to the survivors atomically and Pacemaker prevents resources from returning until unstandby. - On a keepalived + service-level estate (typical Kolla-Ansible / OSA), there is no Pacemaker; drain by disabling the node's HAProxy backends and lowering its keepalived priority (or stopping keepalived to move the VIP), plus openstack compute service set --disable for any co-located compute role, then stopping services gracefully. - Either way the quorum pre-check on survivors and the return-verification are mandatory and identical; only the drain mechanism differs. - Decision driver: which HA stack the deployment actually runs (see the keepalived-vs-Pacemaker ADR in control-plane-ha.md). Do not assume Pacemaker — most community deployments use keepalived and have no pcs at all.

ADR: Director-Based Estate — Repair-in-Place vs Node Replacement vs Migrate to RHOSO/Community¶

Trigger: A director/TripleO controller needs significant hardware work or has failed, on an estate already flagged by TripleO retirement. Considerations: - Repair-in-place (chassis swap, same node identity): lowest churn; requires the BMC-IP triad and NIC-MAC handling but no overcloud topology change. Preferred for a single failed component. - Node replacement via the director replacement workflow: register new Ironic node + ports, update the node-registration inventory and pacemaker-fencing environment file, run the targeted deploy. Heavier; correct when the node identity/hardware generation changes. - Migrate off director: since TripleO is retired (Epoxy 2025.1), a major hardware refresh is a natural decision point to evaluate moving to RHOSO (OpenShift-based, Red Hat) or a community tool (Kolla-Ansible) rather than investing further in director runbooks — weigh existing OpenShift investment and support posture (see deployment-tools.md). - Decision driver: scope of the hardware change, remaining director support horizon, and whether an OpenShift platform already exists.

ADR: Fence-Agent Configuration Source of Truth (live pcs vs templates)¶

Trigger: A BMC IP/credential change must be applied to STONITH on a director-managed cluster. Considerations: - Applying via live pcs stonith update restores fencing immediately but is reverted by the next overcloud deploy. - Applying via the fencing environment template then redeploying is durable but slower and runs a full deploy. - Correct practice is both: live update to restore fencing now (fencing gaps are an active risk), then persist to the template so the fix survives. The ADR exists to make the "persist it too" step non-optional, because the live-only fix is the common regression.

Operational Runbooks¶

Runbook: Planned Single-Node Controller Hardware Maintenance / Chassis Replacement¶

Assumes a 3-controller HA cluster. Adjust node counts for larger clusters (the quorum rule is: survivors must remain a strict majority of the Galera/RabbitMQ membership throughout).

1. Pre-check — prove the survivors are quorate (load-bearing, do not skip):

# On EACH surviving controller (not the one going down):
mysql -e "SHOW STATUS LIKE 'wsrep_cluster_size';"          # = full node count (e.g. 3)
mysql -e "SHOW STATUS LIKE 'wsrep_local_state_comment';"   # = Synced
rabbitmqctl cluster_status                                  # all members joined, no partitions
pcs status                                                  # (Pacemaker estates) no Failed/Stopped resources

Abort the window if any survivor is not fully synced/joined — fix that first. Taking a node down now would drop below quorum.

2. Drain the target node: - Pacemaker estate: pcs node standby <node> — verify with pcs status that resources (incl. the VIP) have moved to survivors. - Keepalived estate: disable the node's HAProxy backends (or stop HAProxy on it), stop keepalived so the VIP moves, and openstack compute service set --disable --disable-reason "maintenance" <host> nova-compute if it is also a compute host.

3. Confirm post-drain state: survivors still show full Galera wsrep_cluster_size for the cluster minus the drained node's writes, RabbitMQ healthy, VIP responding via the survivors. Expect a brief API blip as the VIP relocates.

4. Graceful stop + shutdown: stop OpenStack services on the node, then stop MariaDB/RabbitMQ cleanly (a clean RabbitMQ stop records the node as the last-stopped state; a clean Galera stop updates grastate.dat), then OS shutdown.

5. Physical maintenance / chassis swap. If the chassis (and thus BMC and/or NICs) changes, see the BMC-IP-change triad and NIC-MAC runbooks below before returning to service.

6. Power on, boot OS, start datastores first, then OpenStack services.

7. Return-to-service verification (resync-aware):

mysql -e "SHOW STATUS LIKE 'wsrep_local_state_comment';"   # Synced (not Donor/Desynced/Joining)
mysql -e "SHOW STATUS LIKE 'wsrep_cluster_size';"          # back to full count
rabbitmqctl cluster_status                                  # node rejoined, queues replicated
pcs status                                                  # (Pacemaker) resources Started AND monitored

Confirm whether Galera did IST (fast) or SST (full transfer) — a long SST means redundancy is reduced until it completes.

8. Un-drain: pcs node unstandby <node> (Pacemaker) or re-enable HAProxy backends / restart keepalived and openstack compute service set --enable … nova-compute (keepalived). Verify the node takes traffic and the cluster is at full redundancy.

Runbook: Pacemaker / Corosync Day-2 Quick Reference¶

pcs status                          # cluster, node, and resource state
pcs node standby <node>             # evict this node's resources to survivors (single-node maintenance)
pcs node unstandby <node>           # return the node to eligibility
pcs property set maintenance-mode=true   # stop monitoring CLUSTER-WIDE, leave resources running un-managed
pcs property set maintenance-mode=false  # resume monitoring
pcs resource move <rsc> <node>      # relocate a resource (creates a location constraint!)
pcs resource clear <rsc>            # remove the constraint left by 'move' so it can rebalance
pcs stonith status                  # fence device state
pcs stonith config                  # inspect fence-agent parameters
pcs stonith fence <node>            # manually fence a node
pcs stonith history                 # fencing events

Key distinctions: standby moves resources off one node; maintenance-mode stops monitoring everywhere — never power a node off while it is only in maintenance-mode. After pcs resource move, always pcs resource clear or the resource is pinned and won't rebalance.

Runbook: BMC/iDRAC IP-Change Triad (after a chassis swap on a director-managed estate)¶

When a controller's BMC IP changes, update all three and persist:

# 1. Ironic driver-info (undercloud power management) — run against the undercloud:
openstack baremetal node set <node> \
  --driver-info ipmi_address=<new-bmc-ip> \
  --driver-info ipmi_username=<user> --driver-info ipmi_password=<pass>
openstack baremetal node validate <node>     # power/management interfaces must pass

# 2. Pacemaker STONITH fence agent (live, restores fencing now) — on a controller:
pcs stonith update <fence-agent-for-node> ipaddr=<new-bmc-ip>   # param may be 'ip=' depending on agent
pcs stonith status

# 3. Deployment templates (persist, or the next deploy reverts 1 and 2):
#    - update the pacemaker-fencing environment file (fence_ipmilan ipaddr for this node)
#    - update the node-registration / instackenv inventory (ipmi address)
#    then re-run the targeted overcloud deploy.

Skipping (2) blocks failover (cluster can't fence an unreachable node). Skipping (1) leaves the undercloud unable to power-manage. Skipping (3) silently reverts (1) and (2) on the next deploy.

Runbook: NIC-MAC Handling on Chassis Swap¶

If the replacement chassis has different NIC MAC addresses:

openstack baremetal port list --node <node>             # current ports/MACs
openstack baremetal port set <port-uuid> --address <new-mac>   # or create new ports for the new NICs

Also review os-net-config / predictable-interface-naming templates if interfaces are keyed by MAC. A swap that fixes the BMC IP but leaves stale NIC MACs typically fails to PXE/provision or comes up with mis-mapped interfaces.

Reference Links¶

OpenStack Operations Guide — maintenance/availability — routine maintenance, controller/compute/storage node procedures
ClusterLabs Pacemaker documentation — Pacemaker concepts, pcs/crm operation, fencing/STONITH
Red Hat — Configuring and managing high availability clusters (RHEL HA Add-On) — pcs, standby vs maintenance-mode, fence_ipmilan
OpenStack Ironic documentation — baremetal node/port management, driver-info, validate, power management
Red Hat OpenStack Platform director docs — Director Installation and Usage; High Availability Deployment and Usage; Replacing Controller Nodes
Galera Cluster crash recovery — IST/SST, bootstrap, resync verification