Skip to content

OpenShift Operations

Scope

This file covers OpenShift operational depth -- the concrete commands and pre-flight branching that operators execute during runtime failures specific to OpenShift's operator-driven model. It is the per-vendor implementation layer beneath the framework guidance in general/operational-runbooks.md and is distinct from the architecture-decision content in providers/openshift/infrastructure.md (cluster topology, IPI/UPI/AI deployment) and the Kubernetes-layer incident-response in providers/kubernetes/incident-response.md. Topics: cluster-version-operator (CVO) stuck states, ClusterOperator degraded triage, MachineConfig Operator (MCO) degraded states and pool rendering, OperatorHub/OLM subscription failures, etcd-operator behavior on OpenShift, MachineSet/MachineHealthCheck remediation, and information-only vs change-control command boundaries. For Kubernetes-layer pod/node/etcd troubleshooting, see providers/kubernetes/incident-response.md. For OpenShift Data Foundation (ODF/Rook-Ceph) symptoms, see providers/ceph/operations.md. For OpenShift design (network, storage, identity, security context constraints), see the design files in providers/openshift/.

Checklist

  • [Critical] Is the boundary between information-only commands (oc get clusteroperators, oc get clusterversion, oc get nodes, oc adm must-gather for read-only diagnostics, oc describe co/<operator>, oc get machineconfigpool, oc get co <name> -o yaml) and change-control commands (anything that pauses an MCP, edits a MachineConfig, force-deletes a stuck operator pod, runs oc adm cordon / oc adm drain, or modifies clusterversion) explicit in the runbook -- so on-call engineers know which side of the line each step sits on, especially given OpenShift's operator-reconcile model where wrong edits get reverted automatically (or worse, fight the operator)?
  • [Critical] Is oc get clusteroperators treated as the canonical cluster-health surface -- every ClusterOperator (CO) reports Available, Progressing, Degraded conditions; Available=False or Degraded=True on any CO is a real incident and the CO's name identifies which subsystem (e.g., network, kube-apiserver, etcd, monitoring, machine-config); oc describe co/<name> shows the specific reason; do not start with oc get pods because OpenShift's operators may have pods running fine while the operator's reconcile loop is stuck for an upstream-resource reason?
  • [Critical] Is diagnostic capture done via oc adm must-gather before any mutating action -- oc adm must-gather collects logs and CRs across all operators into a structured archive that vendor support expects; targeted variants exist (oc adm must-gather --image=quay.io/konveyor/must-gather:latest for migration toolkit, --image=registry.redhat.io/odf4/odf-must-gather-rhel8 for ODF) and should be used per the troubled subsystem; capture before restarting anything because operator-reconcile state is volatile?
  • [Critical] Is the MachineConfig Operator (MCO) degraded triage understood -- oc get machineconfigpool shows pool state per role (master, worker, custom); Updated=False, Updating=True is a normal in-progress render-and-roll; Degraded=True is the failure state; oc describe mcp/<name> shows the failed nodes and the rendered MC; common causes are: a MachineConfig with invalid Ignition (rejected by node), a node that cannot drain due to PDBs, a node with a hardware/connectivity issue preventing reboot, or a stuck kubelet after the reboot; the MCO renders and rolls -- editing a MachineConfig triggers a new rendered MC, which the MCO applies via cordon/drain/reboot in series, and a single bad node blocks the whole pool?
  • [Critical] Is MCP pause discipline documented -- oc patch mcp/<name> --type merge -p '{"spec":{"paused":true}}' halts MCO rollouts on that pool; useful for: avoiding mid-incident rollouts that compound the issue, applying multiple MachineConfig changes atomically rather than each triggering a roll, or freezing the pool while debugging a stuck node; never leave a pool paused after the work is done because the pool will not receive cluster-version operator updates and will drift; pair with a calendar reminder and explicit unpause step?
  • [Critical] Is the cluster-version-operator (CVO) stuck triage understood -- oc get clusterversion shows Available, Progressing, Degraded; during an upgrade, Progressing=True is normal but should advance through the manifest list within hours; if stuck, oc describe clusterversion shows the specific manifest that failed to apply (usually a ClusterOperator that is itself degraded); the right response is to fix the underlying CO, not to force-edit clusterversion; oc adm upgrade --to-image=... and CVO overrides exist but are essentially "vendor-support escalation" tools and using them without RH support direction can break upgrade integrity?
  • [Critical] Is the etcd-operator behavior on OpenShift distinct from upstream Kubernetes etcd recovery -- OpenShift runs etcd as a static-pod-based 3-node cluster managed by the etcd Operator; oc get pods -n openshift-etcd shows the etcd pods; oc rsh -n openshift-etcd <pod> and etcdctl endpoint status --cluster --write-out=table (with the env vars set inside the pod) shows member health; quorum-loss recovery on OpenShift uses the official disaster-recovery procedure (cluster-restore.sh from /usr/local/bin/ on a control-plane node, requires a recent etcd snapshot from cluster-backup.sh); manually running etcdctl snapshot restore outside this procedure breaks the operator-managed lifecycle?
  • [Critical] Is the ClusterOperator restart pattern understood -- when a specific CO is stuck (e.g., network, monitoring, authentication), the right intervention is usually to delete the operator's pod (oc delete pod -n openshift-<operator-name> -l <selector>) and let the deployment recreate it, not to delete the operator's CRs or the CO resource itself; oc get co <name> -o yaml and the operator's deployment annotations show the current managed-state; deleting the CO resource is destructive and recreates it from the cluster bundle, which has wide blast radius?
  • [Recommended] Are OperatorHub/OLM subscription failures triaged via the subscription chain -- oc get subscription -A for all subscriptions, oc describe sub/<name> for the specific failure (catalog-source not available, install-plan rejected, dependency not satisfied), oc get installplan -A for pending install plans (some require manual approval if installPlanApproval: Manual), oc get csv -A for the resulting ClusterServiceVersion (Phase: Failed with a reason in Status.Conditions); subscription failures cascade -- a missing dependency operator blocks the dependent's CSV, which blocks the user's workload?
  • [Recommended] Are MachineSet / MachineHealthCheck remediation patterns documented -- oc get machinesets -n openshift-machine-api and oc get machines -n openshift-machine-api show the desired vs actual machine state; MachineHealthCheck (MHC) auto-replaces unhealthy machines based on configurable conditions; if MHC is over-aggressive, it can cause replacement-storms (machine fails, MHC deletes it, replacement fails the same way, MHC deletes that, ...); the runbook should call out how to pause MHC (oc annotate machinehealthcheck/<name> 'cluster.x-k8s.io/paused=') during investigation?
  • [Recommended] Is oc adm node-logs documented for cluster-managed log access -- oc adm node-logs <node-name> retrieves node journal logs via the API server (no SSH required), oc adm node-logs <node-name> --unit=kubelet for kubelet specifically, oc adm node-logs <node-name> --path= for /var/log files; this is OpenShift's preferred log-access path because IPI clusters often have no SSH access at all and UPI clusters often have SSH locked down?
  • [Recommended] Are oc debug node/<name> capabilities documented -- creates a privileged debug pod with chroot /host available, equivalent to SSH for diagnostics; use when oc adm node-logs is insufficient (need to run a command on the node, inspect runtime state, modify config when MCO is broken); pair with the standard "do not modify host filesystem outside MachineConfig" rule because MCO will revert direct edits on next reconcile?
  • [Recommended] Is rendered-MachineConfig inspection understood -- oc get mc shows individual MachineConfig resources, oc get mc rendered-<role>-<hash> shows the rendered MC for a pool (the actual config the MCO applies), oc describe mc rendered-<role>-<hash> for the contents; comparing the rendered MC across pools or before/after an incident reveals what actually changed (operator-controlled MCs are not meant to be edited directly; user MCs are layered into the render)?
  • [Recommended] Are NetworkPolicy / EgressFirewall / EgressIP debugging steps documented for OpenShift SDN/OVN-Kubernetes networking -- oc get networkpolicy -A for namespace policies, oc get egressfirewall -A (OpenShift-specific egress restriction), oc get egressip for source-NAT IP assignments; OVN-Kubernetes troubleshooting (oc rsh -n openshift-ovn-kubernetes <ovnkube-master-pod>, ovn-nbctl show, ovn-sbctl show) is the OVN-side counterpart; OpenShift SDN is deprecated since 4.14 and OVN-Kubernetes is the default?
  • [Optional] Is oc adm inspect known as a lighter-weight alternative to must-gather for specific resources -- oc adm inspect ns/<namespace> collects logs and resources for a single namespace, oc adm inspect co/<operator> for a single ClusterOperator -- useful when full must-gather is too large or too slow and the issue is scoped to a known subsystem?
  • [Optional] Is OperatorHub catalog-source health understood -- oc get catalogsource -n openshift-marketplace for the default catalog sources (certified-operators, community-operators, redhat-marketplace, redhat-operators); a degraded catalog source manifests as new subscriptions failing with "no candidate operators found" and existing subscriptions failing to receive updates; check the catalog-source pod logs in openshift-marketplace namespace?

Why This Matters

OpenShift's operator-driven design changes the operational model in ways that catch operators trained on upstream Kubernetes. In upstream Kubernetes, if a Deployment's pod is broken, you kubectl edit the Deployment and the change persists. In OpenShift, if you oc edit an operator-managed resource, the operator's reconcile loop will revert your change within seconds -- the resource is owned by an operator that has its own desired state, and the operator wins. This pattern repeats across the cluster: ClusterOperators own most cluster infrastructure (network, storage, monitoring, console, authentication), the MCO owns node configuration, the etcd Operator owns etcd, and the CVO owns all of them. An operator who treats OpenShift as "Kubernetes plus more YAML" will fight the operators and lose.

The MCO degraded state is OpenShift's most distinctive failure mode and the one most likely to cause prolonged outages. The MCO renders user-supplied MachineConfig resources into a single rendered config per pool, then applies that config by cordoning, draining, rebooting, and uncordoning each node in turn. A bad MachineConfig (invalid Ignition, syntax error, conflicting file) is rejected at the node level, the node fails to come up healthy, the MCO marks the pool degraded, and the rollout halts. The whole pool is now stuck because: (a) the bad MC cannot be applied to subsequent nodes, (b) the previous good MC has already been replaced as the rendered target, and (c) the operator does not auto-revert. Recovery requires identifying the bad MC, removing or fixing it, allowing the MCO to render a new target, and either letting it roll forward or manually rolling back via oc adm uncordon after the operator catches up. This is the single most common cause of "the cluster is stuck" support cases on OpenShift.

The CVO stuck-during-upgrade scenario is structurally similar but at a higher level. The CVO applies the cluster's manifest list (which includes ClusterOperator versions) in dependency order; if any CO fails to upgrade, the CVO halts and reports Progressing=True with no actual progress. The naive operator's response is to "force the upgrade" via CVO overrides, which propagates the broken state to a higher version and makes recovery harder. The right response is to diagnose the underlying CO, fix it, and let the CVO resume. Red Hat support cases involving "the cluster is stuck on 4.X.Y" are almost always actually "ClusterOperator Z is degraded and the CVO is correctly waiting for it."

The etcd disaster-recovery procedure on OpenShift is fundamentally different from upstream Kubernetes etcd recovery. OpenShift provides scripts (cluster-backup.sh, cluster-restore.sh) on every control-plane node that handle the operator-managed etcd lifecycle correctly. Running upstream etcdctl snapshot restore directly on an OpenShift cluster bypasses the etcd Operator's state machine and creates a cluster the operator does not know how to manage. The runbook must direct operators to the official scripts and call out that direct etcdctl manipulation is a vendor-support-required path.

The oc adm node-logs / oc debug node path matters because IPI-installed OpenShift clusters often have no SSH access at all -- the installer creates RHCOS nodes that are managed via the MCO and the API server, with SSH disabled by default. An operator trained on traditional Linux SSH-into-the-host troubleshooting will be blocked. The OpenShift-native paths are not optional alternatives; on IPI clusters they are the only paths.

Common Decisions (ADR Triggers)

  • must-gather vs inspect vs targeted log collection -- Full must-gather is the supported diagnostic-capture path for vendor-support cases but produces large archives (gigabytes) and takes minutes to hours. oc adm inspect is faster and scoped. Targeted collection (specific operator pod logs, specific node logs) is fastest but requires the operator to know the right scope. Default to scope-down for self-service, full must-gather for any vendor-support escalation.
  • MCP pause: explicit-window vs automated -- Pausing an MCP during incidents is a useful tool but the unpause-and-resume step is forgettable. Some teams build automation (a controller that auto-unpauses after a TTL); others enforce a calendar-reminder discipline. The trade-off is operator workload vs failure mode (forgotten paused pool drifts from cluster version).
  • MachineHealthCheck aggressiveness -- MHCs auto-remediate by deleting and replacing unhealthy machines. Aggressive MHCs (short timeouts, broad conditions) recover from transient failures faster but can cause replacement storms during real incidents. Conservative MHCs (long timeouts, narrow conditions) avoid storms but leave failed machines in place. The right setting depends on workload tolerance to machine churn and the underlying infrastructure's reliability.
  • Operator subscription update strategy: Automatic vs Manual -- installPlanApproval: Automatic lets OLM auto-update operators on new versions; Manual requires an operator to explicitly approve each install plan. Automatic is faster and matches managed-service expectations; Manual is the safer choice for production where operator updates have caused outages. Best practice is Manual on production and Automatic on non-production.

See Also

  • providers/openshift/infrastructure.md -- OpenShift cluster topology, IPI/UPI/AI deployment, machine-API design
  • providers/openshift/networking.md -- OVN-Kubernetes vs SDN, NetworkPolicy, EgressFirewall design decisions
  • providers/openshift/security.md -- SCCs, RBAC, OAuth design (auth-related incidents intersect this)
  • providers/openshift/data-protection.md -- backup tooling design (the prerequisite for etcd disaster recovery)
  • providers/kubernetes/incident-response.md -- Kubernetes-layer pod/node/etcd troubleshooting (OpenShift inherits this and adds the operator layer)
  • providers/ceph/operations.md -- ODF/Rook-Ceph storage operations
  • general/operational-runbooks.md -- runbook framework: structure, severity, automation decisions, postmortem process