OpenShift Operations¶

Scope¶

This file covers OpenShift operational depth -- the concrete commands and pre-flight branching that operators execute during runtime failures specific to OpenShift's operator-driven model. It is the per-vendor implementation layer beneath the framework guidance in general/operational-runbooks.md and is distinct from the architecture-decision content in providers/openshift/infrastructure.md (cluster topology, IPI/UPI/AI deployment) and the Kubernetes-layer incident-response in providers/kubernetes/incident-response.md. Topics: cluster-version-operator (CVO) stuck states, ClusterOperator degraded triage, MachineConfig Operator (MCO) degraded states and pool rendering, OperatorHub/OLM subscription failures, etcd-operator behavior on OpenShift, MachineSet/MachineHealthCheck remediation, and information-only vs change-control command boundaries. For Kubernetes-layer pod/node/etcd troubleshooting, see providers/kubernetes/incident-response.md. For OpenShift Data Foundation (ODF/Rook-Ceph) symptoms, see providers/ceph/operations.md. For OpenShift design (network, storage, identity, security context constraints), see the design files in providers/openshift/.

Checklist¶

Why This Matters¶

OpenShift's operator-driven design changes the operational model in ways that catch operators trained on upstream Kubernetes. In upstream Kubernetes, if a Deployment's pod is broken, you kubectl edit the Deployment and the change persists. In OpenShift, if you oc edit an operator-managed resource, the operator's reconcile loop will revert your change within seconds -- the resource is owned by an operator that has its own desired state, and the operator wins. This pattern repeats across the cluster: ClusterOperators own most cluster infrastructure (network, storage, monitoring, console, authentication), the MCO owns node configuration, the etcd Operator owns etcd, and the CVO owns all of them. An operator who treats OpenShift as "Kubernetes plus more YAML" will fight the operators and lose.

The MCO degraded state is OpenShift's most distinctive failure mode and the one most likely to cause prolonged outages. The MCO renders user-supplied MachineConfig resources into a single rendered config per pool, then applies that config by cordoning, draining, rebooting, and uncordoning each node in turn. A bad MachineConfig (invalid Ignition, syntax error, conflicting file) is rejected at the node level, the node fails to come up healthy, the MCO marks the pool degraded, and the rollout halts. The whole pool is now stuck because: (a) the bad MC cannot be applied to subsequent nodes, (b) the previous good MC has already been replaced as the rendered target, and (c) the operator does not auto-revert. Recovery requires identifying the bad MC, removing or fixing it, allowing the MCO to render a new target, and either letting it roll forward or manually rolling back via oc adm uncordon after the operator catches up. This is the single most common cause of "the cluster is stuck" support cases on OpenShift.

The CVO stuck-during-upgrade scenario is structurally similar but at a higher level. The CVO applies the cluster's manifest list (which includes ClusterOperator versions) in dependency order; if any CO fails to upgrade, the CVO halts and reports Progressing=True with no actual progress. The naive operator's response is to "force the upgrade" via CVO overrides, which propagates the broken state to a higher version and makes recovery harder. The right response is to diagnose the underlying CO, fix it, and let the CVO resume. Red Hat support cases involving "the cluster is stuck on 4.X.Y" are almost always actually "ClusterOperator Z is degraded and the CVO is correctly waiting for it."

The etcd disaster-recovery procedure on OpenShift is fundamentally different from upstream Kubernetes etcd recovery. OpenShift provides scripts (cluster-backup.sh, cluster-restore.sh) on every control-plane node that handle the operator-managed etcd lifecycle correctly. Running upstream etcdctl snapshot restore directly on an OpenShift cluster bypasses the etcd Operator's state machine and creates a cluster the operator does not know how to manage. The runbook must direct operators to the official scripts and call out that direct etcdctl manipulation is a vendor-support-required path.

The oc adm node-logs / oc debug node path matters because IPI-installed OpenShift clusters often have no SSH access at all -- the installer creates RHCOS nodes that are managed via the MCO and the API server, with SSH disabled by default. An operator trained on traditional Linux SSH-into-the-host troubleshooting will be blocked. The OpenShift-native paths are not optional alternatives; on IPI clusters they are the only paths.

Common Decisions (ADR Triggers)¶

must-gather vs inspect vs targeted log collection -- Full must-gather is the supported diagnostic-capture path for vendor-support cases but produces large archives (gigabytes) and takes minutes to hours. oc adm inspect is faster and scoped. Targeted collection (specific operator pod logs, specific node logs) is fastest but requires the operator to know the right scope. Default to scope-down for self-service, full must-gather for any vendor-support escalation.
MCP pause: explicit-window vs automated -- Pausing an MCP during incidents is a useful tool but the unpause-and-resume step is forgettable. Some teams build automation (a controller that auto-unpauses after a TTL); others enforce a calendar-reminder discipline. The trade-off is operator workload vs failure mode (forgotten paused pool drifts from cluster version).
MachineHealthCheck aggressiveness -- MHCs auto-remediate by deleting and replacing unhealthy machines. Aggressive MHCs (short timeouts, broad conditions) recover from transient failures faster but can cause replacement storms during real incidents. Conservative MHCs (long timeouts, narrow conditions) avoid storms but leave failed machines in place. The right setting depends on workload tolerance to machine churn and the underlying infrastructure's reliability.
Operator subscription update strategy: Automatic vs Manual -- installPlanApproval: Automatic lets OLM auto-update operators on new versions; Manual requires an operator to explicitly approve each install plan. Automatic is faster and matches managed-service expectations; Manual is the safer choice for production where operator updates have caused outages. Best practice is Manual on production and Automatic on non-production.

Reference Links¶

OpenShift troubleshooting -- official OpenShift troubleshooting guide
Gathering data about your cluster -- must-gather, inspect, and other diagnostic tools
Disaster recovery (etcd) -- official OpenShift etcd backup/restore procedure
Machine Config Operator -- MCO operations including pool pause and MachineConfig management
Operator Lifecycle Manager -- OLM concepts, subscriptions, install plans, CSVs
Cluster Version Operator -- CVO source repository with manifest/upgrade logic