Kubernetes Incident Response¶

Scope¶

This file covers Kubernetes incident-response operational depth -- the concrete commands, diagnostic-capture flows, and pre-flight branching that operators execute during runtime failures. It is the per-vendor implementation layer beneath the framework guidance in general/operational-runbooks.md and is the runtime-troubleshooting counterpart to the design/lifecycle content in providers/kubernetes/operations.md (Helm/Kustomize, GitOps, cluster upgrades, etcd backup strategy). Topics: pod stuck Pending triage, CrashLoopBackOff diagnostic capture, node NotReady branching, control-plane component failure (kube-apiserver, kube-controller-manager, kube-scheduler), etcd quorum loss recovery, kubelet/CRI-level investigation, and DNS/CNI failure modes. For cluster lifecycle, upgrades, and GitOps strategy, see providers/kubernetes/operations.md. For Rook-Ceph storage symptoms, see providers/ceph/operations.md. For OpenShift-specific operator and MCO troubleshooting, see providers/openshift/operations.md.

Checklist¶

Why This Matters¶

Kubernetes incident response is dominated by misdiagnosis. The top-of-funnel symptom -- "my pod is not running" -- maps to dozens of distinct failure modes (pending due to capacity, pending due to taints, pending due to PVC issues, crash-looping due to OOM, crash-looping due to config, image pull failure, CNI not ready, node NotReady, DNS broken, API server down). Each has a different log location, a different remediation, and a different escalation path. An operator who responds to every "pod not running" symptom by deleting and recreating the pod is correct in maybe 5% of cases and actively counterproductive in the remaining 95% (because the root cause persists, the previous container's log is now lost, and the attempt count just resets).

The discipline of kubectl logs --previous before delete is the Kubernetes-specific equivalent of "diagnostic capture before mutation". Once a pod is deleted, the previous container's stdout is gone unless an external log aggregator was running. For sustained crash loops, even --previous only goes one instance back -- earlier failures are overwritten. The runbook needs the previous-logs capture step to be unambiguously before any restart, recreate, or scale operation.

The node NotReady branching matters because the kubelet status condition is a downstream symptom that can have many upstream causes. KubeletReady=False with "PLEG is not healthy" is a CRI problem (containerd or CRI-O hung); MemoryPressure=True is a node-resource problem; DiskPressure=True is usually a log-volume or image-storage problem; NetworkUnavailable=True is a CNI problem. Restarting the kubelet "fixes" the symptom briefly because the conditions are re-evaluated, but the underlying cause remains and the node will return to NotReady within minutes. The right response is to read the conditions and branch.

The etcd quorum-loss recovery is the Kubernetes equivalent of Galera quorum-loss in OpenStack -- a single foot-gun that destroys cluster state if mishandled. Force-restarting individual etcd processes after quorum loss does not restore quorum; one process must be restored from a snapshot (with --force-new-cluster and the previous cluster ID), and the others rejoined as new members. Without a recent etcd snapshot, the cluster's state is unrecoverable and the only path is rebuild + workload re-deployment from manifests. The "regular etcd snapshot cadence" requirement in the lifecycle file (providers/kubernetes/operations.md) is the prerequisite for this incident-response file's recovery procedure to even be possible.

The control-plane triage order matters because the symptom presentation is misleading. If the API server is down, every kubectl returns connection-refused, every kubelet's status update fails, and every node appears NotReady to anyone running kubectl get nodes against a different cluster. The naive operator concludes "all nodes are broken" and starts investigating nodes. The right move is to check the API server first (is it responding to a direct curl on its serving port? are the kube-apiserver pods running? are the etcd endpoints healthy?) and then work down. Managed Kubernetes hides this entirely -- if kubectl is failing on EKS/GKE/AKS, the right response is the cloud provider status page and a vendor-support case, not local triage.

Common Decisions (ADR Triggers)¶

Log capture: kubectl logs --previous vs external log aggregator -- --previous is built-in but limited to one previous instance and bounded by kubelet log rotation. An external log aggregator (Fluent Bit/Vector to Loki/Elasticsearch/Splunk) preserves history but adds infrastructure. For any cluster expected to have crash-loops, the aggregator is required for forensic capability; --previous is best-effort.
Etcd snapshot strategy: cadence and storage -- Frequent snapshots (every 30 minutes) minimize state loss in disaster recovery but increase storage cost; infrequent snapshots (daily) limit recovery point objective to 24 hours of churn. Snapshots must be off-cluster (object storage, NFS, or a different node) -- a snapshot on the same etcd disk that fails is useless. Managed Kubernetes hides this; self-hosted clusters must implement it.
kubectl debug enablement: cluster-wide vs gated -- The kubectl debug ephemeral-container feature lets operators inject containers into running pods without spec changes. Useful for triage but a security concern (privileged containers, namespace sharing). Some clusters disable it via admission policy; others gate it via RBAC. Decide explicitly; do not default to permissive without a security review.
Drain blast radius: PDB-strict vs PDB-flexible -- Strict PDBs (minAvailable exactly equal to replica count) prevent any drain and force scale-up before maintenance. Flexible PDBs (maxUnavailable: 1 for replicas of N) allow rolling drains. The trade-off is availability guarantee vs operational flexibility. PDBs that block drain entirely are the right choice for SEV1-impact services; flexible is right for everything else.

Reference Links¶

Debug pods -- official guide to pod-level troubleshooting including kubectl describe, kubectl logs, and kubectl debug
Debug nodes -- node and cluster-level troubleshooting, etcd debugging, control-plane component logs
Debugging running pods -- ephemeral debug containers, copying pods, executing in containers
crictl reference -- CRI-level container inspection on the node
etcd disaster recovery -- snapshot save and restore procedures
Troubleshooting kubeadm -- common kubeadm-managed cluster failures
Troubleshooting CoreDNS -- DNS-specific diagnostics