Kubernetes Incident Response¶
Scope¶
This file covers Kubernetes incident-response operational depth -- the concrete commands, diagnostic-capture flows, and pre-flight branching that operators execute during runtime failures. It is the per-vendor implementation layer beneath the framework guidance in general/operational-runbooks.md and is the runtime-troubleshooting counterpart to the design/lifecycle content in providers/kubernetes/operations.md (Helm/Kustomize, GitOps, cluster upgrades, etcd backup strategy). Topics: pod stuck Pending triage, CrashLoopBackOff diagnostic capture, node NotReady branching, control-plane component failure (kube-apiserver, kube-controller-manager, kube-scheduler), etcd quorum loss recovery, kubelet/CRI-level investigation, and DNS/CNI failure modes. For cluster lifecycle, upgrades, and GitOps strategy, see providers/kubernetes/operations.md. For Rook-Ceph storage symptoms, see providers/ceph/operations.md. For OpenShift-specific operator and MCO troubleshooting, see providers/openshift/operations.md.
Checklist¶
- [Critical] Is the boundary between information-only commands (
kubectl get,kubectl describe,kubectl logs,kubectl events,kubectl top,crictl pson the node,etcdctl endpoint status) and change-control commands (anything that deletes a pod, drains a node, edits a resource, restarts a control-plane component, or touches etcd) explicit in the runbook -- so on-call engineers know which side of the line each step sits on? - [Critical] Is diagnostic capture done before any pod is deleted --
kubectl describe pod <name>(events, conditions, container statuses, scheduler decisions),kubectl logs <pod> --previousfor crashed-container logs (only available until the pod is deleted or rescheduled),kubectl get events --sort-by='.lastTimestamp' -n <namespace>for the timeline,kubectl get pod <name> -o yamlfor the full spec,crictl inspect <container>on the node for runtime details -- so the post-incident review has the evidence rather than just confirming "deleting the pod fixed it"? - [Critical] Is
Pendingpod triage branched correctly --kubectl describe podEvents section is authoritative:FailedSchedulingwith "Insufficient cpu/memory" means cluster capacity (checkkubectl describe nodes | grep -A 5 "Allocated resources"andkubectl top nodes), "didn't tolerate" means taints (checkkubectl describe nodeTaints), "nodes had volume node affinity conflict" means PV-zone mismatch, "0/N nodes are available" with no specific reason means the scheduler failed silently (checkkube-schedulerlogs), and "ImagePullBackOff" means image issues (registry credentials, image tag, network) -- and is the runbook clear that each branch has a different remediation? - [Critical] Is
CrashLoopBackOfftriage done with the previous container's logs first --kubectl logs <pod> --previous -c <container>shows what the container printed before exit (init failure, panic, OOM-killed),kubectl describe podshowsLast State: TerminatedwithReason: OOMKilled(memory limit too low) vsError(application exit code) vsContainerCannotRun(config/permissions),kubectl get events --field-selector involvedObject.name=<pod>for the kubelet's view; the back-off interval is exponential (10s, 20s, 40s, ... up to 5m) and is not tunable in upstream Kubernetes -- and is the runbook clear that restarting the pod does nothing useful unless the root cause changes? - [Critical] Is node
NotReadybranched by cause --kubectl describe node <name>Conditions show:KubeletReady=Falsewith "PLEG is not healthy" (CRI/container-runtime issue, checksystemctl status containerd/crioandcrictl infoon the node),MemoryPressure=True(node is swapping/evicting, checkdmesgfor OOM, check what is consuming memory),DiskPressure=True(image GC failing or logs filling disk, checkdf -h /var/lib/containerd /var/lib/kubelet /var/log),PIDPressure=True(PID exhaustion, usually a forking process),NetworkUnavailable=True(CNI not ready, check the CNI agent pod on the node) -- so operators do not blanket-restart kubelet when the issue is downstream? - [Critical] Is the etcd quorum-loss recovery procedure documented and reviewed --
etcdctl endpoint status --clustershows leader and member health; if quorum is lost (less than (N/2)+1 members healthy), no writes succeed and the API server returns 5xx; recovery requires disaster recovery from snapshot (etcdctl snapshot restore) on a single member, then growing the cluster back to N members, not force-restarting individual etcd processes; managed Kubernetes (EKS, GKE, AKS) hides etcd entirely and recovery is a vendor-support escalation -- and is the runbook clear thatetcdctl snapshot savemust be running on a regular cadence so this recovery is even possible? - [Critical] Is the control-plane component triage order documented -- API server first (everything depends on it; if down,
kubectlreturns connection-refused andkubeletcannot post status, making nodes appearNotReady); etcd second (API server depends on it for state); controller-manager and scheduler third (workloads continue running; new placement and reconciliation stop) -- so operators do not chase node-readiness symptoms when the actual cause is the API server or etcd? - [Recommended] Are node-level diagnostic commands documented for kubelet-layer problems --
journalctl -u kubelet --since "30 minutes ago"for kubelet logs,crictl psandcrictl ps -afor container state (replacesdocker psin container-runtime-aware way),crictl logs <container-id>for logs even when the pod is gone from the API,crictl podsfor pod sandboxes,crictl infofor runtime version and config,nsenterfor entering a container's namespaces whenkubectl execis not viable -- so operators are not blocked when the API path is broken? - [Recommended] Is DNS-failure triage branched correctly --
kubectl get pods -n kube-system -l k8s-app=kube-dns(orcoredns) for DNS pod health,kubectl logs -n kube-system -l k8s-app=kube-dnsfor resolution errors, in-cluster test viakubectl run -i --rm --tty test-dns --image=busybox --restart=Never -- nslookup kubernetes.defaultto verify ClusterIP DNS works; common causes are CoreDNS pod stuck on aNotReadynode, a NetworkPolicy blocking pod-to-DNS traffic, or upstream resolver failure (check the CoreDNSforwardconfig) -- and is the runbook clear that "everything is broken" symptoms across the cluster are usually DNS, not API server? - [Recommended] Is CNI-failure triage documented --
kubectl get pods -n kube-system -l <cni-label>for the CNI agent pods (Calico:k8s-app=calico-node, Cilium:k8s-app=cilium, Flannel:app=flannel); a pod stuck inContainerCreatingwith "network plugin is not ready" means the CNI agent on its node has not initialized (check the agent pod's logs), "failed to set up sandbox" means the CRI-CNI handoff failed (often a CNI binary missing in/opt/cni/bin/or a CNI conf parsing error in/etc/cni/net.d/); the CNI agent runs as a DaemonSet and a single failed agent breaks all new pods on its node? - [Recommended] Are PVC-stuck-in-Pending failures triaged via the storage chain --
kubectl describe pvcEvents section,kubectl get scfor the StorageClass,kubectl get pods -n <csi-driver-namespace>for the CSI driver controller and node pods, the underlying provisioner's logs (cloud-provider logs for cloud-managed storage, Rook-Ceph or OpenShift Data Foundation logs for in-cluster storage); a PVC stuck Pending with "waiting for first consumer" meansvolumeBindingMode: WaitForFirstConsumerand the consuming pod is itself Pending (real cause is upstream); "failed to provision" with a specific provisioner error is the actual diagnosis path? - [Recommended] Are
kubectl drainfailure modes documented ----ignore-daemonsetsis required because daemonsets cannot be evicted (they are bound to the node);--delete-emptydir-datais required if any pod usesemptyDir(data is lost on drain);--forcedeletes pods not managed by a controller (rare but possible) -- without these, drain hangs indefinitely and operators incorrectly conclude "the node is broken" when the actual issue is a drain command missing the right flags? - [Recommended] Is the PodDisruptionBudget interaction with drain understood -- a PDB with
minAvailableormaxUnavailablecauseskubectl drainto block indefinitely if eviction would violate the budget (e.g., a 3-replica deployment withminAvailable: 3cannot be drained at all); the runbook needs to call out PDB-blocking-drain as a distinct failure mode and the right response (raise the PDB, scale up first, or use--disable-eviction --forceonly in confirmed emergencies)? - [Recommended] Are
kubectl logssize limits understood -- the kubelet keeps log file rotation per container (default 10MB, 5 files = 50MB max retained per container), andkubectl logs --previousonly shows the immediately previous container instance, not the one before that; for sustained crash loops with rapid restarts, the original failure log is overwritten quickly and only an external log shipper (Fluent Bit, Vector, fluentd) preserves the history -- so a runbook that depends on--previousmust time-bound itself to the post-failure window? - [Optional] Is
kubectl debugused for ephemeral container troubleshooting --kubectl debug -it <pod> --image=busybox --target=<container>attaches a shared-namespace ephemeral container to a running pod for diagnosis without modifying the pod spec;kubectl debug node/<node-name>creates a privileged pod scheduled to that node with/hostmounted, useful for node-level investigation when SSH is unavailable; both require Kubernetes 1.23+ for stable support? - [Optional] Is managed-Kubernetes specific guidance documented -- on EKS/GKE/AKS, the control plane (kube-apiserver, etcd, controller-manager, scheduler) is vendor-managed and not directly accessible; etcd snapshot/restore is not an operator operation; control-plane outages are always vendor-support escalations; node-level troubleshooting still applies normally -- so the runbook is honest about where it ends?
Why This Matters¶
Kubernetes incident response is dominated by misdiagnosis. The top-of-funnel symptom -- "my pod is not running" -- maps to dozens of distinct failure modes (pending due to capacity, pending due to taints, pending due to PVC issues, crash-looping due to OOM, crash-looping due to config, image pull failure, CNI not ready, node NotReady, DNS broken, API server down). Each has a different log location, a different remediation, and a different escalation path. An operator who responds to every "pod not running" symptom by deleting and recreating the pod is correct in maybe 5% of cases and actively counterproductive in the remaining 95% (because the root cause persists, the previous container's log is now lost, and the attempt count just resets).
The discipline of kubectl logs --previous before delete is the Kubernetes-specific equivalent of "diagnostic capture before mutation". Once a pod is deleted, the previous container's stdout is gone unless an external log aggregator was running. For sustained crash loops, even --previous only goes one instance back -- earlier failures are overwritten. The runbook needs the previous-logs capture step to be unambiguously before any restart, recreate, or scale operation.
The node NotReady branching matters because the kubelet status condition is a downstream symptom that can have many upstream causes. KubeletReady=False with "PLEG is not healthy" is a CRI problem (containerd or CRI-O hung); MemoryPressure=True is a node-resource problem; DiskPressure=True is usually a log-volume or image-storage problem; NetworkUnavailable=True is a CNI problem. Restarting the kubelet "fixes" the symptom briefly because the conditions are re-evaluated, but the underlying cause remains and the node will return to NotReady within minutes. The right response is to read the conditions and branch.
The etcd quorum-loss recovery is the Kubernetes equivalent of Galera quorum-loss in OpenStack -- a single foot-gun that destroys cluster state if mishandled. Force-restarting individual etcd processes after quorum loss does not restore quorum; one process must be restored from a snapshot (with --force-new-cluster and the previous cluster ID), and the others rejoined as new members. Without a recent etcd snapshot, the cluster's state is unrecoverable and the only path is rebuild + workload re-deployment from manifests. The "regular etcd snapshot cadence" requirement in the lifecycle file (providers/kubernetes/operations.md) is the prerequisite for this incident-response file's recovery procedure to even be possible.
The control-plane triage order matters because the symptom presentation is misleading. If the API server is down, every kubectl returns connection-refused, every kubelet's status update fails, and every node appears NotReady to anyone running kubectl get nodes against a different cluster. The naive operator concludes "all nodes are broken" and starts investigating nodes. The right move is to check the API server first (is it responding to a direct curl on its serving port? are the kube-apiserver pods running? are the etcd endpoints healthy?) and then work down. Managed Kubernetes hides this entirely -- if kubectl is failing on EKS/GKE/AKS, the right response is the cloud provider status page and a vendor-support case, not local triage.
Common Decisions (ADR Triggers)¶
- Log capture:
kubectl logs --previousvs external log aggregator ----previousis built-in but limited to one previous instance and bounded by kubelet log rotation. An external log aggregator (Fluent Bit/Vector to Loki/Elasticsearch/Splunk) preserves history but adds infrastructure. For any cluster expected to have crash-loops, the aggregator is required for forensic capability;--previousis best-effort. - Etcd snapshot strategy: cadence and storage -- Frequent snapshots (every 30 minutes) minimize state loss in disaster recovery but increase storage cost; infrequent snapshots (daily) limit recovery point objective to 24 hours of churn. Snapshots must be off-cluster (object storage, NFS, or a different node) -- a snapshot on the same etcd disk that fails is useless. Managed Kubernetes hides this; self-hosted clusters must implement it.
kubectl debugenablement: cluster-wide vs gated -- Thekubectl debugephemeral-container feature lets operators inject containers into running pods without spec changes. Useful for triage but a security concern (privileged containers, namespace sharing). Some clusters disable it via admission policy; others gate it via RBAC. Decide explicitly; do not default to permissive without a security review.- Drain blast radius: PDB-strict vs PDB-flexible -- Strict PDBs (
minAvailableexactly equal to replica count) prevent any drain and force scale-up before maintenance. Flexible PDBs (maxUnavailable: 1for replicas of N) allow rolling drains. The trade-off is availability guarantee vs operational flexibility. PDBs that block drain entirely are the right choice for SEV1-impact services; flexible is right for everything else.
Reference Links¶
- Debug pods -- official guide to pod-level troubleshooting including
kubectl describe,kubectl logs, andkubectl debug - Debug nodes -- node and cluster-level troubleshooting, etcd debugging, control-plane component logs
- Debugging running pods -- ephemeral debug containers, copying pods, executing in containers
- crictl reference -- CRI-level container inspection on the node
- etcd disaster recovery -- snapshot save and restore procedures
- Troubleshooting kubeadm -- common kubeadm-managed cluster failures
- Troubleshooting CoreDNS -- DNS-specific diagnostics
See Also¶
providers/kubernetes/operations.md-- Kubernetes lifecycle: Helm/Kustomize, GitOps, cluster upgrades, etcd backup strategy (this file is the runtime-incident counterpart)providers/kubernetes/observability.md-- monitoring and alerting that surface the symptoms this runbook responds toproviders/kubernetes/networking.md-- CNI design decisions; CNI-level troubleshooting depends on the chosen CNIproviders/kubernetes/storage.md-- storage class and CSI design; PVC-stuck failures depend on the chosen provisionerproviders/openshift/operations.md-- OpenShift adds operators and MCO on top of Kubernetes; OpenShift-specific incident-response patternsproviders/ceph/operations.md-- Ceph operational depth (Rook-Ceph storage symptoms manifest at the PVC layer)general/operational-runbooks.md-- runbook framework: structure, severity, automation decisions, postmortem process