Skip to content

Kubernetes Incident Response

Scope

This file covers Kubernetes incident-response operational depth -- the concrete commands, diagnostic-capture flows, and pre-flight branching that operators execute during runtime failures. It is the per-vendor implementation layer beneath the framework guidance in general/operational-runbooks.md and is the runtime-troubleshooting counterpart to the design/lifecycle content in providers/kubernetes/operations.md (Helm/Kustomize, GitOps, cluster upgrades, etcd backup strategy). Topics: pod stuck Pending triage, CrashLoopBackOff diagnostic capture, node NotReady branching, control-plane component failure (kube-apiserver, kube-controller-manager, kube-scheduler), etcd quorum loss recovery, kubelet/CRI-level investigation, and DNS/CNI failure modes. For cluster lifecycle, upgrades, and GitOps strategy, see providers/kubernetes/operations.md. For Rook-Ceph storage symptoms, see providers/ceph/operations.md. For OpenShift-specific operator and MCO troubleshooting, see providers/openshift/operations.md.

Checklist

  • [Critical] Is the boundary between information-only commands (kubectl get, kubectl describe, kubectl logs, kubectl events, kubectl top, crictl ps on the node, etcdctl endpoint status) and change-control commands (anything that deletes a pod, drains a node, edits a resource, restarts a control-plane component, or touches etcd) explicit in the runbook -- so on-call engineers know which side of the line each step sits on?
  • [Critical] Is diagnostic capture done before any pod is deleted -- kubectl describe pod <name> (events, conditions, container statuses, scheduler decisions), kubectl logs <pod> --previous for crashed-container logs (only available until the pod is deleted or rescheduled), kubectl get events --sort-by='.lastTimestamp' -n <namespace> for the timeline, kubectl get pod <name> -o yaml for the full spec, crictl inspect <container> on the node for runtime details -- so the post-incident review has the evidence rather than just confirming "deleting the pod fixed it"?
  • [Critical] Is Pending pod triage branched correctly -- kubectl describe pod Events section is authoritative: FailedScheduling with "Insufficient cpu/memory" means cluster capacity (check kubectl describe nodes | grep -A 5 "Allocated resources" and kubectl top nodes), "didn't tolerate" means taints (check kubectl describe node Taints), "nodes had volume node affinity conflict" means PV-zone mismatch, "0/N nodes are available" with no specific reason means the scheduler failed silently (check kube-scheduler logs), and "ImagePullBackOff" means image issues (registry credentials, image tag, network) -- and is the runbook clear that each branch has a different remediation?
  • [Critical] Is CrashLoopBackOff triage done with the previous container's logs first -- kubectl logs <pod> --previous -c <container> shows what the container printed before exit (init failure, panic, OOM-killed), kubectl describe pod shows Last State: Terminated with Reason: OOMKilled (memory limit too low) vs Error (application exit code) vs ContainerCannotRun (config/permissions), kubectl get events --field-selector involvedObject.name=<pod> for the kubelet's view; the back-off interval is exponential (10s, 20s, 40s, ... up to 5m) and is not tunable in upstream Kubernetes -- and is the runbook clear that restarting the pod does nothing useful unless the root cause changes?
  • [Critical] Is node NotReady branched by cause -- kubectl describe node <name> Conditions show: KubeletReady=False with "PLEG is not healthy" (CRI/container-runtime issue, check systemctl status containerd / crio and crictl info on the node), MemoryPressure=True (node is swapping/evicting, check dmesg for OOM, check what is consuming memory), DiskPressure=True (image GC failing or logs filling disk, check df -h /var/lib/containerd /var/lib/kubelet /var/log), PIDPressure=True (PID exhaustion, usually a forking process), NetworkUnavailable=True (CNI not ready, check the CNI agent pod on the node) -- so operators do not blanket-restart kubelet when the issue is downstream?
  • [Critical] Is the etcd quorum-loss recovery procedure documented and reviewed -- etcdctl endpoint status --cluster shows leader and member health; if quorum is lost (less than (N/2)+1 members healthy), no writes succeed and the API server returns 5xx; recovery requires disaster recovery from snapshot (etcdctl snapshot restore) on a single member, then growing the cluster back to N members, not force-restarting individual etcd processes; managed Kubernetes (EKS, GKE, AKS) hides etcd entirely and recovery is a vendor-support escalation -- and is the runbook clear that etcdctl snapshot save must be running on a regular cadence so this recovery is even possible?
  • [Critical] Is the control-plane component triage order documented -- API server first (everything depends on it; if down, kubectl returns connection-refused and kubelet cannot post status, making nodes appear NotReady); etcd second (API server depends on it for state); controller-manager and scheduler third (workloads continue running; new placement and reconciliation stop) -- so operators do not chase node-readiness symptoms when the actual cause is the API server or etcd?
  • [Recommended] Are node-level diagnostic commands documented for kubelet-layer problems -- journalctl -u kubelet --since "30 minutes ago" for kubelet logs, crictl ps and crictl ps -a for container state (replaces docker ps in container-runtime-aware way), crictl logs <container-id> for logs even when the pod is gone from the API, crictl pods for pod sandboxes, crictl info for runtime version and config, nsenter for entering a container's namespaces when kubectl exec is not viable -- so operators are not blocked when the API path is broken?
  • [Recommended] Is DNS-failure triage branched correctly -- kubectl get pods -n kube-system -l k8s-app=kube-dns (or coredns) for DNS pod health, kubectl logs -n kube-system -l k8s-app=kube-dns for resolution errors, in-cluster test via kubectl run -i --rm --tty test-dns --image=busybox --restart=Never -- nslookup kubernetes.default to verify ClusterIP DNS works; common causes are CoreDNS pod stuck on a NotReady node, a NetworkPolicy blocking pod-to-DNS traffic, or upstream resolver failure (check the CoreDNS forward config) -- and is the runbook clear that "everything is broken" symptoms across the cluster are usually DNS, not API server?
  • [Recommended] Is CNI-failure triage documented -- kubectl get pods -n kube-system -l <cni-label> for the CNI agent pods (Calico: k8s-app=calico-node, Cilium: k8s-app=cilium, Flannel: app=flannel); a pod stuck in ContainerCreating with "network plugin is not ready" means the CNI agent on its node has not initialized (check the agent pod's logs), "failed to set up sandbox" means the CRI-CNI handoff failed (often a CNI binary missing in /opt/cni/bin/ or a CNI conf parsing error in /etc/cni/net.d/); the CNI agent runs as a DaemonSet and a single failed agent breaks all new pods on its node?
  • [Recommended] Are PVC-stuck-in-Pending failures triaged via the storage chain -- kubectl describe pvc Events section, kubectl get sc for the StorageClass, kubectl get pods -n <csi-driver-namespace> for the CSI driver controller and node pods, the underlying provisioner's logs (cloud-provider logs for cloud-managed storage, Rook-Ceph or OpenShift Data Foundation logs for in-cluster storage); a PVC stuck Pending with "waiting for first consumer" means volumeBindingMode: WaitForFirstConsumer and the consuming pod is itself Pending (real cause is upstream); "failed to provision" with a specific provisioner error is the actual diagnosis path?
  • [Recommended] Are kubectl drain failure modes documented -- --ignore-daemonsets is required because daemonsets cannot be evicted (they are bound to the node); --delete-emptydir-data is required if any pod uses emptyDir (data is lost on drain); --force deletes pods not managed by a controller (rare but possible) -- without these, drain hangs indefinitely and operators incorrectly conclude "the node is broken" when the actual issue is a drain command missing the right flags?
  • [Recommended] Is the PodDisruptionBudget interaction with drain understood -- a PDB with minAvailable or maxUnavailable causes kubectl drain to block indefinitely if eviction would violate the budget (e.g., a 3-replica deployment with minAvailable: 3 cannot be drained at all); the runbook needs to call out PDB-blocking-drain as a distinct failure mode and the right response (raise the PDB, scale up first, or use --disable-eviction --force only in confirmed emergencies)?
  • [Recommended] Are kubectl logs size limits understood -- the kubelet keeps log file rotation per container (default 10MB, 5 files = 50MB max retained per container), and kubectl logs --previous only shows the immediately previous container instance, not the one before that; for sustained crash loops with rapid restarts, the original failure log is overwritten quickly and only an external log shipper (Fluent Bit, Vector, fluentd) preserves the history -- so a runbook that depends on --previous must time-bound itself to the post-failure window?
  • [Optional] Is kubectl debug used for ephemeral container troubleshooting -- kubectl debug -it <pod> --image=busybox --target=<container> attaches a shared-namespace ephemeral container to a running pod for diagnosis without modifying the pod spec; kubectl debug node/<node-name> creates a privileged pod scheduled to that node with /host mounted, useful for node-level investigation when SSH is unavailable; both require Kubernetes 1.23+ for stable support?
  • [Optional] Is managed-Kubernetes specific guidance documented -- on EKS/GKE/AKS, the control plane (kube-apiserver, etcd, controller-manager, scheduler) is vendor-managed and not directly accessible; etcd snapshot/restore is not an operator operation; control-plane outages are always vendor-support escalations; node-level troubleshooting still applies normally -- so the runbook is honest about where it ends?

Why This Matters

Kubernetes incident response is dominated by misdiagnosis. The top-of-funnel symptom -- "my pod is not running" -- maps to dozens of distinct failure modes (pending due to capacity, pending due to taints, pending due to PVC issues, crash-looping due to OOM, crash-looping due to config, image pull failure, CNI not ready, node NotReady, DNS broken, API server down). Each has a different log location, a different remediation, and a different escalation path. An operator who responds to every "pod not running" symptom by deleting and recreating the pod is correct in maybe 5% of cases and actively counterproductive in the remaining 95% (because the root cause persists, the previous container's log is now lost, and the attempt count just resets).

The discipline of kubectl logs --previous before delete is the Kubernetes-specific equivalent of "diagnostic capture before mutation". Once a pod is deleted, the previous container's stdout is gone unless an external log aggregator was running. For sustained crash loops, even --previous only goes one instance back -- earlier failures are overwritten. The runbook needs the previous-logs capture step to be unambiguously before any restart, recreate, or scale operation.

The node NotReady branching matters because the kubelet status condition is a downstream symptom that can have many upstream causes. KubeletReady=False with "PLEG is not healthy" is a CRI problem (containerd or CRI-O hung); MemoryPressure=True is a node-resource problem; DiskPressure=True is usually a log-volume or image-storage problem; NetworkUnavailable=True is a CNI problem. Restarting the kubelet "fixes" the symptom briefly because the conditions are re-evaluated, but the underlying cause remains and the node will return to NotReady within minutes. The right response is to read the conditions and branch.

The etcd quorum-loss recovery is the Kubernetes equivalent of Galera quorum-loss in OpenStack -- a single foot-gun that destroys cluster state if mishandled. Force-restarting individual etcd processes after quorum loss does not restore quorum; one process must be restored from a snapshot (with --force-new-cluster and the previous cluster ID), and the others rejoined as new members. Without a recent etcd snapshot, the cluster's state is unrecoverable and the only path is rebuild + workload re-deployment from manifests. The "regular etcd snapshot cadence" requirement in the lifecycle file (providers/kubernetes/operations.md) is the prerequisite for this incident-response file's recovery procedure to even be possible.

The control-plane triage order matters because the symptom presentation is misleading. If the API server is down, every kubectl returns connection-refused, every kubelet's status update fails, and every node appears NotReady to anyone running kubectl get nodes against a different cluster. The naive operator concludes "all nodes are broken" and starts investigating nodes. The right move is to check the API server first (is it responding to a direct curl on its serving port? are the kube-apiserver pods running? are the etcd endpoints healthy?) and then work down. Managed Kubernetes hides this entirely -- if kubectl is failing on EKS/GKE/AKS, the right response is the cloud provider status page and a vendor-support case, not local triage.

Common Decisions (ADR Triggers)

  • Log capture: kubectl logs --previous vs external log aggregator -- --previous is built-in but limited to one previous instance and bounded by kubelet log rotation. An external log aggregator (Fluent Bit/Vector to Loki/Elasticsearch/Splunk) preserves history but adds infrastructure. For any cluster expected to have crash-loops, the aggregator is required for forensic capability; --previous is best-effort.
  • Etcd snapshot strategy: cadence and storage -- Frequent snapshots (every 30 minutes) minimize state loss in disaster recovery but increase storage cost; infrequent snapshots (daily) limit recovery point objective to 24 hours of churn. Snapshots must be off-cluster (object storage, NFS, or a different node) -- a snapshot on the same etcd disk that fails is useless. Managed Kubernetes hides this; self-hosted clusters must implement it.
  • kubectl debug enablement: cluster-wide vs gated -- The kubectl debug ephemeral-container feature lets operators inject containers into running pods without spec changes. Useful for triage but a security concern (privileged containers, namespace sharing). Some clusters disable it via admission policy; others gate it via RBAC. Decide explicitly; do not default to permissive without a security review.
  • Drain blast radius: PDB-strict vs PDB-flexible -- Strict PDBs (minAvailable exactly equal to replica count) prevent any drain and force scale-up before maintenance. Flexible PDBs (maxUnavailable: 1 for replicas of N) allow rolling drains. The trade-off is availability guarantee vs operational flexibility. PDBs that block drain entirely are the right choice for SEV1-impact services; flexible is right for everything else.

See Also

  • providers/kubernetes/operations.md -- Kubernetes lifecycle: Helm/Kustomize, GitOps, cluster upgrades, etcd backup strategy (this file is the runtime-incident counterpart)
  • providers/kubernetes/observability.md -- monitoring and alerting that surface the symptoms this runbook responds to
  • providers/kubernetes/networking.md -- CNI design decisions; CNI-level troubleshooting depends on the chosen CNI
  • providers/kubernetes/storage.md -- storage class and CSI design; PVC-stuck failures depend on the chosen provisioner
  • providers/openshift/operations.md -- OpenShift adds operators and MCO on top of Kubernetes; OpenShift-specific incident-response patterns
  • providers/ceph/operations.md -- Ceph operational depth (Rook-Ceph storage symptoms manifest at the PVC layer)
  • general/operational-runbooks.md -- runbook framework: structure, severity, automation decisions, postmortem process