Kubernetes Troubleshooting: The Complete Production Guide

Kubernetes troubleshooting is pattern recognition. The cluster breaks in the same ways every time — CrashLoopBackOff, Pending pods, OOMKilled, service connectivity failures. This is the complete diagnostic sequence I run in production, built from real incidents across 100+ node clusters.

The mindset before the commands

Kubernetes troubleshooting is pattern recognition. Every production incident I've worked on — whether at Optum managing 100+ nodes or earlier in my career on smaller clusters — resolves down to about a dozen failure modes. The cluster doesn't invent new ways to break. It breaks in the same ways, with the same symptoms, and responds to the same diagnostic sequence.

The engineers who resolve incidents fastest aren't the ones who know the most kubectl flags. They're the ones who have a mental model of how Kubernetes scheduling, networking, and the control plane actually work — and who run diagnostics in a logical sequence instead of randomly trying things.

This guide documents the exact sequence I use. Every command here has resolved a real production incident.

The five commands I run first on any broken cluster

Before diving into specific failure modes, these five commands give you the full picture in under two minutes. Run them in this order every time.

# 1. Overall cluster health — are nodes ready?
kubectl get nodes -o wide

# 2. What's broken right now across all namespaces
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

# 3. Recent events — this is where the real error messages live
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

# 4. Control plane component health
kubectl get componentstatuses

# 5. Resource pressure across nodes
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -20

The events log is underused. Most engineers jump straight to pod logs. The events log shows you scheduler decisions, image pull failures, volume mount errors, and OOM kills — often with more context than the pod logs themselves.

CrashLoopBackOff — the complete diagnostic

CrashLoopBackOff means the container starts, crashes, and Kubernetes keeps restarting it with increasing backoff delays. The container is running — it's just dying immediately. The cause is almost always in the application, not Kubernetes itself.

# Step 1: Get the current state and restart count
kubectl describe pod <pod-name> -n <namespace>
# Look for: Last State, Exit Code, Reason

# Step 2: Current logs (may be empty if crashing immediately)
kubectl logs <pod-name> -n <namespace>

# Step 3: Previous container logs — CRITICAL for crash diagnosis
kubectl logs <pod-name> -n <namespace> --previous

# Step 4: Check if it's an init container crashing
kubectl logs <pod-name> -n <namespace> -c <init-container-name>

Exit code diagnosis

The exit code from kubectl describe pod tells you exactly what happened:

# Exit Code 0  → Container exited cleanly (wrong CMD/ENTRYPOINT, not a crash)
# Exit Code 1  → Application error — check app logs
# Exit Code 137 → OOMKilled (128 + signal 9) — container exceeded memory limit
# Exit Code 139 → Segmentation fault — usually a binary/library issue
# Exit Code 143 → SIGTERM not handled — graceful shutdown failing
# Exit Code 255 → SSH or entrypoint script failed

Exit code 137 (OOMKilled) masquerades as CrashLoopBackOff constantly. Always check memory limits before assuming it's an application bug.

Common root causes and fixes

# Missing environment variable — check if required vars are set
kubectl exec <pod> -- env | grep <VAR_NAME>
# Or check the deployment spec
kubectl get deployment <name> -o yaml | grep -A5 env

# Config/secret not mounted — pod starts then crashes when it can't find config
kubectl describe pod <pod> | grep -A10 Mounts
kubectl get secret <secret-name> -n <namespace>

# Wrong image entrypoint — container exits with code 0
kubectl run debug --image=<your-image> --rm -it -- /bin/sh

Pending pods — why the scheduler won't place them

A pod stuck in Pending means the scheduler cannot find a node that satisfies all its requirements. The reason is always in kubectl describe pod under Events — specifically the scheduler message.

kubectl describe pod <pending-pod> | grep -A20 Events

# Most common scheduler messages and what they mean:
# "0/5 nodes are available: 5 Insufficient memory"
#   → Your resource requests exceed available capacity
#   → Either reduce requests, add nodes, or evict lower-priority pods

# "0/5 nodes are available: 5 node(s) didn't match node selector"
#   → nodeSelector or nodeAffinity doesn't match any node
#   → Check node labels vs your pod spec

# "0/5 nodes are available: 5 node(s) had taint {key:val}, that pod didn't tolerate"
#   → Node is tainted and pod has no matching toleration

Diagnosing resource starvation

# See what's consuming resources on each node
kubectl describe nodes | grep -A5 "Allocated resources"

# Find pods with no resource requests (scheduling wildcards — dangerous)
kubectl get pods -A -o json | python3 -c "
import json,sys
pods=json.load(sys.stdin)
for p in pods['items']:
  for c in p['spec']['containers']:
    if 'resources' not in c or 'requests' not in c.get('resources',{}):
      print(p['metadata']['namespace'], p['metadata']['name'], c['name'])
"

# Check if PVC is unbound (common cause of pending pods)
kubectl get pvc -A | grep -v Bound

ImagePullBackOff — network and credential failures

# Get the exact error
kubectl describe pod <pod> | grep -A5 "Failed to pull image"

# Common causes:
# 1. Image doesn't exist / wrong tag
docker pull <image:tag>  # test locally

# 2. Private registry — missing imagePullSecret
kubectl get secret -n <namespace> | grep docker
kubectl create secret docker-registry regcred \
  --docker-server=<registry> \
  --docker-username=<user> \
  --docker-password=<password> \
  -n <namespace>

# Then reference it in your pod spec:
# spec:
#   imagePullSecrets:
#   - name: regcred

# 3. Rate limiting (DockerHub) — check with
kubectl describe pod | grep "toomanyrequests"
# Fix: use authenticated pulls or mirror to ECR/GCR

Service not reachable — the networking checklist

Service connectivity failures have four possible layers: the Service object, the Endpoints, the pod itself, and kube-proxy/CNI. Debug each layer in sequence.

# Layer 1: Does the Service exist and have the right selector?
kubectl get svc <service-name> -n <namespace> -o yaml
# Check: spec.selector matches pod labels exactly

# Layer 2: Are there Endpoints? (if empty, selector is wrong or pods aren't Ready)
kubectl get endpoints <service-name> -n <namespace>
# Empty endpoints = pods not matching selector OR pods not passing readiness probe

# Layer 3: Is the pod itself responding?
kubectl exec -it <debug-pod> -- curl http://<pod-ip>:<port>

# Layer 4: Test via service DNS
kubectl exec -it <debug-pod> -- curl http://<service>.<namespace>.svc.cluster.local:<port>

# Layer 5: Check NetworkPolicy — are there policies blocking traffic?
kubectl get networkpolicy -n <namespace>
kubectl describe networkpolicy <name> -n <namespace>

DNS resolution failures

# Test DNS from inside the cluster
kubectl run dns-test --image=busybox --rm -it --restart=Never -- nslookup kubernetes.default

# Check CoreDNS pods are running
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# CoreDNS ConfigMap — check for misconfigurations
kubectl get configmap coredns -n kube-system -o yaml

OOMKilled — memory limit diagnosis and tuning

# Confirm OOMKill — look for exit code 137 or OOMKilled reason
kubectl describe pod <pod> | grep -E "OOMKilled|exit code|137"

# See current memory usage vs limits
kubectl top pod <pod> -n <namespace> --containers

# See what limits are currently set
kubectl get pod <pod> -o json | jq '.spec.containers[].resources'

# Fix: increase memory limit in deployment
kubectl set resources deployment <name> \
  --containers=<container> \
  --requests=memory=256Mi \
  --limits=memory=512Mi

# Better: use VPA (Vertical Pod Autoscaler) recommendations
kubectl describe vpa <name> | grep "Target Memory"

Don't just raise limits blindly. OOMKilled is often a memory leak in the application, not an under-provisioned limit. Monitor memory over time with Prometheus before deciding whether to raise limits or fix the leak.

Node not ready — control plane to node communication

# Get node status and conditions
kubectl describe node <node-name> | grep -A10 Conditions

# Conditions to look for:
# MemoryPressure=True  → Node running out of memory
# DiskPressure=True    → Node running out of disk
# PIDPressure=True     → Too many processes
# Ready=False          → kubelet not communicating with control plane

# SSH to the node and check kubelet
systemctl status kubelet
journalctl -u kubelet -n 100 --no-pager

# Check disk usage on the node
df -h
du -sh /var/lib/docker/overlay2 | sort -hr | head -20

# Force evict pods from a problem node gracefully
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

Exec into a running container for live debugging

# Shell into a running container
kubectl exec -it <pod> -n <namespace> -- /bin/bash
# If bash isn't available:
kubectl exec -it <pod> -- /bin/sh

# Run a debug command without interactive shell
kubectl exec <pod> -- cat /etc/config/app.yaml
kubectl exec <pod> -- env | sort
kubectl exec <pod> -- wget -qO- http://localhost:8080/health

# Debug a distroless container (no shell) using ephemeral containers
kubectl debug -it <pod> --image=busybox --target=<container> -- sh

# Copy files out of a pod for inspection
kubectl cp <namespace>/<pod>:/var/log/app.log ./app.log

RBAC permission errors

# Test if a service account can perform an action
kubectl auth can-i get pods \
  --as=system:serviceaccount:<namespace>:<serviceaccount> \
  -n <namespace>

# See all permissions for a service account
kubectl get rolebindings,clusterrolebindings -A \
  -o json | jq '.items[] | select(.subjects[]?.name=="<sa-name>")'

# Check what a role actually grants
kubectl describe clusterrole <role-name>

# Common fix: bind a service account to the right role
kubectl create rolebinding <binding-name> \
  --role=<role-name> \
  --serviceaccount=<namespace>:<sa-name> \
  -n <namespace>

Persistent volume issues

# Check PVC status
kubectl get pvc -A
kubectl describe pvc <pvc-name> -n <namespace>

# PVC stuck in Pending — common causes:
# 1. No StorageClass matching the request
kubectl get storageclass

# 2. Wrong access mode (ReadWriteOnce can only bind to one node)
# 3. Capacity not available in StorageClass

# PV stuck in Released — needs manual cleanup to rebind
kubectl patch pv <pv-name> -p '{"spec":{"claimRef": null}}'

# Check volume mounts inside pod
kubectl exec <pod> -- df -h
kubectl exec <pod> -- ls -la /mnt/data

Production runbook: the full sequence

When you get paged at 2am, run this sequence. It covers 95% of production Kubernetes incidents in under 10 minutes:

# 1. What's broken right now?
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

# 2. Recent events — find the first error
kubectl get events -A --sort-by='.lastTimestamp' | grep -i "warning\|error\|fail" | tail -20

# 3. Describe the broken pod
kubectl describe pod <broken-pod> -n <namespace>

# 4. Logs — current and previous
kubectl logs <pod> --previous --tail=100

# 5. Node pressure?
kubectl top nodes

# 6. Recent deployments — was something just pushed?
kubectl rollout history deployment -n <namespace>

# 7. If a bad deploy — rollback immediately, diagnose after
kubectl rollout undo deployment/<name> -n <namespace>

Frequently asked questions

How do I find which node a pod is running on?

kubectl get pod <pod-name> -o wide
# The NODE column shows the node name

How do I restart a deployment without changing anything?

kubectl rollout restart deployment/<name> -n <namespace>
# This triggers a rolling restart — no downtime for replicas > 1

How do I see the full YAML of a running pod including injected fields?

kubectl get pod <pod> -o yaml
# This shows the actual running spec, not just what you applied

Pod is Running but my app still isn't working — where do I look?

Running status only means the container process started. Check: readiness probe status (kubectl describe pod → Conditions → Ready), application logs for startup errors, and whether the service endpoint is populated (kubectl get endpoints).

How do I debug a pod that exits too fast to exec into?

# Override the entrypoint to keep it alive for inspection
kubectl run debug-pod \
  --image=<your-image> \
  --rm -it \
  --restart=Never \
  --command -- sleep 3600
# Then exec in and poke around
kubectl exec -it debug-pod -- /bin/sh