Why EKS upgrades are painful and how to make them not be
Kubernetes version upgrades are one of the most anxiety-inducing operational tasks in platform engineering. Not because the process is technically complex — AWS has made the control plane upgrade largely push-button — but because the blast radius of getting it wrong is enormous and the blast radius of doing it right is invisible. Nobody congratulates you for an upgrade that caused zero downtime. They definitely notice if it doesn't.
After running multiple EKS upgrades across production clusters — some smooth, some not — here's the strategy that actually works.
Understand what's actually upgrading
EKS upgrades have three distinct components that need to be upgraded separately and in order:
- Control plane — upgraded by AWS via the console or CLI. Managed, relatively safe.
- Managed node groups / Fargate — upgraded after the control plane.
- Add-ons — kube-proxy, CoreDNS, VPC CNI, EBS CSI driver — each has its own upgrade process and compatibility matrix.
The pre-upgrade checklist
# 1. Check current versions
kubectl version
kubectl get nodes -o wide
aws eks describe-addon-versions --kubernetes-version 1.29
# 2. Check for deprecated APIs in use
kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis
# Better: use Pluto to scan your cluster and Helm releases
pluto detect-all-in-cluster
pluto detect-helm -owide
# 3. Review AWS release notes for the target version
# https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html
# 4. Check PodDisruptionBudgets won't block node drains
kubectl get pdb -A
Pluto is the most important tool in the pre-upgrade checklist. Deprecated API usage is the #1 cause of upgrade failures — your Helm charts or custom resources may be using API versions removed in the target Kubernetes version, and you won't know until the upgrade breaks something.
The blue-green cluster strategy
For production clusters with strict uptime requirements, the safest upgrade strategy isn't upgrading in place — it's building a new cluster on the target version and migrating workloads.
# Terraform: new cluster module pointing to new version
module "eks_green" {{
source = "terraform-aws-modules/eks/aws"
cluster_name = "production-green"
cluster_version = "1.29" # new version
# Same VPC, same subnets as blue cluster
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
}}
The workflow:
- Provision green cluster on new Kubernetes version
- Deploy all workloads to green cluster
- Run smoke tests and validate all services are healthy
- Shift traffic gradually (10% → 50% → 100%) via weighted DNS or load balancer
- Monitor error rates at each step
- Decommission blue cluster after 24–48 hours of stable green
In-place upgrade: the careful approach
If a full blue-green migration isn't feasible, in-place upgrade can be done safely with the right approach:
# Step 1: Upgrade control plane
aws eks update-cluster-version --name production --kubernetes-version 1.29 --region us-east-1
# Wait for update to complete (~10-15 minutes)
aws eks wait cluster-active --name production
# Step 2: Update add-ons (check compatibility first)
aws eks update-addon --cluster-name production --addon-name vpc-cni --addon-version v1.16.0-eksbuild.1
aws eks update-addon --cluster-name production --addon-name coredns
aws eks update-addon --cluster-name production --addon-name kube-proxy
# Step 3: Upgrade managed node groups
aws eks update-nodegroup-version --cluster-name production --nodegroup-name general --kubernetes-version 1.29
Node group upgrade — what actually happens
When you upgrade a managed node group, AWS launches new nodes on the new version, cordons and drains old nodes one at a time, and terminates them after workloads have migrated. This is where PodDisruptionBudgets matter — if your PDBs are too strict, the drain will stall.
# Watch node upgrade progress
kubectl get nodes -w
# If a drain stalls, check which pods are blocking
kubectl get pods -A -o wide | grep <old-node-name>
# Check PDB status
kubectl get pdb -A
kubectl describe pdb <name> -n <namespace>
Post-upgrade validation
# Verify all nodes on new version
kubectl get nodes -o custom-columns=NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion
# Check all system pods healthy
kubectl get pods -n kube-system
# Verify add-on versions
aws eks list-addons --cluster-name production
aws eks describe-addon --cluster-name production --addon-name vpc-cni
# Run your smoke test suite
kubectl apply -f tests/smoke-test-job.yaml
kubectl wait --for=condition=complete job/smoke-test --timeout=300s