EKS Upgrade Strategies: How to Upgrade Kubernetes Without Downtime

Control plane, node groups, add-ons — EKS upgrades have three moving parts and a wrong step at any of them causes production incidents. Here's the strategy that actually works, including blue-green cluster migration.

Why EKS upgrades are painful and how to make them not be

Kubernetes version upgrades are one of the most anxiety-inducing operational tasks in platform engineering. Not because the process is technically complex — AWS has made the control plane upgrade largely push-button — but because the blast radius of getting it wrong is enormous and the blast radius of doing it right is invisible. Nobody congratulates you for an upgrade that caused zero downtime. They definitely notice if it doesn't.

After running multiple EKS upgrades across production clusters — some smooth, some not — here's the strategy that actually works.

Understand what's actually upgrading

EKS upgrades have three distinct components that need to be upgraded separately and in order:

Control plane — upgraded by AWS via the console or CLI. Managed, relatively safe.
Managed node groups / Fargate — upgraded after the control plane.
Add-ons — kube-proxy, CoreDNS, VPC CNI, EBS CSI driver — each has its own upgrade process and compatibility matrix.

Common mistake: Upgrading the control plane and assuming you're done. Your nodes are still running the old Kubernetes version and your add-ons may be incompatible. A partially upgraded cluster is worse than an un-upgraded one.

The pre-upgrade checklist

# 1. Check current versions
kubectl version
kubectl get nodes -o wide
aws eks describe-addon-versions --kubernetes-version 1.29

# 2. Check for deprecated APIs in use
kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis

# Better: use Pluto to scan your cluster and Helm releases
pluto detect-all-in-cluster
pluto detect-helm -owide

# 3. Review AWS release notes for the target version
# https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html

# 4. Check PodDisruptionBudgets won't block node drains
kubectl get pdb -A

Pluto is the most important tool in the pre-upgrade checklist. Deprecated API usage is the #1 cause of upgrade failures — your Helm charts or custom resources may be using API versions removed in the target Kubernetes version, and you won't know until the upgrade breaks something.

The blue-green cluster strategy

For production clusters with strict uptime requirements, the safest upgrade strategy isn't upgrading in place — it's building a new cluster on the target version and migrating workloads.

# Terraform: new cluster module pointing to new version
module "eks_green" {{
  source          = "terraform-aws-modules/eks/aws"
  cluster_name    = "production-green"
  cluster_version = "1.29"   # new version

  # Same VPC, same subnets as blue cluster
  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets
}}

The workflow:

Provision green cluster on new Kubernetes version
Deploy all workloads to green cluster
Run smoke tests and validate all services are healthy
Shift traffic gradually (10% → 50% → 100%) via weighted DNS or load balancer
Monitor error rates at each step
Decommission blue cluster after 24–48 hours of stable green

In-place upgrade: the careful approach

If a full blue-green migration isn't feasible, in-place upgrade can be done safely with the right approach:

# Step 1: Upgrade control plane
aws eks update-cluster-version   --name production   --kubernetes-version 1.29   --region us-east-1

# Wait for update to complete (~10-15 minutes)
aws eks wait cluster-active --name production

# Step 2: Update add-ons (check compatibility first)
aws eks update-addon   --cluster-name production   --addon-name vpc-cni   --addon-version v1.16.0-eksbuild.1

aws eks update-addon --cluster-name production --addon-name coredns
aws eks update-addon --cluster-name production --addon-name kube-proxy

# Step 3: Upgrade managed node groups
aws eks update-nodegroup-version   --cluster-name production   --nodegroup-name general   --kubernetes-version 1.29

Node group upgrade — what actually happens

When you upgrade a managed node group, AWS launches new nodes on the new version, cordons and drains old nodes one at a time, and terminates them after workloads have migrated. This is where PodDisruptionBudgets matter — if your PDBs are too strict, the drain will stall.

# Watch node upgrade progress
kubectl get nodes -w

# If a drain stalls, check which pods are blocking
kubectl get pods -A -o wide | grep <old-node-name>

# Check PDB status
kubectl get pdb -A
kubectl describe pdb <name> -n <namespace>

Post-upgrade validation

# Verify all nodes on new version
kubectl get nodes -o custom-columns=NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion

# Check all system pods healthy
kubectl get pods -n kube-system

# Verify add-on versions
aws eks list-addons --cluster-name production
aws eks describe-addon --cluster-name production --addon-name vpc-cni

# Run your smoke test suite
kubectl apply -f tests/smoke-test-job.yaml
kubectl wait --for=condition=complete job/smoke-test --timeout=300s

Timing advice: Run upgrades on Tuesday or Wednesday mornings — enough time after the weekend for any issues to surface, enough week left to fix problems without a weekend emergency. Never upgrade on a Friday.