Both blue-green and canary deployments are called “zero-downtime deployment strategies.” That’s true, but it understates the difference. They’re actually solving different problems, and using the wrong one for your situation costs you either money or confidence.
Blue-green: you want an instant rollback
Blue-green is the right choice when your primary concern is instant, complete rollback. You maintain two identical production environments. Traffic switches 100% from one to the other. If something breaks, you flip back.
The cost: double the infrastructure during the transition window. For a fleet of EC2s or EKS node groups, that’s real money.
The benefit: rollback is one DNS change or load balancer target group swap. No gradual anything. If your deploy is bad, users are off it in seconds.
# ArgoCD Rollout — blue-green
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api-service
spec:
strategy:
blueGreen:
activeService: api-active # production traffic
previewService: api-preview # new version, no traffic
autoPromotionEnabled: false # manual promotion required
scaleDownDelaySeconds: 30 # keep blue alive for rollback
With this setup, the new version starts serving zero traffic. Your smoke tests run against api-preview. A human (or automated gate) promotes when ready.
Canary: you want production signal before full rollout
Canary is the right choice when you need production traffic to validate behaviour. You can’t replicate production load in staging. Some bugs only surface under real conditions.
The tradeoff: a percentage of real users hit the new version before you’re confident it works. You need good observability to catch problems early.
# ArgoCD Rollout — canary
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api-service
spec:
strategy:
canary:
steps:
- setWeight: 5 # 5% of traffic
- pause: {} # manual gate — review metrics
- setWeight: 25
- pause:
duration: 10m # automatic pause, then...
- setWeight: 100 # full rollout
canaryService: api-canary
stableService: api-stable
The 5% → pause → 25% → pause → 100% pattern is what I use for most stateless services. The first pause is manual — someone reviews error rates and latency. The second is automatic if metrics look clean.
The real decision criteria
| Situation | Strategy |
|---|---|
| Financial or regulated system | Blue-green — rollback speed matters more than cost |
| High-traffic consumer feature | Canary — need production signal |
| Database schema migration | Blue-green — both versions must work against same schema |
| A/B testing use case | Canary with header-based routing |
| Compliance requires audit trail | Blue-green — clear before/after state |
The database schema constraint is the one people miss. If your deploy includes a schema migration, blue-green requires the new schema to be backward-compatible with the old code. This means additive-only migrations (new columns, new tables) during the transition, then cleanup in a follow-up deploy.
What I actually run on EKS
For most microservices: canary with Argo Rollouts, 5%/25%/100% steps, automated analysis against Prometheus metrics for error rate and p99 latency.
For anything touching payments or auth: blue-green with manual promotion gate, 30-minute warm window before switching active service.
The warm window matters. You want the new version to have processed enough requests that JVM warm-up, connection pool population, and cache warming are done before you declare it production.