Blue-Green vs Canary: when each strategy actually makes sense

Both blue-green and canary deployments are called “zero-downtime deployment strategies.” That’s true, but it understates the difference. They’re actually solving different problems, and using the wrong one for your situation costs you either money or confidence.

Blue-green: you want an instant rollback

Blue-green is the right choice when your primary concern is instant, complete rollback. You maintain two identical production environments. Traffic switches 100% from one to the other. If something breaks, you flip back.

The cost: double the infrastructure during the transition window. For a fleet of EC2s or EKS node groups, that’s real money.

The benefit: rollback is one DNS change or load balancer target group swap. No gradual anything. If your deploy is bad, users are off it in seconds.

# ArgoCD Rollout — blue-green
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-service
spec:
  strategy:
    blueGreen:
      activeService: api-active     # production traffic
      previewService: api-preview   # new version, no traffic
      autoPromotionEnabled: false   # manual promotion required
      scaleDownDelaySeconds: 30     # keep blue alive for rollback

With this setup, the new version starts serving zero traffic. Your smoke tests run against api-preview. A human (or automated gate) promotes when ready.

Canary: you want production signal before full rollout

Canary is the right choice when you need production traffic to validate behaviour. You can’t replicate production load in staging. Some bugs only surface under real conditions.

The tradeoff: a percentage of real users hit the new version before you’re confident it works. You need good observability to catch problems early.

# ArgoCD Rollout — canary
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-service
spec:
  strategy:
    canary:
      steps:
        - setWeight: 5      # 5% of traffic
        - pause: {}         # manual gate — review metrics
        - setWeight: 25
        - pause:
            duration: 10m   # automatic pause, then...
        - setWeight: 100    # full rollout
      canaryService: api-canary
      stableService: api-stable

The 5% → pause → 25% → pause → 100% pattern is what I use for most stateless services. The first pause is manual — someone reviews error rates and latency. The second is automatic if metrics look clean.

The real decision criteria

Situation	Strategy
Financial or regulated system	Blue-green — rollback speed matters more than cost
High-traffic consumer feature	Canary — need production signal
Database schema migration	Blue-green — both versions must work against same schema
A/B testing use case	Canary with header-based routing
Compliance requires audit trail	Blue-green — clear before/after state

The database schema constraint is the one people miss. If your deploy includes a schema migration, blue-green requires the new schema to be backward-compatible with the old code. This means additive-only migrations (new columns, new tables) during the transition, then cleanup in a follow-up deploy.

What I actually run on EKS

For most microservices: canary with Argo Rollouts, 5%/25%/100% steps, automated analysis against Prometheus metrics for error rate and p99 latency.

For anything touching payments or auth: blue-green with manual promotion gate, 30-minute warm window before switching active service.

The warm window matters. You want the new version to have processed enough requests that JVM warm-up, connection pool population, and cache warming are done before you declare it production.