DevOps · Cloud · Platform Engineering
From the
Trenches
Long-form writing on DevOps, Kubernetes, cloud infrastructure, and platform engineering. No fluff. No hot takes. Just the stuff that matters from 8 years of running production systems. By Gaurav Kaushal, Lead DevOps Engineer.
8+ Years in production
AWS Primary cloud
K8s Daily driver
20 Articles published
Writing
All Articles
20 posts covering DevOps, cloud infrastructure, and platform engineering.
Monorepo vs Multi-repo: An Honest Take After Operating Both
This debate has been running for a decade without a clean answer — because there isn't one. Here's the real tradeoffs from someone who has operated both approaches at scale, and when each actually makes sense.
Read article →
Platform Engineering in 2026 — What's Actually Changed
Three years into the platform engineering wave, the hype has settled. Here's what's genuinely different in 2026 — IDPs, golden paths, the product team model — and what teams that did the work actually have to show for it.
Read article →
HashiCorp Vault in Production: Secrets Management for Kubernetes
Kubernetes Secrets are base64 encoded, not encrypted. Here's how to set up Vault properly — Kubernetes auth, agent injection, secret rotation without restarts, and the audit trail regulated environments require.
Read article →
AI in DevOps: What's Actually Useful vs What's Still Hype in 2026
After 18 months of using AI tooling in production environments, here's the honest breakdown — where it saves real time, where the demos were better than the reality, and what's actually worth watching.
Read article →
EKS Upgrade Strategies: How to Upgrade Kubernetes Without Downtime
Control plane, node groups, add-ons — EKS upgrades have three moving parts and a wrong step at any of them causes production incidents. Here's the strategy that actually works, including blue-green cluster migration.
Read article →
When Internal Tooling Becomes a Barrier: A DevOps Mindset Problem
A simple DNS change turned into a multi-hour process. This isn't just a tooling issue — it's a mindset issue. How we build internal tools matters as much as how we build external products.
Read article →
Cut AWS EC2 Costs by 60% with Instance Scheduling: A Complete Guide
Non-production EC2 instances running 24/7 is one of the most common — and most fixable — sources of AWS waste. Here's the Lambda + EventBridge scheduler that fixed it, with full Terraform and code.
Read article →
Hands-On Always Beats Theory: How to Actually Learn DevOps
After 8 years in infrastructure, one thing is consistently true: the knowledge that matters most only comes from operating real systems. Here's how to learn DevOps in a way that actually sticks.
Read article →
Stop Hardcoding Environment Variables in CI/CD Pipelines
Hardcoding variables in pipeline config is a bomb waiting to go off at scale. Here's how to manage environment-specific configuration properly across GitHub Actions, AWS Secrets Manager, and Terraform.
Read article →
Why your Terraform modules are too big — and how to fix them
Most IaC problems aren't about syntax. They're about scope. A practical guide to module boundaries that actually scale with your team and survive production.
Read article →
Kubernetes Troubleshooting: The Complete Production Guide
CrashLoopBackOff, Pending pods, OOMKilled, service connectivity failures — the complete kubectl diagnostic sequence for production Kubernetes clusters.
Read article →
Blue-Green vs Canary: when each strategy actually makes sense
Both reduce deployment risk — but they're solving different problems. A breakdown with real EKS and ArgoCD examples from production.
Read article →
GitHub Actions vs Jenkins: The Honest Comparison in 2026
Real tradeoffs between GitHub Actions and Jenkins — actual pipeline examples in both, the migration reality, and a decision framework for enterprise environments.
Read article →
How we cut AWS spend by 20% without touching a single workload
S3 lifecycle policies, EC2 rightsizing, and Lambda-driven automation. The unglamorous work that actually saves money.
Read article →
Prometheus and Grafana on EKS: Production Setup Guide
Complete guide to deploying kube-prometheus-stack on EKS — production values, EBS persistent storage, alerting that doesn't create noise, PromQL queries for daily operations.
Read article →
Dockerfile Best Practices for Production
Most Dockerfiles work in development but create security risks and performance problems in production. Here are the practices that actually matter when your images run in a real cluster.
Read article →
AWS Cost Optimization: How to Cut Cloud Spend by 20% Systematically
A systematic approach to reducing AWS costs — from tagging and right-sizing to S3 lifecycle policies and NAT Gateway audits. The same process that achieved 20% savings without touching production.
Read article →
ArgoCD and GitOps: A Production Setup Guide
How to set up ArgoCD properly on EKS — repository structure, Application manifests, progressive delivery with Argo Rollouts, and the mistakes that will bite you if you skip them.
Read article →
Ansible for Server Automation at Scale: A Practical Guide
How to use Ansible to manage 100+ servers reliably — project structure, idempotent tasks, automated patching with serial execution, and running it all from CI/CD.
Read article →
Building a DevSecOps Pipeline: Security That Doesn't Slow Teams Down
How to integrate SAST, SCA, container scanning, and secret detection into your CI/CD pipeline in a way that actually gets used — not bypassed.
Read article →
Explore