DevOps · Cloud · Platform Engineering

From the
Trenches

Long-form writing on DevOps, Kubernetes, cloud infrastructure, and platform engineering. No fluff. No hot takes. Just the stuff that matters from 8 years of running production systems. By Gaurav Kaushal, Lead DevOps Engineer.

Read Articles Subscribe — it's free

8+ Years in production

AWS Primary cloud

K8s Daily driver

20 Articles published

Latest Why your Terraform modules are too big — and how to fix them →

Writing

All Articles

20 posts covering DevOps, cloud infrastructure, and platform engineering.

Mar 2026 ● 5 min read

Monorepo vs Multi-repo: An Honest Take After Operating Both

This debate has been running for a decade without a clean answer — because there isn't one. Here's the real tradeoffs from someone who has operated both approaches at scale, and when each actually makes sense.

DevOpsCI/CDEngineering

Read article →

Feb 2026 ● 5 min read

Platform Engineering in 2026 — What's Actually Changed

Three years into the platform engineering wave, the hype has settled. Here's what's genuinely different in 2026 — IDPs, golden paths, the product team model — and what teams that did the work actually have to show for it.

Platform EngineeringDevOps2026

Read article →

Feb 2026 ● 5 min read

HashiCorp Vault in Production: Secrets Management for Kubernetes

Kubernetes Secrets are base64 encoded, not encrypted. Here's how to set up Vault properly — Kubernetes auth, agent injection, secret rotation without restarts, and the audit trail regulated environments require.

VaultKubernetesSecurity

Read article →

Jan 2026 ● 5 min read

AI in DevOps: What's Actually Useful vs What's Still Hype in 2026

After 18 months of using AI tooling in production environments, here's the honest breakdown — where it saves real time, where the demos were better than the reality, and what's actually worth watching.

Read article →

Jan 2026 ● 5 min read

EKS Upgrade Strategies: How to Upgrade Kubernetes Without Downtime

Control plane, node groups, add-ons — EKS upgrades have three moving parts and a wrong step at any of them causes production incidents. Here's the strategy that actually works, including blue-green cluster migration.

EKSKubernetesAWS

Read article →

May 2025 ● 5 min read

When Internal Tooling Becomes a Barrier: A DevOps Mindset Problem

A simple DNS change turned into a multi-hour process. This isn't just a tooling issue — it's a mindset issue. How we build internal tools matters as much as how we build external products.

Platform EngineeringDevOpsDX

Read article →

Apr 2025 ● 5 min read

Cut AWS EC2 Costs by 60% with Instance Scheduling: A Complete Guide

Non-production EC2 instances running 24/7 is one of the most common — and most fixable — sources of AWS waste. Here's the Lambda + EventBridge scheduler that fixed it, with full Terraform and code.

AWSFinOpsLambda

Read article →

Mar 2025 ● 5 min read

Hands-On Always Beats Theory: How to Actually Learn DevOps

After 8 years in infrastructure, one thing is consistently true: the knowledge that matters most only comes from operating real systems. Here's how to learn DevOps in a way that actually sticks.

DevOpsKubernetesCareer

Read article →

Feb 2025 ● 5 min read

Stop Hardcoding Environment Variables in CI/CD Pipelines

Hardcoding variables in pipeline config is a bomb waiting to go off at scale. Here's how to manage environment-specific configuration properly across GitHub Actions, AWS Secrets Manager, and Terraform.

CI/CDSecurityGitHub Actions

Read article →

Jan 2025 ● 5 min read

Why your Terraform modules are too big — and how to fix them

Most IaC problems aren't about syntax. They're about scope. A practical guide to module boundaries that actually scale with your team and survive production.

TerraformInfrastructure as Code

Read article →

Jan 2025 ● 5 min read

Kubernetes Troubleshooting: The Complete Production Guide

CrashLoopBackOff, Pending pods, OOMKilled, service connectivity failures — the complete kubectl diagnostic sequence for production Kubernetes clusters.

KubernetesDevOpsProduction

Read article →

Dec 2024 ● 5 min read

Blue-Green vs Canary: when each strategy actually makes sense

Both reduce deployment risk — but they're solving different problems. A breakdown with real EKS and ArgoCD examples from production.

KubernetesCI/CDArgoCD

Read article →

Dec 2024 ● 5 min read

GitHub Actions vs Jenkins: The Honest Comparison in 2026

Real tradeoffs between GitHub Actions and Jenkins — actual pipeline examples in both, the migration reality, and a decision framework for enterprise environments.

CI/CDJenkinsGitHub Actions

Read article →

Nov 2024 ● 5 min read

How we cut AWS spend by 20% without touching a single workload

S3 lifecycle policies, EC2 rightsizing, and Lambda-driven automation. The unglamorous work that actually saves money.

Read article →

Nov 2024 ● 5 min read

Prometheus and Grafana on EKS: Production Setup Guide

Complete guide to deploying kube-prometheus-stack on EKS — production values, EBS persistent storage, alerting that doesn't create noise, PromQL queries for daily operations.

EKSKubernetesObservability

Read article →

Oct 2024 ● 5 min read

Dockerfile Best Practices for Production

Most Dockerfiles work in development but create security risks and performance problems in production. Here are the practices that actually matter when your images run in a real cluster.

DockerKubernetesSecurity

Read article →

Sept 2024 ● 5 min read

AWS Cost Optimization: How to Cut Cloud Spend by 20% Systematically

A systematic approach to reducing AWS costs — from tagging and right-sizing to S3 lifecycle policies and NAT Gateway audits. The same process that achieved 20% savings without touching production.

AWSFinOpsTerraform

Read article →

Aug 2024 ● 5 min read

ArgoCD and GitOps: A Production Setup Guide

How to set up ArgoCD properly on EKS — repository structure, Application manifests, progressive delivery with Argo Rollouts, and the mistakes that will bite you if you skip them.

ArgoCDGitOpsKubernetes

Read article →

Jul 2024 ● 5 min read

Ansible for Server Automation at Scale: A Practical Guide

How to use Ansible to manage 100+ servers reliably — project structure, idempotent tasks, automated patching with serial execution, and running it all from CI/CD.

AnsibleIaCAutomation

Read article →

Jun 2024 ● 5 min read

Building a DevSecOps Pipeline: Security That Doesn't Slow Teams Down

How to integrate SAST, SCA, container scanning, and secret detection into your CI/CD pipeline in a way that actually gets used — not bypassed.

DevSecOpsGitHub ActionsSecurity

Read article →

Explore

Topics

☁ AWS & Cloud Architecture, cost, services ⬡ Kubernetes EKS, Helm, GitOps ⚙ Infrastructure as Code Terraform, Ansible ⟳ CI/CD Pipelines, GitOps, ArgoCD ◎ Observability Prometheus, Grafana, ELK ⬡ DevSecOps Security, compliance, IAM $ FinOps Cost governance, optimisation ◈ Career & Teams Mentorship, growth, process