Why observability before scaling
Every time I've joined a team that's having reliability problems, the root cause is the same: they scaled before they could see. You can't fix what you can't measure, and you can't measure what you haven't instrumented. The Prometheus + Grafana stack on EKS gives you the observability foundation that makes everything else — incident response, capacity planning, performance optimisation — actually possible.
This guide covers the full production setup: installation, the dashboards that matter, alerting that doesn't create noise, and the operational patterns I use to keep observability stacks healthy long-term.
Architecture overview
Before running commands, understand what you're deploying:
# kube-prometheus-stack installs:
# - Prometheus Operator — manages Prometheus/Alertmanager as CRDs
# - Prometheus — time-series metrics collection
# - Alertmanager — alert routing and deduplication
# - Grafana — dashboards and visualisation
# - node-exporter — host-level metrics (DaemonSet)
# - kube-state-metrics — Kubernetes object state metrics
# - Various PrometheusRule CRDs — pre-built alerting rules
Prerequisites and EKS-specific setup
# Required: Helm 3, kubectl connected to your cluster
helm version --short
kubectl cluster-info
# EKS-specific: create dedicated namespace with proper labels
kubectl create namespace monitoring
kubectl label namespace monitoring \
monitoring=prometheus \
pod-security.kubernetes.io/enforce=privileged
# node-exporter needs privileged access on EKS — label nodes
# (already done on managed node groups, verify with:)
kubectl get nodes -o json | jq '.items[].metadata.labels | keys[]' | grep -i instance
Production-grade installation
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Don't install with defaults. Create a values file tuned for production:
# values-production.yaml
prometheus:
prometheusSpec:
retention: 15d
retentionSize: "40GB"
# EKS: use EBS for persistent storage
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 2000m
memory: 8Gi
# Scrape all ServiceMonitors across namespaces
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
grafana:
persistence:
enabled: true
storageClassName: gp3
size: 10Gi
adminPassword: "" # set via secret, not here
# Grafana → AWS SSO or your IdP
grafana.ini:
server:
root_url: "https://grafana.yourdomain.com"
auth.generic_oauth:
enabled: false # enable and configure for SSO
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: gp3
resources:
requests:
storage: 5Gi
# Disable components you don't need to reduce noise
kubeEtcd:
enabled: false # EKS manages etcd, not accessible
kubeControllerManager:
enabled: false # EKS manages control plane
kubeScheduler:
enabled: false # EKS manages control plane
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values values-production.yaml \
--version 56.6.2 # pin version for reproducibility
Verify the installation
# All pods should reach Running state within 3-5 minutes
kubectl get pods -n monitoring -w
# Check Prometheus is scraping targets correctly
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/targets — all should show UP
# Check Alertmanager is running and connected
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093
# Open http://localhost:9093 — verify no config errors
# Verify PVCs are bound
kubectl get pvc -n monitoring
Accessing Grafana securely in production
# Option 1: Port-forward (dev/debugging only)
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Option 2: Ingress with TLS (production)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grafana
namespace: monitoring
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internal
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:...
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
spec:
rules:
- host: grafana.internal.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: kube-prometheus-stack-grafana
port:
number: 80
The dashboards that actually matter
The kube-prometheus-stack ships with 30+ pre-built dashboards. Most engineers never open 80% of them. These are the ones I check daily and during incidents:
# Dashboard IDs to import from grafana.com/grafana/dashboards
# (Grafana → Dashboards → Import → enter ID)
15760 — Kubernetes cluster overview (nodes, pods, namespaces)
15761 — Kubernetes workloads (deployment health, replica counts)
6417 — Kubernetes cluster resource requests vs limits
11074 — Node exporter full (disk, network, CPU per node)
13659 — Kubernetes persistent volumes
3662 — Prometheus 2.0 stats (monitor your monitor)
Instrumenting your own applications
# Tell Prometheus to scrape your service with a ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-service
namespace: production
spec:
selector:
matchLabels:
app: api-service
endpoints:
- port: metrics # must match your Service port name
path: /metrics
interval: 30s
Alerting that doesn't cry wolf
Alert fatigue kills on-call rotations. Every alert that fires and isn't actionable trains your team to ignore alerts. These are the only alerts worth having initially:
# PrometheusRule — production-ready alert set
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: production-alerts
namespace: monitoring
spec:
groups:
- name: critical
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
runbook_url: "https://wiki.internal/runbooks/crash-loop"
- alert: NodeMemoryPressure
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.10
for: 5m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.instance }} memory below 10%"
- alert: PVCUsageHigh
expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.persistentvolumeclaim }} at {{ $value | humanizePercentage }}"
- alert: DeploymentReplicasMismatch
expr: kube_deployment_spec_replicas != kube_deployment_status_available_replicas
for: 10m
labels:
severity: warning
annotations:
summary: "Deployment {{ $labels.deployment }} has {{ $value }} missing replicas"
Alertmanager routing to Slack
# alertmanager-config.yaml
global:
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
route:
receiver: 'slack-critical'
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
routes:
- match:
severity: critical
receiver: 'slack-critical'
continue: true
- match:
severity: warning
receiver: 'slack-warnings'
receivers:
- name: 'slack-critical'
slack_configs:
- channel: '#alerts-critical'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
send_resolved: true
- name: 'slack-warnings'
slack_configs:
- channel: '#alerts-warnings'
send_resolved: true
# Apply the config
kubectl create secret generic alertmanager-kube-prometheus-stack-alertmanager \
--from-file=alertmanager.yaml=alertmanager-config.yaml \
-n monitoring \
--dry-run=client -o yaml | kubectl apply -f -
Useful PromQL queries for day-to-day operations
# CPU usage per pod as % of request
rate(container_cpu_usage_seconds_total{container!=""}[5m]) /
on(pod, namespace) kube_pod_container_resource_requests{resource="cpu"} * 100
# Memory usage per pod as % of limit
container_memory_working_set_bytes{container!=""} /
on(pod, namespace) kube_pod_container_resource_limits{resource="memory"} * 100
# Pods not running in a namespace
kube_pod_status_phase{phase!~"Running|Succeeded", namespace="production"}
# HTTP error rate for a service (if using Istio or exposing metrics)
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
# Node disk usage
(node_filesystem_size_bytes - node_filesystem_free_bytes) /
node_filesystem_size_bytes * 100
Keeping the monitoring stack healthy
# Check Prometheus storage usage
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
df -h /prometheus
# Check scrape duration — slow scrapes indicate target performance issues
# PromQL: scrape_duration_seconds > 10
# Upgrade kube-prometheus-stack safely
helm repo update
helm diff upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values values-production.yaml
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values values-production.yaml \
--version <new-version>
Frequently asked questions
How much storage does Prometheus actually need?
Roughly: bytes_per_sample × samples_per_second × retention_seconds × 2. A medium EKS cluster scraping every 30 seconds with 15-day retention typically uses 20-50GB. Start with 50GB and monitor with the Prometheus self-monitoring dashboard.
Prometheus is showing high memory usage — what do I do?
# Check cardinality — high cardinality is the #1 cause of Prometheus memory issues
# PromQL to find high-cardinality metrics:
topk(10, count by (__name__)({__name__=~".+"}))
# Drop high-cardinality labels you don't need via metric relabeling
How do I add a new data source to Grafana without restarting?
Use a Grafana DataSource provisioning ConfigMap — Grafana watches the provisioning directory and picks up changes without restart. Add new sources to the grafana.additionalDataSources section of your Helm values and run helm upgrade.
What's the difference between node-exporter and kube-state-metrics?
Node-exporter runs on each node and exposes hardware and OS metrics: CPU, memory, disk, network at the host level. Kube-state-metrics talks to the Kubernetes API and exposes object state: deployment replica counts, pod status, PVC binding status. You need both — they measure completely different things.