Prometheus and Grafana on EKS: Production Setup Guide

You can't fix what you can't measure. This is the complete production setup for Prometheus and Grafana on EKS — not the quickstart that leaves you with broken alerting and no storage persistence, but the real configuration that runs reliably.

Why observability before scaling

Every time I've joined a team that's having reliability problems, the root cause is the same: they scaled before they could see. You can't fix what you can't measure, and you can't measure what you haven't instrumented. The Prometheus + Grafana stack on EKS gives you the observability foundation that makes everything else — incident response, capacity planning, performance optimisation — actually possible.

This guide covers the full production setup: installation, the dashboards that matter, alerting that doesn't create noise, and the operational patterns I use to keep observability stacks healthy long-term.

Architecture overview

Before running commands, understand what you're deploying:

# kube-prometheus-stack installs:
# - Prometheus Operator    — manages Prometheus/Alertmanager as CRDs
# - Prometheus             — time-series metrics collection
# - Alertmanager           — alert routing and deduplication
# - Grafana                — dashboards and visualisation
# - node-exporter          — host-level metrics (DaemonSet)
# - kube-state-metrics     — Kubernetes object state metrics
# - Various PrometheusRule  CRDs — pre-built alerting rules

Prerequisites and EKS-specific setup

# Required: Helm 3, kubectl connected to your cluster
helm version --short
kubectl cluster-info

# EKS-specific: create dedicated namespace with proper labels
kubectl create namespace monitoring
kubectl label namespace monitoring \
  monitoring=prometheus \
  pod-security.kubernetes.io/enforce=privileged

# node-exporter needs privileged access on EKS — label nodes
# (already done on managed node groups, verify with:)
kubectl get nodes -o json | jq '.items[].metadata.labels | keys[]' | grep -i instance

Production-grade installation

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Don't install with defaults. Create a values file tuned for production:

# values-production.yaml
prometheus:
  prometheusSpec:
    retention: 15d
    retentionSize: "40GB"
    
    # EKS: use EBS for persistent storage
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi

    resources:
      requests:
        cpu: 500m
        memory: 2Gi
      limits:
        cpu: 2000m
        memory: 8Gi

    # Scrape all ServiceMonitors across namespaces
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false

grafana:
  persistence:
    enabled: true
    storageClassName: gp3
    size: 10Gi

  adminPassword: "" # set via secret, not here
  
  # Grafana → AWS SSO or your IdP
  grafana.ini:
    server:
      root_url: "https://grafana.yourdomain.com"
    auth.generic_oauth:
      enabled: false  # enable and configure for SSO

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          resources:
            requests:
              storage: 5Gi

# Disable components you don't need to reduce noise
kubeEtcd:
  enabled: false  # EKS manages etcd, not accessible
kubeControllerManager:
  enabled: false  # EKS manages control plane
kubeScheduler:
  enabled: false  # EKS manages control plane

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values values-production.yaml \
  --version 56.6.2  # pin version for reproducibility

Verify the installation

# All pods should reach Running state within 3-5 minutes
kubectl get pods -n monitoring -w

# Check Prometheus is scraping targets correctly
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/targets — all should show UP

# Check Alertmanager is running and connected
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093
# Open http://localhost:9093 — verify no config errors

# Verify PVCs are bound
kubectl get pvc -n monitoring

Accessing Grafana securely in production

# Option 1: Port-forward (dev/debugging only)
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

# Option 2: Ingress with TLS (production)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grafana
  namespace: monitoring
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internal
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:...
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
spec:
  rules:
  - host: grafana.internal.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: kube-prometheus-stack-grafana
            port:
              number: 80

The dashboards that actually matter

The kube-prometheus-stack ships with 30+ pre-built dashboards. Most engineers never open 80% of them. These are the ones I check daily and during incidents:

# Dashboard IDs to import from grafana.com/grafana/dashboards
# (Grafana → Dashboards → Import → enter ID)

15760  — Kubernetes cluster overview (nodes, pods, namespaces)
15761  — Kubernetes workloads (deployment health, replica counts)
6417   — Kubernetes cluster resource requests vs limits
11074  — Node exporter full (disk, network, CPU per node)
13659  — Kubernetes persistent volumes
3662   — Prometheus 2.0 stats (monitor your monitor)

Build a single "golden signals" dashboard. Four panels: request rate, error rate, latency (p50/p95/p99), and resource saturation. This is the dashboard that's open during every incident. Keep it simple — one dashboard your team can read at a glance is worth more than 30 detailed dashboards nobody looks at.

Instrumenting your own applications

# Tell Prometheus to scrape your service with a ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api-service
  namespace: production
spec:
  selector:
    matchLabels:
      app: api-service
  endpoints:
  - port: metrics        # must match your Service port name
    path: /metrics
    interval: 30s

Alerting that doesn't cry wolf

Alert fatigue kills on-call rotations. Every alert that fires and isn't actionable trains your team to ignore alerts. These are the only alerts worth having initially:

# PrometheusRule — production-ready alert set
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: production-alerts
  namespace: monitoring
spec:
  groups:
  - name: critical
    rules:
    
    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Pod {{ $labels.pod }} is crash looping"
        runbook_url: "https://wiki.internal/runbooks/crash-loop"

    - alert: NodeMemoryPressure
      expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.10
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Node {{ $labels.instance }} memory below 10%"

    - alert: PVCUsageHigh
      expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "PVC {{ $labels.persistentvolumeclaim }} at {{ $value | humanizePercentage }}"

    - alert: DeploymentReplicasMismatch
      expr: kube_deployment_spec_replicas != kube_deployment_status_available_replicas
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Deployment {{ $labels.deployment }} has {{ $value }} missing replicas"

Alertmanager routing to Slack

# alertmanager-config.yaml
global:
  slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

route:
  receiver: 'slack-critical'
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  
  routes:
  - match:
      severity: critical
    receiver: 'slack-critical'
    continue: true
  - match:
      severity: warning
    receiver: 'slack-warnings'

receivers:
- name: 'slack-critical'
  slack_configs:
  - channel: '#alerts-critical'
    title: '{{ .GroupLabels.alertname }}'
    text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
    send_resolved: true

- name: 'slack-warnings'
  slack_configs:
  - channel: '#alerts-warnings'
    send_resolved: true

# Apply the config
kubectl create secret generic alertmanager-kube-prometheus-stack-alertmanager \
  --from-file=alertmanager.yaml=alertmanager-config.yaml \
  -n monitoring \
  --dry-run=client -o yaml | kubectl apply -f -

Useful PromQL queries for day-to-day operations

# CPU usage per pod as % of request
rate(container_cpu_usage_seconds_total{container!=""}[5m]) /
  on(pod, namespace) kube_pod_container_resource_requests{resource="cpu"} * 100

# Memory usage per pod as % of limit
container_memory_working_set_bytes{container!=""} /
  on(pod, namespace) kube_pod_container_resource_limits{resource="memory"} * 100

# Pods not running in a namespace
kube_pod_status_phase{phase!~"Running|Succeeded", namespace="production"}

# HTTP error rate for a service (if using Istio or exposing metrics)
sum(rate(http_requests_total{status=~"5.."}[5m])) /
  sum(rate(http_requests_total[5m]))

# Node disk usage
(node_filesystem_size_bytes - node_filesystem_free_bytes) /
  node_filesystem_size_bytes * 100

Keeping the monitoring stack healthy

# Check Prometheus storage usage
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
  df -h /prometheus

# Check scrape duration — slow scrapes indicate target performance issues
# PromQL: scrape_duration_seconds > 10

# Upgrade kube-prometheus-stack safely
helm repo update
helm diff upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values values-production.yaml
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values values-production.yaml \
  --version <new-version>

Frequently asked questions

How much storage does Prometheus actually need?

Roughly: bytes_per_sample × samples_per_second × retention_seconds × 2. A medium EKS cluster scraping every 30 seconds with 15-day retention typically uses 20-50GB. Start with 50GB and monitor with the Prometheus self-monitoring dashboard.

Prometheus is showing high memory usage — what do I do?

# Check cardinality — high cardinality is the #1 cause of Prometheus memory issues
# PromQL to find high-cardinality metrics:
topk(10, count by (__name__)({__name__=~".+"}))

# Drop high-cardinality labels you don't need via metric relabeling

How do I add a new data source to Grafana without restarting?

Use a Grafana DataSource provisioning ConfigMap — Grafana watches the provisioning directory and picks up changes without restart. Add new sources to the grafana.additionalDataSources section of your Helm values and run helm upgrade.

What's the difference between node-exporter and kube-state-metrics?

Node-exporter runs on each node and exposes hardware and OS metrics: CPU, memory, disk, network at the host level. Kube-state-metrics talks to the Kubernetes API and exposes object state: deployment replica counts, pod status, PVC binding status. You need both — they measure completely different things.