Terraform Modules in Production: Design, State, and CI/CD Patterns

Terraform modules written under deadline pressure become the infrastructure your team maintains for years. This covers the design decisions, state isolation strategy, and CI/CD patterns that actually hold up under production use.

Why module design is a long-term infrastructure decision

Terraform modules written under deadline pressure become the infrastructure your team maintains for years. A module that's too large becomes a change-risk monolith. A module that's too granular creates orchestration overhead that costs more than it saves. Getting the boundaries right matters more than almost any other decision in an IaC codebase.

This guide is drawn from managing Terraform at scale at Optum — real module refactors, real state disasters, and the design patterns that actually survived production use.

The three failure patterns I see constantly

The monolith module

A single module that provisions VPC, subnets, security groups, EC2 instances, RDS, and IAM roles. Looks clean from the outside. In practice: a one-line security group change requires planning the entire stack, every apply touches unrelated resources, and state lock contention blocks the whole team.

The nano-module

A separate module for every single resource — a module for a security group rule, a module for a single IAM policy. Creates hundreds of module blocks in root configs, complex dependency graphs, and massive orchestration overhead for changes that should be trivial.

The leaky interface module

A module that exposes every internal resource ID as an output "just in case." Creates hidden dependencies between modules that make refactoring impossible without breaking everything that consumes the outputs.

The structure that actually works

Module boundaries should follow lifecycle and ownership, not resource type. Resources that always change together, owned by the same team, belong in the same module.

# Good module boundary: network layer
# Everything here changes together and is owned by platform team
modules/
  network/
    main.tf       # VPC, subnets, route tables, internet gateway
    variables.tf  # cidr_block, availability_zones, environment
    outputs.tf    # vpc_id, private_subnet_ids, public_subnet_ids ONLY

# Good module boundary: application cluster
modules/
  eks-cluster/
    main.tf       # EKS control plane, managed node groups, OIDC provider
    variables.tf  # cluster_version, node_instance_type, min/max nodes
    outputs.tf    # cluster_endpoint, cluster_name, node_role_arn

# Bad: mixing lifecycle boundaries
modules/
  everything/
    main.tf  # VPC + EKS + RDS + S3 + IAM — changes to RDS risk EKS

Variable design: the interface contract

Your module's variables.tf is a public API. Once other teams consume it, changing variable names or types is a breaking change. Design it carefully upfront.

# variables.tf — the right way

variable "environment" {
  description = "Deployment environment: dev, staging, prod"
  type        = string
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "environment must be dev, staging, or prod"
  }
}

variable "vpc_cidr" {
  description = "CIDR block for the VPC"
  type        = string
  default     = "10.0.0.0/16"
  validation {
    condition     = can(cidrhost(var.vpc_cidr, 0))
    error_message = "vpc_cidr must be a valid CIDR block"
  }
}

variable "availability_zones" {
  description = "List of AZs to deploy subnets into"
  type        = list(string)
  validation {
    condition     = length(var.availability_zones) >= 2
    error_message = "At least 2 availability zones required for HA"
  }
}

variable "tags" {
  description = "Tags to apply to all resources"
  type        = map(string)
  default     = {}
}

Add validation blocks to every variable. Terraform validation catches misconfiguration at plan time, not after a 10-minute apply destroys something. This one habit saves more incidents than any other Terraform practice.

Output design: minimal surface area

Only export what downstream modules actually need. Every output you add becomes a potential hidden dependency.

# outputs.tf — expose only what consumers need

output "vpc_id" {
  description = "ID of the created VPC"
  value       = aws_vpc.main.id
}

output "private_subnet_ids" {
  description = "IDs of private subnets, one per AZ"
  value       = aws_subnet.private[*].id
}

output "public_subnet_ids" {
  description = "IDs of public subnets, one per AZ"
  value       = aws_subnet.public[*].id
}

# Don't export: route table IDs, NACL IDs, individual AZ details
# unless a specific consumer has asked for them

State isolation strategy

State isolation is the most important operational decision in a Terraform codebase. The rule: resources that should never be destroyed together must have separate state files.

# Directory structure that enforces state isolation
infrastructure/
  environments/
    prod/
      network/        # state: s3://tf-state/prod/network.tfstate
        main.tf
        backend.tf
      eks/            # state: s3://tf-state/prod/eks.tfstate
        main.tf
        backend.tf
      databases/      # state: s3://tf-state/prod/databases.tfstate
        main.tf
        backend.tf
    staging/
      # mirrors prod structure

# backend.tf — S3 with DynamoDB locking
terraform {
  backend "s3" {
    bucket         = "your-terraform-state-bucket"
    key            = "prod/network.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-locks"
  }
}

# Create the DynamoDB lock table (do this once)
aws dynamodb create-table \
  --table-name terraform-state-locks \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

Remote state data sources — safe cross-module references

# In eks/main.tf — safely reference network state outputs
data "terraform_remote_state" "network" {
  backend = "s3"
  config = {
    bucket = "your-terraform-state-bucket"
    key    = "prod/network.tfstate"
    region = "us-east-1"
  }
}

resource "aws_eks_cluster" "main" {
  name     = "production"
  role_arn = aws_iam_role.eks_cluster.arn

  vpc_config {
    # Referencing network module outputs safely
    subnet_ids = data.terraform_remote_state.network.outputs.private_subnet_ids
    vpc_id     = data.terraform_remote_state.network.outputs.vpc_id
  }
}

Avoiding state drift — the most common production problem

State drift happens when someone modifies infrastructure outside of Terraform — manually in the AWS console, via CLI, or through another tool. The next terraform plan either shows unexpected changes or, worse, tries to "fix" the drift by destroying the manual change.

# Detect drift — run plan and look for unexpected changes
terraform plan -out=tfplan
terraform show -json tfplan | jq '.resource_changes[] | select(.change.actions != ["no-op"])'

# Import manually-created resources into state
terraform import aws_security_group.manually_created sg-0abc123def456

# Refresh state to reflect real AWS state
terraform refresh

# Remove a resource from state without destroying it (careful)
terraform state rm aws_instance.accidentally_imported

Never run terraform apply on drift without reviewing the plan first. Terraform will try to reconcile drift by destroying the manual changes. Always plan first, understand every change in the output, then apply.

Terraform in CI/CD — the production pattern

# GitHub Actions pipeline pattern for safe Terraform deploys
name: Terraform
on:
  pull_request:
    paths: ['infrastructure/**']
  push:
    branches: [main]
    paths: ['infrastructure/**']

jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.7.0"

      - name: Terraform Init
        run: terraform init
        working-directory: infrastructure/environments/prod/network

      - name: Terraform Format Check
        run: terraform fmt -check -recursive

      - name: Terraform Validate
        run: terraform validate

      - name: Terraform Plan
        if: github.event_name == 'pull_request'
        run: terraform plan -no-color -out=tfplan
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

      - name: Terraform Apply
        if: github.ref == 'refs/heads/main'
        run: terraform apply -auto-approve tfplan
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

Module versioning with the Terraform Registry

# Pin module versions — never use latest in production
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.5.0"  # pinned — safe
  # NOT: version = "~> 5.0"  ← minor version drift can break things
  
  name = "production-vpc"
  cidr = "10.0.0.0/16"
  
  azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
  
  enable_nat_gateway = true
  single_nat_gateway = false  # true costs less, false is HA
}

The pre-merge checklist

Before merging any Terraform PR to main:

# 1. Format
terraform fmt -recursive

# 2. Validate
terraform validate

# 3. Security scan
tfsec .
checkov -d .

# 4. Cost estimate
infracost breakdown --path .

# 5. Plan reviewed by second engineer for:
#    - Any destroy actions (red flag)
#    - Any changes to shared networking resources
#    - Any IAM changes
#    - Resource count changes (scaling events)

Frequently asked questions

What's the difference between `terraform destroy` and removing a resource from config?

Both will delete the resource. terraform destroy destroys everything in the state file. Removing from config and running apply destroys just that resource. Use terraform state rm if you want to stop managing a resource without destroying it.

How do I rename a resource without destroying and recreating it?

# Use terraform state mv to rename in state
terraform state mv aws_instance.old_name aws_instance.new_name
# Then update the resource block name in your config
# Next plan will show no changes if done correctly

How do I share outputs between workspaces?

Use terraform_remote_state data sources (shown above) or a parameter store like AWS SSM Parameter Store. Avoid direct workspace-to-workspace output references — they create coupling that makes isolated deploys impossible.

When should I use `count` vs `for_each`?

Use for_each with maps or sets whenever possible. count uses index-based state keys — removing element 1 from a 5-element list shifts indices 2-4, causing Terraform to destroy and recreate them. for_each uses stable string keys that don't shift.

# count — fragile (index-based keys)
resource "aws_subnet" "private" {
  count      = length(var.private_cidrs)
  cidr_block = var.private_cidrs[count.index]
}

# for_each — stable (string keys)
resource "aws_subnet" "private" {
  for_each   = { for az, cidr in zipmap(var.azs, var.private_cidrs) : az => cidr }
  cidr_block = each.value
  availability_zone = each.key
}

Terraform Modules in Production: Design, State, and CI/CD Patterns

Why module design is a long-term infrastructure decision

The three failure patterns I see constantly

The monolith module

The nano-module

The leaky interface module

The structure that actually works

Variable design: the interface contract

Output design: minimal surface area

State isolation strategy

Remote state data sources — safe cross-module references

Avoiding state drift — the most common production problem

Terraform in CI/CD — the production pattern

Module versioning with the Terraform Registry

The pre-merge checklist

Frequently asked questions

What's the difference between terraform destroy and removing a resource from config?

How do I rename a resource without destroying and recreating it?

How do I share outputs between workspaces?

When should I use count vs for_each?

What's the difference between `terraform destroy` and removing a resource from config?

When should I use `count` vs `for_each`?