Why module design is a long-term infrastructure decision
Terraform modules written under deadline pressure become the infrastructure your team maintains for years. A module that's too large becomes a change-risk monolith. A module that's too granular creates orchestration overhead that costs more than it saves. Getting the boundaries right matters more than almost any other decision in an IaC codebase.
This guide is drawn from managing Terraform at scale at Optum — real module refactors, real state disasters, and the design patterns that actually survived production use.
The three failure patterns I see constantly
The monolith module
A single module that provisions VPC, subnets, security groups, EC2 instances, RDS, and IAM roles. Looks clean from the outside. In practice: a one-line security group change requires planning the entire stack, every apply touches unrelated resources, and state lock contention blocks the whole team.
The nano-module
A separate module for every single resource — a module for a security group rule, a module for a single IAM policy. Creates hundreds of module blocks in root configs, complex dependency graphs, and massive orchestration overhead for changes that should be trivial.
The leaky interface module
A module that exposes every internal resource ID as an output "just in case." Creates hidden dependencies between modules that make refactoring impossible without breaking everything that consumes the outputs.
The structure that actually works
Module boundaries should follow lifecycle and ownership, not resource type. Resources that always change together, owned by the same team, belong in the same module.
# Good module boundary: network layer
# Everything here changes together and is owned by platform team
modules/
network/
main.tf # VPC, subnets, route tables, internet gateway
variables.tf # cidr_block, availability_zones, environment
outputs.tf # vpc_id, private_subnet_ids, public_subnet_ids ONLY
# Good module boundary: application cluster
modules/
eks-cluster/
main.tf # EKS control plane, managed node groups, OIDC provider
variables.tf # cluster_version, node_instance_type, min/max nodes
outputs.tf # cluster_endpoint, cluster_name, node_role_arn
# Bad: mixing lifecycle boundaries
modules/
everything/
main.tf # VPC + EKS + RDS + S3 + IAM — changes to RDS risk EKS
Variable design: the interface contract
Your module's variables.tf is a public API. Once other teams consume it, changing variable names or types is a breaking change. Design it carefully upfront.
# variables.tf — the right way
variable "environment" {
description = "Deployment environment: dev, staging, prod"
type = string
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "environment must be dev, staging, or prod"
}
}
variable "vpc_cidr" {
description = "CIDR block for the VPC"
type = string
default = "10.0.0.0/16"
validation {
condition = can(cidrhost(var.vpc_cidr, 0))
error_message = "vpc_cidr must be a valid CIDR block"
}
}
variable "availability_zones" {
description = "List of AZs to deploy subnets into"
type = list(string)
validation {
condition = length(var.availability_zones) >= 2
error_message = "At least 2 availability zones required for HA"
}
}
variable "tags" {
description = "Tags to apply to all resources"
type = map(string)
default = {}
}
Output design: minimal surface area
Only export what downstream modules actually need. Every output you add becomes a potential hidden dependency.
# outputs.tf — expose only what consumers need
output "vpc_id" {
description = "ID of the created VPC"
value = aws_vpc.main.id
}
output "private_subnet_ids" {
description = "IDs of private subnets, one per AZ"
value = aws_subnet.private[*].id
}
output "public_subnet_ids" {
description = "IDs of public subnets, one per AZ"
value = aws_subnet.public[*].id
}
# Don't export: route table IDs, NACL IDs, individual AZ details
# unless a specific consumer has asked for them
State isolation strategy
State isolation is the most important operational decision in a Terraform codebase. The rule: resources that should never be destroyed together must have separate state files.
# Directory structure that enforces state isolation
infrastructure/
environments/
prod/
network/ # state: s3://tf-state/prod/network.tfstate
main.tf
backend.tf
eks/ # state: s3://tf-state/prod/eks.tfstate
main.tf
backend.tf
databases/ # state: s3://tf-state/prod/databases.tfstate
main.tf
backend.tf
staging/
# mirrors prod structure
# backend.tf — S3 with DynamoDB locking
terraform {
backend "s3" {
bucket = "your-terraform-state-bucket"
key = "prod/network.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-locks"
}
}
# Create the DynamoDB lock table (do this once)
aws dynamodb create-table \
--table-name terraform-state-locks \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST
Remote state data sources — safe cross-module references
# In eks/main.tf — safely reference network state outputs
data "terraform_remote_state" "network" {
backend = "s3"
config = {
bucket = "your-terraform-state-bucket"
key = "prod/network.tfstate"
region = "us-east-1"
}
}
resource "aws_eks_cluster" "main" {
name = "production"
role_arn = aws_iam_role.eks_cluster.arn
vpc_config {
# Referencing network module outputs safely
subnet_ids = data.terraform_remote_state.network.outputs.private_subnet_ids
vpc_id = data.terraform_remote_state.network.outputs.vpc_id
}
}
Avoiding state drift — the most common production problem
State drift happens when someone modifies infrastructure outside of Terraform — manually in the AWS console, via CLI, or through another tool. The next terraform plan either shows unexpected changes or, worse, tries to "fix" the drift by destroying the manual change.
# Detect drift — run plan and look for unexpected changes
terraform plan -out=tfplan
terraform show -json tfplan | jq '.resource_changes[] | select(.change.actions != ["no-op"])'
# Import manually-created resources into state
terraform import aws_security_group.manually_created sg-0abc123def456
# Refresh state to reflect real AWS state
terraform refresh
# Remove a resource from state without destroying it (careful)
terraform state rm aws_instance.accidentally_imported
terraform apply on drift without reviewing the plan first. Terraform will try to reconcile drift by destroying the manual changes. Always plan first, understand every change in the output, then apply.
Terraform in CI/CD — the production pattern
# GitHub Actions pipeline pattern for safe Terraform deploys
name: Terraform
on:
pull_request:
paths: ['infrastructure/**']
push:
branches: [main]
paths: ['infrastructure/**']
jobs:
terraform:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.7.0"
- name: Terraform Init
run: terraform init
working-directory: infrastructure/environments/prod/network
- name: Terraform Format Check
run: terraform fmt -check -recursive
- name: Terraform Validate
run: terraform validate
- name: Terraform Plan
if: github.event_name == 'pull_request'
run: terraform plan -no-color -out=tfplan
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Terraform Apply
if: github.ref == 'refs/heads/main'
run: terraform apply -auto-approve tfplan
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
Module versioning with the Terraform Registry
# Pin module versions — never use latest in production
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.5.0" # pinned — safe
# NOT: version = "~> 5.0" ← minor version drift can break things
name = "production-vpc"
cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
enable_nat_gateway = true
single_nat_gateway = false # true costs less, false is HA
}
The pre-merge checklist
Before merging any Terraform PR to main:
# 1. Format
terraform fmt -recursive
# 2. Validate
terraform validate
# 3. Security scan
tfsec .
checkov -d .
# 4. Cost estimate
infracost breakdown --path .
# 5. Plan reviewed by second engineer for:
# - Any destroy actions (red flag)
# - Any changes to shared networking resources
# - Any IAM changes
# - Resource count changes (scaling events)
Frequently asked questions
What's the difference between terraform destroy and removing a resource from config?
Both will delete the resource. terraform destroy destroys everything in the state file. Removing from config and running apply destroys just that resource. Use terraform state rm if you want to stop managing a resource without destroying it.
How do I rename a resource without destroying and recreating it?
# Use terraform state mv to rename in state
terraform state mv aws_instance.old_name aws_instance.new_name
# Then update the resource block name in your config
# Next plan will show no changes if done correctly
How do I share outputs between workspaces?
Use terraform_remote_state data sources (shown above) or a parameter store like AWS SSM Parameter Store. Avoid direct workspace-to-workspace output references — they create coupling that makes isolated deploys impossible.
When should I use count vs for_each?
Use for_each with maps or sets whenever possible. count uses index-based state keys — removing element 1 from a 5-element list shifts indices 2-4, causing Terraform to destroy and recreate them. for_each uses stable string keys that don't shift.
# count — fragile (index-based keys)
resource "aws_subnet" "private" {
count = length(var.private_cidrs)
cidr_block = var.private_cidrs[count.index]
}
# for_each — stable (string keys)
resource "aws_subnet" "private" {
for_each = { for az, cidr in zipmap(var.azs, var.private_cidrs) : az => cidr }
cidr_block = each.value
availability_zone = each.key
}