Why your Terraform modules are too big — and how to fix them

Most Terraform problems I’ve seen in production aren’t about syntax errors or provider quirks. They’re about scope — specifically, modules that have grown to own too much.

Here’s the pattern I’ve seen at least a dozen times: someone creates a vpc module that eventually absorbs subnets, route tables, NAT gateways, and then slowly accumulates security groups, flow logs, and VPC endpoints. Six months later, the module is 2,000 lines and changing anything means a 40-minute plan output that nobody reads carefully anymore.

What a module boundary actually is

A module boundary should answer one question cleanly: what does this unit of infrastructure own, and what does it depend on?

When a module owns too much, the answer becomes “everything related to networking.” That’s not an owner — that’s a sprawl.

The test I use: can you destroy and recreate this module without touching anything that owns data or serves live traffic? If the answer is “it depends,” the boundary is wrong.

The three failure modes

1. The god module

Everything in one folder. One main.tf with 1,500 lines. The blast radius of any change is enormous because every resource is coupled.

# This is the god module pattern
module "everything" {
  source = "./infra"
  
  # VPC config
  vpc_cidr = "10.0.0.0/16"
  
  # RDS config
  db_instance_class = "db.t3.medium"
  
  # EKS config
  cluster_version = "1.28"
  
  # And so on for 40 more variables...
}

2. The deeply nested module

Modules calling modules calling modules. Three levels deep and you can’t tell what’s creating what resource without following a trail through four files.

# networking/main.tf calls vpc/main.tf which calls subnet/main.tf
# Debugging this is a nightmare
module "networking" {
  source = "../networking"
  # ...
}

3. The cross-concern module

A module that mixes stateful and stateless resources. Databases and application infrastructure in the same plan means a database replacement can happen during an app rollout.

Rule of thumb: Stateful resources (RDS, S3 with data, ElastiCache) should never share a module with stateless compute resources (EC2 ASGs, ECS services, Lambda).

What good boundaries look like

Here’s the structure I settled on after several migrations:

infra/
  foundation/       # VPC, subnets, IGW, NAT — changes rarely
  platform/         # EKS cluster, IAM roles — changes quarterly  
  data/             # RDS, ElastiCache, S3 buckets — changes carefully
  services/         # App-specific ECS/Lambda — changes frequently
    api/
    worker/

Each layer depends only on outputs from the layer below it. The services layer never reaches into foundation directly — it consumes outputs from platform.

# services/api/main.tf
data "terraform_remote_state" "platform" {
  backend = "s3"
  config = {
    bucket = "my-tfstate"
    key    = "platform/terraform.tfstate"
    region = "ap-south-1"
  }
}

resource "aws_ecs_service" "api" {
  cluster = data.terraform_remote_state.platform.outputs.ecs_cluster_arn
  # ...
}

The practical migration

If you’re staring at a 2,000-line module right now, don’t try to refactor it all at once. The approach that’s worked for me:

Identify state boundaries first. Use terraform state list and group resources by what they logically own. Resources that share a lifecycle belong together.
Extract leaves first. Find resources with no other resources depending on them. Move those out first — lowest risk.
Use moved blocks. Since Terraform 1.1, you can use moved blocks to tell Terraform a resource moved modules without destroying and recreating it.

# In the new module, add a moved block
moved {
  from = aws_security_group.api
  to   = module.api_service.aws_security_group.api
}

Never refactor and change config simultaneously. One PR moves the resource. A separate PR changes its configuration.

Warning: Always run terraform plan after adding moved blocks and verify zero destroys before applying. A missing moved block will cause Terraform to destroy and recreate the resource.

The payoff

The teams I’ve worked with who’ve done this migration report the same outcomes: plans become readable again, CI runs faster because modules can plan in parallel, and on-call gets less stressful because the blast radius of any change is predictable.

The goal isn’t small modules for the sake of it. The goal is modules where you can answer “what does this own?” in one sentence.