The most common mistake I see with AWS cost optimisation is treating it like a one-time project. You run Cost Explorer, find the obvious waste, fix it, declare victory. Three months later the bill is back up.
Sustainable cost reduction is about building systems that continuously surface waste and eliminate it automatically. Here’s what we did to achieve a consistent 20% reduction.
Step 1: S3 lifecycle policies (the easiest win)
S3 is cheap per GB but the costs compound invisibly. Old snapshots, build artifacts, log archives — they accumulate without anyone noticing.
The fix is lifecycle policies that automatically transition objects to cheaper storage classes:
{
"Rules": [
{
"ID": "artifacts-lifecycle",
"Status": "Enabled",
"Filter": { "Prefix": "builds/" },
"Transitions": [
{ "Days": 30, "StorageClass": "STANDARD_IA" },
{ "Days": 90, "StorageClass": "GLACIER_IR" }
],
"Expiration": { "Days": 365 }
}
]
}
Build artifacts older than 30 days are rarely accessed. Glacier Instant Retrieval is 68% cheaper than Standard. For a team with significant CI/CD output, this alone can move the needle.
Step 2: EC2 rightsizing with actual data
The instinct is to provision generously and worry about downsizing later. “Later” never comes without a forcing function.
We used CloudWatch metrics to find instances with consistently low CPU and memory utilisation:
import boto3
from datetime import datetime, timedelta
cloudwatch = boto3.client('cloudwatch', region_name='ap-south-1')
def get_avg_cpu(instance_id: str, days: int = 14) -> float:
response = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=datetime.utcnow() - timedelta(days=days),
EndTime=datetime.utcnow(),
Period=86400, # daily aggregates
Statistics=['Average'],
)
datapoints = response.get('Datapoints', [])
if not datapoints:
return 0.0
return sum(d['Average'] for d in datapoints) / len(datapoints)
Anything averaging below 15% CPU over 14 days was a rightsizing candidate. We moved from m5.xlarge to m5.large for several dev/staging instances — 50% cost reduction per instance.
Don’t rightsize based on CPU alone. Check memory with CloudWatch agent metrics and network throughput before downsizing. An instance that looks idle on CPU might be doing significant I/O.
Step 3: Lambda-driven automated cleanup
Manual cleanup doesn’t scale. We wrote Lambda functions triggered on schedule to find and remove waste automatically:
- Unattached EBS volumes (provisioned, attached to a terminated instance, forgotten)
- Unused Elastic IPs (allocated but not associated with a running instance — $0.005/hr each, adds up)
- Old AMIs and their snapshots (pre-deploy baking process creating a new AMI on every deploy with no cleanup)
def cleanup_unattached_volumes(ec2_client, dry_run: bool = True) -> list[str]:
"""Find EBS volumes in 'available' state — unattached."""
paginator = ec2_client.get_paginator('describe_volumes')
volumes_to_delete = []
for page in paginator.paginate(Filters=[{'Name': 'status', 'Values': ['available']}]):
for vol in page['Volumes']:
# Skip if tagged as intentionally unattached
tags = {t['Key']: t['Value'] for t in vol.get('Tags', [])}
if tags.get('KeepUnattached') == 'true':
continue
# Only delete if unattached for more than 7 days
create_time = vol['CreateTime']
age_days = (datetime.utcnow(timezone.utc) - create_time).days
if age_days > 7:
volumes_to_delete.append(vol['VolumeId'])
if not dry_run:
ec2_client.delete_volume(VolumeId=vol['VolumeId'])
return volumes_to_delete
Always run with dry_run=True for the first few weeks to build confidence in the logic before enabling deletion.
The numbers
After 90 days of running these three tracks in parallel:
- S3 storage costs down 35% (lifecycle policies + deletion of orphaned buckets)
- EC2 costs down 22% (rightsizing + reserved instance coverage for stable workloads)
- Miscellaneous waste (EBS, EIPs, old AMIs) eliminated: ~$180/month
- Overall bill: down ~20%
The reserved instance piece was the multiplier. Once we had accurate utilisation data from rightsizing, we could confidently buy 1-year standard RIs for baseline compute — saving an additional 40% on those instances versus on-demand.