The problem hiding in plain sight
When I started working with one client on their AWS cost optimisation, the monthly bill was significantly higher than it should have been — and nobody had a clear picture of where the money was going. After enabling Cost Explorer with proper tag filtering, the answer was immediately obvious: non-production EC2 instances were running 24 hours a day, 7 days a week. Instances that developers used during business hours were sitting idle every night, every weekend, every public holiday — fully running, fully charged.
The fix wasn't complicated. But the savings were significant — over 60% reduction in EC2 costs for non-production environments, with zero impact on any production workload or developer workflow.
Why this happens at most organisations
Non-production environments get created quickly — someone needs a dev or staging environment, it gets spun up, work gets done. What rarely happens is someone sitting down to configure lifecycle management for that environment. The instance runs, the work continues, and nobody notices the idle time accumulating on the bill.
At scale, this adds up fast. A t3.large running 24/7 costs roughly $60/month. The same instance running 10 hours a day on weekdays costs about $18/month. Across 10 non-production instances, that's $420/month in waste — $5,040/year — for compute that nobody is using.
The solution: Lambda + EventBridge scheduler
The architecture is straightforward: EventBridge triggers a Lambda function on a schedule. Lambda checks for instances tagged with AutoStop: true and stops or starts them based on the time. No third-party tools, no agents on the instances, no changes to how developers use them.
# Tag your non-production instances
aws ec2 create-tags --resources i-xxxxxxxxxxxxxxxxx --tags Key=AutoStop,Value=true Key=Environment,Value=dev
The Lambda function
import boto3
import os
def handler(event, context):
ec2 = boto3.client('ec2', region_name=os.environ['AWS_REGION'])
action = event.get('action', 'stop') # 'stop' or 'start'
# Find instances tagged AutoStop=true
response = ec2.describe_instances(
Filters=[
{{'Name': 'tag:AutoStop', 'Values': ['true']}},
{{'Name': 'instance-state-name',
'Values': ['running'] if action == 'stop' else ['stopped']}}
]
)
instance_ids = [
i['InstanceId']
for r in response['Reservations']
for i in r['Instances']
]
if not instance_ids:
return {{'message': 'No instances to act on', 'action': action}}
if action == 'stop':
ec2.stop_instances(InstanceIds=instance_ids)
else:
ec2.start_instances(InstanceIds=instance_ids)
print(f"{{action.upper()}}ED: {{instance_ids}}")
return {{'actioned': instance_ids, 'action': action}}
EventBridge rules — the schedule
# Terraform: stop at 8pm, start at 8am — weekdays only
resource "aws_cloudwatch_event_rule" "stop_instances" {{
name = "stop-dev-instances"
schedule_expression = "cron(0 20 ? * MON-FRI *)" # 8pm UTC weekdays
description = "Stop non-prod instances after business hours"
}}
resource "aws_cloudwatch_event_rule" "start_instances" {{
name = "start-dev-instances"
schedule_expression = "cron(0 8 ? * MON-FRI *)" # 8am UTC weekdays
}}
resource "aws_cloudwatch_event_target" "stop_target" {{
rule = aws_cloudwatch_event_rule.stop_instances.name
target_id = "StopInstances"
arn = aws_lambda_function.scheduler.arn
input = jsonencode({{ action = "stop" }})
}}
resource "aws_cloudwatch_event_target" "start_target" {{
rule = aws_cloudwatch_event_rule.start_instances.name
target_id = "StartInstances"
arn = aws_lambda_function.scheduler.arn
input = jsonencode({{ action = "start" }})
}}
Manual override for critical testing
The first thing developers ask when you implement this: "What if I'm doing a late deployment or need the instance outside business hours?" The answer is a simple tag override:
# Exclude a specific instance from auto-stop for 24 hours
aws ec2 create-tags --resources i-xxxxxxxxxxxxxxxxx --tags Key=AutoStop,Value=false
# Re-enable it when done
aws ec2 create-tags --resources i-xxxxxxxxxxxxxxxxx --tags Key=AutoStop,Value=true
The Lambda function only acts on instances where AutoStop=true, so flipping the tag is all it takes. No pipeline changes, no exceptions list to maintain.
What else I found while investigating
The scheduler was the main fix, but the cost investigation surfaced three other silent budget drainers that are common across AWS accounts:
- Orphaned EBS volumes — volumes that persist after their EC2 instance was terminated. Charged at full price for storing nothing useful. Find them with
aws ec2 describe-volumes --filters Name=status,Values=available. - Unused Elastic IPs — AWS charges for EIPs not associated with a running instance. Find them with
aws ec2 describe-addresses --filters Name=domain,Values=vpcand check theAssociationIdfield. - Oversized RDS instances — dev databases running on
db.r5.2xlargebecause that's what production uses. Adb.t3.mediumhandles development workloads at a fraction of the cost.
The required IAM policy for the Lambda
{{
"Version": "2012-10-17",
"Statement": [{{
"Effect": "Allow",
"Action": [
"ec2:DescribeInstances",
"ec2:StartInstances",
"ec2:StopInstances"
],
"Resource": "*",
"Condition": {{
"StringEquals": {{
"ec2:ResourceTag/Environment": ["dev", "staging"]
}}
}}
}}]
}}
Scope the IAM policy with a condition on the Environment tag. This ensures the Lambda can only start and stop non-production instances — it cannot accidentally touch production even if the tag logic has a bug.