A note on this story: The numbers below are a composite of 9 EC2 right-sizing engagements we've run on fleets sized 80–2,400 instances. Mix of EKS worker nodes, ECS hosts, and standalone application servers across us-east-1 / eu-west-1 / ap-south-1.
A CTO sent us this Slack DM last quarter:
"Our AWS bill is $1.4M/year. EC2 alone is $720K. I've already done RIs, Spot for stateless, and Compute Optimizer keeps telling me 'not enough data'. I'm out of ideas."
Here's what we found in the first 4 hours of the audit:
- Median CPU utilization across the fleet: 14.2% (over 30 days)
- 96 instances averaging <5% CPU for 90 days straight
- 41 m5.4xlarge running workloads that fit on m7i.xlarge
- Annualized waste: ~$203K
By day 21, we'd cut their EC2 bill from $720K to $474K/year. Zero performance regressions. No engineers complained. No incidents.
This is the playbook.
Why right-sizing beats every other AWS lever
Most teams chase Reserved Instances first because the savings look big in the calculator (40-60% off list). But here's the trap: RIs lock you into the wrong size for 1-3 years. If you over-provisioned by 2x and then buy a 3-year RI, you've just bought 6 years of waste.
Right-size first. Buy RIs second. Always.
| Lever | Typical savings | Reversibility | When to use |
|---|
| Right-sizing | 30-55% | Hours (just resize) | Always, first |
| Generation upgrade (m5→m7i) | 15-25% | Hours | After right-sizing |
| Spot instances | 60-90% | Stateless workloads | EKS workers, batch |
| Reserved Instances | 30-50% | 1-3 year commit | After steady-state |
| Savings Plans | 20-35% | 1-3 year commit | Compute-flexible workloads |
The compounding math: right-size (-40%) → upgrade generation (-20% of remainder) → buy 1y SP on what's left (-30%). End-state: 66% off list. Without locking yourself into the wrong size.
Step 1: Pull 30 days of real metrics (not 7, not 14)
The single biggest right-sizing mistake is using a 7-day window. Workloads breathe. Month-end batch jobs spike. Quarterly reports double the load. You need 30 days minimum.
Pull these 6 CloudWatch metrics for every instance:
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0abc123 \
--start-time $(date -d '30 days ago' --iso-8601) \
--end-time $(date --iso-8601) \
--period 300 \
--statistics Maximum,Average,p99
| Metric | Why it matters |
|---|
CPUUtilization Max | Spot CPU bursts — never right-size below your max |
CPUUtilization p99 | Sustained peak — informs target headroom |
CPUUtilization Avg | Overall efficiency baseline |
mem_used_percent (CW agent) | EC2 doesn't expose memory natively — install the CW agent |
NetworkIn/Out Sum | Bandwidth-bound workloads need different families |
DiskWriteOps Sum | I/O bound? Consider i-class or attached io2 EBS |
Critical: if mem_used_percent isn't in CloudWatch, install the CW agent first. Right-sizing on CPU alone misses memory-bound workloads, and you'll learn that the hard way at 3am.
Step 2: Classify instances into 4 buckets
Once you have 30 days of metrics, sort every instance into exactly one bucket:
Bucket A — Idle (kill candidates)
- Max CPU < 5% for 30 days
- Avg CPU < 2%
- Network In/Out < 1 MB/day
- Action: stop, observe for 7 days. If nobody screams → terminate.
In our 9 audits: 8.4% of fleet lands here on average. These are forgotten dev/test boxes, abandoned pipelines, "temporary" instances from 2 years ago.
Bucket B — Way over-provisioned (drop 2+ sizes)
- p99 CPU < 25%
- Memory < 40%
- Action: drop 2 instance sizes (e.g., m5.4xlarge → m5.xlarge)
In our audits: 23-31% of fleet lands here. The biggest dollar lever.
Bucket C — Mildly over-provisioned (drop 1 size)
- p99 CPU 25-50%
- Memory < 60%
- Action: drop 1 instance size
In our audits: 18-24% of fleet. Steady gains.
Bucket D — Right-sized (leave alone)
- p99 CPU > 50% OR memory > 60%
- Action: consider generation upgrade only (Step 4)
Step 3: Validate with AWS Compute Optimizer (but don't trust it blindly)
AWS Compute Optimizer is free. Turn it on for every linked account 14+ days before your audit. By the time you start, you'll have a recommendation for every instance.
What it gets right:
- Instance family suggestions (e.g., "this is memory-bound, try r6i")
- Confidence scoring based on metric coverage
- Cross-region pricing comparisons
What it gets wrong (every time):
- It assumes 14-day metric windows are enough — they're not for batch workloads
- It under-recommends generation upgrades (still suggests m5 when m7i is cheaper)
- It ignores Spot eligibility entirely
Our rule: treat Compute Optimizer as a second opinion, not gospel. If it disagrees with your manual classification, dig deeper. We've had cases where it recommended downsizing a database instance that was correctly sized for memory headroom — would have caused OOM kills in week 2 of month-end close.
Step 4: Generation upgrades (the silent 20% lever)
After right-sizing, upgrade the generation. Every new generation is roughly 20% cheaper for equivalent performance:
| From | To | Same vCPU/RAM? | Price drop |
|---|
| m5.xlarge | m6i.xlarge | Yes | -7% |
| m5.xlarge | m7i.xlarge | Yes | -13% |
| r5.xlarge | r6i.xlarge | Yes | -7% |
| r5.xlarge | r7i.xlarge | Yes | -13% |
| c5.xlarge | c7i.xlarge | Yes | -15% |
| t3.large | t3a.large (AMD) | Yes | -10% |
| Graviton: m5.xlarge | m7g.xlarge | Yes | -20% |
Graviton (m7g, c7g, r7g) is the best deal in EC2 today. ARM-native workloads run 20-40% cheaper with comparable or better single-threaded performance. The catch: your AMI must be ARM-compatible (Amazon Linux 2023 ✓, Ubuntu 22.04 ✓, Java apps ✓, most Node/Python ✓, native binaries with x86_64 deps ✗).
The migration order:
- Stateless EKS workers → Graviton (m7g) first — easy rollback, blast radius is one node
- ECS apps with Linux AMIs → m7i (no architecture change)
- Stateful databases / cache → m7g if app supports it; otherwise m7i
- Windows EC2 → m7i (Graviton support is limited)
We've never seen a generation upgrade cause a regression. The compounding savings on top of right-sizing are massive.
Step 5: The safe rollout pattern
Right-sizing scares engineers because the worst-case outcome is a 3am page when prod runs hot. Here's the pattern we use to make rollouts boring:
For Auto Scaling Groups (ASG)
# 1. Update the launch template to the new instance type
aws ec2 create-launch-template-version \
--launch-template-id lt-0abc123 \
--source-version 5 \
--launch-template-data '{"InstanceType":"m7i.xlarge"}'
# 2. Set the new version as default
aws ec2 modify-launch-template \
--launch-template-id lt-0abc123 \
--default-version 6
# 3. Trigger an instance refresh with 33% min healthy
aws autoscaling start-instance-refresh \
--auto-scaling-group-name prod-web-asg \
--preferences '{"MinHealthyPercentage":67,"InstanceWarmup":300}'
The ASG cycles instances 1 at a time. CloudWatch alarms catch any regression. You can pause / rollback at any moment.
For standalone EC2
- Stop the instance during a low-traffic window
- Change instance type via console / CLI
- Start it
- Watch CPU/mem for 24 hours
- Keep the old AMI for 7 days as rollback insurance
Always stop, don't terminate. A stopped instance can be re-started at the old size in 60 seconds. A terminated one is gone forever.
Step 6: The "did this break anything?" dashboard
Right-sizing is only safe if you can detect regressions fast. Set up these 4 alarms in CloudWatch (or Datadog) before the migration:
- CPU sustained >85% for 15 minutes → page someone
- Memory sustained >85% for 15 minutes → page someone
- App-level p95 latency +30% vs 30-day baseline → Slack alert
- Error rate +50% vs 7-day baseline → Slack alert
Run the migration in waves: 10% on day 1 → wait 48h → 30% → wait 48h → remaining 60%. Total elapsed: ~7 days for the whole fleet.
Real numbers from one audit (composite)
Customer: B2B SaaS, ~200 EKS worker nodes (m5.4xlarge), ~80 standalone app servers, ~120 batch workers.
| Stage | Annualized cost | Savings |
|---|
| Baseline | $720K | — |
| After Step 1-3 (right-sizing) | $480K | $240K (-33%) |
| After Step 4 (m5→m7g/m7i) | $401K | $79K more (-16%) |
| After Step 5 (1y Compute Savings Plan on baseline) | $321K | $80K more (-20%) |
| Total annual | $321K | $399K saved (55%) |
Fleet utilization went from 14% → 41% average CPU. Zero regressions. CFO bonus: 1.5x annual.
The 5-minute self-audit anyone can do today
Don't have 4 hours? Try this in 5 minutes:
-- AWS Cost Explorer SQL (or: bash + aws cli)
SELECT
instance_type,
COUNT(*) AS instance_count,
ROUND(SUM(unblended_cost), 0) AS monthly_cost,
ROUND(AVG(cpu_avg_30d), 1) AS avg_cpu_pct
FROM ec2_running_instances
WHERE month = LAST_MONTH
GROUP BY instance_type
HAVING avg_cpu_pct < 20 AND monthly_cost > 1000
ORDER BY monthly_cost DESC
LIMIT 20
If any row shows avg_cpu_pct < 15% and monthly_cost > $5K, you've found at least $40K/year of waste. Drop a size. Watch a week. Bank the savings.
Common objections (and the answers)
"We can't right-size databases — they need headroom for spikes."
True for the primary, false for read replicas. Read replicas can almost always drop a size.
"Compute Optimizer says we're already right-sized."
Pull the raw metrics. CO uses 14 days; you need 30. We've found 30%+ waste on fleets CO calls "optimized".
"What about JVM heap pressure?"
Memory-bound JVM workloads should move to r-class (more RAM, same vCPU). Cheaper than m-class with double the RAM.
"Generation upgrades broke something for us before."
That was probably an old AMI. m7i / m7g need recent Linux kernels (5.10+) for full performance. Update the AMI first.
How CARTIE AI helps
CARTIE AI's EC2 right-sizer connects read-only to your AWS account, pulls 30 days of CloudWatch metrics across every region, and generates per-instance recommendations grouped by the 4 buckets above. Typical first-scan: $8K–$45K/month projected savings.
We also flag generation-upgrade candidates (m5 → m7i / m7g), idle instances (Bucket A), and Compute Optimizer disagreements (where our 30-day analysis catches what CO's 14-day window missed).
Even without a tool, the 5-minute SQL audit above will find $1K–$5K/month of waste in any fleet over $20K/month of EC2 spend.
Now go pull your CPU metrics. 🚀