The EC2 Right-Sizing Playbook: How We Cut $200K/Year From a 600-Instance Fleet (Without a Single Performance Regression)

A note on this story: The numbers below are a composite of 9 EC2 right-sizing engagements we've run on fleets sized 80–2,400 instances. Mix of EKS worker nodes, ECS hosts, and standalone application servers across us-east-1 / eu-west-1 / ap-south-1.

A CTO sent us this Slack DM last quarter:

"Our AWS bill is $1.4M/year. EC2 alone is $720K. I've already done RIs, Spot for stateless, and Compute Optimizer keeps telling me 'not enough data'. I'm out of ideas."

Here's what we found in the first 4 hours of the audit:

Median CPU utilization across the fleet: 14.2% (over 30 days)
96 instances averaging <5% CPU for 90 days straight
41 m5.4xlarge running workloads that fit on m7i.xlarge
Annualized waste: ~$203K

By day 21, we'd cut their EC2 bill from $720K to $474K/year. Zero performance regressions. No engineers complained. No incidents.

This is the playbook.

Why right-sizing beats every other AWS lever

Most teams chase Reserved Instances first because the savings look big in the calculator (40-60% off list). But here's the trap: RIs lock you into the wrong size for 1-3 years. If you over-provisioned by 2x and then buy a 3-year RI, you've just bought 6 years of waste.

Right-size first. Buy RIs second. Always.

Lever	Typical savings	Reversibility	When to use
Right-sizing	30-55%	Hours (just resize)	Always, first
Generation upgrade (m5→m7i)	15-25%	Hours	After right-sizing
Spot instances	60-90%	Stateless workloads	EKS workers, batch
Reserved Instances	30-50%	1-3 year commit	After steady-state
Savings Plans	20-35%	1-3 year commit	Compute-flexible workloads

The compounding math: right-size (-40%) → upgrade generation (-20% of remainder) → buy 1y SP on what's left (-30%). End-state: 66% off list. Without locking yourself into the wrong size.

Step 1: Pull 30 days of real metrics (not 7, not 14)

The single biggest right-sizing mistake is using a 7-day window. Workloads breathe. Month-end batch jobs spike. Quarterly reports double the load. You need 30 days minimum.

Pull these 6 CloudWatch metrics for every instance:

aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0abc123 \
  --start-time $(date -d '30 days ago' --iso-8601) \
  --end-time $(date --iso-8601) \
  --period 300 \
  --statistics Maximum,Average,p99

Metric	Why it matters
`CPUUtilization` Max	Spot CPU bursts — never right-size below your max
`CPUUtilization` p99	Sustained peak — informs target headroom
`CPUUtilization` Avg	Overall efficiency baseline
`mem_used_percent` (CW agent)	EC2 doesn't expose memory natively — install the CW agent
`NetworkIn/Out` Sum	Bandwidth-bound workloads need different families
`DiskWriteOps` Sum	I/O bound? Consider i-class or attached io2 EBS

Critical: if mem_used_percent isn't in CloudWatch, install the CW agent first. Right-sizing on CPU alone misses memory-bound workloads, and you'll learn that the hard way at 3am.

Step 2: Classify instances into 4 buckets

Once you have 30 days of metrics, sort every instance into exactly one bucket:

Bucket A — Idle (kill candidates)

Max CPU < 5% for 30 days
Avg CPU < 2%
Network In/Out < 1 MB/day
Action: stop, observe for 7 days. If nobody screams → terminate.

In our 9 audits: 8.4% of fleet lands here on average. These are forgotten dev/test boxes, abandoned pipelines, "temporary" instances from 2 years ago.

Bucket B — Way over-provisioned (drop 2+ sizes)

p99 CPU < 25%
Memory < 40%
Action: drop 2 instance sizes (e.g., m5.4xlarge → m5.xlarge)

In our audits: 23-31% of fleet lands here. The biggest dollar lever.

Bucket C — Mildly over-provisioned (drop 1 size)

p99 CPU 25-50%
Memory < 60%
Action: drop 1 instance size

In our audits: 18-24% of fleet. Steady gains.

Bucket D — Right-sized (leave alone)

p99 CPU > 50% OR memory > 60%
Action: consider generation upgrade only (Step 4)

Step 3: Validate with AWS Compute Optimizer (but don't trust it blindly)

AWS Compute Optimizer is free. Turn it on for every linked account 14+ days before your audit. By the time you start, you'll have a recommendation for every instance.

What it gets right:

Instance family suggestions (e.g., "this is memory-bound, try r6i")
Confidence scoring based on metric coverage
Cross-region pricing comparisons

What it gets wrong (every time):

It assumes 14-day metric windows are enough — they're not for batch workloads
It under-recommends generation upgrades (still suggests m5 when m7i is cheaper)
It ignores Spot eligibility entirely

Our rule: treat Compute Optimizer as a second opinion, not gospel. If it disagrees with your manual classification, dig deeper. We've had cases where it recommended downsizing a database instance that was correctly sized for memory headroom — would have caused OOM kills in week 2 of month-end close.

Step 4: Generation upgrades (the silent 20% lever)

After right-sizing, upgrade the generation. Every new generation is roughly 20% cheaper for equivalent performance:

From	To	Same vCPU/RAM?	Price drop
m5.xlarge	m6i.xlarge	Yes	-7%
m5.xlarge	m7i.xlarge	Yes	-13%
r5.xlarge	r6i.xlarge	Yes	-7%
r5.xlarge	r7i.xlarge	Yes	-13%
c5.xlarge	c7i.xlarge	Yes	-15%
t3.large	t3a.large (AMD)	Yes	-10%
Graviton: m5.xlarge	m7g.xlarge	Yes	-20%

Graviton (m7g, c7g, r7g) is the best deal in EC2 today. ARM-native workloads run 20-40% cheaper with comparable or better single-threaded performance. The catch: your AMI must be ARM-compatible (Amazon Linux 2023 ✓, Ubuntu 22.04 ✓, Java apps ✓, most Node/Python ✓, native binaries with x86_64 deps ✗).

The migration order:

Stateless EKS workers → Graviton (m7g) first — easy rollback, blast radius is one node
ECS apps with Linux AMIs → m7i (no architecture change)
Stateful databases / cache → m7g if app supports it; otherwise m7i
Windows EC2 → m7i (Graviton support is limited)

We've never seen a generation upgrade cause a regression. The compounding savings on top of right-sizing are massive.

Step 5: The safe rollout pattern

Right-sizing scares engineers because the worst-case outcome is a 3am page when prod runs hot. Here's the pattern we use to make rollouts boring:

For Auto Scaling Groups (ASG)

# 1. Update the launch template to the new instance type
aws ec2 create-launch-template-version \
  --launch-template-id lt-0abc123 \
  --source-version 5 \
  --launch-template-data '{"InstanceType":"m7i.xlarge"}'

# 2. Set the new version as default
aws ec2 modify-launch-template \
  --launch-template-id lt-0abc123 \
  --default-version 6

# 3. Trigger an instance refresh with 33% min healthy
aws autoscaling start-instance-refresh \
  --auto-scaling-group-name prod-web-asg \
  --preferences '{"MinHealthyPercentage":67,"InstanceWarmup":300}'

The ASG cycles instances 1 at a time. CloudWatch alarms catch any regression. You can pause / rollback at any moment.

For standalone EC2

Stop the instance during a low-traffic window
Change instance type via console / CLI
Start it
Watch CPU/mem for 24 hours
Keep the old AMI for 7 days as rollback insurance

Always stop, don't terminate. A stopped instance can be re-started at the old size in 60 seconds. A terminated one is gone forever.

Step 6: The "did this break anything?" dashboard

Right-sizing is only safe if you can detect regressions fast. Set up these 4 alarms in CloudWatch (or Datadog) before the migration:

CPU sustained >85% for 15 minutes → page someone
Memory sustained >85% for 15 minutes → page someone
App-level p95 latency +30% vs 30-day baseline → Slack alert
Error rate +50% vs 7-day baseline → Slack alert

Run the migration in waves: 10% on day 1 → wait 48h → 30% → wait 48h → remaining 60%. Total elapsed: ~7 days for the whole fleet.

Real numbers from one audit (composite)

Customer: B2B SaaS, ~200 EKS worker nodes (m5.4xlarge), ~80 standalone app servers, ~120 batch workers.

Stage	Annualized cost	Savings
Baseline	$720K	—
After Step 1-3 (right-sizing)	$480K	$240K (-33%)
After Step 4 (m5→m7g/m7i)	$401K	$79K more (-16%)
After Step 5 (1y Compute Savings Plan on baseline)	$321K	$80K more (-20%)
Total annual	$321K	$399K saved (55%)

Fleet utilization went from 14% → 41% average CPU. Zero regressions. CFO bonus: 1.5x annual.

The 5-minute self-audit anyone can do today

Don't have 4 hours? Try this in 5 minutes:

-- AWS Cost Explorer SQL (or: bash + aws cli)
SELECT
  instance_type,
  COUNT(*) AS instance_count,
  ROUND(SUM(unblended_cost), 0) AS monthly_cost,
  ROUND(AVG(cpu_avg_30d), 1) AS avg_cpu_pct
FROM ec2_running_instances
WHERE month = LAST_MONTH
GROUP BY instance_type
HAVING avg_cpu_pct < 20 AND monthly_cost > 1000
ORDER BY monthly_cost DESC
LIMIT 20

If any row shows avg_cpu_pct < 15% and monthly_cost > $5K, you've found at least $40K/year of waste. Drop a size. Watch a week. Bank the savings.

Common objections (and the answers)

"We can't right-size databases — they need headroom for spikes." True for the primary, false for read replicas. Read replicas can almost always drop a size.

"Compute Optimizer says we're already right-sized." Pull the raw metrics. CO uses 14 days; you need 30. We've found 30%+ waste on fleets CO calls "optimized".

"What about JVM heap pressure?" Memory-bound JVM workloads should move to r-class (more RAM, same vCPU). Cheaper than m-class with double the RAM.

"Generation upgrades broke something for us before." That was probably an old AMI. m7i / m7g need recent Linux kernels (5.10+) for full performance. Update the AMI first.

How CARTIEAI helps

CARTIEAI's EC2 right-sizer connects read-only to your AWS account, pulls 30 days of CloudWatch metrics across every region, and generates per-instance recommendations grouped by the 4 buckets above. Typical first-scan: $8K–$45K/month projected savings.

We also flag generation-upgrade candidates (m5 → m7i / m7g), idle instances (Bucket A), and Compute Optimizer disagreements (where our 30-day analysis catches what CO's 14-day window missed).

Even without a tool, the 5-minute SQL audit above will find $1K–$5K/month of waste in any fleet over $20K/month of EC2 spend.

Now go pull your CPU metrics. 🚀