Back to home
Best Practices
May 2, 2026 14 min read

The EC2 Right-Sizing Playbook: How We Cut $200K/Year From a 600-Instance Fleet (Without a Single Performance Regression)

Most EC2 fleets run at 12-18% average CPU. Right-sizing is the single biggest lever in AWS — bigger than RIs, bigger than Spot. Here's the 6-step playbook that worked across 9 audits.

L

Lakshmi Kiranmai Guduru

Founder, CARTIEAI

A note on this story: The numbers below are a composite of 9 EC2 right-sizing engagements we've run on fleets sized 80–2,400 instances. Mix of EKS worker nodes, ECS hosts, and standalone application servers across us-east-1 / eu-west-1 / ap-south-1.

A CTO sent us this Slack DM last quarter:

"Our AWS bill is $1.4M/year. EC2 alone is $720K. I've already done RIs, Spot for stateless, and Compute Optimizer keeps telling me 'not enough data'. I'm out of ideas."

Here's what we found in the first 4 hours of the audit:

  • Median CPU utilization across the fleet: 14.2% (over 30 days)
  • 96 instances averaging <5% CPU for 90 days straight
  • 41 m5.4xlarge running workloads that fit on m7i.xlarge
  • Annualized waste: ~$203K

By day 21, we'd cut their EC2 bill from $720K to $474K/year. Zero performance regressions. No engineers complained. No incidents.

This is the playbook.


Why right-sizing beats every other AWS lever

Most teams chase Reserved Instances first because the savings look big in the calculator (40-60% off list). But here's the trap: RIs lock you into the wrong size for 1-3 years. If you over-provisioned by 2x and then buy a 3-year RI, you've just bought 6 years of waste.

Right-size first. Buy RIs second. Always.

LeverTypical savingsReversibilityWhen to use
Right-sizing30-55%Hours (just resize)Always, first
Generation upgrade (m5→m7i)15-25%HoursAfter right-sizing
Spot instances60-90%Stateless workloadsEKS workers, batch
Reserved Instances30-50%1-3 year commitAfter steady-state
Savings Plans20-35%1-3 year commitCompute-flexible workloads

The compounding math: right-size (-40%) → upgrade generation (-20% of remainder) → buy 1y SP on what's left (-30%). End-state: 66% off list. Without locking yourself into the wrong size.


Step 1: Pull 30 days of real metrics (not 7, not 14)

The single biggest right-sizing mistake is using a 7-day window. Workloads breathe. Month-end batch jobs spike. Quarterly reports double the load. You need 30 days minimum.

Pull these 6 CloudWatch metrics for every instance:

aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0abc123 \
  --start-time $(date -d '30 days ago' --iso-8601) \
  --end-time $(date --iso-8601) \
  --period 300 \
  --statistics Maximum,Average,p99
MetricWhy it matters
CPUUtilization MaxSpot CPU bursts — never right-size below your max
CPUUtilization p99Sustained peak — informs target headroom
CPUUtilization AvgOverall efficiency baseline
mem_used_percent (CW agent)EC2 doesn't expose memory natively — install the CW agent
NetworkIn/Out SumBandwidth-bound workloads need different families
DiskWriteOps SumI/O bound? Consider i-class or attached io2 EBS

Critical: if mem_used_percent isn't in CloudWatch, install the CW agent first. Right-sizing on CPU alone misses memory-bound workloads, and you'll learn that the hard way at 3am.


Step 2: Classify instances into 4 buckets

Once you have 30 days of metrics, sort every instance into exactly one bucket:

Bucket A — Idle (kill candidates)

  • Max CPU < 5% for 30 days
  • Avg CPU < 2%
  • Network In/Out < 1 MB/day
  • Action: stop, observe for 7 days. If nobody screams → terminate.

In our 9 audits: 8.4% of fleet lands here on average. These are forgotten dev/test boxes, abandoned pipelines, "temporary" instances from 2 years ago.

Bucket B — Way over-provisioned (drop 2+ sizes)

  • p99 CPU < 25%
  • Memory < 40%
  • Action: drop 2 instance sizes (e.g., m5.4xlarge → m5.xlarge)

In our audits: 23-31% of fleet lands here. The biggest dollar lever.

Bucket C — Mildly over-provisioned (drop 1 size)

  • p99 CPU 25-50%
  • Memory < 60%
  • Action: drop 1 instance size

In our audits: 18-24% of fleet. Steady gains.

Bucket D — Right-sized (leave alone)

  • p99 CPU > 50% OR memory > 60%
  • Action: consider generation upgrade only (Step 4)

Step 3: Validate with AWS Compute Optimizer (but don't trust it blindly)

AWS Compute Optimizer is free. Turn it on for every linked account 14+ days before your audit. By the time you start, you'll have a recommendation for every instance.

What it gets right:

  • Instance family suggestions (e.g., "this is memory-bound, try r6i")
  • Confidence scoring based on metric coverage
  • Cross-region pricing comparisons

What it gets wrong (every time):

  • It assumes 14-day metric windows are enough — they're not for batch workloads
  • It under-recommends generation upgrades (still suggests m5 when m7i is cheaper)
  • It ignores Spot eligibility entirely

Our rule: treat Compute Optimizer as a second opinion, not gospel. If it disagrees with your manual classification, dig deeper. We've had cases where it recommended downsizing a database instance that was correctly sized for memory headroom — would have caused OOM kills in week 2 of month-end close.


Step 4: Generation upgrades (the silent 20% lever)

After right-sizing, upgrade the generation. Every new generation is roughly 20% cheaper for equivalent performance:

FromToSame vCPU/RAM?Price drop
m5.xlargem6i.xlargeYes-7%
m5.xlargem7i.xlargeYes-13%
r5.xlarger6i.xlargeYes-7%
r5.xlarger7i.xlargeYes-13%
c5.xlargec7i.xlargeYes-15%
t3.larget3a.large (AMD)Yes-10%
Graviton: m5.xlargem7g.xlargeYes-20%

Graviton (m7g, c7g, r7g) is the best deal in EC2 today. ARM-native workloads run 20-40% cheaper with comparable or better single-threaded performance. The catch: your AMI must be ARM-compatible (Amazon Linux 2023 ✓, Ubuntu 22.04 ✓, Java apps ✓, most Node/Python ✓, native binaries with x86_64 deps ✗).

The migration order:

  1. Stateless EKS workers → Graviton (m7g) first — easy rollback, blast radius is one node
  2. ECS apps with Linux AMIs → m7i (no architecture change)
  3. Stateful databases / cache → m7g if app supports it; otherwise m7i
  4. Windows EC2 → m7i (Graviton support is limited)

We've never seen a generation upgrade cause a regression. The compounding savings on top of right-sizing are massive.


Step 5: The safe rollout pattern

Right-sizing scares engineers because the worst-case outcome is a 3am page when prod runs hot. Here's the pattern we use to make rollouts boring:

For Auto Scaling Groups (ASG)

# 1. Update the launch template to the new instance type
aws ec2 create-launch-template-version \
  --launch-template-id lt-0abc123 \
  --source-version 5 \
  --launch-template-data '{"InstanceType":"m7i.xlarge"}'

# 2. Set the new version as default
aws ec2 modify-launch-template \
  --launch-template-id lt-0abc123 \
  --default-version 6

# 3. Trigger an instance refresh with 33% min healthy
aws autoscaling start-instance-refresh \
  --auto-scaling-group-name prod-web-asg \
  --preferences '{"MinHealthyPercentage":67,"InstanceWarmup":300}'

The ASG cycles instances 1 at a time. CloudWatch alarms catch any regression. You can pause / rollback at any moment.

For standalone EC2

  • Stop the instance during a low-traffic window
  • Change instance type via console / CLI
  • Start it
  • Watch CPU/mem for 24 hours
  • Keep the old AMI for 7 days as rollback insurance

Always stop, don't terminate. A stopped instance can be re-started at the old size in 60 seconds. A terminated one is gone forever.


Step 6: The "did this break anything?" dashboard

Right-sizing is only safe if you can detect regressions fast. Set up these 4 alarms in CloudWatch (or Datadog) before the migration:

  1. CPU sustained >85% for 15 minutes → page someone
  2. Memory sustained >85% for 15 minutes → page someone
  3. App-level p95 latency +30% vs 30-day baseline → Slack alert
  4. Error rate +50% vs 7-day baseline → Slack alert

Run the migration in waves: 10% on day 1 → wait 48h → 30% → wait 48h → remaining 60%. Total elapsed: ~7 days for the whole fleet.


Real numbers from one audit (composite)

Customer: B2B SaaS, ~200 EKS worker nodes (m5.4xlarge), ~80 standalone app servers, ~120 batch workers.

StageAnnualized costSavings
Baseline$720K
After Step 1-3 (right-sizing)$480K$240K (-33%)
After Step 4 (m5→m7g/m7i)$401K$79K more (-16%)
After Step 5 (1y Compute Savings Plan on baseline)$321K$80K more (-20%)
Total annual$321K$399K saved (55%)

Fleet utilization went from 14% → 41% average CPU. Zero regressions. CFO bonus: 1.5x annual.


The 5-minute self-audit anyone can do today

Don't have 4 hours? Try this in 5 minutes:

-- AWS Cost Explorer SQL (or: bash + aws cli)
SELECT
  instance_type,
  COUNT(*) AS instance_count,
  ROUND(SUM(unblended_cost), 0) AS monthly_cost,
  ROUND(AVG(cpu_avg_30d), 1) AS avg_cpu_pct
FROM ec2_running_instances
WHERE month = LAST_MONTH
GROUP BY instance_type
HAVING avg_cpu_pct < 20 AND monthly_cost > 1000
ORDER BY monthly_cost DESC
LIMIT 20

If any row shows avg_cpu_pct < 15% and monthly_cost > $5K, you've found at least $40K/year of waste. Drop a size. Watch a week. Bank the savings.


Common objections (and the answers)

"We can't right-size databases — they need headroom for spikes." True for the primary, false for read replicas. Read replicas can almost always drop a size.

"Compute Optimizer says we're already right-sized." Pull the raw metrics. CO uses 14 days; you need 30. We've found 30%+ waste on fleets CO calls "optimized".

"What about JVM heap pressure?" Memory-bound JVM workloads should move to r-class (more RAM, same vCPU). Cheaper than m-class with double the RAM.

"Generation upgrades broke something for us before." That was probably an old AMI. m7i / m7g need recent Linux kernels (5.10+) for full performance. Update the AMI first.


How CARTIE AI helps

CARTIE AI's EC2 right-sizer connects read-only to your AWS account, pulls 30 days of CloudWatch metrics across every region, and generates per-instance recommendations grouped by the 4 buckets above. Typical first-scan: $8K–$45K/month projected savings.

We also flag generation-upgrade candidates (m5 → m7i / m7g), idle instances (Bucket A), and Compute Optimizer disagreements (where our 30-day analysis catches what CO's 14-day window missed).

Even without a tool, the 5-minute SQL audit above will find $1K–$5K/month of waste in any fleet over $20K/month of EC2 spend.

Now go pull your CPU metrics. 🚀

Free · Printable · Ready to run

Get the EC2 Right-Sizing Audit Checklist

The exact 6-step playbook from the post — printable, ready to run on Monday.

No spam. The founder reads every reply personally.

Go deeper · Field guide
☁️

AWS Cost Optimization: The Complete Guide for FinOps Teams (2026)

Amazon Web Services is the largest cloud platform in the world — and the source of more than half of the cloud waste we audit. This guide gives you the 14 prove…

Read the AWS guide

FREE — NO SIGNUP — 60 SECONDS

Find your Snowflake waste right now.

Take the free 10-question Snowflake Cost Health Score. Get a grade, your monthly $-waste estimate, and the top 3 fixes — instantly.

THE FINOPS BRIEF

3 cost-saving tips, every Tuesday.

Built for finance & engineering teams who are tired of paying for cloud they don't use. No fluff. Just what works.

Unsubscribe anytime. We never sell your data.

Lakshmi Kiranmai Guduru

ABOUT THE AUTHOR

Lakshmi Kiranmai Guduru

Founder, CARTIEAI · Building in public

I'm building CARTIE AI to fix the cloud-cost problem I saw drain millions at companies I worked for — where engineering and finance kept talking past each other. If you liked this post, here's where I share unfiltered notes on building this in public:

Keep reading

We value your privacy. Cookies help us improve your experience. Learn more

Install CARTIE AI

Add to your home screen for quick access and offline support