Back to home
Best Practices
May 1, 2026 8 min read

Cloud Cost Anomaly Detection: Why MTTD Matters More Than Total Spend

A $48K AWS overspend caught in 23 minutes vs 8 days is the difference between a non-event and a board-level conversation. Here's how to build anomaly detection that actually catches things in time.

L

Lakshmi Kiranmai Guduru

Founder, CARTIEAI

The most expensive metric in FinOps is the one nobody tracks.

It's not "% of spend optimized". It's not "cost per customer". It's MTTD — Mean Time To Detection.

Two FinOps teams. Same $1.2M monthly AWS bill. Same suite of tools. One catches a misconfigured Lambda runaway in 23 minutes. The other catches it in 8.5 days.

Math: at $0.20/sec runaway burn, that's the difference between $280 and $146,880.

The tool stack is identical. The difference is the anomaly detection.

Here's how to build anomaly detection that actually catches the mistakes — fast.


Why default tools miss things

AWS Cost Anomaly Detection. Datadog Watchdog. GCP Recommender. They all share the same architectural flaw:

They run once per day.

The cheapest way to detect anomalies is to compare today's daily total to a baseline. So that's what every default tool does. But the median runaway resource fires for <6 hours before someone notices the bill — meaning your daily-aggregated detector is structurally incapable of catching it inside the same billing day.

By the time the daily roll-up runs, the meter has already been ticking for 24+ hours.


The four detection methods, ranked by speed

1. Hourly statistical baseline (the workhorse)

For every resource × every service × every hour, compute a 14-day median + median absolute deviation (MAD). If the current hour exceeds median + 4·MAD, fire.

import numpy as np

def detect_hourly_anomaly(history_hours, current_hour_cost):
    """history_hours = list of last 14 days at the same hour-of-day."""
    median = np.median(history_hours)
    mad = np.median(np.abs(history_hours - median))
    threshold = median + 4 * 1.4826 * mad  # 1.4826 = MAD-to-stddev correction
    return current_hour_cost > threshold, threshold

Why MAD instead of stddev: MAD is robust to outliers. If yesterday had a spike, mean+stddev gets dragged up; MAD doesn't. Robust statistics matter when your training window is dirty.

Latency: 1 hour from cost incurred. Better but still slow.

2. CloudWatch metric-based detection (fast)

Skip the bill. Detect from the cause: CPU/memory/network. If Lambda invocations jump 100x, you don't need to wait for the bill — you can alert on the invocation count itself.

alarm:
  metric: AWS/Lambda Invocations (sum, 5min)
  threshold: 5x rolling 1-hour median
  trigger_after: 2 consecutive periods

Latency: 5 minutes. This is what catches infinite loops.

3. Tag-coverage drift (catches the slow drain)

When a new resource appears with no owner tag, that's a leading indicator. Untagged resources are 7x more likely to become orphans.

Daily query: "Resources created in last 24h with empty/missing required tags". One Slack post per morning, 3-line summary. Cheap to implement, high signal.

Latency: 1 day. Used to catch future bleed, not active fires.

4. Forecast-divergence (the canary)

Maintain a 30-day forecast. Compare actual to forecast every 4 hours. Alert when actual exceeds the upper 80% confidence band.

Forecast: $1,150 by hour 17:00
Upper P80: $1,210
Actual: $1,290 ← FIRE

Latency: 4-8 hours. Used as a backstop for #1 (catches drift the hourly detector misses).


Alert design: the 3-second rule

A cost anomaly alert that takes >3 seconds to comprehend is a useless alert. The on-call will dismiss it.

Bad:

🚨 Cost anomaly detected on AWS account 123456789012, service AWSLambda, resource arn:aws:lambda:us-east-1:123456789012:function:payments-prod, with current period cost of $4823.10 vs. baseline of $124.30, deviation of 38.7x median.

Good:

🚨 payments-prod Lambda runaway $4.8K / 1h (38x normal) — burning $80/min Owner: payments-team · Stop function · View logs

Three things in the title (resource, severity, money). Burn rate, not cumulative. Owner. Action button. Done.


The MTTD trend

Track MTTD as a KPI. Plot it on the dashboard the leadership team looks at.

Most teams' MTTD starts at 5–10 days (basically: "we noticed when the bill arrived"). Walking backward through the 4 detection methods above, here's what each phase moves it to:

PhaseMTTDTools needed
Baseline5–10 daysNone — bill arrives
+ Daily aggregate detection1.5 daysAWS Cost Anomaly Detection (free)
+ Hourly statistical baseline1–4 hoursCustom (or CARTIE AI)
+ CloudWatch metric alarms5–15 minCloudWatch + ops review
+ Forecast-divergencesub-hour driftCustom (or CARTIE AI)

Driving MTTD from 8.5 days → 23 minutes is a 500x improvement on the metric that actually saves money.


What CARTIE AI does

We run all four detection methods in parallel:

  1. Hourly statistical baseline with MAD-based thresholds.
  2. CloudWatch / Stackdriver metric alarms wired into our alerting.
  3. Tag-coverage drift as a daily 9 AM Slack digest.
  4. Forecast-divergence at the cost-line level.

Plus the alerts are written for the 3-second rule — title, burn rate, owner, action button.

Internal MTTD: 23 minutes. Customer median: 47 minutes (we lose some time because customers' tagging hygiene varies).


How to start tomorrow

If you have nothing in place, here's the 1-day plan:

  1. Morning — turn on AWS Cost Anomaly Detection (it's free). MTTD ≈ 1.5 days. Acceptable starting point.
  2. Afternoon — set up 5 CloudWatch alarms on your top 5 services (Lambda invocations, EC2 NetworkOut, RDS connections, DynamoDB capacity, S3 bytes-out). MTTD on the catastrophic stuff drops to <15 min.
  3. Tomorrow — start tracking MTTD as a metric.

That's it for week 1. The hourly statistical baseline (#1) and forecast divergence (#4) are week 2 onward — but the CloudWatch alarms alone will catch 80% of the catastrophic anomalies.


TL;DR

  • The metric that matters is MTTD, not "% optimised".
  • Default tools (daily aggregates) can't catch fires inside the same billing day.
  • Run 4 detection methods in parallel: hourly stats, CloudWatch, tag drift, forecast divergence.
  • Write alerts for the 3-second rule: resource, burn rate, owner, action button.
  • Drive MTTD from 8.5 days → <1 hour. That single change is worth 6-figure savings the first time it catches a runaway.

CARTIE AI's MTTD page shows your team's MTTD trend out of the box. Connect AWS, run for 30 days, and watch the curve drop.

Go deeper · Field guide
☁️

AWS Cost Optimization: The Complete Guide for FinOps Teams (2026)

Amazon Web Services is the largest cloud platform in the world — and the source of more than half of the cloud waste we audit. This guide gives you the 14 prove…

Read the AWS guide

FREE — NO SIGNUP — 60 SECONDS

Find your Snowflake waste right now.

Take the free 10-question Snowflake Cost Health Score. Get a grade, your monthly $-waste estimate, and the top 3 fixes — instantly.

THE FINOPS BRIEF

3 cost-saving tips, every Tuesday.

Built for finance & engineering teams who are tired of paying for cloud they don't use. No fluff. Just what works.

Unsubscribe anytime. We never sell your data.

Lakshmi Kiranmai Guduru

ABOUT THE AUTHOR

Lakshmi Kiranmai Guduru

Founder, CARTIEAI · Building in public

I'm building CARTIE AI to fix the cloud-cost problem I saw drain millions at companies I worked for — where engineering and finance kept talking past each other. If you liked this post, here's where I share unfiltered notes on building this in public:

Keep reading

We value your privacy. Cookies help us improve your experience. Learn more

Install CARTIE AI

Add to your home screen for quick access and offline support