Cloud Cost Anomaly Detection: Why MTTD Matters More Than Total Spend

The most expensive metric in FinOps is the one nobody tracks.

It's not "% of spend optimized". It's not "cost per customer". It's MTTD — Mean Time To Detection.

Two FinOps teams. Same $1.2M monthly AWS bill. Same suite of tools. One catches a misconfigured Lambda runaway in 23 minutes. The other catches it in 8.5 days.

Math: at $0.20/sec runaway burn, that's the difference between $280 and $146,880.

The tool stack is identical. The difference is the anomaly detection.

Here's how to build anomaly detection that actually catches the mistakes — fast.

Why default tools miss things

AWS Cost Anomaly Detection. Datadog Watchdog. GCP Recommender. They all share the same architectural flaw:

They run once per day.

The cheapest way to detect anomalies is to compare today's daily total to a baseline. So that's what every default tool does. But the median runaway resource fires for <6 hours before someone notices the bill — meaning your daily-aggregated detector is structurally incapable of catching it inside the same billing day.

By the time the daily roll-up runs, the meter has already been ticking for 24+ hours.

The four detection methods, ranked by speed

1. Hourly statistical baseline (the workhorse)

For every resource × every service × every hour, compute a 14-day median + median absolute deviation (MAD). If the current hour exceeds median + 4·MAD, fire.

import numpy as np

def detect_hourly_anomaly(history_hours, current_hour_cost):
    """history_hours = list of last 14 days at the same hour-of-day."""
    median = np.median(history_hours)
    mad = np.median(np.abs(history_hours - median))
    threshold = median + 4 * 1.4826 * mad  # 1.4826 = MAD-to-stddev correction
    return current_hour_cost > threshold, threshold

Why MAD instead of stddev: MAD is robust to outliers. If yesterday had a spike, mean+stddev gets dragged up; MAD doesn't. Robust statistics matter when your training window is dirty.

Latency: 1 hour from cost incurred. Better but still slow.

2. CloudWatch metric-based detection (fast)

Skip the bill. Detect from the cause: CPU/memory/network. If Lambda invocations jump 100x, you don't need to wait for the bill — you can alert on the invocation count itself.

alarm:
  metric: AWS/Lambda Invocations (sum, 5min)
  threshold: 5x rolling 1-hour median
  trigger_after: 2 consecutive periods

Latency: 5 minutes. This is what catches infinite loops.

3. Tag-coverage drift (catches the slow drain)

When a new resource appears with no owner tag, that's a leading indicator. Untagged resources are 7x more likely to become orphans.

Daily query: "Resources created in last 24h with empty/missing required tags". One Slack post per morning, 3-line summary. Cheap to implement, high signal.

Latency: 1 day. Used to catch future bleed, not active fires.

4. Forecast-divergence (the canary)

Maintain a 30-day forecast. Compare actual to forecast every 4 hours. Alert when actual exceeds the upper 80% confidence band.

Forecast: $1,150 by hour 17:00
Upper P80: $1,210
Actual: $1,290 ← FIRE

Latency: 4-8 hours. Used as a backstop for #1 (catches drift the hourly detector misses).

Alert design: the 3-second rule

A cost anomaly alert that takes >3 seconds to comprehend is a useless alert. The on-call will dismiss it.

Bad:

🚨 Cost anomaly detected on AWS account 123456789012, service AWSLambda, resource arn:aws:lambda:us-east-1:123456789012:function:payments-prod, with current period cost of $4823.10 vs. baseline of $124.30, deviation of 38.7x median.

Good:

🚨 payments-prod Lambda runaway $4.8K / 1h (38x normal) — burning $80/min Owner: payments-team · Stop function · View logs

Three things in the title (resource, severity, money). Burn rate, not cumulative. Owner. Action button. Done.

The MTTD trend

Track MTTD as a KPI. Plot it on the dashboard the leadership team looks at.

Most teams' MTTD starts at 5–10 days (basically: "we noticed when the bill arrived"). Walking backward through the 4 detection methods above, here's what each phase moves it to:

Phase	MTTD	Tools needed
Baseline	5–10 days	None — bill arrives
+ Daily aggregate detection	1.5 days	AWS Cost Anomaly Detection (free)
+ Hourly statistical baseline	1–4 hours	Custom (or CARTIEAI)
+ CloudWatch metric alarms	5–15 min	CloudWatch + ops review
+ Forecast-divergence	sub-hour drift	Custom (or CARTIEAI)

Driving MTTD from 8.5 days → 23 minutes is a 500x improvement on the metric that actually saves money.

What CARTIEAI does

We run all four detection methods in parallel:

Hourly statistical baseline with MAD-based thresholds.
CloudWatch / Stackdriver metric alarms wired into our alerting.
Tag-coverage drift as a daily 9 AM Slack digest.
Forecast-divergence at the cost-line level.

Plus the alerts are written for the 3-second rule — title, burn rate, owner, action button.

Internal MTTD: 23 minutes. Customer median: 47 minutes (we lose some time because customers' tagging hygiene varies).

How to start tomorrow

If you have nothing in place, here's the 1-day plan:

Morning — turn on AWS Cost Anomaly Detection (it's free). MTTD ≈ 1.5 days. Acceptable starting point.
Afternoon — set up 5 CloudWatch alarms on your top 5 services (Lambda invocations, EC2 NetworkOut, RDS connections, DynamoDB capacity, S3 bytes-out). MTTD on the catastrophic stuff drops to <15 min.
Tomorrow — start tracking MTTD as a metric.

That's it for week 1. The hourly statistical baseline (#1) and forecast divergence (#4) are week 2 onward — but the CloudWatch alarms alone will catch 80% of the catastrophic anomalies.

TL;DR

The metric that matters is MTTD, not "% optimised".
Default tools (daily aggregates) can't catch fires inside the same billing day.
Run 4 detection methods in parallel: hourly stats, CloudWatch, tag drift, forecast divergence.
Write alerts for the 3-second rule: resource, burn rate, owner, action button.
Drive MTTD from 8.5 days → <1 hour. That single change is worth 6-figure savings the first time it catches a runaway.

CARTIEAI's MTTD page shows your team's MTTD trend out of the box. Connect AWS, run for 30 days, and watch the curve drop.

Cloud Cost Anomaly Detection: Why MTTD Matters More Than Total Spend

The most expensive metric in FinOps is the one nobody tracks.

Why default tools miss things

The four detection methods, ranked by speed

1. Hourly statistical baseline (the workhorse)

2. CloudWatch metric-based detection (fast)

3. Tag-coverage drift (catches the slow drain)

4. Forecast-divergence (the canary)

Alert design: the 3-second rule

The MTTD trend

What CARTIEAI does

How to start tomorrow

TL;DR

AWS Cost Optimization: The Complete Guide for FinOps Teams (2026)

Find your Snowflake waste right now.

3 cost-saving tips, every Tuesday.

Lakshmi Kiranmai Guduru

Keep reading

Install CARTIEAI