The most expensive metric in FinOps is the one nobody tracks.
It's not "% of spend optimized". It's not "cost per customer". It's MTTD — Mean Time To Detection.
Two FinOps teams. Same $1.2M monthly AWS bill. Same suite of tools. One catches a misconfigured Lambda runaway in 23 minutes. The other catches it in 8.5 days.
Math: at $0.20/sec runaway burn, that's the difference between $280 and $146,880.
The tool stack is identical. The difference is the anomaly detection.
Here's how to build anomaly detection that actually catches the mistakes — fast.
Why default tools miss things
AWS Cost Anomaly Detection. Datadog Watchdog. GCP Recommender. They all share the same architectural flaw:
They run once per day.
The cheapest way to detect anomalies is to compare today's daily total to a baseline. So that's what every default tool does. But the median runaway resource fires for <6 hours before someone notices the bill — meaning your daily-aggregated detector is structurally incapable of catching it inside the same billing day.
By the time the daily roll-up runs, the meter has already been ticking for 24+ hours.
The four detection methods, ranked by speed
1. Hourly statistical baseline (the workhorse)
For every resource × every service × every hour, compute a 14-day median + median absolute deviation (MAD). If the current hour exceeds median + 4·MAD, fire.
import numpy as np
def detect_hourly_anomaly(history_hours, current_hour_cost):
"""history_hours = list of last 14 days at the same hour-of-day."""
median = np.median(history_hours)
mad = np.median(np.abs(history_hours - median))
threshold = median + 4 * 1.4826 * mad # 1.4826 = MAD-to-stddev correction
return current_hour_cost > threshold, threshold
Why MAD instead of stddev: MAD is robust to outliers. If yesterday had a spike, mean+stddev gets dragged up; MAD doesn't. Robust statistics matter when your training window is dirty.
Latency: 1 hour from cost incurred. Better but still slow.
2. CloudWatch metric-based detection (fast)
Skip the bill. Detect from the cause: CPU/memory/network. If Lambda invocations jump 100x, you don't need to wait for the bill — you can alert on the invocation count itself.
alarm:
metric: AWS/Lambda Invocations (sum, 5min)
threshold: 5x rolling 1-hour median
trigger_after: 2 consecutive periods
Latency: 5 minutes. This is what catches infinite loops.
3. Tag-coverage drift (catches the slow drain)
When a new resource appears with no owner tag, that's a leading indicator. Untagged resources are 7x more likely to become orphans.
Daily query: "Resources created in last 24h with empty/missing required tags". One Slack post per morning, 3-line summary. Cheap to implement, high signal.
Latency: 1 day. Used to catch future bleed, not active fires.
4. Forecast-divergence (the canary)
Maintain a 30-day forecast. Compare actual to forecast every 4 hours. Alert when actual exceeds the upper 80% confidence band.
Forecast: $1,150 by hour 17:00
Upper P80: $1,210
Actual: $1,290 ← FIRE
Latency: 4-8 hours. Used as a backstop for #1 (catches drift the hourly detector misses).
Alert design: the 3-second rule
A cost anomaly alert that takes >3 seconds to comprehend is a useless alert. The on-call will dismiss it.
Bad:
🚨 Cost anomaly detected on AWS account 123456789012, service AWSLambda, resource arn:aws:lambda:us-east-1:123456789012:function:payments-prod, with current period cost of $4823.10 vs. baseline of $124.30, deviation of 38.7x median.
Good:
🚨 payments-prod Lambda runaway
$4.8K / 1h (38x normal) — burning $80/min
Owner: payments-team · Stop function · View logs
Three things in the title (resource, severity, money). Burn rate, not cumulative. Owner. Action button. Done.
The MTTD trend
Track MTTD as a KPI. Plot it on the dashboard the leadership team looks at.
Most teams' MTTD starts at 5–10 days (basically: "we noticed when the bill arrived"). Walking backward through the 4 detection methods above, here's what each phase moves it to:
| Phase | MTTD | Tools needed |
|---|
| Baseline | 5–10 days | None — bill arrives |
| + Daily aggregate detection | 1.5 days | AWS Cost Anomaly Detection (free) |
| + Hourly statistical baseline | 1–4 hours | Custom (or CARTIE AI) |
| + CloudWatch metric alarms | 5–15 min | CloudWatch + ops review |
| + Forecast-divergence | sub-hour drift | Custom (or CARTIE AI) |
Driving MTTD from 8.5 days → 23 minutes is a 500x improvement on the metric that actually saves money.
What CARTIE AI does
We run all four detection methods in parallel:
- Hourly statistical baseline with MAD-based thresholds.
- CloudWatch / Stackdriver metric alarms wired into our alerting.
- Tag-coverage drift as a daily 9 AM Slack digest.
- Forecast-divergence at the cost-line level.
Plus the alerts are written for the 3-second rule — title, burn rate, owner, action button.
Internal MTTD: 23 minutes. Customer median: 47 minutes (we lose some time because customers' tagging hygiene varies).
How to start tomorrow
If you have nothing in place, here's the 1-day plan:
- Morning — turn on AWS Cost Anomaly Detection (it's free). MTTD ≈ 1.5 days. Acceptable starting point.
- Afternoon — set up 5 CloudWatch alarms on your top 5 services (Lambda invocations, EC2 NetworkOut, RDS connections, DynamoDB capacity, S3 bytes-out). MTTD on the catastrophic stuff drops to <15 min.
- Tomorrow — start tracking MTTD as a metric.
That's it for week 1. The hourly statistical baseline (#1) and forecast divergence (#4) are week 2 onward — but the CloudWatch alarms alone will catch 80% of the catastrophic anomalies.
TL;DR
- The metric that matters is MTTD, not "% optimised".
- Default tools (daily aggregates) can't catch fires inside the same billing day.
- Run 4 detection methods in parallel: hourly stats, CloudWatch, tag drift, forecast divergence.
- Write alerts for the 3-second rule: resource, burn rate, owner, action button.
- Drive MTTD from 8.5 days → <1 hour. That single change is worth 6-figure savings the first time it catches a runaway.
CARTIE AI's MTTD page shows your team's MTTD trend out of the box. Connect AWS, run for 30 days, and watch the curve drop.