A note on this story: Numbers below are aggregated from 11 production Datadog accounts we've audited (8 SaaS, 3 fintech, monthly spend $8K–$95K). Same patterns, every time.
Datadog is the BMW of observability tools. Beautiful product. Excellent UX. Insanely expensive when nobody's watching.
We've audited 11 production Datadog accounts in the last year. Median overspend: 47%.
The pattern is always the same: a team adopts Datadog because something broke and they needed observability fast. Six months later, the monthly bill quadrupled and nobody can explain why. By month 12 it's a six-figure line item nobody wants to question.
This guide is the line-by-line breakdown. By the end you'll know:
- The 8 patterns driving 40–60% of Datadog overspend
- The 3 settings nobody touches that cost the most
- The 5-minute audit any engineer can run today
- When to switch tools (rarely) and when to optimise (almost always)
Pattern 1: Custom metrics with infinite cardinality
Datadog charges per unique metric × tag combination. The default custom metric allowance is 100 per host. Going over: $0.05/metric/month. Sounds tiny. Watch what happens with bad tagging:
# BAD — `user_id` creates a new metric per user
statsd.increment('api.request', tags=[f'endpoint:/users/{user_id}'])
100 endpoints × 50,000 users = 5,000,000 unique metrics. That's a $250K/month custom-metrics line item from ONE bad metric.
The fix:
- Move high-cardinality identifiers (user_id, request_id, transaction_id) to logs, not metrics
- Use
@dimension annotations to mark which tags are intentional
- Set up metrics-without-limits controls per metric (Datadog UI → Metrics → Manage Tags)
Typical savings: $5K–$50K/month. Single biggest line item we find.
Pattern 2: Log retention default = 15 days. Most logs need 3.
Datadog logs are billed in two ways:
- Indexed logs: $1.27/million events for 15 days, fully searchable
- Archived logs: $0.10/million events, stored in your S3, slower to query
The fix:
- 15 days indexed → too long for 90% of logs. Drop to 3 days indexed + archive to S3 for 30
- Critical security/audit logs: 30 days indexed (compliance need)
- Application debug logs: Drop them in production altogether — they're more noise than signal
Typical savings: $1K–$8K/month.
Pattern 3: APM trace ingestion at 100% sampling
Datadog APM bills per ingested trace. Most teams default to 100% sampling — every single request gets a trace. For a service doing 1 billion requests/month, that's a lot.
The fix:
- Production traffic > 1k req/sec: sample at 10%
- Production traffic > 10k req/sec: sample at 1%
- Always-on for errors:
DD_TRACE_SAMPLE_ALL_SPANS=false + error-tracing rule
# datadog.yaml — head-based + tail-based sampling combo
apm_config:
sampling_rules:
- service: my-service
name: error
sample_rate: 1.0 # always trace errors
- service: my-service
sample_rate: 0.1 # 10% of normal traffic
Typical savings: 30–60% of APM cost.
Pattern 4: Synthetic tests running every 60 seconds
Datadog Synthetic tests are great. They're also priced per-test-per-month, with location multipliers.
A "test the homepage from 5 locations every 1 minute" test = 5 × 60 × 24 × 30 = 216,000 runs/month per test. At ~$5 per 100K runs, that's $10/test/month minimum.
The fix:
- 1-minute frequency → only for mission-critical login + checkout
- Most pages → 5 minutes is fine
- Internal tools / staging → 30 minutes or kill them entirely
Typical savings: $200–$2K/month per team.
Pattern 5: Hosts you forgot
Datadog charges per host. Auto-scaling clusters that scale up to 200 nodes for a 5-minute load spike → you're paying for 200 hosts that month even if 195 of them only existed for 5 minutes.
The fix:
- Switch to per-second host billing (newer Datadog plans)
- Or pin agents to specific node pools that scale conservatively
- Run the "Host Map" weekly — kill any host with 0 incoming metrics
Typical savings: $500–$5K/month.
Pattern 6: Integrations you turned on for "let me try" experiments
Datadog has 800+ integrations. They're free to enable. They're NOT free to feed metrics into your billable counters.
The Kubernetes integration alone can pump 500+ metrics per node per minute. The MongoDB integration: 200+ metrics per cluster. Most teams have 30+ integrations enabled, half of which they never look at.
The audit:
- Datadog UI → Integrations → list installed
- For each: "Do I have a dashboard, monitor, or SLO using this?" If no → uninstall
Typical savings: $1K–$3K/month.
Pattern 7: Watchdog Insights / RUM features turned on org-wide
Watchdog (Datadog's anomaly detection) and RUM (Real User Monitoring) are per-feature charges layered on top of base APM/Logs. Most teams turn them on for the demo and forget.
- Watchdog: $0.30/host/month
- RUM: $1.50 per 10K sessions
- CI Visibility: $0.50/test-execution
The fix: if the team isn't actively reviewing the data weekly, disable the feature. You can turn it back on when you actually need it.
Typical savings: $1K–$4K/month.
Pattern 8: Dev/staging running the production agent config
Most teams deploy the Datadog agent with the same config across all environments. So your dev cluster ships every metric, every log, every trace to Datadog — and gets billed for it.
The fix:
# datadog-dev.yaml — heavy reduction in dev
logs_enabled: false
apm_config:
enabled: false
process_config:
enabled: false
# Just keep host-level metrics
Typical savings: 30–50% of total Datadog spend (dev+staging combined).
The 5-minute audit any engineer can run
- Custom metrics overage: Datadog UI → Plan & Usage → Custom Metrics. If
>100 per host, you're paying overage. Find the high-cardinality metrics (Manage Tags page).
- Log retention check: Logs → Configuration. If retention >7 days for non-security logs, drop it.
- APM sample rate: Service Map → click any high-traffic service → check sample rate. >50% on a >1k req/sec service = paying 5–10x what you need.
- Unused integrations: Integrations → installed list. Kill anything without an active dashboard or monitor.
- Synthetic frequency: Synthetics → list tests by frequency. Anything <5min that isn't mission-critical → bump frequency.
Steps 1–3 alone usually cut a Datadog bill 30%.
When to switch off Datadog (rarely the right answer)
The "Datadog is too expensive, let's go open-source" alternatives are:
- Grafana Cloud: ~50% cheaper but UX is rougher
- Self-hosted Prometheus + Loki + Grafana: essentially free for tooling, massive SRE overhead (replication, retention, query optimisation)
- Honeycomb / New Relic: comparable price, different UX
Math: 2 dedicated SREs to run self-hosted observability cost $400K/year. Datadog at $200K/year is cheaper for any company under ~30 engineers. The "let's self-host" conversation is almost always a false economy.
The right move is optimise Datadog first, then revisit only if you're at $400K+/year and growing fast.
How CARTIE AI helps
CARTIE AI's Datadog cost optimizer ingests your Datadog API key, runs all 8 patterns automatically, and gives you a dollar number for each. Typical first-scan finds $4K–$15K/month of waste.
Even without a tool, the 5-minute audit will find $2K–$5K/month of savings in any company over $10K/month spend. Promise.
Now go check your custom-metrics page. 🥃