The Datadog Cost Optimization Guide: 8 Patterns That Cut Bills 45%

A note on this story: Numbers below are aggregated from 11 production Datadog accounts we've audited (8 SaaS, 3 fintech, monthly spend $8K–$95K). Same patterns, every time.

Datadog is the BMW of observability tools. Beautiful product. Excellent UX. Insanely expensive when nobody's watching.

We've audited 11 production Datadog accounts in the last year. Median overspend: 47%.

The pattern is always the same: a team adopts Datadog because something broke and they needed observability fast. Six months later, the monthly bill quadrupled and nobody can explain why. By month 12 it's a six-figure line item nobody wants to question.

This guide is the line-by-line breakdown. By the end you'll know:

The 8 patterns driving 40–60% of Datadog overspend
The 3 settings nobody touches that cost the most
The 5-minute audit any engineer can run today
When to switch tools (rarely) and when to optimise (almost always)

Pattern 1: Custom metrics with infinite cardinality

Datadog charges per unique metric × tag combination. The default custom metric allowance is 100 per host. Going over: $0.05/metric/month. Sounds tiny. Watch what happens with bad tagging:

# BAD — `user_id` creates a new metric per user
statsd.increment('api.request', tags=[f'endpoint:/users/{user_id}'])

100 endpoints × 50,000 users = 5,000,000 unique metrics. That's a $250K/month custom-metrics line item from ONE bad metric.

The fix:

Move high-cardinality identifiers (user_id, request_id, transaction_id) to logs, not metrics
Use @dimension annotations to mark which tags are intentional
Set up metrics-without-limits controls per metric (Datadog UI → Metrics → Manage Tags)

Typical savings: $5K–$50K/month. Single biggest line item we find.

Pattern 2: Log retention default = 15 days. Most logs need 3.

Datadog logs are billed in two ways:

Indexed logs: $1.27/million events for 15 days, fully searchable
Archived logs: $0.10/million events, stored in your S3, slower to query

The fix:

15 days indexed → too long for 90% of logs. Drop to 3 days indexed + archive to S3 for 30
Critical security/audit logs: 30 days indexed (compliance need)
Application debug logs: Drop them in production altogether — they're more noise than signal

Typical savings: $1K–$8K/month.

Pattern 3: APM trace ingestion at 100% sampling

Datadog APM bills per ingested trace. Most teams default to 100% sampling — every single request gets a trace. For a service doing 1 billion requests/month, that's a lot.

The fix:

Production traffic > 1k req/sec: sample at 10%
Production traffic > 10k req/sec: sample at 1%
Always-on for errors: DD_TRACE_SAMPLE_ALL_SPANS=false + error-tracing rule

# datadog.yaml — head-based + tail-based sampling combo
apm_config:
  sampling_rules:
    - service: my-service
      name: error
      sample_rate: 1.0      # always trace errors
    - service: my-service
      sample_rate: 0.1      # 10% of normal traffic

Typical savings: 30–60% of APM cost.

Pattern 4: Synthetic tests running every 60 seconds

Datadog Synthetic tests are great. They're also priced per-test-per-month, with location multipliers.

A "test the homepage from 5 locations every 1 minute" test = 5 × 60 × 24 × 30 = 216,000 runs/month per test. At ~$5 per 100K runs, that's $10/test/month minimum.

The fix:

1-minute frequency → only for mission-critical login + checkout
Most pages → 5 minutes is fine
Internal tools / staging → 30 minutes or kill them entirely

Typical savings: $200–$2K/month per team.

Pattern 5: Hosts you forgot

Datadog charges per host. Auto-scaling clusters that scale up to 200 nodes for a 5-minute load spike → you're paying for 200 hosts that month even if 195 of them only existed for 5 minutes.

The fix:

Switch to per-second host billing (newer Datadog plans)
Or pin agents to specific node pools that scale conservatively
Run the "Host Map" weekly — kill any host with 0 incoming metrics

Typical savings: $500–$5K/month.

Pattern 6: Integrations you turned on for "let me try" experiments

Datadog has 800+ integrations. They're free to enable. They're NOT free to feed metrics into your billable counters.

The Kubernetes integration alone can pump 500+ metrics per node per minute. The MongoDB integration: 200+ metrics per cluster. Most teams have 30+ integrations enabled, half of which they never look at.

The audit:

Datadog UI → Integrations → list installed
For each: "Do I have a dashboard, monitor, or SLO using this?" If no → uninstall

Typical savings: $1K–$3K/month.

Pattern 7: Watchdog Insights / RUM features turned on org-wide

Watchdog (Datadog's anomaly detection) and RUM (Real User Monitoring) are per-feature charges layered on top of base APM/Logs. Most teams turn them on for the demo and forget.

Watchdog: $0.30/host/month
RUM: $1.50 per 10K sessions
CI Visibility: $0.50/test-execution

The fix: if the team isn't actively reviewing the data weekly, disable the feature. You can turn it back on when you actually need it.

Typical savings: $1K–$4K/month.

Pattern 8: Dev/staging running the production agent config

Most teams deploy the Datadog agent with the same config across all environments. So your dev cluster ships every metric, every log, every trace to Datadog — and gets billed for it.

The fix:

# datadog-dev.yaml — heavy reduction in dev
logs_enabled: false
apm_config:
  enabled: false
process_config:
  enabled: false
# Just keep host-level metrics

Typical savings: 30–50% of total Datadog spend (dev+staging combined).

The 5-minute audit any engineer can run

Custom metrics overage: Datadog UI → Plan & Usage → Custom Metrics. If >100 per host, you're paying overage. Find the high-cardinality metrics (Manage Tags page).
Log retention check: Logs → Configuration. If retention >7 days for non-security logs, drop it.
APM sample rate: Service Map → click any high-traffic service → check sample rate. >50% on a >1k req/sec service = paying 5–10x what you need.
Unused integrations: Integrations → installed list. Kill anything without an active dashboard or monitor.
Synthetic frequency: Synthetics → list tests by frequency. Anything <5min that isn't mission-critical → bump frequency.

Steps 1–3 alone usually cut a Datadog bill 30%.

When to switch off Datadog (rarely the right answer)

The "Datadog is too expensive, let's go open-source" alternatives are:

Grafana Cloud: ~50% cheaper but UX is rougher
Self-hosted Prometheus + Loki + Grafana: essentially free for tooling, massive SRE overhead (replication, retention, query optimisation)
Honeycomb / New Relic: comparable price, different UX

Math: 2 dedicated SREs to run self-hosted observability cost $400K/year. Datadog at $200K/year is cheaper for any company under ~30 engineers. The "let's self-host" conversation is almost always a false economy.

The right move is optimise Datadog first, then revisit only if you're at $400K+/year and growing fast.

How CARTIEAI helps

CARTIEAI's Datadog cost optimizer ingests your Datadog API key, runs all 8 patterns automatically, and gives you a dollar number for each. Typical first-scan finds $4K–$15K/month of waste.

Even without a tool, the 5-minute audit will find $2K–$5K/month of savings in any company over $10K/month spend. Promise.

Now go check your custom-metrics page. 🥃

The Datadog Cost Optimization Guide: 8 Patterns That Cut Bills 45%

Pattern 1: Custom metrics with infinite cardinality

Pattern 2: Log retention default = 15 days. Most logs need 3.

Pattern 3: APM trace ingestion at 100% sampling

Pattern 4: Synthetic tests running every 60 seconds

Pattern 5: Hosts you forgot

Pattern 6: Integrations you turned on for "let me try" experiments

Pattern 7: Watchdog Insights / RUM features turned on org-wide

Pattern 8: Dev/staging running the production agent config

The 5-minute audit any engineer can run

When to switch off Datadog (rarely the right answer)

How CARTIEAI helps

AWS Cost Optimization: The Complete Guide for FinOps Teams (2026)

Find your Snowflake waste right now.

3 cost-saving tips, every Tuesday.

Lakshmi Kiranmai Guduru

Keep reading

Install CARTIEAI