"Why did our Databricks bill double?"
I've heard this question 14 times in the last 6 months. Every time, the answer is the same: it's not one big mistake — it's 6 hidden costs stacking on top of each other.
Each one looks small in the docs. Each one is invisible in the default UI. Each one is a multiplier on your bill.
Here are the 6, in order of how often I find them in audits.
Hidden Cost #1: Photon's Silent 2x DBU Markup
The trap: Photon's marketing makes it look like a free perf upgrade. The footnote: Photon-enabled clusters consume 2x DBUs per hour.
If your job took 10 minutes on standard compute and now takes 4 minutes on Photon, you're saving 60% of wall time but using 2x the DBUs per minute. Net: ~80% the cost. Not 60% savings — 20% savings.
If your job got faster but is CPU-bound (no I/O bottleneck), Photon is a savings. If your job is I/O bound (most ETL is), Photon turns into a tax.
Find it
SELECT
cluster_id,
job_name,
SUM(IF(photon_enabled, dbu_consumed, 0)) AS photon_dbus,
SUM(IF(NOT photon_enabled, dbu_consumed, 0)) AS standard_dbus,
AVG(p99_io_wait_pct) AS avg_io_wait_pct
FROM system.billing.usage
WHERE usage_date >= current_date() - INTERVAL 30 DAYS
GROUP BY 1, 2
HAVING avg_io_wait_pct > 30
ORDER BY photon_dbus DESC;
Any cluster with >30% I/O wait time AND Photon enabled is paying the markup for nothing.
Hidden Cost #2: Serverless's "Convenience Premium"
The trap: Serverless SQL Warehouses cost ~25% more in DBUs than Classic, plus they include a markup on the underlying VMs (which you don't see — Databricks bills you a flat rate).
The pitch is "no cluster management". The math: at scale, you're paying 25–40% extra for the privilege of not running a Terraform module.
Rule of thumb
- Spiky workloads (analyst queries 9-5): Serverless wins. Cold-start penalty hurts Classic.
- Predictable workloads (24/7 ELT, scheduled BI refreshes): Classic wins by 25–40%.
Mixed environments are common — and you're probably running everything Serverless because it was the default.
Hidden Cost #3: Idle Cluster Minutes (the "10-minute auto-terminate" trap)
The trap: The default cluster auto-terminate is 120 minutes. That's 2 hours of paid DBUs after every notebook closes.
Worse: every job in your workspace probably inherited the default. A team running 40 ad-hoc clusters/day is paying for 80 cluster-hours of pure idle time, every day. At a Standard DBS-DLT rate of $0.55/DBU and a typical 2 DBU/hour mid-size cluster, that's $48/day = $1,440/month of pure waste.
The fix (one click)
- Set workspace default
auto_termination_minutes = 10.
- For SQL warehouses:
auto_stop_mins = 5.
- Use cluster policies to enforce this — even your senior engineers can't override.
Find it
SELECT cluster_id, AVG(auto_termination_minutes) AS avg_auto_term_min
FROM system.compute.clusters
GROUP BY cluster_id
HAVING avg_auto_term_min > 30
ORDER BY avg_auto_term_min DESC;
Hidden Cost #4: Cross-Region Egress Nobody Sees
The trap: Your Databricks workspace is in us-east-1. Your S3 source data lives in us-west-2 because that's where the data engineering team set it up two re-orgs ago. Every job pays AWS $0.02/GB egress + an inter-region bandwidth charge.
For a daily 500GB ETL job, that's $300/month in egress alone — invisible in your Databricks bill (it's an AWS line item) and never attributed back to the job that caused it.
Find it
- AWS Cost Explorer → filter by
AWSDataTransfer.
- Group by
Resource (turn on resource tagging).
- Look for any S3 bucket with >$50/month of
InterRegion cost.
The fix
Move the bucket. Or move the workspace. Don't tolerate inter-region for high-volume sources.
Hidden Cost #5: Default Cluster Sizes (the "i3.xlarge tax")
The trap: Databricks defaults new general-purpose clusters to i3.xlarge workers. Every analyst's first cluster is i3.xlarge. Most analysts never resize.
i3.xlarge has expensive NVMe SSDs baked in — great if you're shuffling 100GB+. Useless and expensive if your analyst is querying a 5GB table.
Find it
SELECT
worker_node_type,
COUNT(*) AS cluster_count,
SUM(dbu_consumed) AS total_dbus,
AVG(p95_disk_io_mb_s) AS avg_disk_io
FROM system.billing.usage
WHERE usage_date >= current_date() - INTERVAL 30 DAYS
GROUP BY 1
HAVING avg_disk_io < 10 -- low disk usage = wasted i3
ORDER BY total_dbus DESC;
Any i3.* cluster with <10MB/s p95 disk I/O is paying for SSDs it doesn't use. Switch to m5.* (general purpose, no NVMe) and save 30–50%.
Hidden Cost #6: Autoscale's "Minimum Floor" Trap
The trap: Autoscaling clusters with min_workers=2 and max_workers=8. Most jobs only ever need the minimum. The 2-worker minimum runs 24/7 if the cluster's pinned, even when no jobs are scheduled.
If your workspace has 12 always-on autoscale clusters with min_workers=2, that's 24 worker-hours/hour = 17,280 worker-hours/month of minimum floor — even at zero job activity.
The fix
{
"autoscale": {
"min_workers": 1,
"max_workers": 8
},
"auto_termination_minutes": 10
}
min_workers: 1 + auto-terminate = the cluster scales itself to zero between jobs.
Putting It All Together: A Real Audit
A SaaS company running on Databricks Premium, ~$42K/month. We ran the 6 diagnostic queries above. Here's what we found:
| Hidden cost | Monthly waste | Effort to fix |
|---|
| Photon on I/O-bound jobs | $4,800 | 1 day (toggle off) |
| Serverless on 24/7 ELT | $3,200 | 1 sprint (migrate 4 jobs) |
| Idle cluster minutes | $2,100 | 30 min (workspace setting) |
| Cross-region egress | $1,400 | 2 weeks (move bucket) |
| Default i3 sizing | $5,500 | 1 week (cluster policy) |
| Autoscale min floor | $1,800 | 30 min (policy update) |
| TOTAL | $18,800/mo | ~$226K/year |
$42K → $23K. 45% off the bill, in 3 weeks.
Why your Databricks bill doubled
Because all 6 of these stack. You didn't change any one thing — you changed your workload, and every multiplier amplified.
If you want to see which of the 6 are happening in your workspace, request a free Databricks cost audit — we'll run the diagnostics, return a numbered fix list, and quote the savings to the dollar.
No DBU usage, no card needed.
TL;DR
6 hidden Databricks costs, in order of frequency:
- Photon on I/O-bound jobs (2x DBU markup, no speedup)
- Serverless on 24/7 workloads (25–40% premium for spiky-only feature)
- Idle cluster minutes (default 120-min auto-terminate)
- Cross-region egress (invisible AWS line item)
- Default
i3.xlarge worker tax (expensive NVMe nobody uses)
- Autoscale minimum floor (
min_workers ≥ 2 running 24/7)
Stack-effect: these typically combine to 40–50% of total spend. Run the SQL above. Find your worst 2. Fix them this sprint.