The Databricks Bill Problem
Snowflake gets all the FinOps attention. But quietly, Databricks accounts are bleeding worse.
A typical mid-market Databricks account wastes 30–45% of DBUs. Higher than Snowflake. This guide is the playbook to fix it.
The DBU Pricing Cheat Sheet
| Tier | All-purpose | Jobs Compute |
|---|
| Standard | ~$0.40 / DBU | ~$0.15 / DBU |
| Premium | ~$0.55 / DBU | ~$0.30 / DBU |
| Enterprise | ~$0.65 / DBU | ~$0.40 / DBU |
Key insight: Jobs Compute is 2.5x cheaper than All-Purpose. If your scheduled jobs are running on All-Purpose clusters, you're paying 2.5x for no reason.
Pattern 1: Move Scheduled Jobs Off All-Purpose Clusters
In your Databricks Jobs UI:
- Edit the task → Cluster → "New job cluster" instead of "Existing cluster"
- Set min/max workers based on actual job needs
Expected savings: ~60% on every job that previously ran on All-Purpose.
Pattern 2: Photon Where It Doesn't Help
Photon is brilliant — for SQL workloads on Delta tables. 2x the DBU rate but typically 2–5x faster.
But teams enable Photon globally and waste 100% extra on workloads where it provides zero benefit:
- Pure Python notebooks
- ML training jobs
- Anything UDF-heavy
- Streaming workloads
The fix: Audit your jobs. Keep Photon ON only for SQL/Delta-heavy ETL.
Expected savings: 30–50% on misconfigured jobs.
Pattern 3: The Always-On All-Purpose Cluster
Set auto-terminate to 30 minutes for shared clusters. Use Databricks SQL Warehouses for BI tools. For ML researchers, use Personal Compute clusters.
Expected savings: 30–60% on shared cluster costs.
Pattern 4: Over-Provisioned Auto-Scaling
{
"min_workers": 1,
"max_workers": 8,
"autoscale": {
"mode": "ENHANCED"
}
}
Set min_workers = 1 (or even 0 for serverless-eligible workloads).
Expected savings: 20–40% on cluster costs.
Pattern 5: Driver Node Way Too Big
Engineers default to i3.4xlarge driver nodes "to be safe." For most ETL workloads, the driver only coordinates — i3.xlarge or i3.2xlarge is plenty.
Expected savings: 10–25% on cluster costs.
Pattern 6: Spot Instances for Workers
{
"aws_attributes": {
"availability": "SPOT_WITH_FALLBACK",
"first_on_demand": 1,
"spot_bid_price_percent": 100
}
}
Use SPOT_WITH_FALLBACK so the driver stays on-demand (resilient) but workers use spot (cheap).
Expected savings: 50–70% on worker costs for spot-eligible workloads.
Pattern 7: Streaming Jobs With No Trigger Interval
Use Trigger.AvailableNow for incremental batch processing if real-time isn't required:
df.writeStream \
.trigger(availableNow=True) \
.toTable("my_table")
Expected savings: 80–95% for non-real-time streaming workloads.
Diagnostic Query: Find Your Top DBU-Burning Clusters
SELECT
workspace_id,
cluster_id,
cluster_name,
cluster_source,
spark_version,
ROUND(SUM(usage_quantity), 2) AS total_dbus,
ROUND(SUM(usage_quantity) * 0.55, 2) AS estimated_dollars
FROM system.billing.usage
WHERE usage_date > DATE_SUB(CURRENT_DATE(), 30)
AND sku_name LIKE '%DBU%'
GROUP BY 1, 2, 3, 4, 5
ORDER BY total_dbus DESC
LIMIT 25;
The Quick-Win Checklist
If you only do 5 things this week:
- ✅ Move all scheduled jobs to Job Clusters
- ✅ Turn off Photon on non-SQL/Delta workloads
- ✅ Set shared all-purpose clusters to auto-terminate at 30 min
- ✅ Set
min_workers = 1 on every auto-scaling cluster
- ✅ Use
SPOT_WITH_FALLBACK on worker nodes for non-critical jobs
Combined, these typically cut Databricks DBUs by 30–45%.