A note on this story: Numbers below are from a composite of 8 production K8s clusters we've audited (50–500 nodes, $40K–$300K/month spend). The pain points are universal.
If you're running Kubernetes at any meaningful scale, you've had this conversation:
CFO: "Our K8s bill jumped $40K last month. What changed?"
You: "Uh… I'll look into it."
(8 hours later, after manually correlating Cost Explorer with kubectl output, prom queries, and 3 Jira tickets:)
You: "Looks like the data team's new spark workload spun up 50 GPU pods on weekends and forgot to turn them off."
CFO: "Why did it take you 8 hours to figure that out?"
Because Kubernetes is great at hiding cost. AWS / Azure / GCP bills you for the underlying nodes. Kubernetes runs whatever workloads you throw at it on those nodes. The link between "Pod X spent $4K this month" and "Team Y owns Pod X" is not built into the platform.
This guide fixes that. By the end, you'll have:
- A labelling convention every team can follow in 5 minutes
- The 4 query patterns that turn raw cost into chargeback reports
- A YAML template you can paste into your cluster today
- The 3 mistakes that make 90% of K8s cost-allocation projects fail
The fundamental mistake: trying to allocate cost AFTER the fact
Most teams start their K8s FinOps journey by buying a tool (Kubecost, OpenCost, Cloudability) and pointing it at their cluster. The tool reports "namespace-A spent $12K, namespace-B spent $8K, untagged spent $34K".
That untagged line is the killer. Without labels, no tool can tell you who owns what. And labels can't be applied retroactively — you can't go back in time and tag last month's pods.
The fix is upstream: make labels mandatory at deploy time, before workloads ever hit the cluster.
The labelling convention: 5 keys, every workload, no exceptions
Here are the labels we install on every workload (Deployments, StatefulSets, DaemonSets, Jobs):
metadata:
labels:
cost-team: data-platform # which team pays
cost-project: feature-store-v2 # which project this is for
cost-environment: production # prod / staging / dev
cost-service: feature-store-api # which microservice
cost-tier: critical # critical / standard / batch
Why these 5? Each one answers a CFO question:
| Label | CFO question it answers |
|---|
cost-team | "Which team's budget does this come out of?" |
cost-project | "Is this for the new product launch or the old infra refresh?" |
cost-environment | "How much are we spending on staging that nobody uses?" |
cost-service | "Which microservice is the cost driver this month?" |
cost-tier | "Can we kill the 'standard' tier pods on weekends?" |
Critical: these are labels, not annotations. Labels are queryable; annotations are not.
The 3 enforcement mechanisms (pick at least one)
1. OPA / Gatekeeper policy (the gold standard)
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
name: must-have-cost-labels
spec:
match:
kinds:
- apiGroups: ["apps"]
kinds: ["Deployment", "StatefulSet"]
- apiGroups: ["batch"]
kinds: ["Job", "CronJob"]
parameters:
labels:
- "cost-team"
- "cost-project"
- "cost-environment"
Any kubectl apply without those labels gets rejected at admission time. Hard but effective — the labels can never be skipped.
2. Kyverno policy (lighter weight)
If Gatekeeper feels heavy, Kyverno does the same with simpler YAML. We've seen 50% faster rollout when teams pick Kyverno over Gatekeeper for cost-tagging policies.
3. CI/CD gate (the soft fallback)
If your platform team can't enforce admission control yet, add a kubeval-style check to every CI pipeline:
# Reject any deploy that doesn't have cost-team + cost-project
yq -r '.metadata.labels | has("cost-team")' deploy.yaml | grep -q true || exit 1
Less bullet-proof (developers can still kubectl-apply directly), but unblocks 80% of teams.
The 4 query patterns most teams miss
Once labels are in, the queries are stupid simple. Here are the four every FinOps lead needs:
Pattern 1: Spend by team (the chargeback report)
sum by (label_cost_team) (
kube_pod_resource_request{resource="cpu"} *
on(pod, namespace) group_left(label_cost_team)
kube_pod_labels
) * <node_hourly_cost>
Output: { data-platform: $34K, web-app: $12K, ml-research: $8K }. Send to Finance monthly.
Pattern 2: Wasted spend by tier
sum by (label_cost_tier) (
kube_pod_status_phase{phase="Succeeded"} *
on(pod) group_left(label_cost_tier)
kube_pod_labels{label_cost_tier="batch"}
)
Find batch jobs running on prod-tier nodes. Move them to spot. Save 60–80%.
Pattern 3: Idle resources by service
(kube_pod_resource_request{resource="cpu"} - rate(container_cpu_usage_seconds_total[1h]))
> 0.5 # requesting >0.5 cores more than using
Find pods over-requesting CPU. Right-size them. Typical savings: 15–30% of cluster cost.
Pattern 4: Spend trend by environment
sum by (label_cost_environment) (
kube_pod_resource_request{resource="memory"}
)
If staging >40% of production, you have an over-provisioned staging environment. Cut it in half.
The 3 mistakes that kill K8s cost projects
Mistake 1: Allocating cost without enforcement
Teams add Kubecost, see "untagged: $34K", and assume the labels will eventually appear. They never do. Without OPA/Gatekeeper/Kyverno enforcement, untagged spend grows. It NEVER shrinks.
Mistake 2: Charging back without educating
You can't drop a $34K bill on the data team without warning. The first month after rolling out chargeback should be "shadow billing" — show the numbers but don't actually charge. Give teams 30 days to optimise. Then go live.
Mistake 3: Per-namespace allocation when teams share namespaces
Many K8s setups have one production namespace shared across teams. Allocating by namespace doesn't work — you need labels at the workload level, not the namespace level. This is why cost-team is a label on every Deployment, not just on namespaces.
What does success look like?
Here's a real before/after we ran with a 200-engineer SaaS company:
| Metric | Before | After 90 days |
|---|
| % of cluster spend allocatable | 38% | 96% |
| Time to answer "what changed?" | 6–8 hrs | <5 min |
| Total cluster spend | $180K/mo | $132K/mo (-27%) |
| Teams with their own optimisation goals | 0 | 7 |
The interesting number is the 27% spend reduction. Nothing about the labels themselves saves money. But once teams can SEE their own spend, they optimise it. That's the whole game.
How CARTIE AI helps
If you want this without the YAML-wrestling phase, CARTIE AI's Kubernetes Cost Allocation tool ingests your cluster, applies our default labelling convention, and gives you the 4 queries pre-built as dashboards. 10 minutes to first chargeback report.
But honestly? Even if you don't use a tool — just enforce the 5 labels with a Gatekeeper policy and write the 4 PromQL queries. That's 60% of the value of any FinOps platform, for free.
Now go tag your pods. 🥃