Current capacity (Feb 2026)
- FastAPI backend: single-pod (Kubernetes), 2 vCPU + 4GB RAM, supervisor-managed
- React frontend: single-pod static-served via nginx, behind the same ingress
- MongoDB Atlas: M10 dedicated cluster, ~300 GB storage, 3-node replica set
- APScheduler: in-process AsyncIO scheduler (single instance) running ~12 cron jobs
Scaling triggers
Each trigger is a measurable threshold + a documented response. We do NOT speculatively over-provision; we DO move when a threshold trips.
FastAPI backend
| Metric | Trigger | Response |
|---|---|---|
| p95 request latency | > 800ms for 24h | Add horizontal replica (HPA target: 60% CPU) |
| CPU utilisation | > 70% sustained over 4h | Scale pod RAM/CPU up one tier |
| Error rate | > 0.5% over 1h | SEV-2 incident; do NOT auto-scale, investigate first |
| Concurrent tenants | > 50 active orgs | Switch from in-process scheduler to externalized (Redis-backed) |
MongoDB Atlas
| Metric | Trigger | Response |
|---|---|---|
| Storage used | > 80% of cluster size | Upgrade tier (M10 → M20 → M30 → sharded) |
| Connections | > 70% of pool | Increase Motor pool size; if still hot, increase replica count |
| Per-tenant DB count | > 200 tenant DBs on one cluster | Move to multi-cluster sharding (router lookup by tenant prefix) |
| Replication lag | > 5 sec on secondaries | SEV-3 ticket; investigate slow queries; index audit |
APScheduler / cron
| Metric | Trigger | Response |
|---|---|---|
| Job execution time | > 50% of interval | Optimize query; split job by tenant batch |
| Job count | > 20 concurrent jobs | Externalize scheduler to dedicated worker pod |
| Missed runs | ≥ 2/week | Increase misfire_grace_time + move to externalized scheduler |
Headroom check cadence
Monthly headroom review (every 1st Tuesday): Founder reads the Atlas + Sentry dashboards, files a "headroom report" in capacity_reports. If any metric is within 25% of its trigger, we move BEFORE the trigger fires.
Forecasting
CARTIE's own Roadmap-Aware Forecast is used internally to predict cost (and therefore capacity) as customer count grows. The Jira automation that imports product-launch events also fires capacity-review tickets when projected MAU growth exceeds 50%.
Sign-off
Runbook approved Feb 18, 2026. Next review: May 18, 2026.