Severity classification
| Sev | Definition | MTTA target | MTTR target | Customer comms |
|---|---|---|---|---|
| SEV-1 | Production outage; all customers affected; or any confirmed data loss/leak | 15 min | 4 hr | Status page + email within 60 min |
| SEV-2 | Single critical flow broken (auth, billing, savings calculation); >25% of customers affected | 1 hr | 1 business day | Status page within 2 hr; email if > 4hr |
| SEV-3 | Non-critical feature degraded; ≤ 25% of customers affected | 4 hr | 3 business days | In-product banner; no email |
| SEV-4 | Cosmetic; no customer-facing impact | 1 business day | 2 weeks | None |
On-call
Single-founder stage (Feb 2026): Founder is on-call 24/7. Slack-paged via PagerDuty-equivalent webhook (Sentry alerts → Slack #alerts channel with @here mention for SEV-1/2). When team grows beyond 3 engineers, we'll adopt a weekly rotation with PagerDuty primary/secondary.
Detection sources
- Sentry — backend exceptions; alert if > 10 errors/min for 5 min
- Uptime monitor — pings
/api/healthevery 30s; alert if 3 consecutive failures - Customer report — any email to lakshmi@cartie.ai or in-app support ticket → triaged within MTTA target
- Security audit cron — weekly run flags anomalies in admin audit log
Triage flow
- Acknowledge in Slack #alerts within MTTA target. State "I am on it"
- Classify severity using the table above
- SEV-1 only: open a war-room thread in #alerts. Tag the incident with a sequential ID (INC-YYYYMMDD-NN)
- Mitigate first, root-cause second. If the fix is > 30 min, consider feature-flag-off or rollback to last green deploy
- Verify with the same detection source that originally fired
- Communicate per the severity-comms table
Customer communication templates
SEV-1 status page (within 60 min)
[INVESTIGATING] {start_time} — We're investigating reports of {what_is_broken}. Customers may experience {impact}. We'll post the next update at {now + 30min}.
SEV-1 customer email (post-resolution)
Subject: Service incident on {date} — your data is safe
On {date} between {start} and {end} UTC, CARTIE experienced {1-sentence what}. {Affected feature} was unavailable. No customer data was lost or exposed.
Root cause: {1 sentence}.
Fix: {1 sentence}.
Prevention: {1 sentence describing the controls we are adding}.
We don't think this kind of failure should happen, and we're committed to making it not happen again. Reply to this email if you have any questions; I read every one.
— Lakshmi
Post-mortem requirements (mandatory for SEV-1 & SEV-2)
Within 5 business days of resolution, the on-call writes a blameless post-mortem with:
- Incident ID + severity + duration
- Timeline (UTC) — every action, every alert
- Root cause (technical) + contributing causes (process, organizational)
- What worked / what didn't
- Action items: each with an owner and a due date (no "TBD"s)
- Customer impact estimate (# customers, $ revenue impact if any)
Post-mortems are filed in incident_post_mortems and reviewed in the next quarterly Risk Assessment.
Practice drills
We run a quarterly chaos drill (next: 2026-05-18):
- Founder picks an unannounced failure scenario from a hat (e.g., "MongoDB primary down", "Stripe webhook receiver returns 500", "LLM provider rate-limits us")
- Inject the failure in staging (NEVER prod)
- Time how long it takes to: detect → triage → mitigate → notify customers (using the templates above)
- File results in
incident_drill_results; update playbook if any gap found
Sign-off
Playbook approved Feb 18, 2026. Next review: May 18, 2026 (post Q1 chaos drill).