Back to home
Back to SOC 2 readiness
SOC 2 evidence document

Incident Response Playbook

Severity classification, on-call rotation, customer-comms templates, post-mortem requirements.

Version 1.0 Last reviewed 2026-02-18Owner: Founder

Severity classification

SevDefinitionMTTA targetMTTR targetCustomer comms
SEV-1Production outage; all customers affected; or any confirmed data loss/leak15 min4 hrStatus page + email within 60 min
SEV-2Single critical flow broken (auth, billing, savings calculation); >25% of customers affected1 hr1 business dayStatus page within 2 hr; email if > 4hr
SEV-3Non-critical feature degraded; ≤ 25% of customers affected4 hr3 business daysIn-product banner; no email
SEV-4Cosmetic; no customer-facing impact1 business day2 weeksNone

On-call

Single-founder stage (Feb 2026): Founder is on-call 24/7. Slack-paged via PagerDuty-equivalent webhook (Sentry alerts → Slack #alerts channel with @here mention for SEV-1/2). When team grows beyond 3 engineers, we'll adopt a weekly rotation with PagerDuty primary/secondary.

Detection sources

  • Sentry — backend exceptions; alert if > 10 errors/min for 5 min
  • Uptime monitor — pings /api/health every 30s; alert if 3 consecutive failures
  • Customer report — any email to lakshmi@cartie.ai or in-app support ticket → triaged within MTTA target
  • Security audit cron — weekly run flags anomalies in admin audit log

Triage flow

  1. Acknowledge in Slack #alerts within MTTA target. State "I am on it"
  2. Classify severity using the table above
  3. SEV-1 only: open a war-room thread in #alerts. Tag the incident with a sequential ID (INC-YYYYMMDD-NN)
  4. Mitigate first, root-cause second. If the fix is > 30 min, consider feature-flag-off or rollback to last green deploy
  5. Verify with the same detection source that originally fired
  6. Communicate per the severity-comms table

Customer communication templates

SEV-1 status page (within 60 min)

[INVESTIGATING] {start_time} — We're investigating reports of {what_is_broken}. Customers may experience {impact}. We'll post the next update at {now + 30min}.

SEV-1 customer email (post-resolution)

Subject: Service incident on {date} — your data is safe

On {date} between {start} and {end} UTC, CARTIE experienced {1-sentence what}. {Affected feature} was unavailable. No customer data was lost or exposed.

Root cause: {1 sentence}.
Fix: {1 sentence}.
Prevention: {1 sentence describing the controls we are adding}.

We don't think this kind of failure should happen, and we're committed to making it not happen again. Reply to this email if you have any questions; I read every one.

— Lakshmi

Post-mortem requirements (mandatory for SEV-1 & SEV-2)

Within 5 business days of resolution, the on-call writes a blameless post-mortem with:

  • Incident ID + severity + duration
  • Timeline (UTC) — every action, every alert
  • Root cause (technical) + contributing causes (process, organizational)
  • What worked / what didn't
  • Action items: each with an owner and a due date (no "TBD"s)
  • Customer impact estimate (# customers, $ revenue impact if any)

Post-mortems are filed in incident_post_mortems and reviewed in the next quarterly Risk Assessment.

Practice drills

We run a quarterly chaos drill (next: 2026-05-18):

  1. Founder picks an unannounced failure scenario from a hat (e.g., "MongoDB primary down", "Stripe webhook receiver returns 500", "LLM provider rate-limits us")
  2. Inject the failure in staging (NEVER prod)
  3. Time how long it takes to: detect → triage → mitigate → notify customers (using the templates above)
  4. File results in incident_drill_results; update playbook if any gap found

Sign-off

Playbook approved Feb 18, 2026. Next review: May 18, 2026 (post Q1 chaos drill).

Linked SOC 2 controls
CC7.2

We value your privacy. Cookies help us improve your experience. Learn more

Install CARTIE AI

Add to your home screen for quick access and offline support