Back to home
Back to SOC 2 readiness
SOC 2 evidence document

MongoDB Backup & Recovery Runbook

Snapshot frequency, retention, restoration drill schedule, RTO & RPO targets.

Version 1.0 Last reviewed 2026-02-18Owner: Founder

Backup strategy

CARTIE runs on MongoDB Atlas. Atlas Cloud Backup provides continuous snapshot-based protection at the cluster level. We layer per-tenant export jobs on top for auditability.

Tier 1: Atlas Cloud Backup (cluster-level)

  • Snapshot cadence: continuous (every 6 hours full snapshot + WiredTiger oplog continuous replay)
  • Retention: 7 daily · 4 weekly · 12 monthly = ~13 month rolling window
  • Cross-region copy: enabled (us-east-1 → us-west-2)
  • Encryption: AES-256 server-side at rest in S3; AWS KMS-managed keys
  • Point-in-time recovery: down to 1-second granularity within last 7 days

Tier 2: Per-tenant export (auditability)

A weekly cron (Sundays 03:00 UTC, runs immediately after Pipeline Dedup cron) iterates every tenant DB and writes a Cloud-Backup snapshot reference into backup_audit_log. This gives operators a "tenant X exists in snapshot Y" attestation without taking redundant data copies.

RTO & RPO targets

ScenarioRPO targetRTO targetAchievement
Single-document corruption (operator deletes a wrong record)0 min (oplog replay)30 minAtlas point-in-time → 1 min
Single-collection corruption (logic bug wipes a table)≤ 6 hours2 hoursAtlas snapshot restore → 1 hour
Whole-cluster loss (region outage)≤ 6 hours4 hoursCross-region snapshot promote → 2 hour
Total catastrophic loss (Atlas data loss)≤ 7 days1 dayCross-region copy + offsite weekly export

Restoration drill

We run a quarterly restoration drill (next: 2026-05-18):

  1. Pick a random tenant DB (excluding production-customer DBs)
  2. Trigger Atlas Cloud Backup restore to a fresh staging cluster
  3. Verify document count matches snapshot; spot-check 10 random documents for content equality
  4. Verify time-to-restore vs RTO target; file ticket if missed
  5. Tear down staging cluster

Drill results filed in backup_drill_results collection + summarized in the quarterly Risk Assessment review.

Operator emergency runbook

Step 1 — identify the loss

  • Was a single record deleted? → use oplog replay (1 min)
  • Was a collection emptied? → use point-in-time recovery (1 hour)
  • Is the whole tenant gone? → use cross-region snapshot promotion (2 hour)

Step 2 — restore

  1. In Atlas UI, navigate to Backup → Restore → select snapshot < loss event time
  2. Target: staging cluster first (NEVER restore directly over production)
  3. Once staging verifies, restore over production via cluster swap
  4. Notify affected tenants within 1 hour of any data loss event (per Incident Response playbook)

Step 3 — post-mortem

  • Document the loss + recovery in Incident Response registry within 24 hours
  • Update Risk Assessment if a new failure mode discovered
  • Customer comms: status page banner + email to affected tenants

Sign-off

Runbook approved Feb 18, 2026. Next review: May 18, 2026 (post Q1 drill).

Linked SOC 2 controls
CC7.1

We value your privacy. Cookies help us improve your experience. Learn more

Install CARTIE AI

Add to your home screen for quick access and offline support