Backup strategy
CARTIE runs on MongoDB Atlas. Atlas Cloud Backup provides continuous snapshot-based protection at the cluster level. We layer per-tenant export jobs on top for auditability.
Tier 1: Atlas Cloud Backup (cluster-level)
- Snapshot cadence: continuous (every 6 hours full snapshot + WiredTiger oplog continuous replay)
- Retention: 7 daily · 4 weekly · 12 monthly = ~13 month rolling window
- Cross-region copy: enabled (us-east-1 → us-west-2)
- Encryption: AES-256 server-side at rest in S3; AWS KMS-managed keys
- Point-in-time recovery: down to 1-second granularity within last 7 days
Tier 2: Per-tenant export (auditability)
A weekly cron (Sundays 03:00 UTC, runs immediately after Pipeline Dedup cron) iterates every tenant DB and writes a Cloud-Backup snapshot reference into backup_audit_log. This gives operators a "tenant X exists in snapshot Y" attestation without taking redundant data copies.
RTO & RPO targets
| Scenario | RPO target | RTO target | Achievement |
|---|---|---|---|
| Single-document corruption (operator deletes a wrong record) | 0 min (oplog replay) | 30 min | Atlas point-in-time → 1 min |
| Single-collection corruption (logic bug wipes a table) | ≤ 6 hours | 2 hours | Atlas snapshot restore → 1 hour |
| Whole-cluster loss (region outage) | ≤ 6 hours | 4 hours | Cross-region snapshot promote → 2 hour |
| Total catastrophic loss (Atlas data loss) | ≤ 7 days | 1 day | Cross-region copy + offsite weekly export |
Restoration drill
We run a quarterly restoration drill (next: 2026-05-18):
- Pick a random tenant DB (excluding production-customer DBs)
- Trigger Atlas Cloud Backup restore to a fresh staging cluster
- Verify document count matches snapshot; spot-check 10 random documents for content equality
- Verify time-to-restore vs RTO target; file ticket if missed
- Tear down staging cluster
Drill results filed in backup_drill_results collection + summarized in the quarterly Risk Assessment review.
Operator emergency runbook
Step 1 — identify the loss
- Was a single record deleted? → use oplog replay (1 min)
- Was a collection emptied? → use point-in-time recovery (1 hour)
- Is the whole tenant gone? → use cross-region snapshot promotion (2 hour)
Step 2 — restore
- In Atlas UI, navigate to Backup → Restore → select snapshot < loss event time
- Target: staging cluster first (NEVER restore directly over production)
- Once staging verifies, restore over production via cluster swap
- Notify affected tenants within 1 hour of any data loss event (per Incident Response playbook)
Step 3 — post-mortem
- Document the loss + recovery in Incident Response registry within 24 hours
- Update Risk Assessment if a new failure mode discovered
- Customer comms: status page banner + email to affected tenants
Sign-off
Runbook approved Feb 18, 2026. Next review: May 18, 2026 (post Q1 drill).