Backup strategy

CARTIE runs on MongoDB Atlas. Atlas Cloud Backup provides continuous snapshot-based protection at the cluster level. We layer per-tenant export jobs on top for auditability.

Tier 1: Atlas Cloud Backup (cluster-level)

Snapshot cadence: continuous (every 6 hours full snapshot + WiredTiger oplog continuous replay)
Retention: 7 daily · 4 weekly · 12 monthly = ~13 month rolling window
Cross-region copy: enabled (us-east-1 → us-west-2)
Encryption: AES-256 server-side at rest in S3; AWS KMS-managed keys
Point-in-time recovery: down to 1-second granularity within last 7 days

Tier 2: Per-tenant export (auditability)

A weekly cron (Sundays 03:00 UTC, runs immediately after Pipeline Dedup cron) iterates every tenant DB and writes a Cloud-Backup snapshot reference into backup_audit_log. This gives operators a "tenant X exists in snapshot Y" attestation without taking redundant data copies.

RTO & RPO targets

Scenario	RPO target	RTO target	Achievement
Single-document corruption (operator deletes a wrong record)	0 min (oplog replay)	30 min	Atlas point-in-time → 1 min
Single-collection corruption (logic bug wipes a table)	≤ 6 hours	2 hours	Atlas snapshot restore → 1 hour
Whole-cluster loss (region outage)	≤ 6 hours	4 hours	Cross-region snapshot promote → 2 hour
Total catastrophic loss (Atlas data loss)	≤ 7 days	1 day	Cross-region copy + offsite weekly export

Restoration drill

We run a quarterly restoration drill (next: 2026-05-18):

Pick a random tenant DB (excluding production-customer DBs)
Trigger Atlas Cloud Backup restore to a fresh staging cluster
Verify document count matches snapshot; spot-check 10 random documents for content equality
Verify time-to-restore vs RTO target; file ticket if missed
Tear down staging cluster

Drill results filed in backup_drill_results collection + summarized in the quarterly Risk Assessment review.

Operator emergency runbook

Step 1 — identify the loss

Was a single record deleted? → use oplog replay (1 min)
Was a collection emptied? → use point-in-time recovery (1 hour)
Is the whole tenant gone? → use cross-region snapshot promotion (2 hour)

Step 2 — restore

In Atlas UI, navigate to Backup → Restore → select snapshot < loss event time
Target: staging cluster first (NEVER restore directly over production)
Once staging verifies, restore over production via cluster swap
Notify affected tenants within 1 hour of any data loss event (per Incident Response playbook)

Step 3 — post-mortem

Document the loss + recovery in Incident Response registry within 24 hours
Update Risk Assessment if a new failure mode discovered
Customer comms: status page banner + email to affected tenants

Sign-off

Runbook approved Feb 18, 2026. Next review: May 18, 2026 (post Q1 drill).

MongoDB Backup & Recovery Runbook