Local AI Backup & Disaster Recovery: Complete 2026 Playbook
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Local AI Backup and Disaster Recovery: The Playbook Nobody Wrote
Published April 23, 2026 - 20 min read
The first time I lost a local AI deployment, it was 11pm on a Friday in October 2024. A 4TB NVMe drive in our team server failed without warning. Gone with it: 187 GB of pulled Ollama models, three fine-tuned adapters representing maybe 60 hours of training time, a Chroma vector store with 2.4 million chunks of internal documentation, and most painfully, the audit log database we used to satisfy our SOC 2 controls. We had backups of the application database. We did not have backups of any of the AI-specific assets, because at the time none of us thought of them as "data."
That outage cost the team about a week of work. The second time something similar happened - this time a corrupted file system on a different server - we lost twenty minutes. The difference was a backup playbook specifically designed for the way local AI stores state. It is a different shape from typical application backups, and most generic backup tools will quietly miss the parts that matter.
This guide is that playbook. Tested. Drilled. Audited.
Quick Start: Your First Backup in 15 Minutes
If you do nothing else after reading this, do this:
# 1. Install restic (encrypted, deduplicated backup)
brew install restic # or apt install restic
# 2. Create a repository on a separate disk
restic init --repo /mnt/backup-disk/ai-repo
# 3. Back up the four critical paths
restic -r /mnt/backup-disk/ai-repo backup \
~/.ollama/models \
/var/lib/ollama \
/opt/anythingllm/storage \
/var/lib/audit-db
# 4. Verify
restic -r /mnt/backup-disk/ai-repo snapshots
That alone puts you ahead of about 80% of self-hosted AI deployments. The rest of this guide is about doing it right - encryption, off-site, retention, and crucially the part everyone skips, practicing the restore.
Table of Contents
- Why Local AI Needs Its Own Backup Strategy
- The Six Things You Must Back Up
- Setting RTO and RPO Targets That Are Honest
- The 3-2-1 Rule, Adapted for AI
- Tooling: Restic, BorgBackup, or Restic+S3
- Backing Up Ollama Model Files
- Backing Up Fine-Tunes and Adapters
- Backing Up RAG Vector Stores
- Backing Up Audit Logs and Application State
- Restore Drills: The Part Everyone Skips
- Off-Site and Air-Gapped Copies
- Pitfalls
- FAQ
Why Local AI Needs Its Own Backup Strategy {#why-special}
Standard application backups - "back up the database every night, back up uploads weekly" - miss most of what matters in a local AI deployment. The state lives in unusual places:
- Model files are large (5-100 GB each), binary, and identical across many users. Naive backups duplicate them; smart backups dedupe.
- Fine-tuned adapters are small (50-500 MB) but represent hours of GPU time. Losing them is unrecoverable without re-training.
- Vector indices are mutating. A backup taken mid-write can be unrecoverably corrupt.
- Audit logs must be immutable. Backups must preserve the hash chain.
- Prompt templates and system configurations live in dotfiles that often are not in version control.
The blast radius of a single drive failure is therefore much bigger than people expect. Worse, the warning signs are different. A model file that becomes corrupted may load successfully and return subtly wrong outputs - you do not get a clean error.
If you have not yet hardened the rest of your local AI stack, our Ollama production deployment guide is the natural prerequisite.
The Six Things You Must Back Up {#what-to-backup}
In rough order of recovery cost (highest first):
| Asset | Typical Size | If Lost |
|---|---|---|
| Fine-tuned adapters / LoRA weights | 50 MB - 5 GB | Hours-to-weeks of GPU re-training |
| Vector store (Chroma, Qdrant, FAISS) | 1-50 GB | Hours-to-days re-embedding documents |
| Audit log database | 100 MB - 10 GB | Compliance failure, legal exposure |
| Prompt templates and system configs | < 100 MB | Multiple iterations of prompt engineering |
| Application state DB (users, sessions) | 100 MB - 50 GB | Standard application loss |
| Base model files | 5-200 GB total | Re-downloadable, but slow over hotel WiFi |
The base model files are the largest by far but the cheapest to lose - you can re-pull them from the registry. Everything else is irreplaceable or expensive. Your backup strategy should weight by recovery cost, not by storage size.
Setting RTO and RPO Targets That Are Honest {#rto-rpo}
Two metrics, one decision per asset class:
- RTO (Recovery Time Objective) - how long can you tolerate the system being down?
- RPO (Recovery Point Objective) - how much data loss can you tolerate?
For most small-team local AI deployments I have helped with:
| Asset Class | RTO | RPO | Backup Frequency |
|---|---|---|---|
| Audit logs | 4 hours | 5 minutes | Continuous (WAL ship) |
| Fine-tuned adapters | 1 hour | 24 hours | Daily + on-change |
| Vector store | 8 hours | 1 hour | Hourly snapshots |
| Application DB | 4 hours | 15 minutes | 15-min PITR |
| Prompt configs | 8 hours | 24 hours | Daily |
| Base model files | 24 hours | 7 days | Weekly |
If those numbers feel aggressive, write down your actual tolerances. The goal is to be honest now so the backup tooling matches reality. Promising 5-minute RPO and not delivering it is worse than promising 24 hours and over-delivering.
The 3-2-1 Rule, Adapted for AI {#321-rule}
The standard 3-2-1 rule says: 3 copies of your data, on 2 different media, with 1 off-site. For local AI I add a fourth:
3 copies, 2 media, 1 off-site, 1 air-gapped.
The air-gapped copy matters specifically because of ransomware. Crypto-ransomware now targets backup directories. An air-gapped copy - meaning physically disconnected, not just "on a different drive" - is the only reliably recoverable backup if you are hit. We use a USB SSD that gets connected for a one-hour weekly sync and then unplugged.
The full layout I run:
Primary: /var/lib/ollama (NVMe, the running system)
Copy 1: /mnt/backup-nvme (separate physical disk, hourly)
Copy 2: s3://ai-backup-bucket (encrypted, daily)
Copy 3: USB SSD in a drawer (air-gapped, weekly)
Three copies, three media (internal NVMe, S3, USB), off-site (S3 in a different region), air-gapped (USB).
For air-gapped deployments specifically, our air-gapped AI deployment guide covers the operational patterns end to end.
Tooling: Restic, BorgBackup, or Restic+S3 {#tooling}
Three excellent open-source choices, in order of how I rank them for local AI:
Restic
Restic is the one I default to. Encrypted by default, deduplicated, supports local disk, S3, B2, Azure Blob, SFTP. The deduplication is the killer feature - back up two machines that both have the same llama3.3:70b weights and you store the bytes once.
# Initialize
export RESTIC_PASSWORD_FILE=/root/.restic-password
restic init --repo s3:s3.amazonaws.com/my-ai-backup
# Daily backup script
restic -r s3:s3.amazonaws.com/my-ai-backup backup \
--tag ai \
--exclude='*.log.gz' \
~/.ollama/models \
/var/lib/ollama \
/opt/anythingllm/storage \
/var/lib/audit-db
# Retention - keep 7 daily, 4 weekly, 6 monthly
restic -r s3:s3.amazonaws.com/my-ai-backup forget --prune \
--keep-daily 7 --keep-weekly 4 --keep-monthly 6
BorgBackup
Borg is restic's older sibling - slightly less polished UX, slightly better dedup ratios on some workloads, and the only tool I know that supports append-only mode reliably. If audit-grade immutability matters, Borg's append-only repository is what you want.
borg init --encryption=repokey-blake2 /mnt/borg-repo
borg create --stats --progress \
/mnt/borg-repo::ai-{now} \
~/.ollama/models /var/lib/ollama
rsync + ZFS Snapshots
If you run on ZFS or Btrfs, the snapshot mechanism gives you near-instant point-in-time copies for free. Combine with rsync to a remote box and you have a fast, simple, native solution. The downside is no built-in encryption and no dedup across machines.
zfs snapshot tank/ai@$(date +%Y%m%d-%H%M)
zfs send tank/ai@$(date +%Y%m%d-%H%M) | \
ssh backup-box zfs receive tank/ai-backup
For most teams, restic + S3 hits the right balance of features, simplicity, and cost.
Backing Up Ollama Model Files {#backup-models}
Ollama stores model files in ~/.ollama/models (Linux/Mac) or C:\Users\<user>\.ollama\models (Windows). The structure is content-addressed - blobs in models/blobs and manifests in models/manifests. This is great for backup because:
- Identical layers across models are stored once.
- Blob filenames are SHA-256 digests, so corruption is detectable.
- Restoring is just "put the files back in the right place."
The catch: Ollama writes to this directory while it runs. Backing up live can produce inconsistent state. Two safe approaches:
Option A: Stop Ollama briefly.
systemctl stop ollama
restic backup ~/.ollama/models
systemctl start ollama
For a 50-GB model directory and a fast disk, restic finishes the incremental in 30-90 seconds. Your AI is offline that long.
Option B: Use a filesystem snapshot.
# ZFS
zfs snapshot tank/ollama@backup
restic backup /tank/.zfs/snapshot/backup/ollama/models
zfs destroy tank/ollama@backup
# LVM
lvcreate -L1G -s -n ollama-snap /dev/vg0/ollama
mount /dev/vg0/ollama-snap /mnt/snap
restic backup /mnt/snap/ollama/models
umount /mnt/snap
lvremove -f /dev/vg0/ollama-snap
Snapshots take milliseconds. Ollama keeps running during the backup.
Verify Model Integrity After Restore
# The fast check
ollama list
# The thorough check
for blob in ~/.ollama/models/blobs/sha256-*; do
expected=$(basename "$blob" | sed 's/sha256-//')
actual=$(sha256sum "$blob" | awk '{print $1}')
if [ "$expected" != "$actual" ]; then
echo "CORRUPT: $blob"
fi
done
This script has caught two silent corruptions for me - both in cases where the underlying disk had developed a fault. Worth running monthly even without a restore event.
Backing Up Fine-Tunes and Adapters {#backup-finetunes}
Fine-tunes are the most painful asset to lose. A LoRA adapter for a 70B model represents 8-30 hours of GPU time. The training data, the hyperparameter sweeps, the validation runs - none of that survives a drive failure unless you backed up the whole pipeline.
What to back up:
adapter_model.safetensors # The weights
adapter_config.json # The architecture
training_args.bin # Hyperparameters used
trainer_state.json # Final loss, eval metrics
training_data.jsonl # Source dataset (CRITICAL)
training_script.py # The exact code
That last one - the script - is the difference between "we have a backup" and "we can rebuild from scratch if needed." A binary backup of the weights without the script means you can use the fine-tune but cannot iterate on it.
I version-control everything except weights and data. Weights go into restic. Data goes into a separate encrypted bucket because it often contains client information.
# Daily fine-tune backup script
cd /opt/finetunes
git add training_script.py training_args.bin trainer_state.json
git commit -m "checkpoint $(date +%Y%m%d)" || true
git push origin main
restic -r s3:.../finetunes backup ./adapter_model.safetensors
Backing Up RAG Vector Stores {#backup-rag}
Vector stores are the most failure-prone backup target because they are constantly being written. A naive copy of a Chroma directory mid-write produces a corrupt restore that may load successfully and return wrong results.
Chroma
Chroma persists to a SQLite-backed directory. The right approach:
# Use Chroma's built-in snapshot
import chromadb
client = chromadb.PersistentClient(path="/data/chroma")
client.persist() # Forces write of any pending changes
Then back up /data/chroma with restic. SQLite supports online backup, so even without stopping the application this is generally safe - but persist() first.
Qdrant
Qdrant has a first-class snapshot API:
curl -X POST http://localhost:6333/collections/my-collection/snapshots
# Returns a snapshot path, e.g. /qdrant/snapshots/...
restic backup /qdrant/snapshots
FAISS
FAISS indices are immutable once written, so just back up the .index file. The trickier part is the metadata alongside it - usually a SQLite or Parquet file mapping vector IDs to source documents. Back up both, atomically.
The Recovery Test
This is the part everyone skips. Periodically restore the vector store to a scratch directory and run a known query. If the top-10 results match what you saved before the backup, the restore is good. I run this monthly.
Backing Up Audit Logs and Application State {#backup-audit}
Audit logs are special. They are append-only by design (see our audit trail guide for why), and your backup must preserve that property. The right pattern is continuous WAL shipping, not periodic snapshots.
For PostgreSQL:
# Configure on the audit log database
archive_mode = on
archive_command = 'restic backup --tag wal --stdin --stdin-filename %f < %p'
For SQLite (which I use for smaller deployments):
# Use Litestream for continuous replication
litestream replicate /var/lib/audit/audit.db s3://audit-backup/audit
Litestream replicates SQLite to S3 with sub-second lag. For audit data this is the closest thing to a free lunch in this space.
The application state database (users, sessions, settings) is more standard. Daily pg_dump plus 15-minute WAL shipping for a typical small SaaS-style deployment. Nothing AI-specific here.
Restore Drills: The Part Everyone Skips {#drills}
A backup that has never been restored is not a backup. It is a hopeful gesture.
Schedule quarterly restore drills. Mine looks like:
- Spin up a clean VM (1 hour).
- Install Ollama from scratch (10 minutes).
- Restore from the latest backup:
restic -r s3:.../ai-backup restore latest --target / systemctl start ollama - Run the smoke test suite (10 minutes):
ollama listshows expected models- Vector store query returns expected results
- Audit log chain verification passes
- Fine-tuned adapter loads and produces expected output
- Document timing. Compare to your RTO target. If you missed it, fix the gap.
- Throw away the VM.
The first drill always reveals something broken. The second drill catches what you fixed in the first. By the fourth or fifth drill, the procedure is muscle memory and recovery from a real outage takes minutes, not days.
Off-Site and Air-Gapped Copies {#offsite}
The off-site copy goes to S3 (or any equivalent object store). Two things that matter:
1. Bucket-level encryption with your own key. AWS KMS with a CMK you control. Default S3 encryption is not enough for compliance.
2. Object Lock in compliance mode. This makes the bucket genuinely write-once for the retention period. Even an attacker with full AWS console access cannot delete protected objects. For audit data this is required for SOC 2 evidence; for everything else it is a nice-to-have.
aws s3api put-object-lock-configuration \
--bucket my-ai-backup \
--object-lock-configuration '{"ObjectLockEnabled":"Enabled","Rule":{"DefaultRetention":{"Mode":"COMPLIANCE","Days":2555}}}'
The air-gapped copy is the boring USB SSD in a drawer. Once a week, plug it in, run restic backup, unplug. Label the drive with the date of the last sync. Rotate two drives so you always have a slightly older copy as a fallback.
For deeper isolation patterns, see our air-gapped AI deployment guide.
Pitfalls {#pitfalls}
What goes wrong:
1. Backups on the same physical drive. I have seen this twice. /data/ollama and /data/backup on the same NVMe. The drive failed, both were gone. Backup destinations must be physically separate.
2. No encryption. Model weights themselves are usually fine to leave unencrypted. Audit logs, training data, and any embedded client documents must be encrypted at rest. Restic does this by default. Borg does not unless you choose an encrypted mode. Choose it.
3. Forgetting the password. Restic and Borg passwords are not recoverable. Lose them, lose the backup. Store passwords in a password manager and a sealed envelope in a safe.
4. Backing up while training is running. A fine-tune in progress will leave half-written checkpoint files. Either pause training during backup or back up only the most recent fully-written checkpoint.
5. No bandwidth budget. A 200-GB backup over a residential 50 Mbps uplink takes 9-10 hours. Test the real upload time, not the marketing number. Schedule overnight if needed.
6. Restore from latest only. Sometimes the corruption you are recovering from happened a week ago. Test restoring from a 30-day-old snapshot too. Make sure the retention policy keeps enough history.
7. No alerting on backup failure. A backup that silently fails for two months is worse than no backup at all - it gives false confidence. Wire restic exit code into your monitoring stack.
8. Restoring to the wrong path. ~/.ollama/models on Linux, C:\Users\\.ollama\models on Windows. Restoring to the wrong path means Ollama does not see the models. Test the restore path explicitly during drills.
FAQ {#faq}
The single most common question I get: "How much does this cost?" For a typical small team setup - 200 GB total local AI footprint, daily incrementals, 90-day retention, S3 standard storage in us-east-1 - the monthly cost is $4-7 in S3 storage and another $3-5 in PUT/GET request charges. Less than your team's coffee budget. Less than five minutes of your engineer's time to restore from a real outage.
Where to Take This Next
Three deeper rabbit holes:
- Multi-region replication - if you have customers across regions, replicating S3 backups to a second region (with cross-region replication or restic's multiple-repo support) is cheap insurance against a regional outage.
- Encryption key escrow - for SOC 2 you need a documented procedure for what happens if the engineer who set up the backup leaves. Two-person key access, written runbooks, escrow with corporate IT.
- Backup verification automation - extend the quarterly restore drill into a weekly automated test that spins up a container, restores, runs a smoke test, and reports green/red. We use this pattern for our audit trail verification too.
For the broader operational picture, our Ollama production deployment guide and securing Ollama guide are the natural companions.
Conclusion
The pattern is the same one good operators have used for thirty years. What changes for local AI is the inventory of things to back up. Model files. Adapter weights. Vector indices. Audit logs. Each has its own quirks - mid-write corruption, retention requirements, encryption needs - and a generic backup strategy will quietly miss something that matters.
The setup I described in this guide takes about a day to deploy and a day per quarter to drill. The cost is single-digit dollars per month. The first time you have to restore - and you will, because drives fail and humans run rm -rf - the difference between minutes-long and weeks-long downtime is exactly this work.
Build the simple version this weekend. Run the first restore drill before the end of the month. The day a drive fails and you watch the deployment come back in 20 minutes is the day this guide will have paid for itself a thousand times over.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!