Should I back up Ollama model files even though I can re-download them?

Yes, but with a different priority than other assets. A 200-GB model directory takes 30-60 minutes to back up incrementally with restic, and restoring locally is roughly 10x faster than pulling from the registry over a residential connection. For tax season or a critical demo where you cannot wait 6 hours for re-downloads, the local backup pays off. For air-gapped environments where the registry is unreachable, backups are mandatory.

What backup tool is best for self-hosted AI?

Restic for most teams - encrypted by default, supports S3 and local disk, deduplicates beautifully across model files. BorgBackup is comparable and has a true append-only mode that is useful for audit logs. ZFS snapshots are the right answer if you already run on ZFS. Avoid generic file-sync tools like rsync alone for AI workloads because they cannot handle the consistency requirements of mutating vector stores and audit logs.

How often should I run backups?

Tier by recovery cost. Continuous WAL shipping for audit logs (Litestream for SQLite, archive_command for PostgreSQL). Hourly for vector stores and active fine-tune checkpoints. Daily for adapters and application databases. Weekly for base model files. Monthly verification drills. The rule is: backup frequency should match your RPO target, which should match your team's actual tolerance for data loss.

Can I back up Ollama while it is running?

Yes if you use a filesystem snapshot (ZFS, Btrfs, or LVM) - the snapshot is atomic and the live system continues running. Without snapshots, briefly stopping Ollama is the safest approach. A 50-GB incremental restic backup with Ollama stopped takes 30-90 seconds, which is acceptable for most teams. Never trust a backup taken with Ollama writing to disk and no snapshot - corruption can be silent.

What is the right retention policy for AI backups?

Layer your retention. Daily snapshots for 30 days for fast operational recovery. Weekly snapshots for 12 weeks for catching slow-developing problems. Monthly snapshots for 24 months for long-tail issues. Audit-log backups separately at 7 years to match SOC 2 norms. Restic and Borg both support this policy out of the box with --keep-daily, --keep-weekly, --keep-monthly flags.

How do I verify my backup is actually good?

Quarterly restore drills against a clean VM. Restore from the latest snapshot, restart Ollama, query the vector store, verify the audit log chain, load a fine-tuned adapter and run a known prompt. Compare results to a saved baseline. This catches both backup corruption and procedure drift. If you have never done a restore drill, your backups are unverified and you should treat them as such.

What about ransomware protection?

Three layers. Object Lock in compliance mode on the S3 bucket prevents deletion even by full-access attackers. An air-gapped USB SSD that is physically disconnected most of the time prevents network ransomware from reaching it. Append-only mode in BorgBackup prevents modification of existing snapshots even with repository access. For a high-stakes deployment, all three together.

How much does a complete AI backup setup cost monthly?

For a typical small team with 200 GB of local AI footprint backed up daily to S3 with 90-day retention: $4-7 in S3 storage, $3-5 in PUT/GET requests, $0 in restic licensing. Total under $15/month for what would otherwise be hours of recovery work after a single drive failure. The hardware for the on-site copy (NVMe + USB SSD) runs $150-250 one-time.

Local AI Backup and Disaster Recovery: The Playbook Nobody Wrote

Published April 23, 2026 - 20 min read

The first time I lost a local AI deployment, it was 11pm on a Friday in October 2024. A 4TB NVMe drive in our team server failed without warning. Gone with it: 187 GB of pulled Ollama models, three fine-tuned adapters representing maybe 60 hours of training time, a Chroma vector store with 2.4 million chunks of internal documentation, and most painfully, the audit log database we used to satisfy our SOC 2 controls. We had backups of the application database. We did not have backups of any of the AI-specific assets, because at the time none of us thought of them as "data."

That outage cost the team about a week of work. The second time something similar happened - this time a corrupted file system on a different server - we lost twenty minutes. The difference was a backup playbook specifically designed for the way local AI stores state. It is a different shape from typical application backups, and most generic backup tools will quietly miss the parts that matter.

This guide is that playbook. Tested. Drilled. Audited.

Quick Start: Your First Backup in 15 Minutes

If you do nothing else after reading this, do this:

# 1. Install restic (encrypted, deduplicated backup)
brew install restic   # or apt install restic

# 2. Create a repository on a separate disk
restic init --repo /mnt/backup-disk/ai-repo

# 3. Back up the four critical paths
restic -r /mnt/backup-disk/ai-repo backup \
  ~/.ollama/models \
  /var/lib/ollama \
  /opt/anythingllm/storage \
  /var/lib/audit-db

# 4. Verify
restic -r /mnt/backup-disk/ai-repo snapshots

That alone puts you ahead of about 80% of self-hosted AI deployments. The rest of this guide is about doing it right - encryption, off-site, retention, and crucially the part everyone skips, practicing the restore.

Why Local AI Needs Its Own Backup Strategy
The Six Things You Must Back Up
Setting RTO and RPO Targets That Are Honest
The 3-2-1 Rule, Adapted for AI
Tooling: Restic, BorgBackup, or Restic+S3
Backing Up Ollama Model Files
Backing Up Fine-Tunes and Adapters
Backing Up RAG Vector Stores
Backing Up Audit Logs and Application State
Restore Drills: The Part Everyone Skips
Off-Site and Air-Gapped Copies
Pitfalls
FAQ

Why Local AI Needs Its Own Backup Strategy {#why-special}

Standard application backups - "back up the database every night, back up uploads weekly" - miss most of what matters in a local AI deployment. The state lives in unusual places:

Model files are large (5-100 GB each), binary, and identical across many users. Naive backups duplicate them; smart backups dedupe.
Fine-tuned adapters are small (50-500 MB) but represent hours of GPU time. Losing them is unrecoverable without re-training.
Vector indices are mutating. A backup taken mid-write can be unrecoverably corrupt.
Audit logs must be immutable. Backups must preserve the hash chain.
Prompt templates and system configurations live in dotfiles that often are not in version control.

The blast radius of a single drive failure is therefore much bigger than people expect. Worse, the warning signs are different. A model file that becomes corrupted may load successfully and return subtly wrong outputs - you do not get a clean error.

If you have not yet hardened the rest of your local AI stack, our Ollama production deployment guide is the natural prerequisite.

The Six Things You Must Back Up {#what-to-backup}

In rough order of recovery cost (highest first):

Asset	Typical Size	If Lost
Fine-tuned adapters / LoRA weights	50 MB - 5 GB	Hours-to-weeks of GPU re-training
Vector store (Chroma, Qdrant, FAISS)	1-50 GB	Hours-to-days re-embedding documents
Audit log database	100 MB - 10 GB	Compliance failure, legal exposure
Prompt templates and system configs	< 100 MB	Multiple iterations of prompt engineering
Application state DB (users, sessions)	100 MB - 50 GB	Standard application loss
Base model files	5-200 GB total	Re-downloadable, but slow over hotel WiFi

The base model files are the largest by far but the cheapest to lose - you can re-pull them from the registry. Everything else is irreplaceable or expensive. Your backup strategy should weight by recovery cost, not by storage size.

Setting RTO and RPO Targets That Are Honest {#rto-rpo}

Two metrics, one decision per asset class:

RTO (Recovery Time Objective) - how long can you tolerate the system being down?
RPO (Recovery Point Objective) - how much data loss can you tolerate?

For most small-team local AI deployments I have helped with:

Asset Class	RTO	RPO	Backup Frequency
Audit logs	4 hours	5 minutes	Continuous (WAL ship)
Fine-tuned adapters	1 hour	24 hours	Daily + on-change
Vector store	8 hours	1 hour	Hourly snapshots
Application DB	4 hours	15 minutes	15-min PITR
Prompt configs	8 hours	24 hours	Daily
Base model files	24 hours	7 days	Weekly

If those numbers feel aggressive, write down your actual tolerances. The goal is to be honest now so the backup tooling matches reality. Promising 5-minute RPO and not delivering it is worse than promising 24 hours and over-delivering.

The 3-2-1 Rule, Adapted for AI {#321-rule}

The standard 3-2-1 rule says: 3 copies of your data, on 2 different media, with 1 off-site. For local AI I add a fourth:

3 copies, 2 media, 1 off-site, 1 air-gapped.

The air-gapped copy matters specifically because of ransomware. Crypto-ransomware now targets backup directories. An air-gapped copy - meaning physically disconnected, not just "on a different drive" - is the only reliably recoverable backup if you are hit. We use a USB SSD that gets connected for a one-hour weekly sync and then unplugged.

The full layout I run:

Primary:   /var/lib/ollama (NVMe, the running system)
Copy 1:    /mnt/backup-nvme (separate physical disk, hourly)
Copy 2:    s3://ai-backup-bucket (encrypted, daily)
Copy 3:    USB SSD in a drawer (air-gapped, weekly)

Three copies, three media (internal NVMe, S3, USB), off-site (S3 in a different region), air-gapped (USB).

For air-gapped deployments specifically, our air-gapped AI deployment guide covers the operational patterns end to end.

Tooling: Restic, BorgBackup, or Restic+S3 {#tooling}

Three excellent open-source choices, in order of how I rank them for local AI:

Restic

Restic is the one I default to. Encrypted by default, deduplicated, supports local disk, S3, B2, Azure Blob, SFTP. The deduplication is the killer feature - back up two machines that both have the same llama3.3:70b weights and you store the bytes once.

# Initialize
export RESTIC_PASSWORD_FILE=/root/.restic-password
restic init --repo s3:s3.amazonaws.com/my-ai-backup

# Daily backup script
restic -r s3:s3.amazonaws.com/my-ai-backup backup \
  --tag ai \
  --exclude='*.log.gz' \
  ~/.ollama/models \
  /var/lib/ollama \
  /opt/anythingllm/storage \
  /var/lib/audit-db

# Retention - keep 7 daily, 4 weekly, 6 monthly
restic -r s3:s3.amazonaws.com/my-ai-backup forget --prune \
  --keep-daily 7 --keep-weekly 4 --keep-monthly 6

BorgBackup

Borg is restic's older sibling - slightly less polished UX, slightly better dedup ratios on some workloads, and the only tool I know that supports append-only mode reliably. If audit-grade immutability matters, Borg's append-only repository is what you want.

borg init --encryption=repokey-blake2 /mnt/borg-repo
borg create --stats --progress \
  /mnt/borg-repo::ai-{now} \
  ~/.ollama/models /var/lib/ollama

rsync + ZFS Snapshots

If you run on ZFS or Btrfs, the snapshot mechanism gives you near-instant point-in-time copies for free. Combine with rsync to a remote box and you have a fast, simple, native solution. The downside is no built-in encryption and no dedup across machines.

zfs snapshot tank/ai@$(date +%Y%m%d-%H%M)
zfs send tank/ai@$(date +%Y%m%d-%H%M) | \
  ssh backup-box zfs receive tank/ai-backup

For most teams, restic + S3 hits the right balance of features, simplicity, and cost.

Backing Up Ollama Model Files {#backup-models}

Ollama stores model files in ~/.ollama/models (Linux/Mac) or C:\Users\<user>\.ollama\models (Windows). The structure is content-addressed - blobs in models/blobs and manifests in models/manifests. This is great for backup because:

Identical layers across models are stored once.
Blob filenames are SHA-256 digests, so corruption is detectable.
Restoring is just "put the files back in the right place."

The catch: Ollama writes to this directory while it runs. Backing up live can produce inconsistent state. Two safe approaches:

Option A: Stop Ollama briefly.

systemctl stop ollama
restic backup ~/.ollama/models
systemctl start ollama

For a 50-GB model directory and a fast disk, restic finishes the incremental in 30-90 seconds. Your AI is offline that long.

Option B: Use a filesystem snapshot.

# ZFS
zfs snapshot tank/ollama@backup
restic backup /tank/.zfs/snapshot/backup/ollama/models
zfs destroy tank/ollama@backup

# LVM
lvcreate -L1G -s -n ollama-snap /dev/vg0/ollama
mount /dev/vg0/ollama-snap /mnt/snap
restic backup /mnt/snap/ollama/models
umount /mnt/snap
lvremove -f /dev/vg0/ollama-snap

Snapshots take milliseconds. Ollama keeps running during the backup.

Verify Model Integrity After Restore

# The fast check
ollama list

# The thorough check
for blob in ~/.ollama/models/blobs/sha256-*; do
  expected=$(basename "$blob" | sed 's/sha256-//')
  actual=$(sha256sum "$blob" | awk '{print $1}')
  if [ "$expected" != "$actual" ]; then
    echo "CORRUPT: $blob"
  fi
done

This script has caught two silent corruptions for me - both in cases where the underlying disk had developed a fault. Worth running monthly even without a restore event.

Backing Up Fine-Tunes and Adapters {#backup-finetunes}

Fine-tunes are the most painful asset to lose. A LoRA adapter for a 70B model represents 8-30 hours of GPU time. The training data, the hyperparameter sweeps, the validation runs - none of that survives a drive failure unless you backed up the whole pipeline.

What to back up:

adapter_model.safetensors        # The weights
adapter_config.json              # The architecture
training_args.bin                # Hyperparameters used
trainer_state.json               # Final loss, eval metrics
training_data.jsonl              # Source dataset (CRITICAL)
training_script.py               # The exact code

That last one - the script - is the difference between "we have a backup" and "we can rebuild from scratch if needed." A binary backup of the weights without the script means you can use the fine-tune but cannot iterate on it.

I version-control everything except weights and data. Weights go into restic. Data goes into a separate encrypted bucket because it often contains client information.

# Daily fine-tune backup script
cd /opt/finetunes
git add training_script.py training_args.bin trainer_state.json
git commit -m "checkpoint $(date +%Y%m%d)" || true
git push origin main

restic -r s3:.../finetunes backup ./adapter_model.safetensors

Backing Up RAG Vector Stores {#backup-rag}

Vector stores are the most failure-prone backup target because they are constantly being written. A naive copy of a Chroma directory mid-write produces a corrupt restore that may load successfully and return wrong results.

Chroma

Chroma persists to a SQLite-backed directory. The right approach:

# Use Chroma's built-in snapshot
import chromadb
client = chromadb.PersistentClient(path="/data/chroma")
client.persist()  # Forces write of any pending changes

Then back up /data/chroma with restic. SQLite supports online backup, so even without stopping the application this is generally safe - but persist() first.

Qdrant

Qdrant has a first-class snapshot API:

curl -X POST http://localhost:6333/collections/my-collection/snapshots
# Returns a snapshot path, e.g. /qdrant/snapshots/...
restic backup /qdrant/snapshots

FAISS

FAISS indices are immutable once written, so just back up the .index file. The trickier part is the metadata alongside it - usually a SQLite or Parquet file mapping vector IDs to source documents. Back up both, atomically.

The Recovery Test

This is the part everyone skips. Periodically restore the vector store to a scratch directory and run a known query. If the top-10 results match what you saved before the backup, the restore is good. I run this monthly.

Backing Up Audit Logs and Application State {#backup-audit}

Audit logs are special. They are append-only by design (see our audit trail guide for why), and your backup must preserve that property. The right pattern is continuous WAL shipping, not periodic snapshots.

For PostgreSQL:

# Configure on the audit log database
archive_mode = on
archive_command = 'restic backup --tag wal --stdin --stdin-filename %f < %p'

For SQLite (which I use for smaller deployments):

# Use Litestream for continuous replication
litestream replicate /var/lib/audit/audit.db s3://audit-backup/audit

Litestream replicates SQLite to S3 with sub-second lag. For audit data this is the closest thing to a free lunch in this space.

The application state database (users, sessions, settings) is more standard. Daily pg_dump plus 15-minute WAL shipping for a typical small SaaS-style deployment. Nothing AI-specific here.

Restore Drills: The Part Everyone Skips {#drills}

A backup that has never been restored is not a backup. It is a hopeful gesture.

Schedule quarterly restore drills. Mine looks like:

Spin up a clean VM (1 hour).
Install Ollama from scratch (10 minutes).

Restore from the latest backup:

restic -r s3:.../ai-backup restore latest --target /
systemctl start ollama

Run the smoke test suite (10 minutes):
- ollama list shows expected models
- Vector store query returns expected results
- Audit log chain verification passes
- Fine-tuned adapter loads and produces expected output
Document timing. Compare to your RTO target. If you missed it, fix the gap.
Throw away the VM.

The first drill always reveals something broken. The second drill catches what you fixed in the first. By the fourth or fifth drill, the procedure is muscle memory and recovery from a real outage takes minutes, not days.

Off-Site and Air-Gapped Copies {#offsite}

The off-site copy goes to S3 (or any equivalent object store). Two things that matter:

1. Bucket-level encryption with your own key. AWS KMS with a CMK you control. Default S3 encryption is not enough for compliance.

2. Object Lock in compliance mode. This makes the bucket genuinely write-once for the retention period. Even an attacker with full AWS console access cannot delete protected objects. For audit data this is required for SOC 2 evidence; for everything else it is a nice-to-have.

aws s3api put-object-lock-configuration \
  --bucket my-ai-backup \
  --object-lock-configuration '{"ObjectLockEnabled":"Enabled","Rule":{"DefaultRetention":{"Mode":"COMPLIANCE","Days":2555}}}'

The air-gapped copy is the boring USB SSD in a drawer. Once a week, plug it in, run restic backup, unplug. Label the drive with the date of the last sync. Rotate two drives so you always have a slightly older copy as a fallback.

For deeper isolation patterns, see our air-gapped AI deployment guide.

Pitfalls {#pitfalls}

What goes wrong:

1. Backups on the same physical drive. I have seen this twice. /data/ollama and /data/backup on the same NVMe. The drive failed, both were gone. Backup destinations must be physically separate.

2. No encryption. Model weights themselves are usually fine to leave unencrypted. Audit logs, training data, and any embedded client documents must be encrypted at rest. Restic does this by default. Borg does not unless you choose an encrypted mode. Choose it.

3. Forgetting the password. Restic and Borg passwords are not recoverable. Lose them, lose the backup. Store passwords in a password manager and a sealed envelope in a safe.

4. Backing up while training is running. A fine-tune in progress will leave half-written checkpoint files. Either pause training during backup or back up only the most recent fully-written checkpoint.

5. No bandwidth budget. A 200-GB backup over a residential 50 Mbps uplink takes 9-10 hours. Test the real upload time, not the marketing number. Schedule overnight if needed.

6. Restore from latest only. Sometimes the corruption you are recovering from happened a week ago. Test restoring from a 30-day-old snapshot too. Make sure the retention policy keeps enough history.

7. No alerting on backup failure. A backup that silently fails for two months is worse than no backup at all - it gives false confidence. Wire restic exit code into your monitoring stack.

8. Restoring to the wrong path. ~/.ollama/models on Linux, C:\Users\\.ollama\models on Windows. Restoring to the wrong path means Ollama does not see the models. Test the restore path explicitly during drills.

FAQ {#faq}

The single most common question I get: "How much does this cost?" For a typical small team setup - 200 GB total local AI footprint, daily incrementals, 90-day retention, S3 standard storage in us-east-1 - the monthly cost is $4-7 in S3 storage and another $3-5 in PUT/GET request charges. Less than your team's coffee budget. Less than five minutes of your engineer's time to restore from a real outage.

Where to Take This Next

Three deeper rabbit holes:

Multi-region replication - if you have customers across regions, replicating S3 backups to a second region (with cross-region replication or restic's multiple-repo support) is cheap insurance against a regional outage.
Encryption key escrow - for SOC 2 you need a documented procedure for what happens if the engineer who set up the backup leaves. Two-person key access, written runbooks, escrow with corporate IT.
Backup verification automation - extend the quarterly restore drill into a weekly automated test that spins up a container, restores, runs a smoke test, and reports green/red. We use this pattern for our audit trail verification too.

For the broader operational picture, our Ollama production deployment guide and securing Ollama guide are the natural companions.

Conclusion

The pattern is the same one good operators have used for thirty years. What changes for local AI is the inventory of things to back up. Model files. Adapter weights. Vector indices. Audit logs. Each has its own quirks - mid-write corruption, retention requirements, encryption needs - and a generic backup strategy will quietly miss something that matters.

The setup I described in this guide takes about a day to deploy and a day per quarter to drill. The cost is single-digit dollars per month. The first time you have to restore - and you will, because drives fail and humans run rm -rf - the difference between minutes-long and weeks-long downtime is exactly this work.

Build the simple version this weekend. Run the first restore drill before the end of the month. The day a drive fails and you watch the deployment come back in 20 minutes is the day this guide will have paid for itself a thousand times over.

Local AI Backup & Disaster Recovery: Complete 2026 Playbook

Want to go deeper than this article?

Local AI Backup and Disaster Recovery: The Playbook Nobody Wrote

Quick Start: Your First Backup in 15 Minutes

Table of Contents

Why Local AI Needs Its Own Backup Strategy {#why-special}

The Six Things You Must Back Up {#what-to-backup}

Setting RTO and RPO Targets That Are Honest {#rto-rpo}

The 3-2-1 Rule, Adapted for AI {#321-rule}

Tooling: Restic, BorgBackup, or Restic+S3 {#tooling}

Restic

BorgBackup

rsync + ZFS Snapshots

Backing Up Ollama Model Files {#backup-models}

Verify Model Integrity After Restore

Backing Up Fine-Tunes and Adapters {#backup-finetunes}

Backing Up RAG Vector Stores {#backup-rag}

Chroma

Qdrant

FAISS

The Recovery Test

Backing Up Audit Logs and Application State {#backup-audit}

Restore Drills: The Part Everyone Skips {#drills}

Off-Site and Air-Gapped Copies {#offsite}

Pitfalls {#pitfalls}

FAQ {#faq}

Where to Take This Next

Conclusion

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

🎓 Continue Learning

Sleep Better at Night

Related Guides

Build Real AI on Your Machine

Continue Learning

Securing Ollama

Monitoring Local AI

SOC 2 for Local AI

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI