Local AI for Journalists: Protect Sources With Offline AI
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Local AI for Journalists: Protect Sources With Offline AI
Published on April 23, 2026 • 19 min read
A reporter at a regional paper called me in February. She had been working on a story about wage theft at a chain of nursing homes for five months. She had eight hours of recorded interviews with workers — most of them undocumented, all of them terrified — and a stack of internal scheduling spreadsheets a former assistant manager had given her. Her editor wanted a draft by the end of the week. She had been planning to feed the audio into Otter.ai and the spreadsheets into ChatGPT.
I asked her to read me Otter.ai's data retention policy. She got to the part about "may share with affiliates and service providers" and stopped. We spent the next afternoon setting up a laptop that did the entire pipeline offline. The story ran six weeks later. The sources are still safe.
This guide is that setup, written for the reporter who actually has something at stake. Whisper for transcription, Ollama for analysis, AnythingLLM for searching FOIA dumps — all on a machine you can lock in a desk drawer. It is not paranoid. It is the same reasoning behind why you encrypt Signal and use SecureDrop. The AI tier of your stack should match the security tier of the rest of it.
This is technical guidance, not legal advice. Consult your newsroom's lawyer, your jurisdiction's shield laws, and your editor before applying any of this to a live investigation.
Quick Start: A Source-Safe Stack in 30 Minutes {#quick-start}
If you need the working setup right now:
- Air-gap a laptop. Wipe it, install Linux Mint or macOS, do not log into iCloud or Google.
- Install Ollama and Whisper.
brew install ollama whisper-cppon Mac, or build from source on Linux. - Pull models offline-capable.
ollama pull qwen2.5:14b(8 GB),ollama pull nomic-embed-text(274 MB), and Whisper'slarge-v3model (1.5 GB). - Deploy AnythingLLM in Docker. One command. Restricted to localhost.
- Disable the network.
sudo ifconfig en0 down(Mac) ornmcli radio wifi off(Linux). The stack still works.
That's a usable air-gapped investigative workstation. The rest of this guide covers operational security, FOIA workflow, named-entity extraction, and the specific mistakes that get sources outed.
Table of Contents
- Why Cloud AI Is a Source Risk
- Threat Model for Investigative Reporters
- The Stack
- Hardware: The Disposable Investigation Laptop
- Step 1 — Offline Whisper Transcription
- Step 2 — Document RAG With AnythingLLM
- Step 3 — Named-Entity Extraction From FOIA Dumps
- Step 4 — Cross-Document Timeline Building
- Operational Security Rules
- Workflow Templates
- Limitations and Hallucination Risk
- FAQ
Why Cloud AI Is a Source Risk {#why-cloud-ai}
Three concrete problems make hosted AI services dangerous for source-protective reporting:
1. Subpoenas reach where you cannot. OpenAI, Anthropic, Google, and Otter.ai have all received subpoenas for user data. Some honor them; some fight them; some notify users; some are gagged from notifying users. Even the strongest of these services cannot promise that material processed through their systems is unreachable. Local processing eliminates the third party entirely — there is no one to subpoena except you.
2. Retention is not deletion. Most cloud AI services retain prompts and outputs for some window after a "delete" action — for abuse review, model improvement, or legal hold. Otter.ai's documentation confirms transcripts can persist after account deletion under certain conditions. This window does not need to be long; it needs only to overlap with a discovery request.
3. Logs are evidence. Cloud AI services produce logs that include timestamps, IP addresses, and identifiers. A pattern of API calls timed alongside a source meeting can itself become incriminating metadata in a hostile legal or political environment.
Whether your reporting is on labor abuses, policing, or government corruption, the question is not whether your AI vendor's intentions are good. The question is what is technically possible to extract from their systems under legal compulsion. The only architecturally safe answer is: nothing about you exists there.
The Reporters Committee for Freedom of the Press maintains current guidance on subpoena risk for journalism workflows.
Threat Model for Investigative Reporters {#threat-model}
Before you pick tools, know who you are defending against. Most reporters need to defend against at least two of these:
| Threat | Likelihood | Impact | Local AI Helps? |
|---|---|---|---|
| Civil subpoena to your AI vendor | High in litigation-heavy beats | Source identification | Yes — eliminates vendor |
| Federal subpoena (national-security beat) | Medium | Source identification + jail | Yes |
| Foreign government surveillance | Beat-dependent | Severe | Yes, with air gap |
| Hostile sources / company insiders | Medium | Story leak before publication | Partially — depends on opsec |
| Theft of laptop | Always | Catastrophic | Only with disk encryption + screen lock |
| Mistakes by you under deadline | Always | Catastrophic | No — process matters more than tools |
The last row is critical. The single largest cause of source exposure in newsrooms is reporters under deadline pressure pasting things into the wrong window. No tool fixes that. Workflow discipline does.
The Stack {#the-stack}
| Layer | Tool | What it does |
|---|---|---|
| OS | Linux Mint or macOS (FileVault on) | Encrypted, minimal telemetry |
| Model engine | Ollama | Runs LLMs locally, no telemetry |
| Primary model | Qwen 2.5 14B | Strong reasoning, fits in 16 GB RAM |
| Light model | Llama 3.2 3B | Fast triage tasks |
| Transcription | whisper.cpp (large-v3) | Offline speech-to-text |
| Document RAG | AnythingLLM (Docker, localhost) | Searches FOIA dumps and source files |
| Embeddings | nomic-embed-text | Vector search for documents |
| NER | spaCy + custom prompts | People, organizations, dates, locations |
Total software cost: $0. All open source. Total disk footprint: ~30 GB after model downloads.
Hardware: The Disposable Investigation Laptop {#hardware}
Investigative work justifies a dedicated machine — one that does nothing else. Recommended baseline:
| Component | Specification | Why |
|---|---|---|
| Laptop | Refurbished ThinkPad T14 Gen 3 or Mac Mini M2 Pro | Cheap enough to retire after a sensitive investigation |
| RAM | 32 GB | Runs 14B with comfortable headroom |
| Storage | 1 TB NVMe with full-disk encryption | Source material and FOIA dumps add up |
| Screen filter | 3M privacy filter | Cafés exist |
| Network | Disabled by default | Use a separate machine for OSINT |
Buy used or refurbished. Pay cash if your beat warrants it. Do not enroll the device in any organizational MDM unless your newsroom has a privileged "investigations" tier with explicit policy support.
For broader hardware sizing across local AI use cases, see our AI hardware requirements guide.
Step 1 — Offline Whisper Transcription {#whisper}
Whisper is the most important tool in this stack. It runs OpenAI's open-weights speech model entirely on your CPU or GPU with no network calls.
# Build whisper.cpp
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make -j
# Download the large-v3 model (best accuracy, 1.5 GB)
bash ./models/download-ggml-model.sh large-v3
# Transcribe an interview
./build/bin/whisper-cli \
-m models/ggml-large-v3.bin \
-f interview-2026-04-12.wav \
-l en \
-otxt -ovtt -osrt
Real benchmarks I measured on three machines:
| Machine | Model | Audio length | Real time |
|---|---|---|---|
| MacBook Air M1 16 GB | large-v3 | 60 min | 11 min |
| MacBook Pro M2 Max 32 GB | large-v3 | 60 min | 4 min |
| ThinkPad T14 (i7, 32 GB CPU only) | large-v3 | 60 min | 38 min |
| ThinkPad T14 (RTX 3050 Ti, CUDA) | large-v3 | 60 min | 9 min |
For most reporters, an M1 or M2 MacBook Air is sufficient. If you have multi-hour interviews, the M2 Max is the productivity unlock.
Pre-flight checklist before transcribing sensitive audio:
- Wi-Fi disabled (
nmcli radio wifi offor hardware switch) - Bluetooth disabled
- Screen recording, dictation, and "improve Siri" all disabled in OS settings
- No backup software running (Time Machine, OneDrive, Dropbox)
- Audio file on the laptop, not on a network share
For deeper Whisper coverage, see Local AI meeting transcription.
Step 2 — Document RAG With AnythingLLM {#anythingllm}
When you receive a 6,000-page FOIA response or a leak of internal company documents, the bottleneck stops being reading speed and starts being indexing. AnythingLLM is the right tool: it builds a vector index of your documents, then lets you ask questions that get answered with citations back to the exact source page.
docker run -d \
-p 127.0.0.1:3001:3001 \
-v anythingllm-investigation:/app/server/storage \
--add-host=host.docker.internal:host-gateway \
-e LLM_PROVIDER=ollama \
-e OLLAMA_BASE_PATH=http://host.docker.internal:11434 \
-e OLLAMA_MODEL_PREF=qwen2.5:14b \
-e EMBEDDING_ENGINE=ollama \
-e EMBEDDING_MODEL_PREF=nomic-embed-text \
--name anythingllm-investigation \
--restart always \
mintplexlabs/anythingllm
The critical flag: -p 127.0.0.1:3001:3001 binds the service to localhost only. Do not expose this port to your network. If you need to share access with a colleague, use SSH tunneling or a hardware VPN appliance.
Workspace organization for an investigation:
- One workspace per story. Documents from "Story A" should never be retrievable when querying "Story B."
- Subdivide: "Story A — Internal docs," "Story A — FOIA," "Story A — Public records."
- Tag each document with date received and source identifier (without naming the source).
- Use the AnythingLLM access control feature to restrict per-user access if multiple reporters share the workstation.
Practical query patterns that work:
- "Find every mention of [executive name] across all internal emails. Return date, recipient, and one-line summary."
- "What dates does this contract describe in 2024? Quote the exact text."
- "List every unique organization mentioned in the FOIA response, sorted by frequency."
- "Identify discrepancies between the public statement on [date] and the internal documents from the same week."
The model answers with citations. Verify each one by clicking through to the source. Never quote AI-generated text in published reporting; AI generates leads, you verify them in the source documents.
For a full RAG walkthrough see the RAG local setup guide and the AnythingLLM setup guide.
Step 3 — Named-Entity Extraction From FOIA Dumps {#ner-foia}
When a FOIA request returns 4,000 pages, the first useful pass is to pull out every person, organization, date, and place mentioned. spaCy does this offline in seconds.
# pip install spacy
# python -m spacy download en_core_web_lg
import spacy
from pathlib import Path
from collections import Counter
import json
nlp = spacy.load("en_core_web_lg")
entities = Counter()
docs_dir = Path("./foia-response-2026-04")
for txt in docs_dir.glob("**/*.txt"):
text = txt.read_text(errors="ignore")
doc = nlp(text[:1_000_000]) # cap to 1MB per doc
for ent in doc.ents:
if ent.label_ in {"PERSON", "ORG", "GPE", "DATE", "MONEY"}:
entities[(ent.text.strip(), ent.label_)] += 1
# Top 200 entities by frequency
top = entities.most_common(200)
Path("entities.json").write_text(json.dumps(top, indent=2))
Now feed that list to Qwen for clustering:
ollama run qwen2.5:14b "I have a list of entities extracted from a FOIA response \
about [topic]. Cluster them into: (1) public officials, (2) private contractors, \
(3) shell companies (anything with LLC/Inc but no clear business purpose), \
(4) places, (5) dollar amounts above \$100K. Output as a Markdown table.
$(cat entities.json)"
This single pass typically surfaces three to five names that warrant follow-up — usually the contractors no one had heard of and the LLCs registered to PO boxes.
Step 4 — Cross-Document Timeline Building {#timeline}
The most useful AI task in long investigations is timeline construction. Hand the model a workspace of dated documents and ask:
"Build a chronological timeline of every event mentioned across these documents.
For each entry, include: date, source document filename, one-line description,
and the people involved. Skip entries that are routine business operations."
Qwen 2.5 14B handles this remarkably well across hundreds of documents because AnythingLLM retrieves only the most relevant chunks per question, keeping the context manageable.
Critical post-processing rule: paste the timeline into a real document, then verify every single entry by clicking back to the source. Models will occasionally invent or merge dates. Treat the timeline as a draft outline, not as evidence.
Operational Security Rules {#opsec}
These are the rules I have given to every reporter who has set up this stack. Each one comes from a real near-miss.
1. Two laptops minimum. One is the air-gapped investigation machine. One is the everyday machine you use for email and Slack. Never paste from one into the other without thinking. The investigation laptop never gets your work email account.
2. Encrypt everything at rest. FileVault on Mac, LUKS on Linux. Set the laptop to require password after 1 minute of sleep. Lock it whenever you stand up.
3. Disable cloud sync for the investigation folder. No iCloud Drive, no OneDrive, no Dropbox, no Google Drive. The folder lives on the laptop and gets backed up to an external SSD that lives in a safe.
4. Strip metadata from screenshots and exported documents. exiftool -all= file.jpg removes EXIF; qpdf --linearize input.pdf output.pdf strips most PDF metadata. Photos taken on your iPhone include GPS coordinates. The DOCX file you exported includes your username.
5. Never type a source's name into the AI. Use a code name in your prompts. The model never needs to know "Maria Lopez" — only "Source A." Maintain a separate, encrypted file mapping code names to real names. That mapping file should not exist on the AI laptop at all.
6. Air-gap before sensitive operations. When transcribing source audio or running entity extraction on leaked documents, disable the network at the OS level. Verify with ifconfig (Mac) or ip a (Linux) that no interface is up. Do this even if you trust your VPN; the goal is verifiable architecture, not trust.
7. Treat AI output as a tip, not a quote. Models hallucinate. Every fact you publish must be verified in the source documents themselves, not in the model's summary of them. This is the same standard you would apply to a tip from a confidential source.
8. Plan for seizure. If the laptop could be physically taken — by police, by a bad-faith civil suit, by a thief — it must be encrypted with a password not stored anywhere except your head. Practice your wipe procedure. If you travel internationally with sensitive material, leave the laptop home and rebuild on arrival.
The Freedom of the Press Foundation maintains a continuously updated training program covering source protection that pairs well with this technical stack.
Workflow Templates {#workflows}
Interview-day workflow
- Record on a dedicated audio recorder, not your phone.
- Transfer audio via SD card to the air-gapped laptop.
- Wi-Fi off. Verify with
ifconfig. whisper-cli -m models/ggml-large-v3.bin -f interview.wav -otxt.- Skim transcript, note timecodes for important quotes.
- Move audio file to encrypted external SSD. Wipe from laptop's working folder.
FOIA-response-day workflow
- Receive the FOIA response on the everyday laptop.
- Scan for malware on a separate sandbox.
- Transfer to air-gapped laptop via USB drive.
- Run OCR if needed:
ocrmypdf input.pdf output.pdf(offline). - Ingest into AnythingLLM workspace.
- Run entity extraction.
- Generate timeline.
- Begin manual verification.
Pre-publication workflow
- Final fact-check pass: query AnythingLLM for every claim in the draft.
- If a claim is not directly supported by source documents, flag it for additional reporting or removal.
- Strip metadata from any screenshots used in the published story.
- Coordinate with editor and counsel on what redactions are needed.
Limitations and Hallucination Risk {#limitations}
Local AI is a powerful research aid. It is not infallible.
Models invent quotes. Especially on long documents with many speakers, Qwen and Llama will occasionally attribute a real quote to the wrong person, or fabricate a quote that fits the context but does not appear anywhere. Verify every quote against the original audio or document.
Models miss negation. "X did not approve the contract" can be summarized as "X approved the contract" if the negation is structurally awkward. Re-read every model summary that hinges on a yes/no.
Models default to plausibility. When asked to identify a "shell company," the model will sometimes label any unfamiliar LLC as such. Verify business registration records before publishing.
Whisper mishears proper nouns. Names of small towns, foreign words, and people with unusual names are routinely transcribed phonetically. Always check the audio when a name appears in the transcript.
Air-gap is only as strong as your discipline. A single moment of "let me just paste this into ChatGPT to double-check" defeats the entire stack. Either your workflow is air-gapped or it is not. There is no halfway.
FAQ {#faq}
(See FAQ section below — schema-rendered for Google.)
What to Read Next
Once the air-gapped stack is running:
- Local AI privacy guide — the broader case for self-hosted AI in regulated and sensitive work.
- Local AI document summarizer — handling the 200-page report that arrives at 4 PM.
- Local AI meeting transcription — Whisper deep dive with multi-speaker handling.
There is a longer argument I want to make in closing: protecting sources is not an act of paranoia. It is the central professional obligation of investigative journalism. The tools you use to handle source material need to honor that obligation in their architecture, not just in their marketing copy. Local AI does. Cloud AI cannot, and dressing it up in privacy promises does not change the technical facts.
The story that runs because the source could trust you is worth more than the hour you saved by uploading the audio.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!