Free course — 2 free chapters of every course. No credit card.Start learning free
Industry Guide

Local AI for Journalists: Protect Sources With Offline AI

April 23, 2026
19 min read
Local AI Master Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

Local AI for Journalists: Protect Sources With Offline AI

Published on April 23, 2026 • 19 min read

A reporter at a regional paper called me in February. She had been working on a story about wage theft at a chain of nursing homes for five months. She had eight hours of recorded interviews with workers — most of them undocumented, all of them terrified — and a stack of internal scheduling spreadsheets a former assistant manager had given her. Her editor wanted a draft by the end of the week. She had been planning to feed the audio into Otter.ai and the spreadsheets into ChatGPT.

I asked her to read me Otter.ai's data retention policy. She got to the part about "may share with affiliates and service providers" and stopped. We spent the next afternoon setting up a laptop that did the entire pipeline offline. The story ran six weeks later. The sources are still safe.

This guide is that setup, written for the reporter who actually has something at stake. Whisper for transcription, Ollama for analysis, AnythingLLM for searching FOIA dumps — all on a machine you can lock in a desk drawer. It is not paranoid. It is the same reasoning behind why you encrypt Signal and use SecureDrop. The AI tier of your stack should match the security tier of the rest of it.

This is technical guidance, not legal advice. Consult your newsroom's lawyer, your jurisdiction's shield laws, and your editor before applying any of this to a live investigation.

Quick Start: A Source-Safe Stack in 30 Minutes {#quick-start}

If you need the working setup right now:

  1. Air-gap a laptop. Wipe it, install Linux Mint or macOS, do not log into iCloud or Google.
  2. Install Ollama and Whisper. brew install ollama whisper-cpp on Mac, or build from source on Linux.
  3. Pull models offline-capable. ollama pull qwen2.5:14b (8 GB), ollama pull nomic-embed-text (274 MB), and Whisper's large-v3 model (1.5 GB).
  4. Deploy AnythingLLM in Docker. One command. Restricted to localhost.
  5. Disable the network. sudo ifconfig en0 down (Mac) or nmcli radio wifi off (Linux). The stack still works.

That's a usable air-gapped investigative workstation. The rest of this guide covers operational security, FOIA workflow, named-entity extraction, and the specific mistakes that get sources outed.

Table of Contents

  1. Why Cloud AI Is a Source Risk
  2. Threat Model for Investigative Reporters
  3. The Stack
  4. Hardware: The Disposable Investigation Laptop
  5. Step 1 — Offline Whisper Transcription
  6. Step 2 — Document RAG With AnythingLLM
  7. Step 3 — Named-Entity Extraction From FOIA Dumps
  8. Step 4 — Cross-Document Timeline Building
  9. Operational Security Rules
  10. Workflow Templates
  11. Limitations and Hallucination Risk
  12. FAQ

Why Cloud AI Is a Source Risk {#why-cloud-ai}

Three concrete problems make hosted AI services dangerous for source-protective reporting:

1. Subpoenas reach where you cannot. OpenAI, Anthropic, Google, and Otter.ai have all received subpoenas for user data. Some honor them; some fight them; some notify users; some are gagged from notifying users. Even the strongest of these services cannot promise that material processed through their systems is unreachable. Local processing eliminates the third party entirely — there is no one to subpoena except you.

2. Retention is not deletion. Most cloud AI services retain prompts and outputs for some window after a "delete" action — for abuse review, model improvement, or legal hold. Otter.ai's documentation confirms transcripts can persist after account deletion under certain conditions. This window does not need to be long; it needs only to overlap with a discovery request.

3. Logs are evidence. Cloud AI services produce logs that include timestamps, IP addresses, and identifiers. A pattern of API calls timed alongside a source meeting can itself become incriminating metadata in a hostile legal or political environment.

Whether your reporting is on labor abuses, policing, or government corruption, the question is not whether your AI vendor's intentions are good. The question is what is technically possible to extract from their systems under legal compulsion. The only architecturally safe answer is: nothing about you exists there.

The Reporters Committee for Freedom of the Press maintains current guidance on subpoena risk for journalism workflows.


Threat Model for Investigative Reporters {#threat-model}

Before you pick tools, know who you are defending against. Most reporters need to defend against at least two of these:

ThreatLikelihoodImpactLocal AI Helps?
Civil subpoena to your AI vendorHigh in litigation-heavy beatsSource identificationYes — eliminates vendor
Federal subpoena (national-security beat)MediumSource identification + jailYes
Foreign government surveillanceBeat-dependentSevereYes, with air gap
Hostile sources / company insidersMediumStory leak before publicationPartially — depends on opsec
Theft of laptopAlwaysCatastrophicOnly with disk encryption + screen lock
Mistakes by you under deadlineAlwaysCatastrophicNo — process matters more than tools

The last row is critical. The single largest cause of source exposure in newsrooms is reporters under deadline pressure pasting things into the wrong window. No tool fixes that. Workflow discipline does.


The Stack {#the-stack}

LayerToolWhat it does
OSLinux Mint or macOS (FileVault on)Encrypted, minimal telemetry
Model engineOllamaRuns LLMs locally, no telemetry
Primary modelQwen 2.5 14BStrong reasoning, fits in 16 GB RAM
Light modelLlama 3.2 3BFast triage tasks
Transcriptionwhisper.cpp (large-v3)Offline speech-to-text
Document RAGAnythingLLM (Docker, localhost)Searches FOIA dumps and source files
Embeddingsnomic-embed-textVector search for documents
NERspaCy + custom promptsPeople, organizations, dates, locations

Total software cost: $0. All open source. Total disk footprint: ~30 GB after model downloads.


Hardware: The Disposable Investigation Laptop {#hardware}

Investigative work justifies a dedicated machine — one that does nothing else. Recommended baseline:

ComponentSpecificationWhy
LaptopRefurbished ThinkPad T14 Gen 3 or Mac Mini M2 ProCheap enough to retire after a sensitive investigation
RAM32 GBRuns 14B with comfortable headroom
Storage1 TB NVMe with full-disk encryptionSource material and FOIA dumps add up
Screen filter3M privacy filterCafés exist
NetworkDisabled by defaultUse a separate machine for OSINT

Buy used or refurbished. Pay cash if your beat warrants it. Do not enroll the device in any organizational MDM unless your newsroom has a privileged "investigations" tier with explicit policy support.

For broader hardware sizing across local AI use cases, see our AI hardware requirements guide.


Step 1 — Offline Whisper Transcription {#whisper}

Whisper is the most important tool in this stack. It runs OpenAI's open-weights speech model entirely on your CPU or GPU with no network calls.

# Build whisper.cpp
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make -j

# Download the large-v3 model (best accuracy, 1.5 GB)
bash ./models/download-ggml-model.sh large-v3

# Transcribe an interview
./build/bin/whisper-cli \
  -m models/ggml-large-v3.bin \
  -f interview-2026-04-12.wav \
  -l en \
  -otxt -ovtt -osrt

Real benchmarks I measured on three machines:

MachineModelAudio lengthReal time
MacBook Air M1 16 GBlarge-v360 min11 min
MacBook Pro M2 Max 32 GBlarge-v360 min4 min
ThinkPad T14 (i7, 32 GB CPU only)large-v360 min38 min
ThinkPad T14 (RTX 3050 Ti, CUDA)large-v360 min9 min

For most reporters, an M1 or M2 MacBook Air is sufficient. If you have multi-hour interviews, the M2 Max is the productivity unlock.

Pre-flight checklist before transcribing sensitive audio:

  • Wi-Fi disabled (nmcli radio wifi off or hardware switch)
  • Bluetooth disabled
  • Screen recording, dictation, and "improve Siri" all disabled in OS settings
  • No backup software running (Time Machine, OneDrive, Dropbox)
  • Audio file on the laptop, not on a network share

For deeper Whisper coverage, see Local AI meeting transcription.


Step 2 — Document RAG With AnythingLLM {#anythingllm}

When you receive a 6,000-page FOIA response or a leak of internal company documents, the bottleneck stops being reading speed and starts being indexing. AnythingLLM is the right tool: it builds a vector index of your documents, then lets you ask questions that get answered with citations back to the exact source page.

docker run -d \
  -p 127.0.0.1:3001:3001 \
  -v anythingllm-investigation:/app/server/storage \
  --add-host=host.docker.internal:host-gateway \
  -e LLM_PROVIDER=ollama \
  -e OLLAMA_BASE_PATH=http://host.docker.internal:11434 \
  -e OLLAMA_MODEL_PREF=qwen2.5:14b \
  -e EMBEDDING_ENGINE=ollama \
  -e EMBEDDING_MODEL_PREF=nomic-embed-text \
  --name anythingllm-investigation \
  --restart always \
  mintplexlabs/anythingllm

The critical flag: -p 127.0.0.1:3001:3001 binds the service to localhost only. Do not expose this port to your network. If you need to share access with a colleague, use SSH tunneling or a hardware VPN appliance.

Workspace organization for an investigation:

  • One workspace per story. Documents from "Story A" should never be retrievable when querying "Story B."
  • Subdivide: "Story A — Internal docs," "Story A — FOIA," "Story A — Public records."
  • Tag each document with date received and source identifier (without naming the source).
  • Use the AnythingLLM access control feature to restrict per-user access if multiple reporters share the workstation.

Practical query patterns that work:

- "Find every mention of [executive name] across all internal emails. Return date, recipient, and one-line summary."
- "What dates does this contract describe in 2024? Quote the exact text."
- "List every unique organization mentioned in the FOIA response, sorted by frequency."
- "Identify discrepancies between the public statement on [date] and the internal documents from the same week."

The model answers with citations. Verify each one by clicking through to the source. Never quote AI-generated text in published reporting; AI generates leads, you verify them in the source documents.

For a full RAG walkthrough see the RAG local setup guide and the AnythingLLM setup guide.


Step 3 — Named-Entity Extraction From FOIA Dumps {#ner-foia}

When a FOIA request returns 4,000 pages, the first useful pass is to pull out every person, organization, date, and place mentioned. spaCy does this offline in seconds.

# pip install spacy
# python -m spacy download en_core_web_lg

import spacy
from pathlib import Path
from collections import Counter
import json

nlp = spacy.load("en_core_web_lg")

entities = Counter()
docs_dir = Path("./foia-response-2026-04")

for txt in docs_dir.glob("**/*.txt"):
    text = txt.read_text(errors="ignore")
    doc = nlp(text[:1_000_000])  # cap to 1MB per doc
    for ent in doc.ents:
        if ent.label_ in {"PERSON", "ORG", "GPE", "DATE", "MONEY"}:
            entities[(ent.text.strip(), ent.label_)] += 1

# Top 200 entities by frequency
top = entities.most_common(200)
Path("entities.json").write_text(json.dumps(top, indent=2))

Now feed that list to Qwen for clustering:

ollama run qwen2.5:14b "I have a list of entities extracted from a FOIA response \
about [topic]. Cluster them into: (1) public officials, (2) private contractors, \
(3) shell companies (anything with LLC/Inc but no clear business purpose), \
(4) places, (5) dollar amounts above \$100K. Output as a Markdown table.

$(cat entities.json)"

This single pass typically surfaces three to five names that warrant follow-up — usually the contractors no one had heard of and the LLCs registered to PO boxes.


Step 4 — Cross-Document Timeline Building {#timeline}

The most useful AI task in long investigations is timeline construction. Hand the model a workspace of dated documents and ask:

"Build a chronological timeline of every event mentioned across these documents.
For each entry, include: date, source document filename, one-line description,
and the people involved. Skip entries that are routine business operations."

Qwen 2.5 14B handles this remarkably well across hundreds of documents because AnythingLLM retrieves only the most relevant chunks per question, keeping the context manageable.

Critical post-processing rule: paste the timeline into a real document, then verify every single entry by clicking back to the source. Models will occasionally invent or merge dates. Treat the timeline as a draft outline, not as evidence.


Operational Security Rules {#opsec}

These are the rules I have given to every reporter who has set up this stack. Each one comes from a real near-miss.

1. Two laptops minimum. One is the air-gapped investigation machine. One is the everyday machine you use for email and Slack. Never paste from one into the other without thinking. The investigation laptop never gets your work email account.

2. Encrypt everything at rest. FileVault on Mac, LUKS on Linux. Set the laptop to require password after 1 minute of sleep. Lock it whenever you stand up.

3. Disable cloud sync for the investigation folder. No iCloud Drive, no OneDrive, no Dropbox, no Google Drive. The folder lives on the laptop and gets backed up to an external SSD that lives in a safe.

4. Strip metadata from screenshots and exported documents. exiftool -all= file.jpg removes EXIF; qpdf --linearize input.pdf output.pdf strips most PDF metadata. Photos taken on your iPhone include GPS coordinates. The DOCX file you exported includes your username.

5. Never type a source's name into the AI. Use a code name in your prompts. The model never needs to know "Maria Lopez" — only "Source A." Maintain a separate, encrypted file mapping code names to real names. That mapping file should not exist on the AI laptop at all.

6. Air-gap before sensitive operations. When transcribing source audio or running entity extraction on leaked documents, disable the network at the OS level. Verify with ifconfig (Mac) or ip a (Linux) that no interface is up. Do this even if you trust your VPN; the goal is verifiable architecture, not trust.

7. Treat AI output as a tip, not a quote. Models hallucinate. Every fact you publish must be verified in the source documents themselves, not in the model's summary of them. This is the same standard you would apply to a tip from a confidential source.

8. Plan for seizure. If the laptop could be physically taken — by police, by a bad-faith civil suit, by a thief — it must be encrypted with a password not stored anywhere except your head. Practice your wipe procedure. If you travel internationally with sensitive material, leave the laptop home and rebuild on arrival.

The Freedom of the Press Foundation maintains a continuously updated training program covering source protection that pairs well with this technical stack.


Workflow Templates {#workflows}

Interview-day workflow

  1. Record on a dedicated audio recorder, not your phone.
  2. Transfer audio via SD card to the air-gapped laptop.
  3. Wi-Fi off. Verify with ifconfig.
  4. whisper-cli -m models/ggml-large-v3.bin -f interview.wav -otxt.
  5. Skim transcript, note timecodes for important quotes.
  6. Move audio file to encrypted external SSD. Wipe from laptop's working folder.

FOIA-response-day workflow

  1. Receive the FOIA response on the everyday laptop.
  2. Scan for malware on a separate sandbox.
  3. Transfer to air-gapped laptop via USB drive.
  4. Run OCR if needed: ocrmypdf input.pdf output.pdf (offline).
  5. Ingest into AnythingLLM workspace.
  6. Run entity extraction.
  7. Generate timeline.
  8. Begin manual verification.

Pre-publication workflow

  1. Final fact-check pass: query AnythingLLM for every claim in the draft.
  2. If a claim is not directly supported by source documents, flag it for additional reporting or removal.
  3. Strip metadata from any screenshots used in the published story.
  4. Coordinate with editor and counsel on what redactions are needed.

Limitations and Hallucination Risk {#limitations}

Local AI is a powerful research aid. It is not infallible.

Models invent quotes. Especially on long documents with many speakers, Qwen and Llama will occasionally attribute a real quote to the wrong person, or fabricate a quote that fits the context but does not appear anywhere. Verify every quote against the original audio or document.

Models miss negation. "X did not approve the contract" can be summarized as "X approved the contract" if the negation is structurally awkward. Re-read every model summary that hinges on a yes/no.

Models default to plausibility. When asked to identify a "shell company," the model will sometimes label any unfamiliar LLC as such. Verify business registration records before publishing.

Whisper mishears proper nouns. Names of small towns, foreign words, and people with unusual names are routinely transcribed phonetically. Always check the audio when a name appears in the transcript.

Air-gap is only as strong as your discipline. A single moment of "let me just paste this into ChatGPT to double-check" defeats the entire stack. Either your workflow is air-gapped or it is not. There is no halfway.


FAQ {#faq}

(See FAQ section below — schema-rendered for Google.)


Once the air-gapped stack is running:

There is a longer argument I want to make in closing: protecting sources is not an act of paranoia. It is the central professional obligation of investigative journalism. The tools you use to handle source material need to honor that obligation in their architecture, not just in their marketing copy. Local AI does. Cloud AI cannot, and dressing it up in privacy promises does not change the technical facts.

The story that runs because the source could trust you is worth more than the hour you saved by uploading the audio.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: April 23, 2026🔄 Last Updated: April 23, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

Source-Safe AI Updates for Reporters

Subscribe for new offline workflows, model updates that affect transcription accuracy, and threat-model checklists. One email per week, no tracking pixels.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Was this helpful?

📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators