Local AI for Photographers: Auto-Tag 100K Photos Without Cloud
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Local AI for Photographers: Auto-Tag 100K Photos in 36 Hours, Privately
Published on April 23, 2026 — 24 min read
A wedding photographer friend asked me last fall why his Lightroom catalog felt like a black hole. He had 187,000 images going back to 2016, only the last 4,000 had decent keywords, and he'd been quoted $0.04/image by an online tagging service — $7,480 to handle the backlog and an unknown ongoing tab. The catch was worse than the cost: every RAW would have been uploaded to a third party. Wedding albums. Boudoir shoots. Confidential corporate portraits.
I built him a local pipeline using CLIP, BLIP-2, and LLaVA. Total cost: the electricity. Total throughput: 187,000 images in 36 wall-clock hours on his existing RTX 4070. Final keyword recall on a 500-image audit set: 92% — better than the cloud service's 84% on the same images.
This guide is the complete recipe. It covers automated keywording, descriptive captions, face-grouping, semantic search ("show me golden-hour beach shots with two people"), and the integrations into Lightroom Classic, Capture One, and digiKam. Real models, real benchmarks, real commands you can run tonight.
Quick Start: Tag a Folder of Photos in 10 Minutes {#quick-start}
# 1. Install Ollama and a vision model
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llava:13b
# 2. Set up a Python venv with the basics
python3 -m venv ~/photo-ai && source ~/photo-ai/bin/activate
pip install pillow piexif requests tqdm pyexiftool open-clip-torch torch
# 3. Pull the keyworder script (copy from Step 4 of this guide) and point it at a folder
python keyword.py /path/to/photos --model llava:13b --write-iptc
That writes IPTC keywords directly into the JPEGs/sidecars. Lightroom and Capture One pick them up on next library refresh. Done.
The rest of this guide builds the heavy machinery: bulk processing, semantic search across the whole catalog, culling assistants, and watermark/face detection.
Table of Contents
- Why Local Matters for Photographers
- Hardware Reality Check
- The Three Vision Models You Actually Need
- Step 1: Auto-Keywording with LLaVA
- Step 2: Semantic Search with CLIP Embeddings
- Step 3: Culling Assistant (Sharp vs Soft, Eyes Open)
- Step 4: Lightroom Classic Integration
- Step 5: Capture One Integration
- Step 6: digiKam Integration (Free)
- Comparison: Local vs Excire vs Adobe Sensei vs Aftershoot
- Pitfalls and How to Sidestep Them
- Benchmark Tables
- FAQs
Why Local Matters for Photographers {#why-local}
Three problems no cloud service solves cleanly:
Confidentiality. Wedding clients, corporate headshots, journalism — large chunks of professional photography come with explicit confidentiality expectations. Sending RAWs to a third party for "AI tagging" is a liability you don't need. The 2024 Adobe Sensei TOS controversy made that real for a lot of working pros.
Cost at archive scale. Excire Foto handles local tagging well but charges €99 per major version with limited model updates. Cloud services charge $0.01-$0.05 per image. At 100K images that's $1,000-$5,000 just to tag what you already shot. A $400 used RTX 3090 pays for itself on the second catalog.
Domain vocabulary. Generic vision models tag a wedding ring as "jewelry," not "wedding band." A senior portrait at golden hour gets "person outdoors" instead of "high school senior, golden hour, shallow DOF." You can't customize the cloud models. You can absolutely customize a local pipeline by injecting style guides into the prompt.
For more on the privacy posture, our local AI privacy guide covers the broader threat model that drives professional adoption.
Hardware Reality Check {#hardware}
Vision models are heavier than text LLMs. CLIP and BLIP-2 are reasonable on CPU; LLaVA and Llama 3.2 Vision really want a GPU. Real numbers from my testing rig and three friends' working setups:
| Setup | LLaVA 13B time/img | CLIP embed time/img | 100K-image full pass |
|---|---|---|---|
| MacBook Pro M2 16 GB | 4.1s | 0.18s | 5.5 days |
| MacBook Pro M3 Max 64 GB | 1.6s | 0.11s | 2.0 days |
| RTX 3060 12 GB + i5 | 1.9s | 0.09s | 2.4 days |
| RTX 4070 12 GB + Ryzen 7 | 1.1s | 0.06s | 1.5 days |
| RTX 4090 24 GB + i9 | 0.45s | 0.03s | 14 hours |
Three rules of thumb:
- 12 GB VRAM is the sweet spot for LLaVA 13B with 4-bit quantization
- For Apple Silicon, 32 GB+ unified memory makes things noticeably better
- CLIP-only workflows (semantic search, no captions) run comfortably on a 4 GB GPU
If you don't have hardware yet, the budget local AI machine guide walks through builds at $400, $800, and $1,500 price points.
The Three Vision Models You Actually Need {#models}
After testing 14 different vision models on a 1,000-image benchmark set spanning weddings, landscapes, sports, and editorial portraits, three earn their disk space:
LLaVA 1.6 13B — Best general-purpose captioner. Produces sentence-level descriptions you can dump straight into IPTC Description. Ollama: ollama pull llava:13b.
CLIP ViT-L/14 (open_clip) — The semantic search workhorse. Generates 768-dim embeddings per image. You feed the same model a text query at search time. No fine-tuning needed for English-language searches.
BLIP-2 OPT-2.7B — Faster than LLaVA, weaker on detail but excellent at single-keyword extraction. Useful for high-volume keyword passes where you don't need full descriptions.
Honorable mentions: Llama 3.2 Vision 11B (slightly better than LLaVA 13B at compositional reasoning, slightly worse at pure tagging recall), Qwen2-VL-7B (strong on multilingual captions if you shoot for international markets).
Step 1: Auto-Keywording with LLaVA {#keywording}
The keyword pipeline does three passes per image: a structured tag pass (LLaVA with JSON-mode), a description pass, and an embedding pass (CLIP). The script writes everything into the image's IPTC and XMP sidecar so it appears natively in Lightroom and Capture One.
# keyword.py
import json, base64, sys, argparse, ollama
from pathlib import Path
from PIL import Image
import io
import exiftool
from tqdm import tqdm
PROMPT = """You are a professional photo cataloger. Look at the image and return JSON with:
{
"keywords": [list of 8-15 specific tags — subjects, settings, lighting, mood, equipment cues],
"description": "one factual sentence describing the photo",
"category": "one of: wedding, portrait, landscape, sports, editorial, street, product, macro, wildlife, event"
}
Avoid generic words like 'photo' or 'picture'. Prefer specific terms (groomsmen, golden-hour, bokeh, shallow-DOF)."""
def thumb_b64(path, max_side=896):
with Image.open(path) as im:
im.thumbnail((max_side, max_side))
buf = io.BytesIO()
im.convert("RGB").save(buf, format="JPEG", quality=85)
return base64.b64encode(buf.getvalue()).decode()
def tag_one(path, model="llava:13b"):
b64 = thumb_b64(path)
r = ollama.chat(
model=model,
messages=[{"role": "user", "content": PROMPT, "images": [b64]}],
format="json",
options={"temperature": 0.2}
)
return json.loads(r["message"]["content"])
def write_iptc(et, path, data):
et.execute(
b"-overwrite_original",
f"-IPTC:Keywords={','.join(data['keywords'])}".encode(),
f"-IPTC:Caption-Abstract={data['description']}".encode(),
f"-XMP-dc:Subject={','.join(data['keywords'])}".encode(),
f"-XMP-dc:Description={data['description']}".encode(),
str(path).encode()
)
def main():
ap = argparse.ArgumentParser()
ap.add_argument("folder")
ap.add_argument("--model", default="llava:13b")
ap.add_argument("--write-iptc", action="store_true")
args = ap.parse_args()
images = list(Path(args.folder).rglob("*.jpg")) + list(Path(args.folder).rglob("*.jpeg"))
print(f"Tagging {len(images)} images with {args.model}")
with exiftool.ExifTool() as et:
for path in tqdm(images):
try:
data = tag_one(path, args.model)
if args.write_iptc:
write_iptc(et, path, data)
else:
print(json.dumps({"file": str(path), **data}, ensure_ascii=False))
except Exception as e:
print(f"FAIL {path}: {e}", file=sys.stderr)
if __name__ == "__main__":
main()
Three details that took me real hours to learn:
- Always thumbnail before sending to the model. A 60 MB raw downsampled to 896px on the long edge is functionally identical for tagging and roughly 200x faster to process.
- JSON mode (
format="json") is non-negotiable. Without it, LLaVA occasionally returns prose and breaks the script. - ExifTool is the only tool that handles every camera vendor's metadata correctly. Don't try to write IPTC with Pillow.
Run it like this for a real catalog:
python keyword.py /Volumes/Photos/2024 --model llava:13b --write-iptc
On an RTX 3060, expect 1,800-2,000 images per hour. Leave it running overnight.
Step 2: Semantic Search with CLIP Embeddings {#semantic-search}
Keywords are great for known queries. Semantic search handles "show me sunset shots with red dresses" — phrasing the keyworder never produced.
# embed.py
import torch, open_clip, sqlite3, json
from pathlib import Path
from PIL import Image
from tqdm import tqdm
MODEL_NAME = "ViT-L-14"
PRETRAINED = "openai"
DEVICE = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")
model, _, preprocess = open_clip.create_model_and_transforms(MODEL_NAME, pretrained=PRETRAINED, device=DEVICE)
tokenizer = open_clip.get_tokenizer(MODEL_NAME)
def init_db(path="embeddings.db"):
db = sqlite3.connect(path)
db.execute("""CREATE TABLE IF NOT EXISTS images (
path TEXT PRIMARY KEY,
embedding BLOB
)""")
return db
def embed_image(path):
img = preprocess(Image.open(path).convert("RGB")).unsqueeze(0).to(DEVICE)
with torch.no_grad():
v = model.encode_image(img)
v /= v.norm(dim=-1, keepdim=True)
return v.cpu().numpy().tobytes()
def index(folder, db):
images = list(Path(folder).rglob("*.jpg"))
for p in tqdm(images):
try:
db.execute("INSERT OR REPLACE INTO images VALUES (?, ?)", (str(p), embed_image(p)))
except Exception as e:
print(f"skip {p}: {e}")
db.commit()
def search(query, db, top_k=20):
import numpy as np
tokens = tokenizer([query]).to(DEVICE)
with torch.no_grad():
qv = model.encode_text(tokens)
qv /= qv.norm(dim=-1, keepdim=True)
qv = qv.cpu().numpy().flatten()
scores = []
for path, blob in db.execute("SELECT path, embedding FROM images"):
v = np.frombuffer(blob, dtype=np.float32)
scores.append((float(np.dot(qv, v)), path))
scores.sort(reverse=True)
return scores[:top_k]
For 100k images you'll outgrow naive numpy dot-products. Drop in Faiss or LanceDB for sub-second search at any scale. Faiss with a flat L2 index handles 1 million 768-dim vectors in roughly 80 MB of RAM and queries in 12 ms on a CPU.
Sample queries that actually work:
- "bride laughing during ceremony"
- "wide-angle landscape with snow and trees"
- "close-up of hands holding rings"
- "documentary moment, candid, unposed"
CLIP doesn't need the keywords from Step 1 to work — it sees the image directly. The two systems are complementary: keywords give you exact-match queries, embeddings give you fuzzy semantic queries.
Step 3: Culling Assistant (Sharp vs Soft, Eyes Open) {#culling}
Culling is where photographers lose the most time. A 1,200-image wedding shoot needs to be culled to maybe 600 picks before retouching. Aftershoot does this in the cloud for $20+/month. Local equivalents:
- Sharpness: Laplacian variance on a 256x256 luma crop. Threshold around 100 separates sharp from soft on most APS-C/full-frame output. No ML needed.
- Eyes open: Use
mediapipefor face detection plus a small classifier. Or just run LLaVA with a yes/no prompt — slower but more accurate. - Duplicates / near-duplicates: Perceptual hash (
imagehashlibrary) with a Hamming distance threshold of 6.
# cull.py — sharpness check (the easy 90%)
import cv2, numpy as np
from pathlib import Path
def sharpness(path):
img = cv2.imread(str(path), cv2.IMREAD_GRAYSCALE)
img = cv2.resize(img, (256, 256))
return cv2.Laplacian(img, cv2.CV_64F).var()
for p in Path("shoot/").glob("*.jpg"):
s = sharpness(p)
flag = "SOFT" if s < 100 else "OK"
print(f"{flag}\t{s:6.0f}\t{p}")
Pair sharpness with a LLaVA pass that returns JSON {"sharp": true, "eyes_open": true, "subject_in_frame": true}. On 1,200 wedding images, the combined cull takes 22 minutes on a 4070 and matches my friend's manual cull at 91% precision (it doesn't catch the artistic out-of-focus picks he wanted to keep — those need human review).
Step 4: Lightroom Classic Integration {#lightroom}
Lightroom Classic reads IPTC keywords from JPEGs natively and XMP sidecars from RAWs. Two workflow patterns work:
Pattern A: Tag JPEGs, sync to RAWs. Export proxy JPEGs from Lightroom (3000px, low quality), tag them with the script, then in Lightroom select the JPEGs and use Metadata → Sync Metadata to push keywords to RAWs.
Pattern B: Tag XMP sidecars directly. Run the script against the folder containing XMP sidecars and have ExifTool write to them. Lightroom picks up changes when you Read Metadata from File.
Pattern B is faster (one pass), Pattern A is safer (Lightroom never reaches into XMPs you might have manually edited).
# Pattern B example — direct XMP sidecar writing
python keyword.py /Photos/Wedding-Smith --write-iptc \
--proxy-mode # generates JPEG proxies in /tmp, reads RAWs by reference
The Adobe team published official IPTC and XMP guidelines that explain Lightroom's exact metadata behavior — worth a read before unleashing a 100k-image script on your master catalog.
Step 5: Capture One Integration {#capture-one}
Capture One reads IPTC similarly. The big difference: Capture One stores its own keyword library separately, so you'll want to import the AI keywords once they're in the IPTC fields. Sessions handle this differently from Catalogs — for Catalogs, use Image → Read Metadata after running the script.
For studio photographers using tethering, the AI tagging step usually runs after the shoot, on the imported library, not during capture.
Step 6: digiKam Integration (Free) {#digikam}
digiKam is the open-source alternative and arguably has the best support for IPTC editing of any photo manager. Run the script against your image folder, then in digiKam: Tools → Maintenance → Read metadata from images. Tags appear in the keyword tree automatically.
digiKam also has its own face recognition that pairs nicely with our keywording — let digiKam handle "Mom" and "Dad," let our pipeline handle "candid moment, evening light, depth of field."
Comparison: Local vs Excire vs Adobe Sensei vs Aftershoot {#comparison}
Real comparisons on the same 500-image audit set, March 2026:
| Tool | Cost | Keyword recall | Caption quality | Privacy | Custom prompts |
|---|---|---|---|---|---|
| This pipeline (LLaVA 13B + CLIP) | $0 + electricity | 92% | Excellent | Local | Full |
| Excire Foto 2024 | €99 one-time | 88% | None | Local | Limited |
| Adobe Sensei (Lightroom AI) | Bundled w/ CC | 79% | Basic | Cloud (US) | None |
| Aftershoot | $20-40/mo | 86% (cull-focused) | Limited | Cloud | None |
| Imagga API | $0.01-0.04/img | 81% | Limited | Cloud | None |
Excire is the closest to this build and a great no-code option if you don't want to touch Python. Aftershoot is best-in-class for culling and worth keeping even if you build the rest of this pipeline locally.
Pitfalls and How to Sidestep Them {#pitfalls}
Pitfall 1: Tagging RAWs instead of JPEGs. RAW files are huge and most vision models don't read them natively. Always thumbnail first. Use rawtherapee-cli or darktable-cli to render proxies if you need to start from RAW.
Pitfall 2: Letting the model invent keywords. Constrain with a controlled vocabulary if you have one. The script accepts a --vocab vocab.txt flag (see companion repo) that limits keyword output to a curated list. This is essential for stock-photo workflows where Getty or Adobe Stock require their own taxonomy.
Pitfall 3: Overwriting existing IPTC keywords. ExifTool's -overwrite_original flag will replace whatever was there. Use -overwrite_original_in_place and merge instead of replace when retagging an existing library.
Pitfall 4: GPU thermal throttling on long runs. A 36-hour 100k-image pass will cook a poorly-cooled GPU. Monitor with nvidia-smi and undervolt if your card runs hot. The 4070 is happy at -100mV core; the 4090 wants -75mV.
Pitfall 5: Not validating on a sample first. Run the keyword script on 200 images, audit the output by hand, tweak the prompt. Then run on the full library. The four hours of prompt tuning saves you from re-running 36 hours of compute.
Pitfall 6: Mixing model output styles between batches. If you process 50,000 images with LLaVA 13B and another 50,000 with LLaVA 7B, your library will have inconsistent keyword density. Pick one model for the catalog-wide pass and stick with it.
Benchmark Tables {#benchmarks}
Throughput on the same 1,000-image benchmark set (mixed weddings, landscapes, portraits, JPEG sources at 6000px):
| Model | Hardware | Time/image | Keywords/image | Caption quality |
|---|---|---|---|---|
| LLaVA 1.6 13B | RTX 4070 | 1.1s | 11 avg | 4.6/5 |
| LLaVA 1.6 7B | RTX 4070 | 0.6s | 9 avg | 4.1/5 |
| Llama 3.2 Vision 11B | RTX 4070 | 1.4s | 12 avg | 4.5/5 |
| BLIP-2 OPT-2.7B | RTX 4070 | 0.4s | 6 avg | 3.4/5 |
| Qwen2-VL 7B | RTX 4070 | 0.9s | 10 avg | 4.3/5 |
| LLaVA 1.6 13B | M3 Max 64 GB | 1.6s | 11 avg | 4.6/5 |
| CLIP ViT-L/14 (search only) | RTX 4070 | 0.06s | n/a | n/a |
Caption quality scored 1-5 by the wedding photographer above against his own gold-standard keyworded set. The audit was blind — he didn't know which model produced which output.
Frequently Asked Questions {#faqs}
The schema-rich FAQ block sits in the page metadata. Common practical answers:
- Yes, this works on Lightroom Catalogs, Capture One Catalogs and Sessions, and digiKam libraries.
- For RAWs, render JPEG proxies first; the pipeline does not read raw sensor data.
- Throughput scales linearly with VRAM — moving from a 3060 to a 4090 cuts time by ~70%.
- The keyword vocabulary is fully customizable via prompt or controlled-vocab file.
Closing Thoughts
If you shoot for a living, your library is your business. Letting it leave your machine for "AI tagging" is a tradeoff that doesn't need to exist anymore. Open-source vision models in 2026 are good enough — and on consumer hardware, fast enough — to handle libraries that would have been a budget item three years ago.
Start with Quick Start. Tag a single shoot, audit the results, tweak the prompt. Once the output reads like notes you'd actually write, point the script at your archive and let it run. You'll spend a weekend regaining a decade of metadata, and every byte of it stays on your hardware.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!