Local AI Vision Tasks (2026): OCR, Invoices & Alt-Text with Open VLMs
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Voice working locally? Build the whole pipeline. Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.
Published on June 20, 2026 • 13 min read
The most useful everyday vision tasks — OCR, invoice and receipt extraction, image alt-text, and turning screenshots into structured data — all run on your own machine with an open vision-language model (VLM). For most people the best default in 2026 is Qwen2.5-VL 7B (released January 2025) or MiniCPM-V 2.6 (8B); both fit on a single 12 GB GPU at 4-bit and are strong at OCR, while Moondream (~1.9B) covers ultra-light captioning on tiny hardware. You point a model at an image and ask for text, JSON, or a description — nothing ever leaves your computer, which is the entire reason to do this locally instead of using a cloud OCR API.
This guide covers the open VLMs worth running, how much VRAM each needs, two ways to run them (Ollama for the quick path, vLLM for batch throughput), and concrete examples for each task.
The one-line mental model
image + a prompt → an open VLM → text, JSON, or a description. Swap the prompt and you swap the task: "Transcribe all text" gives OCR, "Extract vendor, total, line items as JSON" gives invoice parsing, "Write concise alt-text" gives accessibility captions.
What are "vision tasks" and which open VLM should I use?
A vision-language model takes an image (or several) plus a text prompt and returns text. That single capability covers a surprising amount of day-to-day work:
- OCR — read printed or handwritten text out of a photo, scan, or screenshot.
- Invoice / receipt extraction — pull vendor, date, totals, and line items into structured JSON.
- Image alt-text — generate a short, accurate description for accessibility and SEO.
- Screenshots-to-data — read a chart, table, dashboard, or UI and return the numbers behind it.
The open models below are the actively maintained, open-weight options that handle these well in 2026. All are downloadable from Hugging Face; the smaller ones also have one-command Ollama builds.
| Model | Size(s) | Released | Strongest at | Run via |
|---|---|---|---|---|
| Qwen2.5-VL | 3B / 7B / 32B / 72B | Jan 2025 | OCR, documents, charts, JSON output | Ollama (qwen2.5vl), vLLM, transformers |
| MiniCPM-V 2.6 | 8B | Aug 2024 | OCR (SOTA on OCRBench for its size) | Ollama (minicpm-v), transformers |
| InternVL3 | 8B / 14B / 38B / 78B | Apr 2025 | Reasoning over documents and scenes | vLLM, transformers |
| Moondream 2 | ~1.9B (and 0.5B) | ongoing | Lightweight captioning, simple OCR, pointing | Ollama (moondream), transformers |
How to choose, plainly: start with Qwen2.5-VL 7B — it was explicitly tuned for documents, layouts, charts, and clean structured output, which makes it the best all-rounder for OCR and invoice work. Pick MiniCPM-V 2.6 if raw OCR accuracy on dense/handwritten text is your priority; per its model card it reaches state-of-the-art on OCRBench, surpassing GPT-4o, GPT-4V, and Gemini 1.5 Pro on that benchmark. Use Moondream when you only have a few GB of VRAM (or a Raspberry Pi) and need quick captions or alt-text. Reach for InternVL3 38B/78B only when you need heavier reasoning and have the GPU for it.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Which open VLM is best for OCR specifically?
For pure text-out-of-images, two models lead the open pack:
- MiniCPM-V 2.6 (8B) — built on SigLip-400M + Qwen2-7B. It can process images of any aspect ratio up to ~1.8 million pixels (e.g. 1344×1344) and reports state-of-the-art OCRBench results for its size class. This is the one to try first if your inputs are dense scans or handwriting.
- Qwen2.5-VL (7B / 72B) — independent reviewers consistently rank the Qwen2.5-VL family among the best open-weight OCR options even though it is a general VLM, with the 72B variant especially good at parsing tables and forms and emitting clean JSON. The 7B is the practical local pick.
A caveat worth stating: if all you need is plain text from clean printed documents at high volume, a dedicated OCR engine (Tesseract, PaddleOCR, or a purpose-built doc model) can be faster and lighter than a 7B–8B VLM. VLMs win when the layout is messy, when you want the meaning (not just the characters), or when you want to extract structured fields in one shot rather than OCR-then-parse.
How do I run a local VLM? (Ollama vs vLLM)
Two paths, depending on whether you want the quick interactive route or batch throughput.
Ollama — the 60-second path. Pull a vision build and point it at an image file:
# Install Ollama, then:
ollama pull qwen2.5vl:7b
ollama run qwen2.5vl:7b "Transcribe every word of text in this image." --image receipt.jpg
# MiniCPM-V for OCR-heavy work:
ollama pull minicpm-v
ollama run minicpm-v "Read all text, preserving line breaks." --image scan.png
# Moondream for a fast caption on tiny hardware:
ollama pull moondream
ollama run moondream "Write concise alt-text for this image." --image photo.jpg
Ollama exposes an HTTP API on port 11434, so you can script the same calls from Python or wire them into an app. It's the right tool for interactive use, single images, and modest volumes.
vLLM — the batch/throughput path. For thousands of invoices or a back-catalog of screenshots, serve the model with vLLM's OpenAI-compatible endpoint and fire concurrent requests:
vllm serve Qwen/Qwen2.5-VL-7B-Instruct --limit-mm-per-prompt image=4
Then call it exactly like the OpenAI Chat Completions API, passing images as URLs or base64 image_url content blocks. vLLM batches and caches across requests, which is what makes high-volume document pipelines practical. (vLLM also serves InternVL3 and many other VLMs from the same interface.)
How much VRAM do these models need?
VLMs cost more memory than text models of the same parameter count because the image encoder and the vision tokens both consume VRAM. Here is a realistic picture at 4-bit, which is the sweet spot for consumer GPUs. Treat these as approximate — exact usage shifts with image resolution, the number of images per prompt, and the quantization build.
| Model | Variant | Approx VRAM (4-bit) | Fits on |
|---|---|---|---|
| Moondream 2 | ~1.9B | ~2–4 GB | 4 GB card, Jetson, even CPU |
| Qwen2.5-VL 3B | 3B (Q4) | ~4–6 GB | 6–8 GB card / edge |
| Qwen2.5-VL 7B | 7B (Q4) | ~8–10 GB | 12 GB card comfortably |
| MiniCPM-V 2.6 | 8B (Q4_K_M) | ~8–9 GB | 12 GB card |
| InternVL3 8B | 8B (Q4) | ~9–11 GB | 12–16 GB card |
| Qwen2.5-VL 72B | 72B (Q4) | ~40+ GB | 48 GB card or multi-GPU |
First-hand, ballpark numbers: on an RTX 3090 (24 GB) running Qwen2.5-VL-7B at the Q4_K_M quant, I measured roughly 20–30 tokens/sec generating a structured JSON extraction from a single A4 invoice scan — call it a couple of seconds of latency per document once the image is decoded. MiniCPM-V 2.6 on the same card landed in a similar range for a full-page OCR transcription. These are approximate and swing with image resolution and prompt length, but the headline holds: a single 12–24 GB consumer GPU is plenty for the 7B–8B class, and Moondream runs on almost anything.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Example 1 — OCR a screenshot or scan
The simplest task. Prompt for a faithful transcription and ask the model to preserve structure:
ollama run qwen2.5vl:7b \
"Transcribe all text exactly. Preserve line breaks and reading order. Do not summarize." \
--image screenshot.png
For tables inside a screenshot, add "Return any tables as Markdown." Qwen2.5-VL and MiniCPM-V both handle multi-column and small-text pages noticeably better than older open VLMs.
Example 2 — Extract an invoice or receipt to JSON
This is where VLMs beat a plain OCR pass: you skip the brittle "OCR then regex" step and get structured fields directly. The trick is to specify the exact schema in the prompt.
import ollama
resp = ollama.chat(
model="qwen2.5vl:7b",
messages=[{
"role": "user",
"content": (
"Extract this invoice as JSON with keys: vendor, invoice_number, "
"date (ISO 8601), currency, line_items (list of {description, qty, "
"unit_price, amount}), subtotal, tax, total. Use null for missing "
"fields. Return ONLY valid JSON."
),
"images": ["invoice.jpg"],
}],
)
print(resp["message"]["content"])
Because layouts vary wildly across vendors, this VLM route is the one to test when rule-based templates keep breaking. Validate the JSON against a schema (e.g. with pydantic) and flag low-confidence rows for human review — a VLM is a strong first pass, not a rubber stamp for accounting.
Example 3 — Generate image alt-text
Accessible, SEO-friendly alt-text is a perfect small-VLM job. Keep the prompt tight so you get a caption, not an essay:
ollama run moondream \
"Write alt-text under 125 characters describing this image factually. No 'image of'." \
--image product-photo.jpg
Moondream (~1.9B) is ideal here because it's fast and runs on tiny hardware; for richer scenes or text-in-image, Qwen2.5-VL 7B produces more detailed, accurate captions.
Example 4 — Screenshots-to-data (charts, dashboards, UIs)
VLMs can read the numbers behind a chart or the values in a UI and hand them back as structured data:
ollama run qwen2.5vl:7b \
"Read this chart. Return the underlying data as a Markdown table with columns and values, then one sentence on the trend." \
--image dashboard.png
Qwen2.5-VL was specifically tuned for chart, document, and layout understanding, which is why it's the default recommendation for this category. For a moving picture rather than a still screenshot, this same VLM step is the visual half of a video pipeline — see our local AI video analysis guide for how frames feed into the same models.
Local vs cloud: when is each right?
| Factor | Local (Qwen2.5-VL / MiniCPM-V) | Cloud (Google Vision, AWS Textract, GPT-4o) |
|---|---|---|
| Where images go | Stay on your disk | Uploaded to the provider |
| Cost model | One-time hardware + power | Per-image / per-page, recurring |
| Volume economics | Flat after hardware | Scales linearly with usage |
| Sensitive docs | Stays on-prem (HIPAA/contracts/PII) | Leaves your control |
| Raw ceiling | Strong, occasionally trails | Often highest accuracy |
| Setup effort | You run the model | API call, zero infra |
If your documents are non-sensitive and volume is low, a cloud OCR API is the pragmatic, zero-setup choice. If you process PII, invoices, contracts, medical scans, or high volumes, local is the answer — and the open models are now good enough that the accuracy gap is small for everyday OCR and extraction.
Key Takeaways
- One model, many tasks. OCR, invoice extraction, alt-text, and screenshots-to-data are all the same operation — image + prompt → text/JSON — handled by one open VLM.
- Default to Qwen2.5-VL 7B. Released January 2025, tuned for documents, charts, and clean JSON output; it's the best all-rounder and runs on a 12 GB GPU at 4-bit.
- Pick MiniCPM-V 2.6 for OCR-first work. The 8B model reports state-of-the-art OCRBench results, surpassing GPT-4o/GPT-4V/Gemini 1.5 Pro on that benchmark, in ~8–9 GB of VRAM.
- Moondream (~1.9B) is the tiny-hardware option for captioning and alt-text — it even runs on a Raspberry Pi.
- Run it two ways: Ollama for the 60-second interactive path, vLLM's OpenAI-compatible server for batch throughput over thousands of documents.
- VLMs beat OCR-then-parse for messy layouts because you extract structured fields in one shot — but validate the JSON and keep a human in the loop for accounting-grade data.
Next Steps
- Pick your model first: browse the best Ollama models to see the current vision builds and pull commands side by side.
- Apply vision to footage: read Local AI video analysis — the same VLMs describe frames so you can summarize and search inside video privately.
- Read documents in other languages: combine OCR with translate documents offline to extract and translate without sending files to the cloud.
- Add the audio half: for video and meeting workflows, pair vision with local AI subtitles using Whisper to turn speech into timestamped text.
Authoritative sources: the Qwen2.5-VL Technical Report and the MiniCPM-V 2.6 model card on Hugging Face document the model specs and OCR claims cited above.
Voice working locally? Build the whole pipeline.
Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!