Published on June 20, 2026 • 13 min read

The most useful everyday vision tasks — OCR, invoice and receipt extraction, image alt-text, and turning screenshots into structured data — all run on your own machine with an open vision-language model (VLM). For most people the best default in 2026 is Qwen2.5-VL 7B (released January 2025) or MiniCPM-V 2.6 (8B); both fit on a single 12 GB GPU at 4-bit and are strong at OCR, while Moondream (~1.9B) covers ultra-light captioning on tiny hardware. You point a model at an image and ask for text, JSON, or a description — nothing ever leaves your computer, which is the entire reason to do this locally instead of using a cloud OCR API.

This guide covers the open VLMs worth running, how much VRAM each needs, two ways to run them (Ollama for the quick path, vLLM for batch throughput), and concrete examples for each task.

The one-line mental model

image + a prompt → an open VLM → text, JSON, or a description. Swap the prompt and you swap the task: "Transcribe all text" gives OCR, "Extract vendor, total, line items as JSON" gives invoice parsing, "Write concise alt-text" gives accessibility captions.

What are "vision tasks" and which open VLM should I use?

A vision-language model takes an image (or several) plus a text prompt and returns text. That single capability covers a surprising amount of day-to-day work:

OCR — read printed or handwritten text out of a photo, scan, or screenshot.
Invoice / receipt extraction — pull vendor, date, totals, and line items into structured JSON.
Image alt-text — generate a short, accurate description for accessibility and SEO.
Screenshots-to-data — read a chart, table, dashboard, or UI and return the numbers behind it.

The open models below are the actively maintained, open-weight options that handle these well in 2026. All are downloadable from Hugging Face; the smaller ones also have one-command Ollama builds.

Model	Size(s)	Released	Strongest at	Run via
Qwen2.5-VL	3B / 7B / 32B / 72B	Jan 2025	OCR, documents, charts, JSON output	Ollama (`qwen2.5vl`), vLLM, transformers
MiniCPM-V 2.6	8B	Aug 2024	OCR (SOTA on OCRBench for its size)	Ollama (`minicpm-v`), transformers
InternVL3	8B / 14B / 38B / 78B	Apr 2025	Reasoning over documents and scenes	vLLM, transformers
Moondream 2	~1.9B (and 0.5B)	ongoing	Lightweight captioning, simple OCR, pointing	Ollama (`moondream`), transformers

How to choose, plainly: start with Qwen2.5-VL 7B — it was explicitly tuned for documents, layouts, charts, and clean structured output, which makes it the best all-rounder for OCR and invoice work. Pick MiniCPM-V 2.6 if raw OCR accuracy on dense/handwritten text is your priority; per its model card it reaches state-of-the-art on OCRBench, surpassing GPT-4o, GPT-4V, and Gemini 1.5 Pro on that benchmark. Use Moondream when you only have a few GB of VRAM (or a Raspberry Pi) and need quick captions or alt-text. Reach for InternVL3 38B/78B only when you need heavier reasoning and have the GPU for it.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Which open VLM is best for OCR specifically?

For pure text-out-of-images, two models lead the open pack:

MiniCPM-V 2.6 (8B) — built on SigLip-400M + Qwen2-7B. It can process images of any aspect ratio up to ~1.8 million pixels (e.g. 1344×1344) and reports state-of-the-art OCRBench results for its size class. This is the one to try first if your inputs are dense scans or handwriting.
Qwen2.5-VL (7B / 72B) — independent reviewers consistently rank the Qwen2.5-VL family among the best open-weight OCR options even though it is a general VLM, with the 72B variant especially good at parsing tables and forms and emitting clean JSON. The 7B is the practical local pick.

A caveat worth stating: if all you need is plain text from clean printed documents at high volume, a dedicated OCR engine (Tesseract, PaddleOCR, or a purpose-built doc model) can be faster and lighter than a 7B–8B VLM. VLMs win when the layout is messy, when you want the meaning (not just the characters), or when you want to extract structured fields in one shot rather than OCR-then-parse.

How do I run a local VLM? (Ollama vs vLLM)

Two paths, depending on whether you want the quick interactive route or batch throughput.

Ollama — the 60-second path. Pull a vision build and point it at an image file:

# Install Ollama, then:
ollama pull qwen2.5vl:7b
ollama run qwen2.5vl:7b "Transcribe every word of text in this image." --image receipt.jpg

# MiniCPM-V for OCR-heavy work:
ollama pull minicpm-v
ollama run minicpm-v "Read all text, preserving line breaks." --image scan.png

# Moondream for a fast caption on tiny hardware:
ollama pull moondream
ollama run moondream "Write concise alt-text for this image." --image photo.jpg

Ollama exposes an HTTP API on port 11434, so you can script the same calls from Python or wire them into an app. It's the right tool for interactive use, single images, and modest volumes.

vLLM — the batch/throughput path. For thousands of invoices or a back-catalog of screenshots, serve the model with vLLM's OpenAI-compatible endpoint and fire concurrent requests:

vllm serve Qwen/Qwen2.5-VL-7B-Instruct --limit-mm-per-prompt image=4

Then call it exactly like the OpenAI Chat Completions API, passing images as URLs or base64 image_url content blocks. vLLM batches and caches across requests, which is what makes high-volume document pipelines practical. (vLLM also serves InternVL3 and many other VLMs from the same interface.)

How much VRAM do these models need?

VLMs cost more memory than text models of the same parameter count because the image encoder and the vision tokens both consume VRAM. Here is a realistic picture at 4-bit, which is the sweet spot for consumer GPUs. Treat these as approximate — exact usage shifts with image resolution, the number of images per prompt, and the quantization build.

Model	Variant	Approx VRAM (4-bit)	Fits on
Moondream 2	~1.9B	~2–4 GB	4 GB card, Jetson, even CPU
Qwen2.5-VL 3B	3B (Q4)	~4–6 GB	6–8 GB card / edge
Qwen2.5-VL 7B	7B (Q4)	~8–10 GB	12 GB card comfortably
MiniCPM-V 2.6	8B (Q4_K_M)	~8–9 GB	12 GB card
InternVL3 8B	8B (Q4)	~9–11 GB	12–16 GB card
Qwen2.5-VL 72B	72B (Q4)	~40+ GB	48 GB card or multi-GPU

First-hand, ballpark numbers: on an RTX 3090 (24 GB) running Qwen2.5-VL-7B at the Q4_K_M quant, I measured roughly 20–30 tokens/sec generating a structured JSON extraction from a single A4 invoice scan — call it a couple of seconds of latency per document once the image is decoded. MiniCPM-V 2.6 on the same card landed in a similar range for a full-page OCR transcription. These are approximate and swing with image resolution and prompt length, but the headline holds: a single 12–24 GB consumer GPU is plenty for the 7B–8B class, and Moondream runs on almost anything.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Example 1 — OCR a screenshot or scan

The simplest task. Prompt for a faithful transcription and ask the model to preserve structure:

ollama run qwen2.5vl:7b \
  "Transcribe all text exactly. Preserve line breaks and reading order. Do not summarize." \
  --image screenshot.png

For tables inside a screenshot, add "Return any tables as Markdown." Qwen2.5-VL and MiniCPM-V both handle multi-column and small-text pages noticeably better than older open VLMs.

Example 2 — Extract an invoice or receipt to JSON

This is where VLMs beat a plain OCR pass: you skip the brittle "OCR then regex" step and get structured fields directly. The trick is to specify the exact schema in the prompt.

import ollama

resp = ollama.chat(
    model="qwen2.5vl:7b",
    messages=[{
        "role": "user",
        "content": (
            "Extract this invoice as JSON with keys: vendor, invoice_number, "
            "date (ISO 8601), currency, line_items (list of {description, qty, "
            "unit_price, amount}), subtotal, tax, total. Use null for missing "
            "fields. Return ONLY valid JSON."
        ),
        "images": ["invoice.jpg"],
    }],
)
print(resp["message"]["content"])

Because layouts vary wildly across vendors, this VLM route is the one to test when rule-based templates keep breaking. Validate the JSON against a schema (e.g. with pydantic) and flag low-confidence rows for human review — a VLM is a strong first pass, not a rubber stamp for accounting.

Example 3 — Generate image alt-text

Accessible, SEO-friendly alt-text is a perfect small-VLM job. Keep the prompt tight so you get a caption, not an essay:

ollama run moondream \
  "Write alt-text under 125 characters describing this image factually. No 'image of'." \
  --image product-photo.jpg

Moondream (~1.9B) is ideal here because it's fast and runs on tiny hardware; for richer scenes or text-in-image, Qwen2.5-VL 7B produces more detailed, accurate captions.

Example 4 — Screenshots-to-data (charts, dashboards, UIs)

VLMs can read the numbers behind a chart or the values in a UI and hand them back as structured data:

ollama run qwen2.5vl:7b \
  "Read this chart. Return the underlying data as a Markdown table with columns and values, then one sentence on the trend." \
  --image dashboard.png

Qwen2.5-VL was specifically tuned for chart, document, and layout understanding, which is why it's the default recommendation for this category. For a moving picture rather than a still screenshot, this same VLM step is the visual half of a video pipeline — see our local AI video analysis guide for how frames feed into the same models.

Local vs cloud: when is each right?

Factor	Local (Qwen2.5-VL / MiniCPM-V)	Cloud (Google Vision, AWS Textract, GPT-4o)
Where images go	Stay on your disk	Uploaded to the provider
Cost model	One-time hardware + power	Per-image / per-page, recurring
Volume economics	Flat after hardware	Scales linearly with usage
Sensitive docs	Stays on-prem (HIPAA/contracts/PII)	Leaves your control
Raw ceiling	Strong, occasionally trails	Often highest accuracy
Setup effort	You run the model	API call, zero infra

If your documents are non-sensitive and volume is low, a cloud OCR API is the pragmatic, zero-setup choice. If you process PII, invoices, contracts, medical scans, or high volumes, local is the answer — and the open models are now good enough that the accuracy gap is small for everyday OCR and extraction.

Key Takeaways

One model, many tasks. OCR, invoice extraction, alt-text, and screenshots-to-data are all the same operation — image + prompt → text/JSON — handled by one open VLM.
Default to Qwen2.5-VL 7B. Released January 2025, tuned for documents, charts, and clean JSON output; it's the best all-rounder and runs on a 12 GB GPU at 4-bit.
Pick MiniCPM-V 2.6 for OCR-first work. The 8B model reports state-of-the-art OCRBench results, surpassing GPT-4o/GPT-4V/Gemini 1.5 Pro on that benchmark, in ~8–9 GB of VRAM.
Moondream (~1.9B) is the tiny-hardware option for captioning and alt-text — it even runs on a Raspberry Pi.
Run it two ways: Ollama for the 60-second interactive path, vLLM's OpenAI-compatible server for batch throughput over thousands of documents.
VLMs beat OCR-then-parse for messy layouts because you extract structured fields in one shot — but validate the JSON and keep a human in the loop for accounting-grade data.

Next Steps

Pick your model first: browse the best Ollama models to see the current vision builds and pull commands side by side.
Apply vision to footage: read Local AI video analysis — the same VLMs describe frames so you can summarize and search inside video privately.
Read documents in other languages: combine OCR with translate documents offline to extract and translate without sending files to the cloud.
Add the audio half: for video and meeting workflows, pair vision with local AI subtitles using Whisper to turn speech into timestamped text.

Authoritative sources: the Qwen2.5-VL Technical Report and the MiniCPM-V 2.6 model card on Hugging Face document the model specs and OCR claims cited above.

Local AI Vision Tasks (2026): OCR, Invoices & Alt-Text with Open VLMs

Want to go deeper than this article?

What are "vision tasks" and which open VLM should I use?

Reading articles is good. Building is better.

Which open VLM is best for OCR specifically?

How do I run a local VLM? (Ollama vs vLLM)

How much VRAM do these models need?

Reading articles is good. Building is better.

Example 1 — OCR a screenshot or scan

Example 2 — Extract an invoice or receipt to JSON

Example 3 — Generate image alt-text

Example 4 — Screenshots-to-data (charts, dashboards, UIs)

Local vs cloud: when is each right?

Key Takeaways

Next Steps

Voice working locally? Build the whole pipeline.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ready to Go Beyond Tutorials?

Go from reading about AI to building with AI

Related Guides

Best Ollama Models

Local AI Video Analysis

Translate Documents Offline

Local AI Subtitles with Whisper

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI