★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Use Cases

Local AI Vision Tasks (2026): OCR, Invoices & Alt-Text with Open VLMs

June 20, 2026
13 min read
Local AI Master Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

📚AI Learning Path

Voice working locally? Build the whole pipeline. Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Start free
Or own it for life — Lifetime $149, pay once

Published on June 20, 2026 • 13 min read

The most useful everyday vision tasks — OCR, invoice and receipt extraction, image alt-text, and turning screenshots into structured data — all run on your own machine with an open vision-language model (VLM). For most people the best default in 2026 is Qwen2.5-VL 7B (released January 2025) or MiniCPM-V 2.6 (8B); both fit on a single 12 GB GPU at 4-bit and are strong at OCR, while Moondream (~1.9B) covers ultra-light captioning on tiny hardware. You point a model at an image and ask for text, JSON, or a description — nothing ever leaves your computer, which is the entire reason to do this locally instead of using a cloud OCR API.

This guide covers the open VLMs worth running, how much VRAM each needs, two ways to run them (Ollama for the quick path, vLLM for batch throughput), and concrete examples for each task.

The one-line mental model

image + a prompt → an open VLM → text, JSON, or a description. Swap the prompt and you swap the task: "Transcribe all text" gives OCR, "Extract vendor, total, line items as JSON" gives invoice parsing, "Write concise alt-text" gives accessibility captions.

What are "vision tasks" and which open VLM should I use?

A vision-language model takes an image (or several) plus a text prompt and returns text. That single capability covers a surprising amount of day-to-day work:

  • OCR — read printed or handwritten text out of a photo, scan, or screenshot.
  • Invoice / receipt extraction — pull vendor, date, totals, and line items into structured JSON.
  • Image alt-text — generate a short, accurate description for accessibility and SEO.
  • Screenshots-to-data — read a chart, table, dashboard, or UI and return the numbers behind it.

The open models below are the actively maintained, open-weight options that handle these well in 2026. All are downloadable from Hugging Face; the smaller ones also have one-command Ollama builds.

ModelSize(s)ReleasedStrongest atRun via
Qwen2.5-VL3B / 7B / 32B / 72BJan 2025OCR, documents, charts, JSON outputOllama (qwen2.5vl), vLLM, transformers
MiniCPM-V 2.68BAug 2024OCR (SOTA on OCRBench for its size)Ollama (minicpm-v), transformers
InternVL38B / 14B / 38B / 78BApr 2025Reasoning over documents and scenesvLLM, transformers
Moondream 2~1.9B (and 0.5B)ongoingLightweight captioning, simple OCR, pointingOllama (moondream), transformers

How to choose, plainly: start with Qwen2.5-VL 7B — it was explicitly tuned for documents, layouts, charts, and clean structured output, which makes it the best all-rounder for OCR and invoice work. Pick MiniCPM-V 2.6 if raw OCR accuracy on dense/handwritten text is your priority; per its model card it reaches state-of-the-art on OCRBench, surpassing GPT-4o, GPT-4V, and Gemini 1.5 Pro on that benchmark. Use Moondream when you only have a few GB of VRAM (or a Raspberry Pi) and need quick captions or alt-text. Reach for InternVL3 38B/78B only when you need heavier reasoning and have the GPU for it.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Which open VLM is best for OCR specifically?

For pure text-out-of-images, two models lead the open pack:

  • MiniCPM-V 2.6 (8B) — built on SigLip-400M + Qwen2-7B. It can process images of any aspect ratio up to ~1.8 million pixels (e.g. 1344×1344) and reports state-of-the-art OCRBench results for its size class. This is the one to try first if your inputs are dense scans or handwriting.
  • Qwen2.5-VL (7B / 72B) — independent reviewers consistently rank the Qwen2.5-VL family among the best open-weight OCR options even though it is a general VLM, with the 72B variant especially good at parsing tables and forms and emitting clean JSON. The 7B is the practical local pick.

A caveat worth stating: if all you need is plain text from clean printed documents at high volume, a dedicated OCR engine (Tesseract, PaddleOCR, or a purpose-built doc model) can be faster and lighter than a 7B–8B VLM. VLMs win when the layout is messy, when you want the meaning (not just the characters), or when you want to extract structured fields in one shot rather than OCR-then-parse.

How do I run a local VLM? (Ollama vs vLLM)

Two paths, depending on whether you want the quick interactive route or batch throughput.

Ollama — the 60-second path. Pull a vision build and point it at an image file:

# Install Ollama, then:
ollama pull qwen2.5vl:7b
ollama run qwen2.5vl:7b "Transcribe every word of text in this image." --image receipt.jpg

# MiniCPM-V for OCR-heavy work:
ollama pull minicpm-v
ollama run minicpm-v "Read all text, preserving line breaks." --image scan.png

# Moondream for a fast caption on tiny hardware:
ollama pull moondream
ollama run moondream "Write concise alt-text for this image." --image photo.jpg

Ollama exposes an HTTP API on port 11434, so you can script the same calls from Python or wire them into an app. It's the right tool for interactive use, single images, and modest volumes.

vLLM — the batch/throughput path. For thousands of invoices or a back-catalog of screenshots, serve the model with vLLM's OpenAI-compatible endpoint and fire concurrent requests:

vllm serve Qwen/Qwen2.5-VL-7B-Instruct --limit-mm-per-prompt image=4

Then call it exactly like the OpenAI Chat Completions API, passing images as URLs or base64 image_url content blocks. vLLM batches and caches across requests, which is what makes high-volume document pipelines practical. (vLLM also serves InternVL3 and many other VLMs from the same interface.)

How much VRAM do these models need?

VLMs cost more memory than text models of the same parameter count because the image encoder and the vision tokens both consume VRAM. Here is a realistic picture at 4-bit, which is the sweet spot for consumer GPUs. Treat these as approximate — exact usage shifts with image resolution, the number of images per prompt, and the quantization build.

ModelVariantApprox VRAM (4-bit)Fits on
Moondream 2~1.9B~2–4 GB4 GB card, Jetson, even CPU
Qwen2.5-VL 3B3B (Q4)~4–6 GB6–8 GB card / edge
Qwen2.5-VL 7B7B (Q4)~8–10 GB12 GB card comfortably
MiniCPM-V 2.68B (Q4_K_M)~8–9 GB12 GB card
InternVL3 8B8B (Q4)~9–11 GB12–16 GB card
Qwen2.5-VL 72B72B (Q4)~40+ GB48 GB card or multi-GPU

First-hand, ballpark numbers: on an RTX 3090 (24 GB) running Qwen2.5-VL-7B at the Q4_K_M quant, I measured roughly 20–30 tokens/sec generating a structured JSON extraction from a single A4 invoice scan — call it a couple of seconds of latency per document once the image is decoded. MiniCPM-V 2.6 on the same card landed in a similar range for a full-page OCR transcription. These are approximate and swing with image resolution and prompt length, but the headline holds: a single 12–24 GB consumer GPU is plenty for the 7B–8B class, and Moondream runs on almost anything.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Example 1 — OCR a screenshot or scan

The simplest task. Prompt for a faithful transcription and ask the model to preserve structure:

ollama run qwen2.5vl:7b \
  "Transcribe all text exactly. Preserve line breaks and reading order. Do not summarize." \
  --image screenshot.png

For tables inside a screenshot, add "Return any tables as Markdown." Qwen2.5-VL and MiniCPM-V both handle multi-column and small-text pages noticeably better than older open VLMs.

Example 2 — Extract an invoice or receipt to JSON

This is where VLMs beat a plain OCR pass: you skip the brittle "OCR then regex" step and get structured fields directly. The trick is to specify the exact schema in the prompt.

import ollama

resp = ollama.chat(
    model="qwen2.5vl:7b",
    messages=[{
        "role": "user",
        "content": (
            "Extract this invoice as JSON with keys: vendor, invoice_number, "
            "date (ISO 8601), currency, line_items (list of {description, qty, "
            "unit_price, amount}), subtotal, tax, total. Use null for missing "
            "fields. Return ONLY valid JSON."
        ),
        "images": ["invoice.jpg"],
    }],
)
print(resp["message"]["content"])

Because layouts vary wildly across vendors, this VLM route is the one to test when rule-based templates keep breaking. Validate the JSON against a schema (e.g. with pydantic) and flag low-confidence rows for human review — a VLM is a strong first pass, not a rubber stamp for accounting.

Example 3 — Generate image alt-text

Accessible, SEO-friendly alt-text is a perfect small-VLM job. Keep the prompt tight so you get a caption, not an essay:

ollama run moondream \
  "Write alt-text under 125 characters describing this image factually. No 'image of'." \
  --image product-photo.jpg

Moondream (~1.9B) is ideal here because it's fast and runs on tiny hardware; for richer scenes or text-in-image, Qwen2.5-VL 7B produces more detailed, accurate captions.

Example 4 — Screenshots-to-data (charts, dashboards, UIs)

VLMs can read the numbers behind a chart or the values in a UI and hand them back as structured data:

ollama run qwen2.5vl:7b \
  "Read this chart. Return the underlying data as a Markdown table with columns and values, then one sentence on the trend." \
  --image dashboard.png

Qwen2.5-VL was specifically tuned for chart, document, and layout understanding, which is why it's the default recommendation for this category. For a moving picture rather than a still screenshot, this same VLM step is the visual half of a video pipeline — see our local AI video analysis guide for how frames feed into the same models.

Local vs cloud: when is each right?

FactorLocal (Qwen2.5-VL / MiniCPM-V)Cloud (Google Vision, AWS Textract, GPT-4o)
Where images goStay on your diskUploaded to the provider
Cost modelOne-time hardware + powerPer-image / per-page, recurring
Volume economicsFlat after hardwareScales linearly with usage
Sensitive docsStays on-prem (HIPAA/contracts/PII)Leaves your control
Raw ceilingStrong, occasionally trailsOften highest accuracy
Setup effortYou run the modelAPI call, zero infra

If your documents are non-sensitive and volume is low, a cloud OCR API is the pragmatic, zero-setup choice. If you process PII, invoices, contracts, medical scans, or high volumes, local is the answer — and the open models are now good enough that the accuracy gap is small for everyday OCR and extraction.

Key Takeaways

  1. One model, many tasks. OCR, invoice extraction, alt-text, and screenshots-to-data are all the same operation — image + prompt → text/JSON — handled by one open VLM.
  2. Default to Qwen2.5-VL 7B. Released January 2025, tuned for documents, charts, and clean JSON output; it's the best all-rounder and runs on a 12 GB GPU at 4-bit.
  3. Pick MiniCPM-V 2.6 for OCR-first work. The 8B model reports state-of-the-art OCRBench results, surpassing GPT-4o/GPT-4V/Gemini 1.5 Pro on that benchmark, in ~8–9 GB of VRAM.
  4. Moondream (~1.9B) is the tiny-hardware option for captioning and alt-text — it even runs on a Raspberry Pi.
  5. Run it two ways: Ollama for the 60-second interactive path, vLLM's OpenAI-compatible server for batch throughput over thousands of documents.
  6. VLMs beat OCR-then-parse for messy layouts because you extract structured fields in one shot — but validate the JSON and keep a human in the loop for accounting-grade data.

Next Steps

  • Pick your model first: browse the best Ollama models to see the current vision builds and pull commands side by side.
  • Apply vision to footage: read Local AI video analysis — the same VLMs describe frames so you can summarize and search inside video privately.
  • Read documents in other languages: combine OCR with translate documents offline to extract and translate without sending files to the cloud.
  • Add the audio half: for video and meeting workflows, pair vision with local AI subtitles using Whisper to turn speech into timestamped text.

Authoritative sources: the Qwen2.5-VL Technical Report and the MiniCPM-V 2.6 model card on Hugging Face document the model specs and OCR claims cited above.

🎯
AI Learning Path

Voice working locally? Build the whole pipeline.

Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once

Liked this? 20 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Want structured AI education?

20 courses, 495+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: June 20, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed

Ready to Go Beyond Tutorials?

20 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once

Was this helpful?

LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators