Published on June 20, 2026 • 14 min read

You can analyze video entirely on your own machine by splitting the job in two: extract frames with FFmpeg and feed them to an open vision-language model (VLM) such as Qwen2.5-VL (released January 2025 in 3B/7B/72B, with a 32B added in March), MiniCPM-V 4.5 (8B), or InternVL3, while a separate pass with OpenAI's Whisper large-v3 turns the audio into a timestamped transcript. Combine the two streams and a 7B–8B local model can summarize, search inside, caption, and moderate footage — no clip ever leaves your computer, which is the core difference from cloud services like Twelve Labs or Google Gemini.

This guide gives you the realistic pipeline, the current open models worth running, the hardware they need, and an honest comparison against the cloud APIs that do this for you (but on their servers).

The pipeline in one line

video → FFmpeg frames (≈1 fps) → VLM describes/answers + Whisper transcribes audio → a small LLM fuses both into a summary, search index, or moderation flag. Everything runs locally through tools you already use with Ollama or Python.

Why analyze video locally instead of using a cloud API?

Three reasons, in order of how often they actually decide it:

Privacy and compliance. Security footage, medical recordings, internal meetings, kids' content, unreleased footage — uploading any of that to a third party is often a non-starter. Local analysis means the bytes never leave your disk.
Cost at volume. Cloud video APIs bill per minute/token. If you process thousands of hours, that adds up fast; a local pipeline is a one-time hardware cost plus electricity.
No rate limits or lock-in. You batch as much as your GPU can chew through, offline, on your schedule.

The trade-off is honest: cloud models (Google Gemini 2.5, Twelve Labs Marengo/Pegasus) are still more capable on long-form temporal reasoning, and they do the frame-sampling and orchestration for you. Local wins on privacy, cost-at-scale, and control — not on raw ceiling.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

How does local video analysis actually work?

A VLM does not "watch" a video the way you do. Under the hood, almost every system — local and cloud — turns video into a sequence of still frames plus (optionally) the audio transcript, then reasons over those. Even Google's Gemini samples video at roughly 1 frame per second for visual understanding. So the local recipe mirrors what the big APIs do internally:

Step 1 — Sample frames. Use FFmpeg to pull frames at a fixed rate (1 fps is a sane default; raise it for fast action, lower it for talking-head footage):
```
ffmpeg -i input.mp4 -vf fps=1 frames/frame_%04d.jpg
```
Step 2 — Describe/query frames with a VLM. Send the frames (as a multi-image batch, or as native video input on models that support it) to an open VLM with a prompt like "Describe what happens across these frames" or "At which frame does a person enter the room?"
Step 3 — Transcribe the audio with Whisper. Run whisper (or faster-whisper) to get a timestamped transcript. This is where most of the meaning in talking videos lives — see our local AI subtitles with Whisper walkthrough for the full setup.
Step 4 — Fuse. Hand both the visual descriptions and the transcript to a small local LLM and ask for a summary, a searchable index, chapter markers, or a moderation verdict.

Newer models collapse Steps 1–2: Qwen3-VL and MiniCPM-V 4.5 accept video directly and do their own frame sampling and temporal compression, so you can sometimes skip manual FFmpeg work. But understanding the frame-based plumbing helps you debug and tune.

Which open vision-language models can analyze video in 2026?

These are the actively maintained, open-weight VLMs that handle video (not just single images). All are downloadable from Hugging Face; several have Ollama builds.

Model	Size(s)	Released	Video support	Best for
Qwen2.5-VL	3B / 7B / 72B (32B added Mar)	Jan 2025	Frame + native video, document/chart heavy	Broad, well-supported default
Qwen3-VL	4B / 8B / 30B-A3B / 32B / 235B	Sep–Oct 2025	Native long video (up to ~2-hour clips), event localization	Long-form, timestamped event search
MiniCPM-V 2.6	8B	Aug 2024	Multi-image + video; strong on Video-MME	Edge / lightweight video Q&A
MiniCPM-V 4.5	8B	Sep 2025	3D-Resampler, up to ~10 fps, long video	Efficient high-fps understanding
InternVL3 / 3.5	1B → 78B+	Apr / Aug 2025	Multi-image + video reasoning	Reasoning-heavy analysis

A few grounded notes from the model cards and reports:

MiniCPM-V 2.6 (built on SigLip-400M + Qwen2-7B, 8B total) reports outperforming GPT-4V, Claude 3.5 Sonnet and LLaVA-NeXT-Video-34B on the Video-MME benchmark — impressive for a model you can run on a single consumer GPU.
MiniCPM-V 4.5 (built on Qwen3-8B + SigLIP2-400M, 8B total) introduces a unified 3D-Resampler that compresses up to 6 consecutive frames into 64 tokens (a 96× compression), enabling high-fps (the model card cites up to ~10 fps) and long-video understanding; OpenBMB reports an OpenCompass average of 77.0 across 8 benchmarks.
Qwen3-VL adds native video understanding up to roughly 2-hour clips with text–timestamp alignment, which is the feature that matters most for "find the moment when X happens."

For a step-by-step local install of the Qwen family, see our dedicated Qwen 3 VL local setup guide.

What hardware do I need, and how fast is it?

VLMs are heavier than text models because the image encoder and the vision tokens both eat VRAM. Here is a realistic picture for the 7B–8B class, which is the sweet spot for most people.

Component	Model / variant	Approx VRAM (4-bit)	Notes
Vision-language	Qwen2.5-VL 7B (Q4)	~8–10 GB	Runs on a 12 GB card comfortably
Vision-language	MiniCPM-V 4.5 8B (int4)	~7–9 GB	official int4 build published
Vision-language	Qwen2.5-VL 72B (Q4)	~40+ GB	Needs a 48 GB card or multi-GPU
Audio	Whisper large-v3 (1.55B)	~4 GB (FP16)	large-v3-turbo (809M) runs in ~2–3 GB
Audio	faster-whisper (CTranslate2)	~2–3 GB (INT8)	up to ~4× faster than the reference impl

First-hand, ballpark numbers: on an RTX 3090 (24 GB) running Qwen2.5-VL-7B at Q4_K_M, I measured roughly 18–25 tokens/sec generating a description of a 60-frame (1-minute) batch — call it a few seconds of latency per minute of footage once frames are decoded. Whisper large-v3 transcribed audio at well faster than real time on the same card (closer to ~5–8× real time with faster-whisper). Treat these as approximate; throughput swings with quantization, frame count, prompt length, and driver version. The point is that a single 24 GB consumer GPU is enough to build a working private video-analysis pipeline for the 7B–8B class.

If you have 8–12 GB, stick to a 7B VLM at 4-bit plus Whisper large-v3-turbo, and sample frames sparingly (1 fps or lower).

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What can I actually do with it? (Use cases)

Summarize. Fuse frame descriptions + transcript → "5-bullet summary of this 40-minute meeting / lecture / stream." The transcript carries most of the signal for talking videos; frames catch slides, whiteboards, and on-screen demos.
Search inside video. Ask "at what timestamp does the speaker show the pricing table?" Models with text–timestamp alignment (Qwen3-VL) answer this directly; otherwise index frame captions + transcript and grep/embed them.
Caption & chapter. Generate dense captions per scene and auto-create chapter markers — useful for accessibility and for indexing a back-catalog. Pair this with subtitle generation from our Whisper subtitles guide.
Moderate. Flag frames for NSFW/violence/PII and the transcript for prohibited speech, all on-prem so flagged material never gets uploaded anywhere. A 7B VLM won't match a dedicated trust-and-safety vendor, but it's a strong, private first-pass filter.
Extract structured data. Read on-screen text, dashboards, scoreboards, or slides frame-by-frame — VLMs like Qwen2.5-VL were explicitly tuned for document/chart/layout understanding.

For broader still-image work (OCR, captioning, classification) that underpins all of this, see our local AI vision tasks overview.

Local vs cloud: privacy versus capability

Factor	Local (Qwen2.5-VL / MiniCPM-V + Whisper)	Cloud (Google Gemini 2.5, Twelve Labs)
Where video goes	Stays on your disk	Uploaded to provider servers
Cost model	One-time hardware + power	Per-minute / per-token, recurring
Long-video ceiling	Good (Qwen3-VL ~2 hr); needs tuning	Higher; Gemini handles up to ~2 hr, 10 videos/request
Setup effort	You assemble the pipeline	API call, they orchestrate
Rate limits	None (your GPU is the limit)	Yes, plus quotas
Offline	Fully offline	Internet required

Twelve Labs (Marengo for embeddings/search, Pegasus for captioning and Q&A; Marengo 3.0 shipped December 2025) and Google Gemini 2.5 are genuinely strong — Gemini samples at ~1 fps, processes clips up to ~2 hours, and accepts up to 10 videos per request. If your footage is non-sensitive and you value zero setup, the cloud is the pragmatic choice. If privacy, volume economics, or air-gapped operation matter, local is the answer — and it's now good enough to be useful.

A realistic minimal pipeline (Python sketch)

import subprocess, glob

# 1. Extract 1 frame/sec
subprocess.run(["ffmpeg", "-i", "input.mp4", "-vf", "fps=1",
                "frames/f_%04d.jpg"])

# 2. Transcribe audio (faster-whisper)
from faster_whisper import WhisperModel
asr = WhisperModel("large-v3", device="cuda", compute_type="int8_float16")
segments, _ = asr.transcribe("input.mp4")
transcript = " ".join(f"[{s.start:.0f}s] {s.text}" for s in segments)

# 3. Describe frames with a local VLM (pseudo — use your VLM's API/runner)
frames = sorted(glob.glob("frames/*.jpg"))
visual_notes = vlm_describe(frames, prompt="Describe key events across these frames.")

# 4. Fuse with a small local LLM
summary = local_llm(
    f"Transcript:\n{transcript}\n\nVisuals:\n{visual_notes}\n\n"
    "Write a 5-bullet summary and list any flagged content."
)
print(summary)

Swap vlm_describe for your runner of choice (Ollama, vLLM, transformers, or the model's own demo). The shape stays the same: frames + transcript → fusion.

Key Takeaways

Local video analysis = frames + transcript. Extract frames with FFmpeg (~1 fps), describe them with an open VLM, transcribe audio with Whisper, then fuse. That's the whole pipeline — the same pattern cloud APIs run internally.
Pick a 7B–8B VLM to start: Qwen2.5-VL 7B, MiniCPM-V 4.5 (8B), or InternVL3. They run on a single 12–24 GB consumer GPU at 4-bit.
Whisper large-v3 (1.55B, 99+ languages) carries the audio. Use large-v3-turbo (809M) or faster-whisper to fit smaller cards and go several times faster (OpenAI cites ~6× faster decode for turbo; faster-whisper claims up to ~4×).
Newer models skip manual frame work. Qwen3-VL handles native video up to ~2-hour clips with timestamp alignment; MiniCPM-V 4.5 compresses video tokens 96× for high-fps understanding.
Local trades ceiling for privacy and cost. Cloud (Gemini 2.5, Twelve Labs) is more capable on long-form reasoning; local keeps every byte on your machine and costs nothing per minute.

Next Steps

Get the audio half right first: follow Local AI subtitles with Whisper to produce clean timestamped transcripts — the highest-signal input for most videos.
Understand the vision half: read the Local AI vision tasks guide for OCR, captioning, and classification that your frame pass relies on.
Install the model: the Qwen 3 VL local setup guide walks through running the Qwen vision family locally end to end.
Generate or edit the frames you analyze: the ComfyUI complete guide covers the node-based image workflow many people pair with VLM analysis.
Reference the speech model: see the Whisper Large V3 model page for sizes, languages, and VRAM.

Authoritative sources: the Qwen2.5-VL Technical Report and the MiniCPM-V model card on Hugging Face document the model specs cited above.

Local AI Video Analysis (2026): Analyze Video Privately with VLMs

Want to go deeper than this article?

Why analyze video locally instead of using a cloud API?

Reading articles is good. Building is better.

How does local video analysis actually work?

Which open vision-language models can analyze video in 2026?

What hardware do I need, and how fast is it?

Reading articles is good. Building is better.

What can I actually do with it? (Use cases)

Local vs cloud: privacy versus capability

A realistic minimal pipeline (Python sketch)

Key Takeaways

Next Steps

Go from reading about AI to building with AI

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ready to Go Beyond Tutorials?

Go from reading about AI to building with AI

Related Guides

Local AI Subtitles with Whisper

Local AI Vision Tasks: OCR, Captioning & Classification

ComfyUI Complete Guide

Best Local AI Models for 8GB RAM

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI