Question 1

What is GLM-4.5V and how is it different from GLM-4V or Qwen 2-VL?

Accepted Answer

GLM-4.5V is Zhipu AI's August 2025 release of their flagship open-weight vision-language model — built on the GLM-4.5-Air MoE backbone (106B total parameters, 12B activated per token) with a vision tower bolted on. It is Zhipu's strongest open VLM as of 2026 and the first open VLM specifically designed around agentic workloads (GUI control, multi-step reasoning over screenshots, long-video understanding). Compared to GLM-4V (the previous-generation 9B dense VLM): GLM-4.5V is ~10x larger, MoE-routed for inference efficiency, has Thinking Mode for multi-step reasoning, supports up to 64K input context including video, and generally beats GLM-4V by 15-25 points on agentic benchmarks. Compared to Qwen 2-VL 72B (the dense alternative): GLM-4.5V uses fewer activated parameters per token (12B vs 72B), so inference is faster on the same hardware; Qwen 2-VL is still slightly stronger on raw OCR and document understanding.

Question 2

What hardware does GLM-4.5V need?

Accepted Answer

GLM-4.5V (106B total, 12B active MoE) BF16: ~210 GB VRAM — needs 4x A100 80GB or 3x H100 80GB. INT8: ~110 GB — 2x H100 80GB or 2x A100 80GB. INT4 AWQ/GPTQ: ~55 GB — single H100 80GB or 2x RTX 4090. GGUF Q4_K_M: ~55 GB — runs on Mac Studio M3 Ultra at ~12-18 tok/s, or 2x RTX 3090 with NVLink + offload. The vision tower adds ~3-5 GB for image encoding. For most self-hosters: 2x RTX 3090 + AWQ INT4 via vLLM is the cost-effective sweet spot. For production: single H100 80GB with FP8 gives 50-70 tok/s.

Question 3

How do I run GLM-4.5V in vLLM / SGLang / Transformers?

Accepted Answer

vLLM (0.7+): `vllm serve zai-org/GLM-4.5V --tensor-parallel-size 2 --trust-remote-code --max-model-len 65536 --limit-mm-per-prompt image=10,video=2`. SGLang has GLM-4.5V native support: `python -m sglang.launch_server --model-path zai-org/GLM-4.5V --tp 2 --trust-remote-code`. For Transformers / Python: load with `AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.5V", trust_remote_code=True, torch_dtype="auto", device_map="auto")` and process images via `AutoProcessor`. The model uses GLM-4.5's chat template with vision-specific image and video tokens; `trust_remote_code` is required because GLM uses custom architecture code.

Question 4

What is Thinking Mode and when should I use it?

Accepted Answer

GLM-4.5V inherits Thinking Mode from GLM-4.5 — the model can be prompted to produce a long chain-of-thought (\u003cthink>...\u003c/think> tokens) before its final answer. Enabled via `do_sample` parameter or system prompt flag depending on engine. Use Thinking Mode for: complex multi-image reasoning, GUI agent task planning, math problems involving images (charts, diagrams), spatial reasoning across video frames. Skip Thinking Mode for: simple OCR, image captioning, single-image VQA — thinking adds 2-10s latency without quality benefit on easy tasks. In production, route by task: simple queries get fast non-thinking responses, complex agent tasks get Thinking Mode.

Question 5

Can GLM-4.5V actually drive a GUI as an agent?

Accepted Answer

Yes — this is one of its specifically advertised capabilities. Zhipu trained GLM-4.5V on screenshot-action pairs: given a UI screenshot and a goal ("book a flight to Tokyo for next Tuesday"), the model outputs the next click, scroll, or text-input action. It scores well on AgentBench, ScreenSpot-Pro, and OSWorld benchmarks. Practical setup: use a browser-control framework (Playwright, browser-use, or Anthropic's Computer Use pattern), capture screenshots at each step, send to GLM-4.5V with the goal + accumulated history, parse its action output, execute. Reliability is below GPT-4o + Claude 3.5 Sonnet computer-use on complex multi-app workflows, but for narrow domain tasks (form-filling, data entry, web scraping) it's production-viable on self-hosted infrastructure.

Question 6

How does GLM-4.5V compare to Qwen 2-VL 72B and Llama 3.2 90B Vision?

Accepted Answer

On general VLM benchmarks (MMMU, MMBench, MMVet): GLM-4.5V is competitive with Qwen 2-VL 72B and slightly ahead of Llama 3.2 90B Vision. On OCR and document understanding (DocVQA, ChartQA, InfoVQA): Qwen 2-VL 72B leads — its visual document training was deeper. On agentic GUI tasks (ScreenSpot-Pro, OSWorld): GLM-4.5V leads decisively — neither Qwen 2-VL nor Llama 3.2 Vision was specifically tuned for screen control. On video understanding (MVBench, Video-MME): GLM-4.5V supports up to 2-hour videos natively, Qwen 2-VL supports ~20 min, Llama 3.2 Vision supports single-image only. Pick GLM-4.5V for agents and video; pick Qwen 2-VL for documents; pick Llama 3.2 Vision for licensing simplicity.

Question 7

Can I fine-tune GLM-4.5V locally?

Accepted Answer

Yes — QLoRA on the language model side (with frozen vision encoder) works on 2x RTX 3090 with NVLink or single H100 80GB. Time: ~6-12 hours for 1K image+text examples at rank 32. For full vision+language fine-tuning (unfreezing the vision tower), you need 4x H100 80GB minimum. Practical pattern for domain VLMs: collect 1K-2K labeled image+text pairs from your real workflow (medical imaging, legal documents, product photos), QLoRA-tune the language model only, evaluate on a held-out set. The vision encoder is shared across domains and rarely benefits from unfreezing for narrow tasks. See [QLoRA Fine-Tuning Guide](/blog/qlora-fine-tuning-guide) and [vision-language QLoRA section](/blog/qlora-fine-tuning-guide#vlm).

Question 8

What's the licensing situation for GLM-4.5V?

Accepted Answer

Released under MIT license — fully unrestricted commercial use, redistribution, modification, and derivative model training. This is unusually permissive for a frontier-class VLM (Llama 3.2 Vision has the standard Meta Community License with 700M MAU threshold; Qwen 2-VL uses the Tongyi Qianwen license with EU restrictions). For commercial deployments where license cleanliness matters — B2B SaaS, EU public sector, regulated industries, derivative-model business — GLM-4.5V is currently the strongest open VLM with no usage restrictions. Note: the *model code* uses Apache 2.0 / MIT depending on file; the *weights* are MIT. Always confirm against the latest LICENSE in the HuggingFace repo before commercial deployment.

Setup	Quant	Throughput	Notes
1x H100 80GB	FP8	50-70 tok/s	Recommended production
2x H100 80GB	BF16	70-90 tok/s	High-throughput
1x A100 80GB	INT8	30-45 tok/s	Budget production
2x RTX 4090 (24GB ea)	INT4 AWQ	25-40 tok/s	Self-hosted enthusiast
2x RTX 3090 (NVLink)	INT4 AWQ	20-30 tok/s	Used-market budget
Mac Studio M3 Ultra 512GB	Q4_K_M GGUF	12-18 tok/s	Solo developer
1x RTX 4090 + 64GB RAM	Q3_K_M GGUF (offload)	4-8 tok/s	Hobbyist limit

Benchmark	GLM-4.5V	Qwen 2-VL 72B	Llama 3.2 90B Vision	GPT-4o
MMMU (val)	65.5	64.5	60.3	69.1
MMBench	87.2	86.5	78.0	86.0
MMVet	73.1	74.0	64.1	76.2
DocVQA	92.5	96.5	90.1	92.8
ChartQA	86.0	88.3	85.5	85.7
InfoVQA	78.0	82.0	73.4	79.2
OCRBench	855	877	753	736
MVBench (video)	73.5	73.6	n/a	n/a
Video-MME (long)	64.6	56.0	n/a	64.3
ScreenSpot-Pro (GUI)	56.4	38.2	n/a	18.9
OSWorld (agent)	35.8	12.5	n/a	22.1
Activated params	12B (MoE)	72B (dense)	90B (dense)	unknown

Quant	Size	VRAM	Quality Loss	Best Engine
BF16	212 GB	240 GB	0%	vLLM/SGLang (3 H100)
FP8 native	106 GB	130 GB	<0.5%	vLLM/SGLang (1-2 H100)
INT8 W8A8	106 GB	130 GB	<1%	vLLM (2 A100)
INT4 AWQ	55 GB	70 GB	1-2%	vLLM (1 H100 / 2 RTX 4090)
GGUF Q5_K_M	75 GB	90 GB	<1%	llama.cpp (Mac)
GGUF Q4_K_M	55 GB	70 GB	1-2%	llama.cpp (Mac)

Workload	Throughput
Single-image VQA (low-res)	65 tok/s
Single-image VQA (high-res, 1024 tokens)	50 tok/s
Multi-image (5 images)	35 tok/s
Video (2 min, 2 fps)	25 tok/s
TTFT (single image, 512 tokens vision)	850 ms
TTFT (5 images)	2.4 s
TTFT (60 min video at 2 fps)	8 s

Workload	Throughput
Single-image VQA	28 tok/s
Multi-image (5 images)	18 tok/s
TTFT (single image)	1.4 s

Symptom	Cause	Fix
Custom code error	Missing trust_remote_code	Add `--trust-remote-code` to vLLM/SGLang
Image tokens missing	Wrong processor	Use `AutoProcessor` not `AutoTokenizer`
OOM with images	Vision tokens uncapped	Set `--limit-mm-per-prompt image=5`
Video frames missing	fps too high	Lower fps to 1-2 for long videos
Thinking tokens leak	No \u003cthink> stripping	Post-process to extract content after \u003c/think>
GUI agent unreliable	No system prompt structure	Force JSON action output via prompt + xgrammar
AWQ load fails	Pre-Aug 2025 vLLM	Upgrade vLLM to 0.6.3+ for GLM-4.5 architecture

GLM-4.5V Local Setup Guide (2026): Zhipu's 106B MoE Vision-Language Model with Agentic GUI Control

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What GLM-4.5V Is {#what-it-is}

Architecture: GLM-4.5-Air MoE + Vision Tower {#architecture}

Vision Encoder (1B params)

MoE Language Model (106B / 12B active)

Vision-Language Projector

Thinking Mode

Hardware Requirements {#hardware}

Reading articles is good. Building is better.

GLM-4.5V vs Qwen 2-VL 72B vs Llama 3.2 90B Vision vs GPT-4o {#comparison}

vLLM Setup {#vllm}

SGLang Setup {#sglang}

Transformers / HF Setup {#hf}

Quantization Options {#quants}

Thinking Mode {#thinking}

Multi-Image Reasoning {#multi-image}

Video Understanding (Up to 2 Hours) {#video}

GUI Agent Pattern {#gui-agent}

OCR & Document Understanding {#ocr}

Fine-Tuning with QLoRA {#fine-tuning}

System Prompts & Sampling {#prompting}

Real Benchmarks {#benchmarks}

Licensing {#licensing}

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

Qwen 3 VL Local Setup

Best Open Source LLMs 2026

AI Agents Local Guide

vLLM Complete Setup

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI