GLM-4.5V Local Setup Guide (2026): Zhipu's 106B MoE Vision-Language Model with Agentic GUI Control
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
GLM-4.5V is Zhipu AI's August 2025 open-weight vision-language MoE — 106B total parameters, 12B activated per token, built on the GLM-4.5-Air backbone with a vision tower for images and video. It's the first frontier-class open VLM specifically designed for agentic workloads: GUI control from screenshots, multi-step reasoning over chart/diagram inputs, long-video understanding up to 2 hours, multi-image RAG. MIT licensed for unrestricted commercial use. For self-hosters building agents, multimodal RAG over visual documents, video analysis pipelines, or any vision workload where license cleanliness matters, it's the strongest 2026 choice.
This guide covers what you need to run GLM-4.5V locally — hardware paths from 2x RTX 3090 to single H100, setup with vLLM / SGLang / Transformers, Thinking Mode for complex reasoning, GUI-agent integration patterns, video pipelines, fine-tuning with QLoRA, and benchmarks vs Qwen 2-VL 72B / Llama 3.2 90B Vision / GPT-4o.
Table of Contents
- What GLM-4.5V Is
- Architecture: GLM-4.5-Air MoE + Vision Tower
- Hardware Requirements
- GLM-4.5V vs Qwen 2-VL 72B vs Llama 3.2 90B Vision vs GPT-4o
- vLLM Setup
- SGLang Setup
- Transformers / HF Setup
- Quantization Options
- Thinking Mode
- Multi-Image Reasoning
- Video Understanding (Up to 2 Hours)
- GUI Agent Pattern
- OCR & Document Understanding
- Fine-Tuning with QLoRA
- System Prompts & Sampling
- Real Benchmarks
- Licensing
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What GLM-4.5V Is {#what-it-is}
GLM-4.5V (zai-org/GLM-4.5V on HuggingFace) is the August 2025 release of Zhipu AI's flagship open-weight vision-language model. Architecture: GLM-4.5-Air MoE backbone (106B total params, 128 experts, top-8 routing, ~12B activated per token) with a vision encoder (custom CLIP-style ViT-L/14 with ~1B params). Native context: 64K text tokens, image inputs at up to 4K resolution per image, video inputs at 2 fps for up to 2 hours.
Variants:
- GLM-4.5V — flagship 106B/12B-active VLM (most users want this)
- GLM-4.5V-Thinking — same weights, default-on Thinking Mode for complex tasks
- GLM-4.5-Air — text-only base (without vision tower)
License: MIT — unrestricted commercial use.
Architecture: GLM-4.5-Air MoE + Vision Tower {#architecture}
Three architectural components:
Vision Encoder (1B params)
- Custom CLIP-style ViT-L/14 architecture
- Supports dynamic resolution (224x224 to 4096x4096)
- Each image becomes 256-1024 tokens depending on resolution
- Video frames sampled at 2 fps, encoded individually
MoE Language Model (106B / 12B active)
- 128 routed experts + 1 shared expert per layer
- Top-8 routing
- 96 transformer layers
- 64K context window
- Inherits GLM-4.5-Air's training (mostly Chinese + English with code/math)
Vision-Language Projector
- 2-layer MLP projecting vision tokens into language embedding space
- Trained jointly with language model in late-stage fine-tuning
Thinking Mode
- Trained with chain-of-thought data including \u003cthink>...\u003c/think> tokens
- Can be toggled per request
- Improves multi-step reasoning at the cost of latency
Hardware Requirements {#hardware}
| Setup | Quant | Throughput | Notes |
|---|---|---|---|
| 1x H100 80GB | FP8 | 50-70 tok/s | Recommended production |
| 2x H100 80GB | BF16 | 70-90 tok/s | High-throughput |
| 1x A100 80GB | INT8 | 30-45 tok/s | Budget production |
| 2x RTX 4090 (24GB ea) | INT4 AWQ | 25-40 tok/s | Self-hosted enthusiast |
| 2x RTX 3090 (NVLink) | INT4 AWQ | 20-30 tok/s | Used-market budget |
| Mac Studio M3 Ultra 512GB | Q4_K_M GGUF | 12-18 tok/s | Solo developer |
| 1x RTX 4090 + 64GB RAM | Q3_K_M GGUF (offload) | 4-8 tok/s | Hobbyist limit |
For most self-hosters: 2x RTX 3090 + AWQ INT4 via vLLM. For production single-server: H100 80GB with FP8. For Mac users: M3 Ultra 512GB.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
GLM-4.5V vs Qwen 2-VL 72B vs Llama 3.2 90B Vision vs GPT-4o {#comparison}
| Benchmark | GLM-4.5V | Qwen 2-VL 72B | Llama 3.2 90B Vision | GPT-4o |
|---|---|---|---|---|
| MMMU (val) | 65.5 | 64.5 | 60.3 | 69.1 |
| MMBench | 87.2 | 86.5 | 78.0 | 86.0 |
| MMVet | 73.1 | 74.0 | 64.1 | 76.2 |
| DocVQA | 92.5 | 96.5 | 90.1 | 92.8 |
| ChartQA | 86.0 | 88.3 | 85.5 | 85.7 |
| InfoVQA | 78.0 | 82.0 | 73.4 | 79.2 |
| OCRBench | 855 | 877 | 753 | 736 |
| MVBench (video) | 73.5 | 73.6 | n/a | n/a |
| Video-MME (long) | 64.6 | 56.0 | n/a | 64.3 |
| ScreenSpot-Pro (GUI) | 56.4 | 38.2 | n/a | 18.9 |
| OSWorld (agent) | 35.8 | 12.5 | n/a | 22.1 |
| Activated params | 12B (MoE) | 72B (dense) | 90B (dense) | unknown |
GLM-4.5V wins decisively on agentic GUI tasks (ScreenSpot-Pro, OSWorld) and matches frontier on video understanding. Qwen 2-VL leads on document/OCR. Llama 3.2 Vision is the dense alternative with simpler architecture.
vLLM Setup {#vllm}
pip install vllm>=0.7
vllm serve zai-org/GLM-4.5V \
--tensor-parallel-size 2 \
--trust-remote-code \
--max-model-len 65536 \
--limit-mm-per-prompt image=10,video=2 \
--gpu-memory-utilization 0.9
For FP8 on H100:
vllm serve zai-org/GLM-4.5V-FP8 \
--quantization fp8 \
--kv-cache-dtype fp8_e5m2 \
--max-model-len 65536 \
--trust-remote-code
OpenAI-compatible vision endpoint:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "GLM-4.5V",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
]
}]
}'
See vLLM Complete Setup Guide.
SGLang Setup {#sglang}
pip install --upgrade sglang
python -m sglang.launch_server \
--model-path zai-org/GLM-4.5V \
--tp 2 \
--trust-remote-code \
--enable-torch-compile \
--port 30000
SGLang's frontend has native image/video helpers:
import sglang as sgl
@sgl.function
def visual_qa(s, image_path, question):
s += sgl.user(sgl.image(image_path) + question)
s += sgl.assistant(sgl.gen("answer", max_tokens=512))
state = visual_qa.run(image_path="./screenshot.png", question="What button should I click?")
print(state["answer"])
Transformers / HF Setup {#hf}
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch
model_name = "zai-org/GLM-4.5V"
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
image = Image.open("./photo.jpg")
messages = [
{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image in detail."},
]}
]
inputs = processor.apply_chat_template(messages, return_tensors="pt", tokenize=True, add_generation_prompt=True).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.6)
print(processor.decode(outputs[0]))
Quantization Options {#quants}
| Quant | Size | VRAM | Quality Loss | Best Engine |
|---|---|---|---|---|
| BF16 | 212 GB | 240 GB | 0% | vLLM/SGLang (3 H100) |
| FP8 native | 106 GB | 130 GB | <0.5% | vLLM/SGLang (1-2 H100) |
| INT8 W8A8 | 106 GB | 130 GB | <1% | vLLM (2 A100) |
| INT4 AWQ | 55 GB | 70 GB | 1-2% | vLLM (1 H100 / 2 RTX 4090) |
| GGUF Q5_K_M | 75 GB | 90 GB | <1% | llama.cpp (Mac) |
| GGUF Q4_K_M | 55 GB | 70 GB | 1-2% | llama.cpp (Mac) |
For most production: FP8 on H100 80GB. For self-hosted: AWQ INT4 on 2x RTX 4090. For Mac: Q4_K_M GGUF.
Note: as of mid-2025, llama.cpp added GLM-4.5 architecture support; ensure your build is post-Sept 2025 for full GLM-4.5V compatibility including vision.
Thinking Mode {#thinking}
Enable Thinking Mode for complex multi-step reasoning:
# Via system prompt flag
messages = [
{"role": "system", "content": "You are an expert reasoner. Use thinking mode for complex problems."},
{"role": "user", "content": [
{"type": "image", "image": chart_image},
{"type": "text", "text": "What trend does this chart show? Reason step by step."},
]}
]
Output structure:
<think>
The chart shows X axis as time, Y axis as revenue. I see three lines representing
products A, B, and C. Line A trends upward from 2020 to 2024...
</think>
The chart shows that Product A grew steadily by ~30% per year, while Products B and C
stagnated. The key trend is Product A's market share gain.
When to use Thinking Mode:
- Multi-step math involving images (charts, diagrams)
- Complex GUI agent task planning
- Multi-image reasoning where relationships matter
- Spatial reasoning across video frames
Skip for: simple captions, single-image OCR, easy VQA. Thinking adds 2-10s latency.
Multi-Image Reasoning {#multi-image}
GLM-4.5V handles up to 10 images per prompt natively:
messages = [
{"role": "user", "content": [
{"type": "image", "image": image1},
{"type": "image", "image": image2},
{"type": "image", "image": image3},
{"type": "text", "text": "What changed between these three screenshots?"},
]}
]
Use cases:
- Before/after comparison (UI changes, design iterations, medical images)
- Multi-camera scene understanding
- Document page comparison
- Time-series visual analysis
Limit per request: 10 images at default vLLM config (--limit-mm-per-prompt image=10). Each image consumes 256-1024 tokens depending on resolution.
Video Understanding (Up to 2 Hours) {#video}
messages = [
{"role": "user", "content": [
{"type": "video", "video": "/path/to/video.mp4", "fps": 2},
{"type": "text", "text": "Summarize what happens in this video."},
]}
]
Frame sampling: default 2 fps, configurable. For 2-hour video at 2 fps = 14,400 frames; the model downsamples to ~256 representative frames internally.
Use cases:
- Long-form video summarization (lectures, meetings, films)
- Surveillance / monitoring (event detection across hours)
- Sports analysis (full-game reasoning)
- Educational video Q&A
VRAM impact: long videos add to KV cache linearly. For 2-hour video on a single H100: cap context at 32K to leave room for video tokens.
GUI Agent Pattern {#gui-agent}
The flagship use case — drive a browser or desktop app from screenshots:
from playwright.sync_api import sync_playwright
def gui_agent_step(screenshot_path, goal, history):
"""Call GLM-4.5V to decide next action."""
messages = [
{"role": "system", "content": "You are a GUI agent. Analyze the screenshot and output the next action as JSON: {'action': 'click|type|scroll|done', 'target': ..., 'value': ...}"},
{"role": "user", "content": [
{"type": "image", "image": screenshot_path},
{"type": "text", "text": f"Goal: {goal}\nHistory: {history}\nNext action:"},
]}
]
# Call GLM-4.5V via vLLM, parse JSON action
return parse_json_action(call_vllm(messages))
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://example.com")
history = []
for step in range(20):
page.screenshot(path=f"step_{step}.png")
action = gui_agent_step(f"step_{step}.png", "Book a flight to Tokyo", history)
if action["action"] == "done": break
execute_action(page, action)
history.append(action)
For production agentic stacks see AI Agents Local Guide and AI Agent Frameworks.
OCR & Document Understanding {#ocr}
GLM-4.5V handles printed and handwritten text in 30+ languages:
messages = [
{"role": "user", "content": [
{"type": "image", "image": invoice_image},
{"type": "text", "text": "Extract all line items as JSON: [{description, qty, unit_price, total}]"},
]}
]
For pure OCR / document workflows, Qwen 2-VL 72B is slightly stronger (DocVQA 96.5 vs GLM 92.5). Pick GLM-4.5V when you need OCR + reasoning + agent capability in one model; pick Qwen 2-VL for OCR-only pipelines.
Fine-Tuning with QLoRA {#fine-tuning}
QLoRA on the language model side, freezing the vision encoder:
# Axolotl config
base_model: zai-org/GLM-4.5V
trust_remote_code: true
load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 64
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
datasets:
- path: ./vision_train.jsonl
type: chat_template
# Freeze vision tower
unfrozen_parameters:
- language_model.*
sequence_len: 4096
sample_packing: false # vision examples don't pack well
micro_batch_size: 1
gradient_accumulation_steps: 8
num_epochs: 3
learning_rate: 1e-4
optimizer: paged_adamw_8bit
bf16: auto
output_dir: ./glm-4.5v-lora
Time: ~6-12 hours on 2x RTX 3090 NVLink for 1K image+text examples. See QLoRA Fine-Tuning Guide vision section.
System Prompts & Sampling {#prompting}
Chat template (GLM-specific):
[gMASK]<sop><|system|>
[system message]
<|user|>
[user message with vision tokens]
<|assistant|>
Most engines auto-handle via tokenizer config (trust_remote_code=True).
Recommended sampling:
- Vision QA: temperature 0.6, top-p 0.95
- OCR / structured extraction: temperature 0.1, top-p 0.9
- Agentic GUI tasks: temperature 0.3, top-p 0.9 (use Thinking Mode for complex)
- Creative caption / description: temperature 0.85, top-p 0.95
- Video summarization: temperature 0.5, top-p 0.95
Real Benchmarks {#benchmarks}
Single H100 80GB, FP8 via vLLM:
| Workload | Throughput |
|---|---|
| Single-image VQA (low-res) | 65 tok/s |
| Single-image VQA (high-res, 1024 tokens) | 50 tok/s |
| Multi-image (5 images) | 35 tok/s |
| Video (2 min, 2 fps) | 25 tok/s |
| TTFT (single image, 512 tokens vision) | 850 ms |
| TTFT (5 images) | 2.4 s |
| TTFT (60 min video at 2 fps) | 8 s |
2x RTX 3090 NVLink, AWQ INT4 via vLLM:
| Workload | Throughput |
|---|---|
| Single-image VQA | 28 tok/s |
| Multi-image (5 images) | 18 tok/s |
| TTFT (single image) | 1.4 s |
Licensing {#licensing}
GLM-4.5V weights: MIT license — unrestricted.
You can:
- Use commercially without limits (no MAU threshold)
- Modify and redistribute
- Bundle into proprietary products
- Sell as paid service
- Train derivative VLMs without restriction
- Use in regulated industries (defense, EU public sector)
The MIT license puts GLM-4.5V in the same legal-cleanliness tier as OLMo 2 and Phi-4 — and currently the strongest open VLM at that tier. Compare to Llama 3.2 Vision (Meta Community License with 700M MAU) and Qwen 2-VL (Tongyi Qianwen with EU restrictions).
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| Custom code error | Missing trust_remote_code | Add --trust-remote-code to vLLM/SGLang |
| Image tokens missing | Wrong processor | Use AutoProcessor not AutoTokenizer |
| OOM with images | Vision tokens uncapped | Set --limit-mm-per-prompt image=5 |
| Video frames missing | fps too high | Lower fps to 1-2 for long videos |
| Thinking tokens leak | No \u003cthink> stripping | Post-process to extract content after \u003c/think> |
| GUI agent unreliable | No system prompt structure | Force JSON action output via prompt + xgrammar |
| AWQ load fails | Pre-Aug 2025 vLLM | Upgrade vLLM to 0.6.3+ for GLM-4.5 architecture |
FAQ {#faq}
See answers to common GLM-4.5V questions below.
Sources: GLM-4.5V on HuggingFace | GLM-4.5 technical report | Zhipu AI GitHub | ScreenSpot-Pro paper | OSWorld benchmark | Internal benchmarks H100 + 2x RTX 3090.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!