★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Multimodal

GLM-4.5V Local Setup Guide (2026): Zhipu's 106B MoE Vision-Language Model with Agentic GUI Control

May 2, 2026
26 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

GLM-4.5V is Zhipu AI's August 2025 open-weight vision-language MoE — 106B total parameters, 12B activated per token, built on the GLM-4.5-Air backbone with a vision tower for images and video. It's the first frontier-class open VLM specifically designed for agentic workloads: GUI control from screenshots, multi-step reasoning over chart/diagram inputs, long-video understanding up to 2 hours, multi-image RAG. MIT licensed for unrestricted commercial use. For self-hosters building agents, multimodal RAG over visual documents, video analysis pipelines, or any vision workload where license cleanliness matters, it's the strongest 2026 choice.

This guide covers what you need to run GLM-4.5V locally — hardware paths from 2x RTX 3090 to single H100, setup with vLLM / SGLang / Transformers, Thinking Mode for complex reasoning, GUI-agent integration patterns, video pipelines, fine-tuning with QLoRA, and benchmarks vs Qwen 2-VL 72B / Llama 3.2 90B Vision / GPT-4o.

Table of Contents

  1. What GLM-4.5V Is
  2. Architecture: GLM-4.5-Air MoE + Vision Tower
  3. Hardware Requirements
  4. GLM-4.5V vs Qwen 2-VL 72B vs Llama 3.2 90B Vision vs GPT-4o
  5. vLLM Setup
  6. SGLang Setup
  7. Transformers / HF Setup
  8. Quantization Options
  9. Thinking Mode
  10. Multi-Image Reasoning
  11. Video Understanding (Up to 2 Hours)
  12. GUI Agent Pattern
  13. OCR & Document Understanding
  14. Fine-Tuning with QLoRA
  15. System Prompts & Sampling
  16. Real Benchmarks
  17. Licensing
  18. Troubleshooting
  19. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What GLM-4.5V Is {#what-it-is}

GLM-4.5V (zai-org/GLM-4.5V on HuggingFace) is the August 2025 release of Zhipu AI's flagship open-weight vision-language model. Architecture: GLM-4.5-Air MoE backbone (106B total params, 128 experts, top-8 routing, ~12B activated per token) with a vision encoder (custom CLIP-style ViT-L/14 with ~1B params). Native context: 64K text tokens, image inputs at up to 4K resolution per image, video inputs at 2 fps for up to 2 hours.

Variants:

  • GLM-4.5V — flagship 106B/12B-active VLM (most users want this)
  • GLM-4.5V-Thinking — same weights, default-on Thinking Mode for complex tasks
  • GLM-4.5-Air — text-only base (without vision tower)

License: MIT — unrestricted commercial use.


Architecture: GLM-4.5-Air MoE + Vision Tower {#architecture}

Three architectural components:

Vision Encoder (1B params)

  • Custom CLIP-style ViT-L/14 architecture
  • Supports dynamic resolution (224x224 to 4096x4096)
  • Each image becomes 256-1024 tokens depending on resolution
  • Video frames sampled at 2 fps, encoded individually

MoE Language Model (106B / 12B active)

  • 128 routed experts + 1 shared expert per layer
  • Top-8 routing
  • 96 transformer layers
  • 64K context window
  • Inherits GLM-4.5-Air's training (mostly Chinese + English with code/math)

Vision-Language Projector

  • 2-layer MLP projecting vision tokens into language embedding space
  • Trained jointly with language model in late-stage fine-tuning

Thinking Mode

  • Trained with chain-of-thought data including \u003cthink>...\u003c/think> tokens
  • Can be toggled per request
  • Improves multi-step reasoning at the cost of latency

Hardware Requirements {#hardware}

SetupQuantThroughputNotes
1x H100 80GBFP850-70 tok/sRecommended production
2x H100 80GBBF1670-90 tok/sHigh-throughput
1x A100 80GBINT830-45 tok/sBudget production
2x RTX 4090 (24GB ea)INT4 AWQ25-40 tok/sSelf-hosted enthusiast
2x RTX 3090 (NVLink)INT4 AWQ20-30 tok/sUsed-market budget
Mac Studio M3 Ultra 512GBQ4_K_M GGUF12-18 tok/sSolo developer
1x RTX 4090 + 64GB RAMQ3_K_M GGUF (offload)4-8 tok/sHobbyist limit

For most self-hosters: 2x RTX 3090 + AWQ INT4 via vLLM. For production single-server: H100 80GB with FP8. For Mac users: M3 Ultra 512GB.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

GLM-4.5V vs Qwen 2-VL 72B vs Llama 3.2 90B Vision vs GPT-4o {#comparison}

BenchmarkGLM-4.5VQwen 2-VL 72BLlama 3.2 90B VisionGPT-4o
MMMU (val)65.564.560.369.1
MMBench87.286.578.086.0
MMVet73.174.064.176.2
DocVQA92.596.590.192.8
ChartQA86.088.385.585.7
InfoVQA78.082.073.479.2
OCRBench855877753736
MVBench (video)73.573.6n/an/a
Video-MME (long)64.656.0n/a64.3
ScreenSpot-Pro (GUI)56.438.2n/a18.9
OSWorld (agent)35.812.5n/a22.1
Activated params12B (MoE)72B (dense)90B (dense)unknown

GLM-4.5V wins decisively on agentic GUI tasks (ScreenSpot-Pro, OSWorld) and matches frontier on video understanding. Qwen 2-VL leads on document/OCR. Llama 3.2 Vision is the dense alternative with simpler architecture.


vLLM Setup {#vllm}

pip install vllm>=0.7
vllm serve zai-org/GLM-4.5V \
    --tensor-parallel-size 2 \
    --trust-remote-code \
    --max-model-len 65536 \
    --limit-mm-per-prompt image=10,video=2 \
    --gpu-memory-utilization 0.9

For FP8 on H100:

vllm serve zai-org/GLM-4.5V-FP8 \
    --quantization fp8 \
    --kv-cache-dtype fp8_e5m2 \
    --max-model-len 65536 \
    --trust-remote-code

OpenAI-compatible vision endpoint:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "GLM-4.5V",
      "messages": [{
        "role": "user",
        "content": [
          {"type": "text", "text": "What is in this image?"},
          {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
        ]
      }]
    }'

See vLLM Complete Setup Guide.


SGLang Setup {#sglang}

pip install --upgrade sglang
python -m sglang.launch_server \
    --model-path zai-org/GLM-4.5V \
    --tp 2 \
    --trust-remote-code \
    --enable-torch-compile \
    --port 30000

SGLang's frontend has native image/video helpers:

import sglang as sgl

@sgl.function
def visual_qa(s, image_path, question):
    s += sgl.user(sgl.image(image_path) + question)
    s += sgl.assistant(sgl.gen("answer", max_tokens=512))

state = visual_qa.run(image_path="./screenshot.png", question="What button should I click?")
print(state["answer"])

Transformers / HF Setup {#hf}

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

model_name = "zai-org/GLM-4.5V"
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

image = Image.open("./photo.jpg")
messages = [
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Describe this image in detail."},
    ]}
]

inputs = processor.apply_chat_template(messages, return_tensors="pt", tokenize=True, add_generation_prompt=True).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.6)
print(processor.decode(outputs[0]))

Quantization Options {#quants}

QuantSizeVRAMQuality LossBest Engine
BF16212 GB240 GB0%vLLM/SGLang (3 H100)
FP8 native106 GB130 GB<0.5%vLLM/SGLang (1-2 H100)
INT8 W8A8106 GB130 GB<1%vLLM (2 A100)
INT4 AWQ55 GB70 GB1-2%vLLM (1 H100 / 2 RTX 4090)
GGUF Q5_K_M75 GB90 GB<1%llama.cpp (Mac)
GGUF Q4_K_M55 GB70 GB1-2%llama.cpp (Mac)

For most production: FP8 on H100 80GB. For self-hosted: AWQ INT4 on 2x RTX 4090. For Mac: Q4_K_M GGUF.

Note: as of mid-2025, llama.cpp added GLM-4.5 architecture support; ensure your build is post-Sept 2025 for full GLM-4.5V compatibility including vision.


Thinking Mode {#thinking}

Enable Thinking Mode for complex multi-step reasoning:

# Via system prompt flag
messages = [
    {"role": "system", "content": "You are an expert reasoner. Use thinking mode for complex problems."},
    {"role": "user", "content": [
        {"type": "image", "image": chart_image},
        {"type": "text", "text": "What trend does this chart show? Reason step by step."},
    ]}
]

Output structure:

<think>
The chart shows X axis as time, Y axis as revenue. I see three lines representing
products A, B, and C. Line A trends upward from 2020 to 2024...
</think>

The chart shows that Product A grew steadily by ~30% per year, while Products B and C
stagnated. The key trend is Product A's market share gain.

When to use Thinking Mode:

  • Multi-step math involving images (charts, diagrams)
  • Complex GUI agent task planning
  • Multi-image reasoning where relationships matter
  • Spatial reasoning across video frames

Skip for: simple captions, single-image OCR, easy VQA. Thinking adds 2-10s latency.


Multi-Image Reasoning {#multi-image}

GLM-4.5V handles up to 10 images per prompt natively:

messages = [
    {"role": "user", "content": [
        {"type": "image", "image": image1},
        {"type": "image", "image": image2},
        {"type": "image", "image": image3},
        {"type": "text", "text": "What changed between these three screenshots?"},
    ]}
]

Use cases:

  • Before/after comparison (UI changes, design iterations, medical images)
  • Multi-camera scene understanding
  • Document page comparison
  • Time-series visual analysis

Limit per request: 10 images at default vLLM config (--limit-mm-per-prompt image=10). Each image consumes 256-1024 tokens depending on resolution.


Video Understanding (Up to 2 Hours) {#video}

messages = [
    {"role": "user", "content": [
        {"type": "video", "video": "/path/to/video.mp4", "fps": 2},
        {"type": "text", "text": "Summarize what happens in this video."},
    ]}
]

Frame sampling: default 2 fps, configurable. For 2-hour video at 2 fps = 14,400 frames; the model downsamples to ~256 representative frames internally.

Use cases:

  • Long-form video summarization (lectures, meetings, films)
  • Surveillance / monitoring (event detection across hours)
  • Sports analysis (full-game reasoning)
  • Educational video Q&A

VRAM impact: long videos add to KV cache linearly. For 2-hour video on a single H100: cap context at 32K to leave room for video tokens.


GUI Agent Pattern {#gui-agent}

The flagship use case — drive a browser or desktop app from screenshots:

from playwright.sync_api import sync_playwright

def gui_agent_step(screenshot_path, goal, history):
    """Call GLM-4.5V to decide next action."""
    messages = [
        {"role": "system", "content": "You are a GUI agent. Analyze the screenshot and output the next action as JSON: {'action': 'click|type|scroll|done', 'target': ..., 'value': ...}"},
        {"role": "user", "content": [
            {"type": "image", "image": screenshot_path},
            {"type": "text", "text": f"Goal: {goal}\nHistory: {history}\nNext action:"},
        ]}
    ]
    # Call GLM-4.5V via vLLM, parse JSON action
    return parse_json_action(call_vllm(messages))

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example.com")

    history = []
    for step in range(20):
        page.screenshot(path=f"step_{step}.png")
        action = gui_agent_step(f"step_{step}.png", "Book a flight to Tokyo", history)
        if action["action"] == "done": break
        execute_action(page, action)
        history.append(action)

For production agentic stacks see AI Agents Local Guide and AI Agent Frameworks.


OCR & Document Understanding {#ocr}

GLM-4.5V handles printed and handwritten text in 30+ languages:

messages = [
    {"role": "user", "content": [
        {"type": "image", "image": invoice_image},
        {"type": "text", "text": "Extract all line items as JSON: [{description, qty, unit_price, total}]"},
    ]}
]

For pure OCR / document workflows, Qwen 2-VL 72B is slightly stronger (DocVQA 96.5 vs GLM 92.5). Pick GLM-4.5V when you need OCR + reasoning + agent capability in one model; pick Qwen 2-VL for OCR-only pipelines.


Fine-Tuning with QLoRA {#fine-tuning}

QLoRA on the language model side, freezing the vision encoder:

# Axolotl config
base_model: zai-org/GLM-4.5V
trust_remote_code: true
load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 64
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

datasets:
  - path: ./vision_train.jsonl
    type: chat_template

# Freeze vision tower
unfrozen_parameters:
  - language_model.*

sequence_len: 4096
sample_packing: false  # vision examples don't pack well
micro_batch_size: 1
gradient_accumulation_steps: 8
num_epochs: 3
learning_rate: 1e-4
optimizer: paged_adamw_8bit
bf16: auto
output_dir: ./glm-4.5v-lora

Time: ~6-12 hours on 2x RTX 3090 NVLink for 1K image+text examples. See QLoRA Fine-Tuning Guide vision section.


System Prompts & Sampling {#prompting}

Chat template (GLM-specific):

[gMASK]<sop><|system|>
[system message]
<|user|>
[user message with vision tokens]
<|assistant|>

Most engines auto-handle via tokenizer config (trust_remote_code=True).

Recommended sampling:

  • Vision QA: temperature 0.6, top-p 0.95
  • OCR / structured extraction: temperature 0.1, top-p 0.9
  • Agentic GUI tasks: temperature 0.3, top-p 0.9 (use Thinking Mode for complex)
  • Creative caption / description: temperature 0.85, top-p 0.95
  • Video summarization: temperature 0.5, top-p 0.95

Real Benchmarks {#benchmarks}

Single H100 80GB, FP8 via vLLM:

WorkloadThroughput
Single-image VQA (low-res)65 tok/s
Single-image VQA (high-res, 1024 tokens)50 tok/s
Multi-image (5 images)35 tok/s
Video (2 min, 2 fps)25 tok/s
TTFT (single image, 512 tokens vision)850 ms
TTFT (5 images)2.4 s
TTFT (60 min video at 2 fps)8 s

2x RTX 3090 NVLink, AWQ INT4 via vLLM:

WorkloadThroughput
Single-image VQA28 tok/s
Multi-image (5 images)18 tok/s
TTFT (single image)1.4 s

Licensing {#licensing}

GLM-4.5V weights: MIT license — unrestricted.

You can:

  • Use commercially without limits (no MAU threshold)
  • Modify and redistribute
  • Bundle into proprietary products
  • Sell as paid service
  • Train derivative VLMs without restriction
  • Use in regulated industries (defense, EU public sector)

The MIT license puts GLM-4.5V in the same legal-cleanliness tier as OLMo 2 and Phi-4 — and currently the strongest open VLM at that tier. Compare to Llama 3.2 Vision (Meta Community License with 700M MAU) and Qwen 2-VL (Tongyi Qianwen with EU restrictions).


Troubleshooting {#troubleshooting}

SymptomCauseFix
Custom code errorMissing trust_remote_codeAdd --trust-remote-code to vLLM/SGLang
Image tokens missingWrong processorUse AutoProcessor not AutoTokenizer
OOM with imagesVision tokens uncappedSet --limit-mm-per-prompt image=5
Video frames missingfps too highLower fps to 1-2 for long videos
Thinking tokens leakNo \u003cthink> strippingPost-process to extract content after \u003c/think>
GUI agent unreliableNo system prompt structureForce JSON action output via prompt + xgrammar
AWQ load failsPre-Aug 2025 vLLMUpgrade vLLM to 0.6.3+ for GLM-4.5 architecture

FAQ {#faq}

See answers to common GLM-4.5V questions below.


Sources: GLM-4.5V on HuggingFace | GLM-4.5 technical report | Zhipu AI GitHub | ScreenSpot-Pro paper | OSWorld benchmark | Internal benchmarks H100 + 2x RTX 3090.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 2, 2026🔄 Last Updated: May 2, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes GLM-4.5V vision agent + Open WebUI deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators