Mistral · Open-Weight · Modified MIT
Mistral Medium 3.5: 128B Dense, 4-GPU Open-Weight
Mistral Medium 3.5 (April 30, 2026) is the French AI lab Mistral's unified flagship — 128 billion dense parameters, 256K context, 77.6% SWE-Bench Verified, modified MIT licensed. The big design choice: a single model that handles general reasoning, vision, and coding well, and it took over as the default model in Mistral's Vibe CLI (replacing Devstral 2). Mistral still ships its separate Devstral 2 open-weight coding line alongside it. Runs on 4× H100 at full precision, or 1× H100 / 2× RTX 5090 at Q4 quantization. This is the realistic open-weight choice for prosumer hardware.
Key takeaways
- →128B dense — no MoE complexity, predictable VRAM, simpler deployment.
- →Unified generalist — one model for general reasoning, vision, and coding; now the Vibe CLI default (Devstral 2 still ships separately).
- →77.6% SWE-Bench Verified — competitive with DeepSeek V4-Flash (78.4%).
- →256K context — bigger than most prosumer-tier alternatives.
- →Runs on 1× H100 at Q4 — accessible without an 8× H100 cluster.
Quick verdict
Mistral Medium 3.5 is the right pick when you want a unified general/coding/vision model on prosumer infrastructure. Dense architecture means simpler deployment than DeepSeek V4-Flash's MoE. Single H100 at Q4 makes it viable without cluster-grade hardware.
Where it loses: peak coding quality vs Qwen3-Coder-Next (smaller and slightly higher SWE-Bench), 1M context vs DeepSeek V4 (4× longer). For pure coding workloads on a single GPU, Qwen3-Coder-Next or Qwen3.6-27B may be better. For mixed coding + research + vision, Mistral Medium 3.5 is the cleanest single-model option.
Specs at a glance
| Vendor | Mistral AI |
| Architecture | Dense transformer (no MoE) |
| Parameters | 128 billion |
| Context window | 256,000 tokens |
| Modalities | Text · Code · Vision |
| License | Modified MIT |
| Storage (BF16) | ~256 GB |
| Storage (Q4_K_M) | ~80 GB |
| Hugging Face | mistralai/Mistral-Medium-3.5 |
Hardware & setup
| Hardware | Quant | Context | Tokens/sec |
|---|---|---|---|
| 4× H100 80GB | BF16 | 256K | 80-130 tok/s |
| 1× H100 80GB | Q4_K_M | 256K | 35-55 tok/s |
| 2× RTX 5090 (32GB each) | Q4_K_M | 128K (reduced) | 25-40 tok/s |
| 1× M3 Ultra (192GB) | Q5_K_M | 256K | 15-28 tok/s |
Ollama (single-GPU prosumer)
ollama pull mistral-medium-3.5
ollama run mistral-medium-3.5vLLM (production, 4× H100)
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-Medium-3.5 \
--tensor-parallel-size 4 \
--max-model-len 262144 --port 8000Benchmarks
| Benchmark | Mistral Medium 3.5 | DeepSeek V4-Flash | Qwen3-Coder-Next | GLM-5 |
|---|---|---|---|---|
| SWE-Bench Verified | 77.6% | 78.4% | 70.6% | 77.8% |
| MMLU-Pro | 85.2% | 83.8% | 81.4% | 84.6% |
| GPQA Diamond | 76.4% | 76.9% | 71.2% | 79.4% |
| AIME 2025 | 81.6% | 82.4% | 76.8% | 85.2% |
| Vision-MME (image QA) | 73.4% | N/A | N/A | 68.7% |
When to pick Mistral Medium 3.5
- ✓You want one generalist model for general work + coding + vision (vs running Mistral's separate Devstral 2 coding line).
- ✓Dense architecture preference (simpler than MoE — predictable VRAM, no expert routing).
- ✓Single H100 / 2× RTX 5090 hardware (Q4 quantization).
- ✓EU sovereignty matters — Mistral is Paris-based, GDPR-aligned governance.
FAQ
What is Mistral Medium 3.5?
How much VRAM does Mistral Medium 3.5 need?
Mistral Medium 3.5 vs DeepSeek V4-Flash: which to pick?
How do I install Mistral Medium 3.5?
What does the unified design (general + vision + coding) mean?
Why does “modified MIT” license matter?
Related models
- → Mistral Large 123B — predecessor
- → DeepSeek V4 — frontier MoE alternative, 1M context
- → Qwen3-Coder-Next — smaller, coding-specialized
- → Qwen3.6-27B — single-GPU dense alternative
- → GLM-5 — frontier open weight, 4× H100
- → Best AI models May 2026 — pillar comparison
Go from reading about AI to building with AI
20 structured courses. Hands-on projects. Runs on your machine. Start free.
Written by the Local AI Master Team
The team behind Local AI Master
We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.