Mistral · Open-Weight · Modified MIT

Mistral Medium 3.5: 128B Dense, 4-GPU Open-Weight

Mistral Medium 3.5 (April 30, 2026) is the French AI lab Mistral's unified flagship — 128 billion dense parameters, 256K context, 77.6% SWE-Bench Verified, modified MIT licensed. The big design choice: it replaces three previously-separate Mistral models (Magistral / Pixtral / Devstral) with one model that handles general reasoning, vision, and coding equally well. Runs on 4× H100 at full precision, or 1× H100 / 2× RTX 5090 at Q4 quantization. This is the realistic open-weight choice for prosumer hardware.

📅 Published: May 9, 2026🔄 Last Updated: May 9, 2026✓ Manually Reviewed

Key takeaways

→128B dense — no MoE complexity, predictable VRAM, simpler deployment.
→Unified model — replaces Magistral, Pixtral, Devstral with one model handling all three.
→77.6% SWE-Bench Verified — competitive with DeepSeek V4-Flash (78.4%).
→256K context — bigger than most prosumer-tier alternatives.
→Runs on 1× H100 at Q4 — accessible without an 8× H100 cluster.

Quick verdict

Mistral Medium 3.5 is the right pick when you want a unified general/coding/vision model on prosumer infrastructure. Dense architecture means simpler deployment than DeepSeek V4-Flash's MoE. Single H100 at Q4 makes it viable without cluster-grade hardware.

Where it loses: peak coding quality vs Qwen3-Coder-Next (smaller and slightly higher SWE-Bench), 1M context vs DeepSeek V4 (4× longer). For pure coding workloads on a single GPU, Qwen3-Coder-Next or Qwen3.6-27B may be better. For mixed coding + research + vision, Mistral Medium 3.5 is the cleanest single-model option.

Specs at a glance

Vendor	Mistral AI
Architecture	Dense transformer (no MoE)
Parameters	128 billion
Context window	256,000 tokens
Modalities	Text · Code · Vision
License	Modified MIT
Storage (BF16)	~256 GB
Storage (Q4_K_M)	~80 GB
Hugging Face	`mistralai/Mistral-Medium-3.5`

Hardware & setup

Hardware	Quant	Context	Tokens/sec
4× H100 80GB	BF16	256K	80-130 tok/s
1× H100 80GB	Q4_K_M	256K	35-55 tok/s
2× RTX 5090 (32GB each)	Q4_K_M	128K (reduced)	25-40 tok/s
1× M3 Ultra (192GB)	Q5_K_M	256K	15-28 tok/s

Ollama (single-GPU prosumer)

ollama pull mistral-medium-3.5
ollama run mistral-medium-3.5

vLLM (production, 4× H100)

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-Medium-3.5 \
  --tensor-parallel-size 4 \
  --max-model-len 262144 --port 8000

Benchmarks

Benchmark	Mistral Medium 3.5	DeepSeek V4-Flash	Qwen3-Coder-Next	GLM-5
SWE-Bench Verified	77.6%	78.4%	70.6%	77.8%
MMLU-Pro	85.2%	83.8%	81.4%	84.6%
GPQA Diamond	76.4%	76.9%	71.2%	79.4%
AIME 2025	81.6%	82.4%	76.8%	85.2%
Vision-MME (image QA)	73.4%	N/A	N/A	68.7%

When to pick Mistral Medium 3.5

✓You want one model for general work + coding + vision (replaces 3 Mistral models).
✓Dense architecture preference (simpler than MoE — predictable VRAM, no expert routing).
✓Single H100 / 2× RTX 5090 hardware (Q4 quantization).
✓EU sovereignty matters — Mistral is Paris-based, GDPR-aligned governance.

FAQ

What is Mistral Medium 3.5?

Mistral Medium 3.5 is the French AI lab Mistral's flagship open-weight model released April 30, 2026. It's a 128-billion-parameter dense transformer (no MoE), 256K context window, scores 77.6% on SWE-Bench Verified, and ships under a modified MIT license that permits commercial use. Mistral Medium 3.5 unifies what were previously three separate models (Magistral for general, Pixtral for vision, Devstral for coding) into one — all three are now retired in favor of the unified Medium 3.5.

How much VRAM does Mistral Medium 3.5 need?

At BF16 (full precision), Mistral Medium 3.5 weights total ~256 GB — needs 4× H100 (80 GB each, 320 GB total) for stable inference. Q4_K_M quantization brings it to ~80 GB, which fits on 1× H100 80GB or 2× RTX 5090 (32 GB each, 64 GB total — tight, requires reduced context). Q5_K_M is ~96 GB. For most teams, 4× H100 with BF16 is the sweet spot. For prosumer/consumer hardware, Q4_K_M on 2× RTX 5090 with 128K context (instead of full 256K) is the realistic config.

Mistral Medium 3.5 vs DeepSeek V4-Flash: which to pick?

Both are accessible open-weight options for prosumer infrastructure. Hardware: Mistral Medium 3.5 dense ~80 GB Q4 (1× H100 or 2× RTX 5090) vs DeepSeek V4-Flash ~150 GB Q4 (2× H100). Benchmarks: Medium 3.5 77.6% SWE-Bench Verified vs V4-Flash 78.4% — essentially tied on coding. V4-Flash wins on context length (1M vs 256K). Mistral wins on simplicity (dense, no MoE complexity). For most teams: pick Mistral Medium 3.5 if hardware budget caps at 1-2 GPUs; pick V4-Flash if you have 2× H100 and need the 1M context.

How do I install Mistral Medium 3.5?

Ollama: `ollama pull mistral-medium-3.5` (default Q4_K_M, ~80 GB) then `ollama run mistral-medium-3.5`. For vLLM serving: `python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-Medium-3.5 --tensor-parallel-size 4 --max-model-len 262144 --port 8000`. Apple Silicon: llama.cpp with Metal backend works on M3 Max/Ultra (~25-40 tok/s at Q4). Cursor/Continue/Aider integration: point any tool at the OpenAI-compatible endpoint with model name `mistral-medium-3.5`.

What does the unified design (Magistral + Pixtral + Devstral) mean?

Before Medium 3.5, Mistral shipped three specialized models: Magistral (general reasoning), Pixtral (vision), Devstral (coding). Operationally a pain — different APIs, different fine-tunes, different licenses. Medium 3.5 unifies all three into one model with strong performance across all domains. Vision: handles image input natively (no separate Pixtral). Coding: matches old Devstral on SWE-Bench. Reasoning: matches old Magistral on math benchmarks. The benefit is operational simplicity — one model, one deployment, one fine-tuning workflow.

Why does “modified MIT” license matter?

Mistral's modified MIT permits unlimited commercial use, modification, and redistribution. The "modification" adds a clause prohibiting use for training competitive models above a certain scale. In practice this affects almost no one — only AI labs trying to clone Mistral's model would hit the restriction. Day-to-day commercial use, fine-tuning, distillation for product-specific purposes, embedding in SaaS, and self-hosting are all fully permitted with no royalties. Compare to Apache 2.0 (no restrictions) or Llama 4 (modified license with usage thresholds and attribution) — Mistral's license is in between but lenient for typical use cases.

Related models

→ Mistral Large 123B — predecessor
→ DeepSeek V4 — frontier MoE alternative, 1M context
→ Qwen3-Coder-Next — smaller, coding-specialized
→ Qwen3.6-27B — single-GPU dense alternative
→ GLM-5 — frontier open weight, 4× H100
→ Best AI models May 2026 — pillar comparison