★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds

Mistral · Open-Weight · Modified MIT

Mistral Medium 3.5: 128B Dense, 4-GPU Open-Weight

Mistral Medium 3.5 (April 30, 2026) is the French AI lab Mistral's unified flagship — 128 billion dense parameters, 256K context, 77.6% SWE-Bench Verified, modified MIT licensed. The big design choice: a single model that handles general reasoning, vision, and coding well, and it took over as the default model in Mistral's Vibe CLI (replacing Devstral 2). Mistral still ships its separate Devstral 2 open-weight coding line alongside it. Runs on 4× H100 at full precision, or 1× H100 / 2× RTX 5090 at Q4 quantization. This is the realistic open-weight choice for prosumer hardware.

📅 Published: May 9, 2026🔄 Last Updated: May 9, 2026✓ Manually Reviewed

Key takeaways

  • 128B dense — no MoE complexity, predictable VRAM, simpler deployment.
  • Unified generalist — one model for general reasoning, vision, and coding; now the Vibe CLI default (Devstral 2 still ships separately).
  • 77.6% SWE-Bench Verified — competitive with DeepSeek V4-Flash (78.4%).
  • 256K context — bigger than most prosumer-tier alternatives.
  • Runs on 1× H100 at Q4 — accessible without an 8× H100 cluster.

Quick verdict

Mistral Medium 3.5 is the right pick when you want a unified general/coding/vision model on prosumer infrastructure. Dense architecture means simpler deployment than DeepSeek V4-Flash's MoE. Single H100 at Q4 makes it viable without cluster-grade hardware.

Where it loses: peak coding quality vs Qwen3-Coder-Next (smaller and slightly higher SWE-Bench), 1M context vs DeepSeek V4 (4× longer). For pure coding workloads on a single GPU, Qwen3-Coder-Next or Qwen3.6-27B may be better. For mixed coding + research + vision, Mistral Medium 3.5 is the cleanest single-model option.

Specs at a glance

VendorMistral AI
ArchitectureDense transformer (no MoE)
Parameters128 billion
Context window256,000 tokens
ModalitiesText · Code · Vision
LicenseModified MIT
Storage (BF16)~256 GB
Storage (Q4_K_M)~80 GB
Hugging Facemistralai/Mistral-Medium-3.5

Hardware & setup

HardwareQuantContextTokens/sec
4× H100 80GBBF16256K80-130 tok/s
1× H100 80GBQ4_K_M256K35-55 tok/s
2× RTX 5090 (32GB each)Q4_K_M128K (reduced)25-40 tok/s
1× M3 Ultra (192GB)Q5_K_M256K15-28 tok/s

Ollama (single-GPU prosumer)

ollama pull mistral-medium-3.5
ollama run mistral-medium-3.5

vLLM (production, 4× H100)

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-Medium-3.5 \
  --tensor-parallel-size 4 \
  --max-model-len 262144 --port 8000

Benchmarks

BenchmarkMistral Medium 3.5DeepSeek V4-FlashQwen3-Coder-NextGLM-5
SWE-Bench Verified77.6%78.4%70.6%77.8%
MMLU-Pro85.2%83.8%81.4%84.6%
GPQA Diamond76.4%76.9%71.2%79.4%
AIME 202581.6%82.4%76.8%85.2%
Vision-MME (image QA)73.4%N/AN/A68.7%

When to pick Mistral Medium 3.5

  • You want one generalist model for general work + coding + vision (vs running Mistral's separate Devstral 2 coding line).
  • Dense architecture preference (simpler than MoE — predictable VRAM, no expert routing).
  • Single H100 / 2× RTX 5090 hardware (Q4 quantization).
  • EU sovereignty matters — Mistral is Paris-based, GDPR-aligned governance.

FAQ

What is Mistral Medium 3.5?
Mistral Medium 3.5 is the French AI lab Mistral's flagship open-weight model released April 30, 2026. It's a 128-billion-parameter dense transformer (no MoE), 256K context window, scores 77.6% on SWE-Bench Verified, and ships under a modified MIT license that permits commercial use. Medium 3.5 is a single model that handles general reasoning, vision, and coding well, and it replaced Devstral 2 as the default model in Mistral's Vibe CLI. Note that Mistral's dedicated coding line, Devstral 2 (123B) and Devstral Small 2 (24B), is still a separate, actively available open-weight stack you can download and run — Medium 3.5 is the new generalist default, not a retirement of Devstral.
How much VRAM does Mistral Medium 3.5 need?
At BF16 (full precision), Mistral Medium 3.5 weights total ~256 GB — needs 4× H100 (80 GB each, 320 GB total) for stable inference. Q4_K_M quantization brings it to ~80 GB, which fits on 1× H100 80GB or 2× RTX 5090 (32 GB each, 64 GB total — tight, requires reduced context). Q5_K_M is ~96 GB. For most teams, 4× H100 with BF16 is the sweet spot. For prosumer/consumer hardware, Q4_K_M on 2× RTX 5090 with 128K context (instead of full 256K) is the realistic config.
Mistral Medium 3.5 vs DeepSeek V4-Flash: which to pick?
Both are accessible open-weight options for prosumer infrastructure. Hardware: Mistral Medium 3.5 dense ~80 GB Q4 (1× H100 or 2× RTX 5090) vs DeepSeek V4-Flash ~150 GB Q4 (2× H100). Benchmarks: Medium 3.5 77.6% SWE-Bench Verified vs V4-Flash 78.4% — essentially tied on coding. V4-Flash wins on context length (1M vs 256K). Mistral wins on simplicity (dense, no MoE complexity). For most teams: pick Mistral Medium 3.5 if hardware budget caps at 1-2 GPUs; pick V4-Flash if you have 2× H100 and need the 1M context.
How do I install Mistral Medium 3.5?
Ollama: `ollama pull mistral-medium-3.5` (default Q4_K_M, ~80 GB) then `ollama run mistral-medium-3.5`. For vLLM serving: `python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-Medium-3.5 --tensor-parallel-size 4 --max-model-len 262144 --port 8000`. Apple Silicon: llama.cpp with Metal backend works on M3 Max/Ultra (~25-40 tok/s at Q4). Cursor/Continue/Aider integration: point any tool at the OpenAI-compatible endpoint with model name `mistral-medium-3.5`.
What does the unified design (general + vision + coding) mean?
Before Medium 3.5, running Mistral across general reasoning, vision, and coding meant juggling different specialized models — separate APIs, fine-tunes, and licenses. Medium 3.5 folds all three capabilities into one model with strong performance across domains. Vision: handles image input natively. Coding: it became the Vibe CLI default, replacing Devstral 2. Reasoning: strong on math benchmarks. The benefit is operational simplicity — one generalist model, one deployment, one fine-tuning workflow. For dedicated coding work, Mistral still ships and maintains the separate open-weight Devstral 2 (123B) and Devstral Small 2 (24B), so this is consolidation of the default, not a discontinuation of the coding line.
Why does “modified MIT” license matter?
Mistral's modified MIT permits unlimited commercial use, modification, and redistribution. The "modification" adds a clause prohibiting use for training competitive models above a certain scale. In practice this affects almost no one — only AI labs trying to clone Mistral's model would hit the restriction. Day-to-day commercial use, fine-tuning, distillation for product-specific purposes, embedding in SaaS, and self-hosting are all fully permitted with no royalties. Compare to Apache 2.0 (no restrictions) or Llama 4 (modified license with usage thresholds and attribution) — Mistral's license is in between but lenient for typical use cases.

Related models

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once
LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
More on AI Models Directory
See the full AI Models Directory guide.
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Found your model? Now build something with it.

20 hands-on courses — RAG, agents, fine-tuning — all running locally. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators