MiniMax · Open-Weight Model
MiniMax M3: The Open-Weight 1M-Context Frontier Model
MiniMax M3 is an open-weight frontier model from MiniMax (Shanghai), released June 1, 2026 with weights published to Hugging Face on June 7. It is a 229.9B-parameter Mixture-of-Experts model (9.8B active per token across 256 experts) that combines three things that usually live in separate models: a 1M-token context window, native multimodal input (image + video), and a new MiniMax Sparse Attention (MSA) design that MiniMax says delivers ~15.6x faster decode and ~9.7x faster prefill at 1M context versus MiniMax M2. Verdict: it's the most capable model in this class you can actually download and self-host today — strong on agentic/coding tasks, with the headline benchmark numbers being MiniMax's own (not yet independently verified).
Open weight, not fully open source. The parameters are downloadable from Hugging Face, but commercial use of M3 or its derivatives requires a separate license agreement with MiniMax beyond the default MiniMax Community License. For other self-hostable frontier models, see DeepSeek V4 and Kimi K2.6.
Key takeaways
- →Open weight + frontier: 229.9B MoE (9.8B active), downloadable from Hugging Face, with a full 1M-token context.
- →MiniMax Sparse Attention: per-token compute ~1/20 of the prior generation; MiniMax reports >15x faster decode and >9x faster prefill at 1M context, ~100 tok/s output.
- →Natively multimodal: ingests image and video, and can operate a desktop computer for agentic tasks.
- →SWE-Bench Pro 59.0% — MiniMax's claim that it edges GPT-5.5 (58.6%) and beats Gemini 3.1 Pro (54.2%). Vendor-run; not independently confirmed.
- →Cheap API: ~$0.60/M in, $2.40/M out (launch promo $0.30/$1.20) — or run it yourself for zero per-token cost.
Quick verdict
MiniMax M3 matters because of what it bundles into a model you can download. Until now, a 1M-token context, native video understanding, and frontier-tier agentic coding tended to be three different products — usually closed ones. M3 puts them in one 229.9B MoE and ships the weights. If you want a self-hostable model for long-document analysis, computer-use agents, or multimodal pipelines, it's the standout pick of mid-2026.
The honest caveat: the headline benchmark numbers are MiniMax's own, run on their infrastructure with baselines they chose. At least one outlet flagged that the figures have not been independently verified — Scale AI, which maintains the official SWE-Bench Pro benchmark, has not confirmed them. Treat the "beats GPT-5.5 / Gemini 3.1 Pro" framing as a vendor claim until third parties replicate it. For privacy-first, local-first work, M3 is still a genuine step forward over its predecessor, DeepSeek V4, and Kimi K2.6 on multimodal + long-context throughput.
Specs at a glance
| Developer | MiniMax (Shanghai, China) |
| Release date | June 1, 2026 (weights on Hugging Face June 7; MSA technical report June 11) |
| Architecture | Mixture-of-Experts (256 fine-grained experts) |
| Total parameters | 229.9B |
| Active per token | 9.8B |
| Context window | 1,000,000 tokens |
| Attention | MiniMax Sparse Attention (MSA) |
| Modalities | Text · Image (input) · Video (input) · Computer use |
| License | MiniMax Community License — open weight; commercial use needs a separate agreement |
| Self-hostable? | Yes — weights on Hugging Face |
| Predecessor | MiniMax M2.7 (April 2026, 229B) |
| API price | ~$0.60/M in · $2.40/M out (launch promo $0.30 / $1.20) |
| Output speed | ~100 tokens/sec (MiniMax-reported) |
MiniMax Sparse Attention (MSA): the real headline
Dense attention cost is what has historically made 1M-token context impractical to run locally — it scales badly as the context grows. MSA attacks that directly. A lightweight index branch scans incoming tokens and picks which blocks of past tokens actually need attention, then runs attention only on those blocks. The result, per MiniMax: at 1M-token context, per-token compute drops to roughly 1/20 of the previous generation.
| Stage | Speedup at 1M context vs MiniMax M2 | What it affects |
|---|---|---|
| Decode | ~15.6x (>15x reported) | Output generation — tokens/sec you feel during a response. |
| Prefill | ~9.7x (>9x reported) | Reading the prompt — time-to-first-token on long inputs. |
| Per-token compute | ~1/20 | Overall cost-to-run at full 1M context. |
In practical terms, MSA is what makes a 1M-context model viable to actually serve. MiniMax reports output of roughly 100 tokens/sec at this context length. The technical report landed on arXiv on June 11, 2026, so the mechanism is now documented rather than just announced.
Benchmarks (vendor-reported)
Read this section with one caveat in mind: every number below is MiniMax-reported, run on MiniMax's own infrastructure with baselines they selected. None have been independently confirmed by Scale AI (which maintains SWE-Bench Pro), and at least one outlet flagged the comparison as not independently verified. They are encouraging, not settled.
| Benchmark | MiniMax M3 | MiniMax's framing |
|---|---|---|
| SWE-Bench Pro (agentic coding) | 59.0% | MiniMax claims it edges GPT-5.5 (58.6%) and beats Gemini 3.1 Pro (54.2%); trails Claude Opus 4.7 (~64.3%). |
| Terminal-Bench 2.1 | 66.0% | Strong terminal/agent task performance. |
| MCP Atlas | 74.2% | Tool / Model-Context-Protocol use. |
| SWE-fficiency | 34.8% | Efficiency-weighted coding metric. |
| Output speed | ~100 tok/s | Reported ~3x faster than Claude Opus at long context. |
Sources: MiniMax M3 launch blog and technical report (June 2026); VentureBeat launch coverage. All scores are vendor-run; the SWE-Bench Pro comparison to GPT-5.5 and Gemini 3.1 Pro is MiniMax's claim and has not been independently verified. We attribute rather than assert it.
Native multimodality
M3 is natively multimodal — it ingests image and video input directly rather than relying on an upstream frame-extraction or OCR step, and it can operate a desktop computer for agentic workflows. Combined with the 1M-token context, that makes it a single model for tasks that previously needed a chain of specialized tools: read a long video, hold an entire codebase or document corpus in context, and act on it.
Image & video input
Direct image and video understanding in the same model, no separate vision pipeline to wire up.
Computer use
M3 can operate a desktop computer, which is the foundation for the agentic/browsing tasks MiniMax highlights.
1M-token working memory
Whole-repository or whole-corpus prompts fit in one context — and MSA is what keeps that practical to serve.
Pricing
If you use the hosted API (MiniMax or third-party gateways like OpenRouter), M3 is priced aggressively — a fraction of frontier closed models. Note the tiered structure: the standard rate applies to requests up to ~512K tokens; above that, the whole request is billed at roughly double the rate.
| Tier | Input / 1M | Output / 1M |
|---|---|---|
| Standard (≤512K) | ~$0.60 | ~$2.40 |
| Launch promo (50% off) | $0.30 | $1.20 |
| Long-context (>512K) | ~2× standard | ~2× standard |
The launch promo is temporary — don't budget on $0.30 / $1.20 as a permanent baseline. And because M3 is open weight, the real long-run lever is self-hosting: one-time hardware replaces the per-token bill entirely for high-volume workloads.
Running MiniMax M3 locally
This is the part that matters most on a local-AI site: you can actually download M3. The weights are on Hugging Face. But be realistic about the hardware — a 229.9B-parameter model is a data-center / multi-GPU workload, not a laptop one. Some honest expectations:
- •Full precision is out of reach for most. At ~230B parameters, BF16 weights alone are well over 400GB — that's a multi-node, many-GPU deployment.
- •Quantization is the path to single-box. A 4-bit quant brings the weight footprint down toward ~115–130GB, which is the realistic target for a high-VRAM multi-GPU server (e.g. several 48–80GB cards) or a large unified-memory machine.
- •Only 9.8B params activate per token. The MoE design means inference compute is far lighter than the total size suggests — but you still need enough memory to hold all 229.9B parameters.
- •MSA helps the most at long context. The sparse-attention speedups are what make a 1M-token prompt tractable on your own hardware rather than purely theoretical.
New to quantization and VRAM math? Start with our quantization guide and the best local AI models for programming roundup. If your hardware can't hold a 230B model, a smaller open-weight coder is the pragmatic choice — and the hosted API is a cheap stopgap while you size a rig.
When to pick MiniMax M3 — and when not to
| Your situation | Best pick | Why |
|---|---|---|
| Self-hostable multimodal + long context | MiniMax M3 | Only open-weight model that bundles 1M context, native video, and frontier coding. |
| Computer-use / agentic browsing | MiniMax M3 | Native computer use + strong agentic scores (MiniMax-reported). |
| General open-weight workhorse | DeepSeek V4 | Broadly capable, widely supported, permissive licensing. |
| Agentic + coding at huge scale | Kimi K2.6 | 1T MoE; strong agentic and coding profile. |
| Modest hardware (laptop / single GPU) | Smaller coders | A 230B model won't fit; pick a 7B–32B coder you can actually run. |
For a full side-by-side of the current open-weight field, see our best open-source LLMs of 2026 roundup and the model comparisons hub.
Want to actually run open weights like M3?
MiniMax M3 proves open-weight models have caught up to the frontier — but only if you can deploy them. The Local AI Master deployment course walks you through quantization, multi-GPU serving, and long-context inference so you can run open models on your own hardware: unlimited inference, full data privacy, zero per-token cost.
See the deployment course →Related models & reading
- → DeepSeek V4 — open-weight frontier workhorse you can self-host
- → Kimi K2.6 — 1T MoE, strong agentic + coding
- → Best open-source LLMs 2026 — full open-weight field compared
- → Best local AI models for programming — picks by hardware tier
- → Quantization explained — how to fit big models in less VRAM
- → Model comparisons hub — side-by-side across vendors
- → MiniMax M3 official announcement (minimax.io)
- → MiniMax model weights on Hugging Face
Go from reading about AI to building with AI
20 structured courses. Hands-on projects. Runs on your machine. Start free.
Written by the Local AI Master Team
The team behind Local AI Master
We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.