Google · Closed-API Model
Gemini 3.1 Pro Review: 1M Context, MoE Thinking, and the Open-Weight Alternatives
Google's Gemini 3.1 Pro is the current frontier multimodal model from Google DeepMind, shipped in February 2026. It runs a Mixture-of-Experts architecture with a three-tier “thinking” system, a 1-million-token context window, and the highest verified ARC-AGI-2 score (77.1%) of any production model. This review covers the real specs, pricing ($2/$12 per million tokens), benchmark performance, and — critically for our audience — which open-weight models you can self-host that come closest to its capabilities.
Note: Gemini 3.1 Pro is API-only — it cannot be downloaded or run locally. For self-hostable frontier-class models with comparable performance, see our reviews of DeepSeek V4, GLM-5, and Qwen3-Coder-Next.
Key takeaways
- →1M-token context — process entire codebases, hour-long videos, or full book series in one call.
- →77.1% on ARC-AGI-2 — highest verified general-reasoning benchmark of any production model.
- →Three thinking tiers — fast / Thinking Mode / Deep Think; trade compute for reasoning depth.
- →Pricing $2/$12 per Mtok — about half of GPT-5.5; competitive with Claude Sonnet 5.
- →API-only — for local hosting, the closest open-weight match is DeepSeek V4-Pro (MIT licensed).
Quick verdict
Gemini 3.1 Pro is the strongest general-purpose closed model available right now if your workload involves large context or hard reasoning. It is the only model that ranks first on multiple categories simultaneously: longest context (1M tokens), highest ARC-AGI-2 score, and lowest cost-per-output-token among the top three frontier models.
For coding-only workloads, Claude Sonnet 5 still leads at 92.4% SWE-Bench Verified vs Gemini's ~88%. For local-first deployment where the model must run on your hardware, look at DeepSeek V4-Pro — it's the closest open-weight equivalent, MIT licensed, with similar 1M context support.
Specs at a glance
| Vendor | Google DeepMind |
| Release date | February 19, 2026 |
| Architecture | Mixture-of-Experts with three-tier thinking |
| Context window | 1,000,000 tokens (input) |
| Max output | 65,536 tokens |
| Modalities | Text · Code · Image · Audio · Video |
| License | Proprietary (API only) |
| Local self-hostable? | No |
| API endpoint | generativelanguage.googleapis.com / Vertex AI |
| Knowledge cutoff | January 2026 |
Benchmarks (verified May 2026)
Where Gemini 3.1 Pro lands across the benchmarks frontier teams actually quote. All scores are vendor-published results verified against third-party leaderboards where available (Vellum, Artificial Analysis, BenchLM).
| Benchmark | Gemini 3.1 Pro | Claude Sonnet 5 | GPT-5.5 | DeepSeek V4-Pro |
|---|---|---|---|---|
| ARC-AGI-2 (general reasoning) | 77.1% | 68.4% | 71.3% | 59.8% |
| SWE-Bench Verified (coding) | 87.9% | 92.4% | 85.1% | 82.6% |
| MMLU-Pro (knowledge) | 89.4% | 87.9% | 90.1% | 86.3% |
| GPQA Diamond (PhD science) | 88.2% | 85.7% | 86.0% | 81.4% |
| MathArena AIME 2025 | 94.0% | 91.5% | 95.2% | 88.7% |
| Video-MME (video QA) | 82.6% | N/A | 79.4% | N/A |
Sources: Google DeepMind Gemini 3.1 Pro model card, Anthropic Sonnet 5 announcement, OpenAI GPT-5.5 release notes, DeepSeek V4 technical report, Artificial Analysis leaderboard.
Pricing & access
API pricing
- Input: $2.00 per 1M tokens
- Output: $12.00 per 1M tokens
- Cached input: $0.50 per 1M tokens (75% off)
- Thinking tokens: Billed at output rate
- Free tier: 50 requests/day via AI Studio
Subscription access
- Google AI Pro: $19.99/month — Gemini 3.1 Pro app + 2 TB storage
- Google AI Ultra: $124.99/month — Deep Think + Veo + early features
- Vertex AI: Same per-token pricing, enterprise SLA
- Workspace: Bundled with Gemini for Workspace Enterprise
For comparison, a heavy 4-hour-per-day developer using Gemini 3.1 Pro through API typically pays $80-300/month. Self-hosting an open-weight alternative on a one-time $3-5K rig pays for itself in 12-24 months and gives you unlimited inference + full data privacy.
How thinking mode works (and when to use each tier)
Gemini 3.1 Pro's three-tier thinking system is the major architectural change from 2.5. Each tier controls how much compute the model spends on a single response — more compute means deeper reasoning but higher latency and cost.
Tier 1 · Standard (default)
Fast inference, no extended reasoning. Use for chat, summaries, simple Q&A, and code completion. Latency: ~500ms first token. Cost: standard $2/$12 per Mtok.
Tier 2 · Thinking Mode
Model produces “thinking” tokens before answering — visible in the API response as a separate thoughts field. Use for code generation, multi-step math, ambiguous-spec tasks. Latency: 5-30 seconds. Cost: thinking tokens billed at output rate.
Tier 3 · Deep Think
Extended reasoning over minutes for hardest problems — research-grade math, novel algorithm design, complex agent workflows. Available only in Google AI Ultra ($124.99/mo) and the API at higher rate limits. Latency: 1-15 minutes. Cost: 3-5× standard output rate.
Open-weight alternatives you can run locally
If you want frontier-class capabilities but cannot send your data to Google's servers — or if you need predictable monthly costs — these three open-weight models come closest to Gemini 3.1 Pro on the benchmarks that matter, and you can self-host them.
| Open-weight alternative | License | Active params | Hardware floor |
|---|---|---|---|
| DeepSeek V4-Pro | MIT | 49B (1.6T total MoE) | 8× H100 / 4× B200 |
| GLM-5 | MIT | 44B (745B total MoE) | 4× H100 / 2× B200 |
| Qwen3-Coder-Next | Apache 2.0 | 3B (80B total MoE) | 2× RTX 5090 (consumer) |
For coding-specific workloads on a single high-end consumer GPU, see Qwen3.6-27B — a dense 27B that beats its own 397B MoE on agentic coding benchmarks.
When to pick Gemini 3.1 Pro
- ✓You need to fit a whole codebase, book, or long video into one call (1M context wins).
- ✓Your workload depends on hard reasoning where ARC-AGI-2 score matters (research, novel problem-solving).
- ✓You're already on Google Cloud and want first-class Vertex AI integration.
- ✓You need video understanding — Gemini 3.1 Pro is the only frontier model with strong native video.
When to pick a local model instead
- →Data privacy is non-negotiable (healthcare, legal, finance, or any regulated industry).
- →You need offline operation or air-gapped deployment.
- →Predictable monthly costs matter more than absolute peak quality.
- →Sub-100ms latency requirements (network round-trip dominates API calls).
Frequently asked questions
Can I run Gemini 3.1 Pro locally?
How much does Gemini 3.1 Pro cost?
What is Gemini 3.1 Pro's context window?
How does Gemini 3.1 Pro's thinking mode work?
Gemini 3.1 Pro vs Claude Sonnet 5: which is better for coding?
What is the ARC-AGI-2 score and why does it matter?
When should I use Gemini 3.1 Pro vs an open-weight model?
Is Gemini 3.1 Pro better than GPT-5.5?
Want to run frontier-class AI on your own hardware?
Local AI Master's Local AI Deployment course walks through running open-weight alternatives like DeepSeek V4 and GLM-5 on consumer and prosumer hardware. Real production code, full GitHub repo, no cloud fees.
See the deployment course →Related models
- → Claude Sonnet 5 — current SWE-Bench Verified leader at 92.4%
- → GPT-5.5 — current ChatGPT default, strongest math benchmarks
- → DeepSeek V4 — open-weight frontier alternative, MIT licensed
- → GLM-5 — 745B/44B MoE open weight from Zhipu
- → Qwen3-Coder-Next — best local coding model
- → Best AI models May 2026: complete comparison