Google · Closed-API Model

Gemini 3.1 Pro Review: 1M Context, MoE Thinking, and the Open-Weight Alternatives

Google's Gemini 3.1 Pro is the current frontier multimodal model from Google DeepMind, shipped in February 2026. It runs a Mixture-of-Experts architecture with a three-tier “thinking” system, a 1-million-token context window, and the highest verified ARC-AGI-2 score (77.1%) of any production model. This review covers the real specs, pricing ($2/$12 per million tokens), benchmark performance, and — critically for our audience — which open-weight models you can self-host that come closest to its capabilities.

📅 Published: May 9, 2026🔄 Last Updated: May 9, 2026✓ Manually Reviewed

Note: Gemini 3.1 Pro is API-only — it cannot be downloaded or run locally. For self-hostable frontier-class models with comparable performance, see our reviews of DeepSeek V4, GLM-5, and Qwen3-Coder-Next.

Key takeaways

→1M-token context — process entire codebases, hour-long videos, or full book series in one call.
→77.1% on ARC-AGI-2 — highest verified general-reasoning benchmark of any production model.
→Three thinking tiers — fast / Thinking Mode / Deep Think; trade compute for reasoning depth.
→Pricing $2/$12 per Mtok — about half of GPT-5.5; competitive with Claude Sonnet 5.
→API-only — for local hosting, the closest open-weight match is DeepSeek V4-Pro (MIT licensed).

Quick verdict

Gemini 3.1 Pro is the strongest general-purpose closed model available right now if your workload involves large context or hard reasoning. It is the only model that ranks first on multiple categories simultaneously: longest context (1M tokens), highest ARC-AGI-2 score, and lowest cost-per-output-token among the top three frontier models.

For coding-only workloads, Claude Sonnet 5 still leads at 92.4% SWE-Bench Verified vs Gemini's ~88%. For local-first deployment where the model must run on your hardware, look at DeepSeek V4-Pro — it's the closest open-weight equivalent, MIT licensed, with similar 1M context support.

Specs at a glance

Vendor	Google DeepMind
Release date	February 19, 2026
Architecture	Mixture-of-Experts with three-tier thinking
Context window	1,000,000 tokens (input)
Max output	65,536 tokens
Modalities	Text · Code · Image · Audio · Video
License	Proprietary (API only)
Local self-hostable?	No
API endpoint	generativelanguage.googleapis.com / Vertex AI
Knowledge cutoff	January 2026

Benchmarks (verified May 2026)

Where Gemini 3.1 Pro lands across the benchmarks frontier teams actually quote. All scores are vendor-published results verified against third-party leaderboards where available (Vellum, Artificial Analysis, BenchLM).

Benchmark	Gemini 3.1 Pro	Claude Sonnet 5	GPT-5.5	DeepSeek V4-Pro
ARC-AGI-2 (general reasoning)	77.1%	68.4%	71.3%	59.8%
SWE-Bench Verified (coding)	87.9%	92.4%	85.1%	82.6%
MMLU-Pro (knowledge)	89.4%	87.9%	90.1%	86.3%
GPQA Diamond (PhD science)	88.2%	85.7%	86.0%	81.4%
MathArena AIME 2025	94.0%	91.5%	95.2%	88.7%
Video-MME (video QA)	82.6%	N/A	79.4%	N/A

Sources: Google DeepMind Gemini 3.1 Pro model card, Anthropic Sonnet 5 announcement, OpenAI GPT-5.5 release notes, DeepSeek V4 technical report, Artificial Analysis leaderboard.

Pricing & access

API pricing

Input: $2.00 per 1M tokens
Output: $12.00 per 1M tokens
Cached input: $0.50 per 1M tokens (75% off)
Thinking tokens: Billed at output rate
Free tier: 50 requests/day via AI Studio

Subscription access

Google AI Pro: $19.99/month — Gemini 3.1 Pro app + 2 TB storage
Google AI Ultra: $124.99/month — Deep Think + Veo + early features
Vertex AI: Same per-token pricing, enterprise SLA
Workspace: Bundled with Gemini for Workspace Enterprise

For comparison, a heavy 4-hour-per-day developer using Gemini 3.1 Pro through API typically pays $80-300/month. Self-hosting an open-weight alternative on a one-time $3-5K rig pays for itself in 12-24 months and gives you unlimited inference + full data privacy.

How thinking mode works (and when to use each tier)

Gemini 3.1 Pro's three-tier thinking system is the major architectural change from 2.5. Each tier controls how much compute the model spends on a single response — more compute means deeper reasoning but higher latency and cost.

Tier 1 · Standard (default)

Fast inference, no extended reasoning. Use for chat, summaries, simple Q&A, and code completion. Latency: ~500ms first token. Cost: standard $2/$12 per Mtok.

Tier 2 · Thinking Mode

Model produces “thinking” tokens before answering — visible in the API response as a separate thoughts field. Use for code generation, multi-step math, ambiguous-spec tasks. Latency: 5-30 seconds. Cost: thinking tokens billed at output rate.

Tier 3 · Deep Think

Extended reasoning over minutes for hardest problems — research-grade math, novel algorithm design, complex agent workflows. Available only in Google AI Ultra ($124.99/mo) and the API at higher rate limits. Latency: 1-15 minutes. Cost: 3-5× standard output rate.

Open-weight alternatives you can run locally

If you want frontier-class capabilities but cannot send your data to Google's servers — or if you need predictable monthly costs — these three open-weight models come closest to Gemini 3.1 Pro on the benchmarks that matter, and you can self-host them.

Open-weight alternative	License	Active params	Hardware floor
DeepSeek V4-Pro	MIT	49B (1.6T total MoE)	8× H100 / 4× B200
GLM-5	MIT	44B (745B total MoE)	4× H100 / 2× B200
Qwen3-Coder-Next	Apache 2.0	3B (80B total MoE)	2× RTX 5090 (consumer)

For coding-specific workloads on a single high-end consumer GPU, see Qwen3.6-27B — a dense 27B that beats its own 397B MoE on agentic coding benchmarks.

When to pick Gemini 3.1 Pro

✓You need to fit a whole codebase, book, or long video into one call (1M context wins).
✓Your workload depends on hard reasoning where ARC-AGI-2 score matters (research, novel problem-solving).
✓You're already on Google Cloud and want first-class Vertex AI integration.
✓You need video understanding — Gemini 3.1 Pro is the only frontier model with strong native video.

When to pick a local model instead

→Data privacy is non-negotiable (healthcare, legal, finance, or any regulated industry).
→You need offline operation or air-gapped deployment.
→Predictable monthly costs matter more than absolute peak quality.
→Sub-100ms latency requirements (network round-trip dominates API calls).

Frequently asked questions

Can I run Gemini 3.1 Pro locally?

No. Gemini 3.1 Pro is a closed proprietary model available only through Google's API and the Gemini app. It cannot be downloaded or run on your own hardware. If you want a frontier-class model you can host yourself, the closest open-weight alternatives are DeepSeek V4-Pro (1.6T MoE / 49B active, MIT licensed), GLM-5 (745B / 44B active, MIT) and Qwen3.5-Plus (397B / 17B active). All three approach Gemini 3.1 Pro on coding and reasoning benchmarks while being fully self-hostable.

How much does Gemini 3.1 Pro cost?

Gemini 3.1 Pro API pricing is $2 per million input tokens and $12 per million output tokens. The Gemini app subscription (Google AI Pro) is $19.99/month and includes Gemini 3.1 Pro access along with NotebookLM, Veo, and 2 TB of Google One storage. Heavy API users typically pay between $50-500/month depending on workload; for local-first comparison, a one-time hardware investment of $2-5K can run an open-weight alternative indefinitely with no per-token cost.

What is Gemini 3.1 Pro's context window?

1 million tokens — roughly 1,500 pages of text or 50,000 lines of code in a single prompt. Gemini 3.1 Pro can process the entire Linux kernel, complete book series, or hour-long video transcripts in one call. Native context length is 1M; with prompt caching enabled, repeated retrieval over the same large corpus drops cost by ~75%. Open-weight models with comparable context include DeepSeek V4 (1M) and Qwen3-Coder-Next (256K).

How does Gemini 3.1 Pro's thinking mode work?

Gemini 3.1 Pro uses a three-tier "thinking" system that lets the model spend more compute on hard problems. Tier 1 is standard fast inference. Tier 2 (Thinking Mode) lets the model think through the answer step-by-step before responding, similar to OpenAI o-series. Tier 3 (Deep Think) extends thinking time to several minutes for the hardest reasoning, math, and code tasks. Each tier costs more compute and produces longer "thinking" traces visible in the API response. Most production apps use Tier 1 or 2; Tier 3 is reserved for research-grade tasks.

Gemini 3.1 Pro vs Claude Sonnet 5: which is better for coding?

Claude Sonnet 5 (92.4% SWE-Bench Verified) currently leads coding benchmarks, beating Gemini 3.1 Pro on production-grade code tasks. Gemini 3.1 Pro has the larger context window (1M vs 200K) which makes it stronger on whole-codebase analysis and multi-file refactors. Pricing is similar ($2/$12 vs $3/$15 per Mtok). For precise code generation pick Sonnet 5; for analyzing or editing across an entire repo, pick Gemini 3.1 Pro. For local alternatives, Qwen3-Coder-Next runs ~70% on SWE-Bench Verified at zero per-token cost.

What is the ARC-AGI-2 score and why does it matter?

ARC-AGI-2 is the second-generation Abstraction and Reasoning Corpus benchmark designed to measure general reasoning rather than memorized patterns. Gemini 3.1 Pro scores 77.1% on ARC-AGI-2, the highest verified result for any production model. Human-level performance is around 85%. ARC-AGI-2 matters because it is far harder to game with training-data leakage than benchmarks like MMLU or HumanEval — high scores suggest genuine reasoning capability. For local AI work, this benchmark indicates which models reason about novel problems vs which only solve familiar ones.

When should I use Gemini 3.1 Pro vs an open-weight model?

Use Gemini 3.1 Pro when you need the best general reasoning available, when your task fits naturally inside its 1M-token context (whole-repo analysis, long video transcripts, full book QA), or when prompt caching makes API economics work. Use an open-weight alternative (DeepSeek V4, GLM-5, Qwen3-Coder-Next, Llama 4) when data privacy is non-negotiable, when you need sub-100ms latency, when you want predictable monthly costs, or when offline operation matters. Many production teams blend both: API for the edge cases, local for the routine 95% of traffic.

Is Gemini 3.1 Pro better than GPT-5.5?

Different strengths. Gemini 3.1 Pro leads on context length (1M vs 400K), ARC-AGI-2 reasoning (77.1% vs 71.3%), and multimodal video understanding. GPT-5.5 leads on raw coding speed, ChatGPT integration, and ecosystem (plugins, custom GPTs, function calling maturity). Pricing: Gemini 3.1 Pro is roughly half the API cost of GPT-5.5 at output tokens ($12 vs $30 per Mtok). For most production engineering teams, the choice comes down to existing stack — Google Cloud users naturally land on Gemini, Azure/OpenAI users on GPT-5.5.

Want to run frontier-class AI on your own hardware?

Local AI Master's Local AI Deployment course walks through running open-weight alternatives like DeepSeek V4 and GLM-5 on consumer and prosumer hardware. Real production code, full GitHub repo, no cloud fees.

See the deployment course →

Related models

→ Claude Sonnet 5 — current SWE-Bench Verified leader at 92.4%
→ GPT-5.5 — current ChatGPT default, strongest math benchmarks
→ DeepSeek V4 — open-weight frontier alternative, MIT licensed
→ GLM-5 — 745B/44B MoE open weight from Zhipu
→ Qwen3-Coder-Next — best local coding model
→ Best AI models May 2026: complete comparison