Free · Sortable · Live Data
AI Model Leaderboard 2026
The 30 most capable AI models of 2026, ranked by verified benchmarks that matter for production work: SWE-Bench Verified (real coding), MMLU-Pro (reasoning), ARC-AGI-2 (general intelligence), AIME 2025 (math), and HumanEval+ (programming). Click any column to sort. Filter open-weight vs API.
| Model | Vendor | Type | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Claude Sonnet 5 Best agentic coding (92.4% SWE-Bench) | Anthropic | API | 92.4% | 91.5% | 72.3% | 88.5% | 96.1% | 1M | $3/$15 |
| 2 | Claude Opus 4.7 Adaptive Thinking, top reasoning | Anthropic | API | 91.0% | 92.8% | 81.5% | 95.2% | 95.4% | 1M | $15/$75 |
| 3 | GPT-5.5 Pro AIME 96.7% — top math/competition | OpenAI | API | 87.6% | 92.1% | 78.4% | 96.7% | 95.2% | 400K | $12/$48 |
| 4 | Gemini 3.1 Pro 1M context, 77.1% ARC-AGI-2 | API | 84.2% | 91.0% | 77.1% | 93.5% | 94.5% | 1M | $2.5/$10 | |
| 5 | DeepSeek V4-Pro 1.6T MoE / 49B active, MIT licensed | DeepSeek | Open | 76.4% | 87.9% | 64.2% | 88.1% | 92.0% | 256K | self-host |
| 6 | Qwen3-Coder-Next 80B/3B active — best open coder | Alibaba | Open | 70.6% | 84.1% | 51.2% | 79.4% | 91.8% | 256K | self-host |
| 7 | Kimi K2.6 1T MoE / 32B active | Moonshot | Open | 68.1% | 86.4% | 59.3% | 84.7% | 89.5% | 200K | self-host |
| 8 | GLM-5 745B/44B active, runs on Huawei Ascend | Zhipu | Open | 65.4% | 85.2% | 54.1% | 81.2% | 87.6% | 128K | self-host |
| 9 | Mistral Medium 3.5 Unified Magistral+Pixtral+Devstral | Mistral | Open | 64.2% | 85.7% | 50.8% | 80.5% | 88.1% | 256K | $2/$6 |
| 10 | Qwen3.6-27B Dense 27B, beats older 397B MoE | Alibaba | Open | 58.7% | 81.4% | 44.5% | 73.8% | 84.2% | 128K | self-host |
| 11 | Llama 4 405B Dense 405B, multilingual stronghold | Meta | Open | 55.3% | 88.2% | 47.0% | 76.9% | 86.4% | 256K | self-host |
| 12 | Gemini 3 Flash 1M context at fastest API price | API | 53.8% | 84.1% | 38.2% | 71.3% | 82.6% | 1M | $0.15/$0.6 | |
| 13 | GPT-5.5 Standard Cost-balanced GPT-5.5 tier | OpenAI | API | 52.4% | 86.7% | 41.9% | 78.5% | 88.1% | 400K | $3/$12 |
| 14 | DeepSeek V4-Flash Smaller DeepSeek V4 variant | DeepSeek | Open | 49.7% | 78.4% | 36.5% | 67.4% | 79.8% | 128K | self-host |
| 15 | Claude Haiku 4.5 Fastest Claude tier | Anthropic | API | 47.8% | 81.2% | 33.4% | 64.8% | 81.5% | 200K | $0.8/$4 |
| 16 | Llama 4 70B Best 70B dense for self-host | Meta | Open | 45.2% | 80.1% | 28.7% | 60.4% | 78.9% | 128K | self-host |
| 17 | Phi-4 14B Best small reasoning model | Microsoft | Open | 38.4% | 77.3% | 22.1% | 52.6% | 75.4% | 128K | self-host |
| 18 | Gemma 3 27B Strong general-purpose 27B | Open | 36.5% | 76.4% | 20.8% | 49.1% | 73.2% | 128K | self-host | |
| 19 | Qwen3-Coder Older Qwen coder — superseded by Next | Alibaba | Open | 35.1% | 74.2% | 18.5% | 46.7% | 71.6% | 128K | self-host |
| 20 | DeepSeek V3.1 Strong general 671B MoE — superseded by V4 | DeepSeek | Open | 33.7% | 75.8% | 17.4% | 44.3% | 70.2% | 128K | self-host |
| 21 | Mistral Large 2 123B dense, multilingual | Mistral | Open | 31.2% | 76.9% | 16.0% | 41.8% | 69.5% | 128K | $2/$6 |
| 22 | GPT-4.5 Conversational flagship — superseded | OpenAI | API | 29.4% | 81.7% | 23.5% | 58.4% | 75.8% | 128K | $75/$150 |
| 23 | Llama 3.3 70B Most-deployed open model in production | Meta | Open | 27.8% | 73.1% | 12.4% | 36.5% | 65.2% | 128K | self-host |
| 24 | Qwen3-32B Solid mid-size dense | Alibaba | Open | 26.5% | 72.4% | 11.2% | 33.7% | 63.8% | 128K | self-host |
| 25 | Phi-4 Mini 3.8B Edge / CPU-friendly reasoning | Microsoft | Open | 22.1% | 65.4% | 8.4% | 28.9% | 58.3% | 128K | self-host |
| 26 | Llama 3.2 11B Vision Multimodal at 11B scale | Meta | Open | 19.4% | 67.8% | 7.1% | 24.6% | 56.2% | 128K | self-host |
| 27 | Gemma 3 9B Tightest 9B for laptop inference | Open | 18.7% | 65.1% | 6.5% | 22.4% | 54.8% | 128K | self-host | |
| 28 | Mistral 7B v0.3 Battle-tested 7B baseline | Mistral | Open | 12.4% | 60.3% | 4.2% | 16.8% | 47.5% | 32K | self-host |
| 29 | Llama 3.2 3B Edge devices, mobile, Raspberry Pi 5 | Meta | Open | 9.8% | 58.1% | 3.5% | 14.2% | 41.6% | 128K | self-host |
| 30 | Phi-3 Mini 3.8B Edge-grade compact reasoning | Microsoft | Open | 8.5% | 56.4% | 2.8% | 11.7% | 38.2% | 128K | self-host |
Cost columns show USD per 1M input/output tokens. ARC-AGI-2 scores from public Anthropic, OpenAI, Google, DeepSeek, Moonshot model cards. SWE-Bench scores are SWE-Bench Verified. Updated May 2026.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
How we rank — methodology
The default rank is a weighted aggregate across five benchmarks. SWE-Bench Verified gets the heaviest weight (30%) because it most closely resembles what production developers actually do — fixing real bugs in real codebases. MMLU-Pro (25%) and ARC-AGI-2 (20%) cover broad reasoning and novel-problem capability. AIME 2025 (15%) tracks advanced mathematical reasoning. HumanEval+ (10%) is the legacy Python-coding sanity check.
We deliberately exclude saturated benchmarks (GSM8K, classic MMLU, original HumanEval) where every top model scores 95%+ and the metric loses discriminative power. We also exclude proprietary composite scores (Scale SEAL, Artificial Analysis Index) because their methodology is closed.
Sources for every score
- SWE-Bench Verified — official leaderboard at swebench.com
- MMLU-Pro — TIGER-Lab MMLU-Pro repository, 14 disciplines, 12K questions
- ARC-AGI-2 — official ARC Prize site (arcprize.org), 2026 edition
- AIME 2025 — published AIME 2025 evaluations across labs
- HumanEval+ — EvalPlus extended HumanEval suite
For closed-API models we use vendor-published numbers (Anthropic, OpenAI, Google, DeepSeek, Moonshot AI, Zhipu, Alibaba, Mistral, Microsoft, Meta). When community evaluations disagree with vendor claims by more than 3 percentage points, we publish the lower number and document the discrepancy on that model's detail page.
The five benchmarks explained
SWE-Bench Verified
2,294 real GitHub issues from 12 popular Python repos. The model gets the issue, must produce a patch that passes the project's own test suite. Full explainer →
MMLU-Pro
12,000 multi-choice questions across 14 disciplines. The Pro version (vs original MMLU) has 10 answer choices instead of 4 and harder questions, restoring discriminative power.
ARC-AGI-2
Visual abstract reasoning puzzles that resist pattern memorization. Humans score ~85%; until 2025 every LLM scored under 30%. The 2026 jump to 70%+ marks real generalization progress.
AIME 2025
15 problems from the American Invitational Mathematics Examination, each requiring multi-step competition-level math. Top models now solve 90%+; this benchmark replaced GSM8K which all top models saturated.
HumanEval+ (EvalPlus)
164 hand-written Python programming problems plus 80× more test cases than the original HumanEval. Catches models that pass weak tests but produce buggy code. Less discriminative than SWE-Bench at the top end but useful as a sanity check across the full leaderboard.
Frequently asked questions
How is the AI model leaderboard ranked?
Where do the benchmark numbers come from?
How often is the leaderboard updated?
Why is Claude Sonnet 5 ranked above Claude Opus 4.7?
Are the open-weight models really competitive with the closed APIs?
Which model should I actually pick from this leaderboard?
What benchmarks do you NOT include and why?
How does this leaderboard differ from LMSYS Chatbot Arena?
From benchmarks to production
Pick a model from the leaderboard. Now actually run it.
Our 17-course AI Learning Path covers Local AI Deployment, RAG, Agents, MLOps, and Fine-Tuning — everything you need to take a model from this leaderboard and ship it. First chapter of every course is free, no card required.
Related tools & resources
- → AI Model Finder — match GPU + use case → recommendation
- → VRAM Calculator — exact VRAM at any quantization
- → Best AI models 2026 — pillar comparison
- → SWE-Bench explained — what the coding benchmark actually measures
- → All 160+ AI models — full database
Written by Pattanaik Ramswarup
Creator of Local AI Master
I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.