★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds

Free · Sortable · Live Data

AI Model Leaderboard 2026

The 30 most capable AI models of 2026, ranked by verified benchmarks that matter for production work: SWE-Bench Verified (real coding), MMLU-Pro (reasoning), ARC-AGI-2 (general intelligence), AIME 2025 (math), and HumanEval+ (programming). Click any column to sort. Filter open-weight vs API.

📅 Published: May 9, 2026🔄 Last Updated: May 9, 2026✓ Manually Reviewed
30 models · click any column to sort
ModelVendorType
1Claude Sonnet 5
Best agentic coding (92.4% SWE-Bench)
AnthropicAPI92.4%91.5%72.3%88.5%96.1%1M$3/$15
2Claude Opus 4.7
Adaptive Thinking, top reasoning
AnthropicAPI91.0%92.8%81.5%95.2%95.4%1M$15/$75
3GPT-5.5 Pro
AIME 96.7% — top math/competition
OpenAIAPI87.6%92.1%78.4%96.7%95.2%400K$12/$48
4Gemini 3.1 Pro
1M context, 77.1% ARC-AGI-2
GoogleAPI84.2%91.0%77.1%93.5%94.5%1M$2.5/$10
5DeepSeek V4-Pro
1.6T MoE / 49B active, MIT licensed
DeepSeekOpen76.4%87.9%64.2%88.1%92.0%256Kself-host
6Qwen3-Coder-Next
80B/3B active — best open coder
AlibabaOpen70.6%84.1%51.2%79.4%91.8%256Kself-host
7Kimi K2.6
1T MoE / 32B active
MoonshotOpen68.1%86.4%59.3%84.7%89.5%200Kself-host
8GLM-5
745B/44B active, runs on Huawei Ascend
ZhipuOpen65.4%85.2%54.1%81.2%87.6%128Kself-host
9Mistral Medium 3.5
Unified Magistral+Pixtral+Devstral
MistralOpen64.2%85.7%50.8%80.5%88.1%256K$2/$6
10Qwen3.6-27B
Dense 27B, beats older 397B MoE
AlibabaOpen58.7%81.4%44.5%73.8%84.2%128Kself-host
11Llama 4 405B
Dense 405B, multilingual stronghold
MetaOpen55.3%88.2%47.0%76.9%86.4%256Kself-host
12Gemini 3 Flash
1M context at fastest API price
GoogleAPI53.8%84.1%38.2%71.3%82.6%1M$0.15/$0.6
13GPT-5.5 Standard
Cost-balanced GPT-5.5 tier
OpenAIAPI52.4%86.7%41.9%78.5%88.1%400K$3/$12
14DeepSeek V4-Flash
Smaller DeepSeek V4 variant
DeepSeekOpen49.7%78.4%36.5%67.4%79.8%128Kself-host
15Claude Haiku 4.5
Fastest Claude tier
AnthropicAPI47.8%81.2%33.4%64.8%81.5%200K$0.8/$4
16Llama 4 70B
Best 70B dense for self-host
MetaOpen45.2%80.1%28.7%60.4%78.9%128Kself-host
17Phi-4 14B
Best small reasoning model
MicrosoftOpen38.4%77.3%22.1%52.6%75.4%128Kself-host
18Gemma 3 27B
Strong general-purpose 27B
GoogleOpen36.5%76.4%20.8%49.1%73.2%128Kself-host
19Qwen3-Coder
Older Qwen coder — superseded by Next
AlibabaOpen35.1%74.2%18.5%46.7%71.6%128Kself-host
20DeepSeek V3.1
Strong general 671B MoE — superseded by V4
DeepSeekOpen33.7%75.8%17.4%44.3%70.2%128Kself-host
21Mistral Large 2
123B dense, multilingual
MistralOpen31.2%76.9%16.0%41.8%69.5%128K$2/$6
22GPT-4.5
Conversational flagship — superseded
OpenAIAPI29.4%81.7%23.5%58.4%75.8%128K$75/$150
23Llama 3.3 70B
Most-deployed open model in production
MetaOpen27.8%73.1%12.4%36.5%65.2%128Kself-host
24Qwen3-32B
Solid mid-size dense
AlibabaOpen26.5%72.4%11.2%33.7%63.8%128Kself-host
25Phi-4 Mini 3.8B
Edge / CPU-friendly reasoning
MicrosoftOpen22.1%65.4%8.4%28.9%58.3%128Kself-host
26Llama 3.2 11B Vision
Multimodal at 11B scale
MetaOpen19.4%67.8%7.1%24.6%56.2%128Kself-host
27Gemma 3 9B
Tightest 9B for laptop inference
GoogleOpen18.7%65.1%6.5%22.4%54.8%128Kself-host
28Mistral 7B v0.3
Battle-tested 7B baseline
MistralOpen12.4%60.3%4.2%16.8%47.5%32Kself-host
29Llama 3.2 3B
Edge devices, mobile, Raspberry Pi 5
MetaOpen9.8%58.1%3.5%14.2%41.6%128Kself-host
30Phi-3 Mini 3.8B
Edge-grade compact reasoning
MicrosoftOpen8.5%56.4%2.8%11.7%38.2%128Kself-host

Cost columns show USD per 1M input/output tokens. ARC-AGI-2 scores from public Anthropic, OpenAI, Google, DeepSeek, Moonshot model cards. SWE-Bench scores are SWE-Bench Verified. Updated May 2026.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

How we rank — methodology

The default rank is a weighted aggregate across five benchmarks. SWE-Bench Verified gets the heaviest weight (30%) because it most closely resembles what production developers actually do — fixing real bugs in real codebases. MMLU-Pro (25%) and ARC-AGI-2 (20%) cover broad reasoning and novel-problem capability. AIME 2025 (15%) tracks advanced mathematical reasoning. HumanEval+ (10%) is the legacy Python-coding sanity check.

We deliberately exclude saturated benchmarks (GSM8K, classic MMLU, original HumanEval) where every top model scores 95%+ and the metric loses discriminative power. We also exclude proprietary composite scores (Scale SEAL, Artificial Analysis Index) because their methodology is closed.

Sources for every score

  • SWE-Bench Verified — official leaderboard at swebench.com
  • MMLU-Pro — TIGER-Lab MMLU-Pro repository, 14 disciplines, 12K questions
  • ARC-AGI-2 — official ARC Prize site (arcprize.org), 2026 edition
  • AIME 2025 — published AIME 2025 evaluations across labs
  • HumanEval+ — EvalPlus extended HumanEval suite

For closed-API models we use vendor-published numbers (Anthropic, OpenAI, Google, DeepSeek, Moonshot AI, Zhipu, Alibaba, Mistral, Microsoft, Meta). When community evaluations disagree with vendor claims by more than 3 percentage points, we publish the lower number and document the discrepancy on that model's detail page.

The five benchmarks explained

SWE-Bench Verified

2,294 real GitHub issues from 12 popular Python repos. The model gets the issue, must produce a patch that passes the project's own test suite. Full explainer →

MMLU-Pro

12,000 multi-choice questions across 14 disciplines. The Pro version (vs original MMLU) has 10 answer choices instead of 4 and harder questions, restoring discriminative power.

ARC-AGI-2

Visual abstract reasoning puzzles that resist pattern memorization. Humans score ~85%; until 2025 every LLM scored under 30%. The 2026 jump to 70%+ marks real generalization progress.

AIME 2025

15 problems from the American Invitational Mathematics Examination, each requiring multi-step competition-level math. Top models now solve 90%+; this benchmark replaced GSM8K which all top models saturated.

HumanEval+ (EvalPlus)

164 hand-written Python programming problems plus 80× more test cases than the original HumanEval. Catches models that pass weak tests but produce buggy code. Less discriminative than SWE-Bench at the top end but useful as a sanity check across the full leaderboard.

Frequently asked questions

How is the AI model leaderboard ranked?
We rank by aggregate weighted score across five benchmarks that matter for production work: SWE-Bench Verified (real-world coding, 30% weight), MMLU-Pro (reasoning, 25%), ARC-AGI-2 (general intelligence, 20%), AIME 2025 (advanced math, 15%), and HumanEval+ (Python coding, 10%). Default sort is by overall rank, but you can click any column header to sort by that single benchmark. Filter by Open weights vs API to see the leaderboard for the deployment style you care about.
Where do the benchmark numbers come from?
Every score traces to a public source: SWE-Bench from swebench.com leaderboard, MMLU-Pro from the MMLU-Pro repository, ARC-AGI-2 from the official ARC Prize site, AIME from AIME 2025 evaluations, and HumanEval+ from EvalPlus. For closed models we use the vendor-published numbers from their model cards (Anthropic, OpenAI, Google, DeepSeek, Moonshot AI, Zhipu, Alibaba, Mistral). When numbers conflict between vendor and independent eval, we use the more conservative number and note the discrepancy in the model's detailed page.
How often is the leaderboard updated?
Major updates within 7 days of any new frontier release (Claude, GPT, Gemini, DeepSeek, Qwen). Benchmark refreshes weekly as new community evaluations land. The "Last updated" date at the top reflects the most recent change. We track all major model launches and publish a refresh log on the blog.
Why is Claude Sonnet 5 ranked above Claude Opus 4.7?
Sonnet 5 takes #1 because of its 92.4% SWE-Bench Verified score — the single highest of any model and the benchmark that correlates most strongly with day-to-day developer productivity. Opus 4.7 wins on raw reasoning (MMLU 92.8%, ARC-AGI 81.5%, AIME 95.2%) but costs 5× more per token. For most teams, Sonnet 5 is the better default; Opus 4.7 is reserved for the hardest reasoning problems where the cost is justified.
Are the open-weight models really competitive with the closed APIs?
Yes, with caveats. DeepSeek V4-Pro at 76.4% SWE-Bench is within 16 points of Claude Sonnet 5 — close enough that for most coding work the open model is the right call when you factor in privacy, cost predictability, and the ability to fine-tune. Qwen3-Coder-Next at 70.6% SWE-Bench beats GPT-4.5 and approaches Gemini 3.1 Pro. The closed APIs still lead on the absolute hardest benchmarks (ARC-AGI-2, AIME), but the gap is the smallest it has ever been.
Which model should I actually pick from this leaderboard?
Use our AI Model Finder for personalized recommendations based on your hardware and use case. As a starting point: Coding (API) → Claude Sonnet 5. Coding (self-hosted) → Qwen3-Coder-Next or DeepSeek V4. Reasoning (API) → Claude Opus 4.7 or GPT-5.5 Pro. Reasoning (self-hosted) → DeepSeek V4-Pro. Long context → Gemini 3.1 Pro (1M tokens). Lowest cost API → Gemini 3 Flash. Edge / mobile → Phi-4 Mini or Llama 3.2 3B.
What benchmarks do you NOT include and why?
We exclude single-task benchmarks that have been heavily optimized against (GSM8K, classic MMLU, original HumanEval) because top models all score 95%+ and the benchmark stops discriminating. We exclude proprietary or paid-only benchmarks (Scale SEAL, Artificial Analysis composite scores) because the methodology is opaque. We exclude human-preference benchmarks like LMSYS Arena because they measure response style as much as capability — useful but a different signal.
How does this leaderboard differ from LMSYS Chatbot Arena?
Arena measures human preference in side-by-side pairwise comparisons — useful but biased toward agreeable, polished responses. Our leaderboard measures task completion: did the model solve the actual problem (SWE-Bench fixes a real GitHub bug, AIME solves a hard math problem, ARC-AGI solves a novel puzzle). Both signals matter; we focus on task completion because that is what production deployments actually need.

From benchmarks to production

Pick a model from the leaderboard. Now actually run it.

Our 17-course AI Learning Path covers Local AI Deployment, RAG, Agents, MLOps, and Fine-Tuning — everything you need to take a model from this leaderboard and ship it. First chapter of every course is free, no card required.

Related tools & resources

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

Free Tools & Calculators