Question 1

How is the AI model leaderboard ranked?

Accepted Answer

We rank by aggregate weighted score across five benchmarks that matter for production work: SWE-Bench Verified (real-world coding, 30% weight), MMLU-Pro (reasoning, 25%), ARC-AGI-2 (general intelligence, 20%), AIME 2025 (advanced math, 15%), and HumanEval+ (Python coding, 10%). Default sort is by overall rank, but you can click any column header to sort by that single benchmark. Filter by Open weights vs API to see the leaderboard for the deployment style you care about.

Question 2

Where do the benchmark numbers come from?

Accepted Answer

Every score traces to a public source: SWE-Bench from swebench.com leaderboard, MMLU-Pro from the MMLU-Pro repository, ARC-AGI-2 from the official ARC Prize site, AIME from AIME 2025 evaluations, and HumanEval+ from EvalPlus. For closed models we use the vendor-published numbers from their model cards (Anthropic, OpenAI, Google, DeepSeek, Moonshot AI, Zhipu, Alibaba, Mistral). When numbers conflict between vendor and independent eval, we use the more conservative number and note the discrepancy in the model's detailed page.

Question 3

How often is the leaderboard updated?

Accepted Answer

Major updates within 7 days of any new frontier release (Claude, GPT, Gemini, DeepSeek, Qwen). Benchmark refreshes weekly as new community evaluations land. The "Last updated" date at the top reflects the most recent change. We track all major model launches and publish a refresh log on the blog.

Question 4

Why is Claude Sonnet 5 ranked above Claude Opus 4.7?

Accepted Answer

Sonnet 5 takes #1 because of its 92.4% SWE-Bench Verified score — the single highest of any model and the benchmark that correlates most strongly with day-to-day developer productivity. Opus 4.7 wins on raw reasoning (MMLU 92.8%, ARC-AGI 81.5%, AIME 95.2%) but costs 5× more per token. For most teams, Sonnet 5 is the better default; Opus 4.7 is reserved for the hardest reasoning problems where the cost is justified.

Question 5

Are the open-weight models really competitive with the closed APIs?

Accepted Answer

Yes, with caveats. DeepSeek V4-Pro at 76.4% SWE-Bench is within 16 points of Claude Sonnet 5 — close enough that for most coding work the open model is the right call when you factor in privacy, cost predictability, and the ability to fine-tune. Qwen3-Coder-Next at 70.6% SWE-Bench beats GPT-4.5 and approaches Gemini 3.1 Pro. The closed APIs still lead on the absolute hardest benchmarks (ARC-AGI-2, AIME), but the gap is the smallest it has ever been.

Question 6

Which model should I actually pick from this leaderboard?

Accepted Answer

Use our AI Model Finder for personalized recommendations based on your hardware and use case. As a starting point: Coding (API) → Claude Sonnet 5. Coding (self-hosted) → Qwen3-Coder-Next or DeepSeek V4. Reasoning (API) → Claude Opus 4.7 or GPT-5.5 Pro. Reasoning (self-hosted) → DeepSeek V4-Pro. Long context → Gemini 3.1 Pro (1M tokens). Lowest cost API → Gemini 3 Flash. Edge / mobile → Phi-4 Mini or Llama 3.2 3B.

Question 7

What benchmarks do you NOT include and why?

Accepted Answer

We exclude single-task benchmarks that have been heavily optimized against (GSM8K, classic MMLU, original HumanEval) because top models all score 95%+ and the benchmark stops discriminating. We exclude proprietary or paid-only benchmarks (Scale SEAL, Artificial Analysis composite scores) because the methodology is opaque. We exclude human-preference benchmarks like LMSYS Arena because they measure response style as much as capability — useful but a different signal.

Question 8

How does this leaderboard differ from LMSYS Chatbot Arena?

Accepted Answer

Arena measures human preference in side-by-side pairwise comparisons — useful but biased toward agreeable, polished responses. Our leaderboard measures task completion: did the model solve the actual problem (SWE-Bench fixes a real GitHub bug, AIME solves a hard math problem, ARC-AGI solves a novel puzzle). Both signals matter; we focus on task completion because that is what production deployments actually need.

	Model	Vendor	Type
1	Claude Sonnet 5 Best agentic coding (92.4% SWE-Bench)	Anthropic	API	92.4%	91.5%	72.3%	88.5%	96.1%	1M	$3/$15
2	Claude Opus 4.7 Adaptive Thinking, top reasoning	Anthropic	API	91.0%	92.8%	81.5%	95.2%	95.4%	1M	$15/$75
3	GPT-5.5 Pro AIME 96.7% — top math/competition	OpenAI	API	87.6%	92.1%	78.4%	96.7%	95.2%	400K	$12/$48
4	Gemini 3.1 Pro 1M context, 77.1% ARC-AGI-2	Google	API	84.2%	91.0%	77.1%	93.5%	94.5%	1M	$2.5/$10
5	DeepSeek V4-Pro 1.6T MoE / 49B active, MIT licensed	DeepSeek	Open	76.4%	87.9%	64.2%	88.1%	92.0%	256K	self-host
6	Qwen3-Coder-Next 80B/3B active — best open coder	Alibaba	Open	70.6%	84.1%	51.2%	79.4%	91.8%	256K	self-host
7	Kimi K2.6 1T MoE / 32B active	Moonshot	Open	68.1%	86.4%	59.3%	84.7%	89.5%	200K	self-host
8	GLM-5 745B/44B active, runs on Huawei Ascend	Zhipu	Open	65.4%	85.2%	54.1%	81.2%	87.6%	128K	self-host
9	Mistral Medium 3.5 Unified Magistral+Pixtral+Devstral	Mistral	Open	64.2%	85.7%	50.8%	80.5%	88.1%	256K	$2/$6
10	Qwen3.6-27B Dense 27B, beats older 397B MoE	Alibaba	Open	58.7%	81.4%	44.5%	73.8%	84.2%	128K	self-host
11	Llama 4 405B Dense 405B, multilingual stronghold	Meta	Open	55.3%	88.2%	47.0%	76.9%	86.4%	256K	self-host
12	Gemini 3 Flash 1M context at fastest API price	Google	API	53.8%	84.1%	38.2%	71.3%	82.6%	1M	$0.15/$0.6
13	GPT-5.5 Standard Cost-balanced GPT-5.5 tier	OpenAI	API	52.4%	86.7%	41.9%	78.5%	88.1%	400K	$3/$12
14	DeepSeek V4-Flash Smaller DeepSeek V4 variant	DeepSeek	Open	49.7%	78.4%	36.5%	67.4%	79.8%	128K	self-host
15	Claude Haiku 4.5 Fastest Claude tier	Anthropic	API	47.8%	81.2%	33.4%	64.8%	81.5%	200K	$0.8/$4
16	Llama 4 70B Best 70B dense for self-host	Meta	Open	45.2%	80.1%	28.7%	60.4%	78.9%	128K	self-host
17	Phi-4 14B Best small reasoning model	Microsoft	Open	38.4%	77.3%	22.1%	52.6%	75.4%	128K	self-host
18	Gemma 3 27B Strong general-purpose 27B	Google	Open	36.5%	76.4%	20.8%	49.1%	73.2%	128K	self-host
19	Qwen3-Coder Older Qwen coder — superseded by Next	Alibaba	Open	35.1%	74.2%	18.5%	46.7%	71.6%	128K	self-host
20	DeepSeek V3.1 Strong general 671B MoE — superseded by V4	DeepSeek	Open	33.7%	75.8%	17.4%	44.3%	70.2%	128K	self-host
21	Mistral Large 2 123B dense, multilingual	Mistral	Open	31.2%	76.9%	16.0%	41.8%	69.5%	128K	$2/$6
22	GPT-4.5 Conversational flagship — superseded	OpenAI	API	29.4%	81.7%	23.5%	58.4%	75.8%	128K	$75/$150
23	Llama 3.3 70B Most-deployed open model in production	Meta	Open	27.8%	73.1%	12.4%	36.5%	65.2%	128K	self-host
24	Qwen3-32B Solid mid-size dense	Alibaba	Open	26.5%	72.4%	11.2%	33.7%	63.8%	128K	self-host
25	Phi-4 Mini 3.8B Edge / CPU-friendly reasoning	Microsoft	Open	22.1%	65.4%	8.4%	28.9%	58.3%	128K	self-host
26	Llama 3.2 11B Vision Multimodal at 11B scale	Meta	Open	19.4%	67.8%	7.1%	24.6%	56.2%	128K	self-host
27	Gemma 3 9B Tightest 9B for laptop inference	Google	Open	18.7%	65.1%	6.5%	22.4%	54.8%	128K	self-host
28	Mistral 7B v0.3 Battle-tested 7B baseline	Mistral	Open	12.4%	60.3%	4.2%	16.8%	47.5%	32K	self-host
29	Llama 3.2 3B Edge devices, mobile, Raspberry Pi 5	Meta	Open	9.8%	58.1%	3.5%	14.2%	41.6%	128K	self-host
30	Phi-3 Mini 3.8B Edge-grade compact reasoning	Microsoft	Open	8.5%	56.4%	2.8%	11.7%	38.2%	128K	self-host

AI Model Leaderboard 2026

Go from reading about AI to building with AI

How we rank — methodology

Sources for every score

The five benchmarks explained

SWE-Bench Verified

MMLU-Pro

ARC-AGI-2

AIME 2025

HumanEval+ (EvalPlus)

Frequently asked questions

Pick a model from the leaderboard. Now actually run it.

Related tools & resources

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide