LMArena 2026: Live AI Model Rankings (Claude vs GPT-5 vs Gemini)
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Go from reading about AI to building with AI 20 structured courses. Hands-on projects. Runs on your machine. Start free.
Looking for LMArena? The official live leaderboard is at lmarena.ai (now branded "Arena," same project). This page is an independent guide, not the official site. As of June 2026, the current top 5 on the Overall leaderboard are: 1. Claude Opus 4.8 (Anthropic, ~1510 ELO), 2. GPT-5.5 Pro (OpenAI), 3. Gemini 3.1 Pro Preview (Google), 4. Claude Opus 4.7 (Anthropic), 5. GPT-5.5 (OpenAI) — the top tier is clustered within ~55 ELO points, the tightest spread on record. Full top-10 and what the scores mean are below; live numbers are always at lmarena.ai/leaderboard.
When someone asks "what's the best AI model?" the honest answer is: check the Arena leaderboard. With over 6.8 million blind human votes across 360+ models, it's the closest thing we have to a definitive, unbiased ranking of AI capabilities.
Heads up — the name changed. As of January 28, 2026, LMArena rebranded to simply "Arena" (still at arena.ai). It's the same project — formerly LMSYS Chatbot Arena — now run as an independent company that raised a $150M Series A (led by Felicis and UC Investments, at a ~$1.7B valuation). You'll still see "LMArena" and "Chatbot Arena" used interchangeably across the web.
But reading the Arena correctly requires understanding what the numbers mean, where the rankings fall short, and how to use them for your specific needs. This guide breaks it all down.
What is LMArena? {#what-is-lmarena}
Arena (formerly LMSYS Chatbot Arena, then LMArena, now hosted at arena.ai) is an open platform for crowdsourced AI benchmarking. Created by researchers from UC Berkeley SkyLab in May 2023, it has grown into the most trusted public benchmark for comparing AI models.
How the Blind Voting Works
- You submit a prompt — any question, task, or conversation
- Two anonymous models respond side-by-side (you don't know which is which)
- You vote for the better response (or tie)
- Model names are revealed after voting
- ELO ratings update based on the outcome
This blind comparison is critical. When users know they're evaluating ChatGPT vs Claude, brand loyalty biases the results. LMArena eliminates this by keeping everything anonymous until after the vote.
Scale and Credibility
| Statistic | Value |
|---|---|
| Total votes | 6.8M+ (as of June 2026) |
| Models evaluated | 360+ |
| Categories | 9 (Overall, Expert, Coding, Math, Creative Writing, etc.) |
| Created by | UC Berkeley, Stanford, UCSD, CMU researchers |
| Update frequency | Continuous (new models within 1-2 weeks of release) |
| Methodology | Bradley-Terry model (evolved from chess ELO) |
No other public benchmark comes close to this volume of human evaluation data. This makes it extremely difficult to manipulate — you'd need to coordinate millions of votes to meaningfully shift a model's ranking.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Current Rankings (June 2026) {#current-rankings}
Overall Text Leaderboard — Top Models
| Rank | Model | Provider | ELO Score | Notes |
|---|---|---|---|---|
| 1 | Claude Opus 4.8 | Anthropic | ~1510+ | Top accessible model; also #1 on coding |
| 2 | GPT-5.5 Pro | OpenAI | ~1510 | Broad frontier capabilities |
| 3 | Gemini 3.1 Pro Preview | ~1505 | Strong reasoning + multimodal | |
| 4 | Claude Opus 4.7 | Anthropic | ~1505 | Prior-gen flagship |
| 5 | GPT-5.5 | OpenAI | ~1506 | Fast frontier tier |
| 6 | Grok 4.3 | xAI | ~1496 | Rapid improvement |
| 7 | Claude Opus 4.6 | Anthropic | ~1490 | Still highly capable |
| 8 | Qwen 3.7 Max | Alibaba | ~1488 | Top proprietary open-lab model |
| 9 | Gemini 3.1 Flash | ~1473 | Fast + capable | |
| 10 | DeepSeek V4.1 Pro | DeepSeek | ~1410 | Top open-weight model |
⚠️ Not on this list — Claude Fable 5 (and Mythos 5). Fable 5 launched June 9, 2026 and briefly topped the board (~1525 ELO), but on June 12, 2026 Anthropic suspended both Fable 5 and Mythos 5 worldwide under a U.S. export-control order restricting foreign-national access. They are not currently usable, so we don't rank or recommend them. Claude Opus 4.8 and all other Anthropic models stayed online.
Scores approximate and shift daily as new votes accumulate; model version names vary slightly across the frontier tier. The top models are clustered within ~55 ELO points — the tightest spread on record. Visit arena.ai/leaderboard for live rankings.
What the ELO Gap Means
Understanding ELO differences helps you decide if a ranking difference actually matters:
| ELO Difference | Win Rate for Higher Model | Practical Meaning |
|---|---|---|
| 0-20 points | 50-53% | Essentially tied — pick either |
| 20-50 points | 53-57% | Slight edge — barely noticeable |
| 50-100 points | 57-64% | Meaningful difference — you'll notice |
| 100-200 points | 64-76% | Significant gap — clear quality difference |
| 200+ points | 76%+ | Major gap — different tier entirely |
Key insight: The top 5 models are often within 20-30 ELO points of each other. This means choosing between them often comes down to pricing, speed, and specific task performance rather than overall quality.
How ELO Ratings Work {#how-elo-works}
LMArena adapted the Bradley-Terry model (evolved from the ELO system used in chess) for AI model ranking.
The Math Behind It
When Model A faces Model B in a blind comparison:
- Expected outcome is calculated from their current rating difference
- If the favorite wins: small rating change (expected result)
- If the underdog wins: large rating change (surprising result)
- Ties: small adjustment based on rating gap
After thousands of matchups per model, ratings converge to reflect true relative quality. The system is self-correcting — a few bad votes don't meaningfully shift rankings when millions of votes exist.
Why Bradley-Terry Over Raw ELO
LMArena switched from traditional ELO to Bradley-Terry because:
- Maximum likelihood estimation — finds the most statistically likely true ranking
- Better confidence intervals — shows how certain the ranking is
- Handles asymmetric matchups — some model pairs get more votes than others
- More stable — less sensitive to vote ordering
The practical effect is the same: higher number = better model, and the gap between numbers predicts win rates.
Category Leaderboards {#category-leaderboards}
LMArena breaks down rankings into specialized categories. A model's overall rank often differs significantly from its category rank.
Available Categories
| Category | What It Measures | Best For |
|---|---|---|
| Overall | General chat quality | Default model choice |
| Expert | Top 5.5% hardest prompts | Complex reasoning, technical work |
| Coding | Code generation, debugging | Developer tools, AI coding |
| Math | Mathematical reasoning | Data analysis, science |
| Creative Writing | Story, poetry, content | Content creation |
| Instruction Following | Following complex instructions | Workflow automation |
| Multi-Turn | Extended conversations | Chatbots, tutoring |
| Hard Prompts | Top 33% difficult prompts | General but challenging tasks |
| Occupational | Job-specific tasks | Professional applications |
Why Category Rankings Matter
A model can rank #1 overall but #5 in coding. For example:
- Claude models tend to gain points on Expert and Coding
- Gemini models tend to excel on Vision and Multi-Turn
- Older general-chat models drop significantly on Expert despite decent overall scores
- DeepSeek V4.1 Pro punches above its weight on Math and Reasoning
If you have a specific use case, check the category leaderboard, not just the overall ranking.
Coding Leaderboard Highlights
For developers, the coding leaderboard is especially relevant:
| Rank | Model | Coding ELO | Notes |
|---|---|---|---|
| 1 | Claude Opus 4.8 | ~1500+ | #1 on coding and overall |
| 2-3 | GPT-5.5 Pro / Gemini 3.1 Pro | ~1490+ | Trade the next spots |
| 4 | Claude Opus 4.7 | ~1480 | Strong code review |
| 5 | DeepSeek V4.1 Pro | ~1465 | Best open-weight for code |
Coding ELO and overall ELO are separate scales. The coding category has well over 100K votes as of June 2026.
For objective coding benchmarks (not just user preference), see our SWE-bench leaderboard guide.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
LMArena vs Other Benchmarks {#lmarena-vs-benchmarks}
Comparison of Major AI Benchmarks
| Benchmark | Type | What It Measures | Strengths | Weaknesses |
|---|---|---|---|---|
| LMArena | Human votes | User preference | Real-world relevance, unbiased | Style bias, English-heavy |
| SWE-bench | Automated | Code correctness | Objective, real bugs | Python-only, scaffold-dependent |
| HumanEval | Automated | Algorithm coding | Clean measurement | Synthetic, single-function |
| MMLU | Automated | Knowledge breadth | Wide coverage | Multiple-choice, memorizable |
| GPQA | Automated | Expert reasoning | PhD-level difficulty | Small test set |
| Arena Hard | Automated | Hard prompt quality | Cheap to run | Approximation of human pref |
When to Use Each Benchmark
- Choosing a general assistant: LMArena Overall
- Choosing a coding agent: SWE-bench Verified
- Choosing a coding assistant: LMArena Coding
- Choosing a reasoning model: LMArena Expert + GPQA
- Choosing a local model: Our Ollama ranking (considers VRAM, speed, quality)
Open-Source Model Rankings {#open-source-rankings}
A major value of LMArena is tracking how open-source models compare to proprietary ones.
Open-Source vs Proprietary Gap (June 2026)
| Metric | Top Proprietary | Top Open-Weight | Gap |
|---|---|---|---|
| Overall ELO | ~1510 (Claude Opus 4.8) | ~1410 (DeepSeek V4.1 Pro) | ~55 points (~58% win rate) |
| Coding ELO | ~1500 (Claude Opus 4.8) | ~1465 (DeepSeek V4.1 Pro) | ~35 points (~55% win rate) |
| Math ELO | ~1490 (varies) | ~1460 (DeepSeek V4.1 Pro) | ~30 points (~54% win rate) |
The gap is closing fast. In 2024, the gap was 100-150 ELO points. By June 2026, the best open-weight model (DeepSeek V4.1 Pro) sits within ~55 points of the top closed-source model — meaning the proprietary advantage is only a 54-58% win rate. For many tasks, open models are effectively equivalent.
Best Open Models on LMArena
| Model | ELO Range | Can Run Locally? | Hardware Needed |
|---|---|---|---|
| DeepSeek V4.1 Pro | ~1410 | Yes (via API or local) | 48GB+ for full model |
| Qwen 3.7 Max | ~1430 | Partially (quantized) | 32-64GB+ |
| Llama 4 Maverick | ~1420 | Yes (quantized) | 48GB+ |
| Qwen3-Coder-Next | ~1400 (coding) | Yes | 46GB (Q4) |
| Llama 4 Scout | ~1390 | Yes | 24GB (1.78-bit) |
For running these locally, see our best Ollama models guide and RTX 5090 vs 5080 comparison.
How to Use LMArena Rankings {#how-to-use}
Step 1: Identify Your Use Case
Don't look at overall rankings unless you need a general-purpose assistant. Use category leaderboards:
- Developer? → Coding leaderboard + SWE-bench
- Writer? → Creative Writing leaderboard
- Researcher? → Expert leaderboard + Math
- Building a chatbot? → Multi-Turn leaderboard
- General use? → Overall leaderboard
Step 2: Check the ELO Gap
If two models are within 20 ELO points, they're essentially tied. Pick based on:
- Price — GPT-OSS is free vs $20/month for ChatGPT Plus
- Speed — Flash variants are 3-5x faster
- Privacy — Local models keep data on your device
- Context window — ranges from 32K to 10M tokens
Step 3: Test on Your Actual Tasks
LMArena reflects average user preference. Your specific workflow might differ:
- Pick 2-3 top-ranked models from the relevant category
- Test them on 5-10 real tasks from your work
- Track quality, speed, and cost over 1-2 weeks
- Choose based on your experience, not just rankings
Step 4: Revisit Quarterly
The leaderboard shifts every few months as new models launch. Set a quarterly reminder to check arena.ai/leaderboard and re-evaluate if a significantly better model has appeared.
Sources {#sources}
- Arena (formerly LMArena) Official Leaderboard — Live rankings with 6.8M+ votes
- LMSYS Blog: Chatbot Arena Launch — Original methodology paper
- "LMArena is now Arena" — rebrand announcement (Jan 28, 2026) — Name change details
- Arena Leaderboard Changelog — Recent model additions and updates
- How to Read ELO Ratings — Statology — ELO rating interpretation guide
- LMArena on Hugging Face — Alternative leaderboard interface
FAQ {#faq}
Go from reading about AI to building with AI
20 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!