What is LMArena (Chatbot Arena)?

Arena (formerly LMSYS Chatbot Arena, then LMArena — rebranded to "Arena" on January 28, 2026, now at arena.ai) is the largest crowdsourced AI model benchmark. Created by UC Berkeley researchers, it ranks AI models using blind A/B voting: users chat with two anonymous models and vote for the better response. With over 6.8 million votes across 360+ models, it's considered the most reliable measure of real-world AI quality because it reflects what actual users prefer, not just synthetic benchmark scores.

How does the LMArena ELO rating system work?

LMArena uses the Bradley-Terry model (evolved from chess ELO). When two models face off in a blind comparison, the winner gains rating points and the loser drops. The exchange depends on expected outcomes: if a top model beats a weak one, few points change. If an underdog wins, the swing is large. A 100-point ELO gap means the higher model wins ~64% of the time. A 200-point gap means ~76% win rate. Ratings stabilize after thousands of votes per model.

What is the difference between LMArena and SWE-bench?

LMArena measures human preference through blind voting — which model "feels" better to use for chat, writing, reasoning, and coding. SWE-bench measures objective coding correctness — can the model actually fix real GitHub bugs? A model might rank high on LMArena (clear explanations, good formatting) but lower on SWE-bench (struggles with complex codebases). Use LMArena rankings for choosing a chat/assistant model. Use SWE-bench for choosing a coding agent. Both together give the complete picture.

Which AI model has the highest LMArena ELO rating?

As of June 2026, Claude Opus 4.8 holds the top accessible position at ~1510+ ELO (also #1 on the coding leaderboard), followed closely by GPT-5.5 Pro, Gemini 3.1 Pro Preview, Claude Opus 4.7, and GPT-5.5 — the top models are clustered within ~55 ELO points, the tightest spread on record. Note: Claude Fable 5 briefly topped the board after its June 9 launch, but Anthropic suspended Fable 5 and Mythos 5 on June 12, 2026 under a U.S. export-control order, so neither is currently usable. Rankings change frequently as new models are added.

Are LMArena rankings reliable?

Yes, Arena (LMArena) is considered the most reliable public AI benchmark because: (1) Over 6.8 million real human votes — impossible to game at scale, (2) Blind comparison — users don't know which model they're judging, (3) Diverse user base — not just researchers, but real users with real tasks, (4) Statistical confidence — models need thousands of votes before ranking stabilizes. Limitations: voting skews toward English speakers, chat-style interactions, and users may prefer style over substance. Category leaderboards (coding, math, creative writing) address some of this.

How do open-source models rank on LMArena?

Open-source (open-weight) models have improved dramatically. As of June 2026, DeepSeek V4.1 Pro is the highest-ranked open-weight model and sits within ~55 ELO points of the top closed-source model. Qwen 3.7 Max scores competitively on the Expert leaderboard, and Llama 4 Maverick performs well on general tasks. However, proprietary models (Claude, Gemini, GPT-5) still hold the top 5-10 positions on most category leaderboards. The gap has narrowed significantly — meaning the win-rate difference is often under 10%.

What are Arena Hard and Arena Expert?

Arena Hard filters for the toughest third of all Arena prompts, producing wider score spreads between models. Arena Expert (launched November 2025) is even stricter — only the top 5.5% of prompts by reasoning depth and specificity. Expert rankings often differ from overall rankings: top reasoning models like Claude Opus 4.8 and Qwen 3.7 Max gain significant points on Expert, while older general-chat models drop. Use the Expert leaderboard if your use case involves complex reasoning, multi-step problems, or technical work.

How often is the LMArena leaderboard updated?

The leaderboard updates continuously as new votes come in, but major ranking changes are announced in the changelog at arena.ai/blog/leaderboard-changelog. New models are typically added within 1-2 weeks of public release. As of June 2026, the platform evaluates 360+ models across categories: Overall, Expert, Coding, Math, Creative Writing, Instruction Following, Multi-Turn, Hard Prompts, and Occupational. Visit arena.ai/leaderboard for the latest live rankings.

LMArena Leaderboard (Live): Top AI Models Ranked

Looking for LMArena? The official live leaderboard is at lmarena.ai (now branded "Arena," same project). This page is an independent guide, not the official site. As of June 2026, the current top 5 on the Overall leaderboard are: 1. Claude Opus 4.8 (Anthropic, ~1510 ELO), 2. GPT-5.5 Pro (OpenAI), 3. Gemini 3.1 Pro Preview (Google), 4. Claude Opus 4.7 (Anthropic), 5. GPT-5.5 (OpenAI) — the top tier is clustered within ~55 ELO points, the tightest spread on record. Full top-10 and what the scores mean are below; live numbers are always at lmarena.ai/leaderboard.

When someone asks "what's the best AI model?" the honest answer is: check the Arena leaderboard. With over 6.8 million blind human votes across 360+ models, it's the closest thing we have to a definitive, unbiased ranking of AI capabilities.

Heads up — the name changed. As of January 28, 2026, LMArena rebranded to simply "Arena" (still at arena.ai). It's the same project — formerly LMSYS Chatbot Arena — now run as an independent company that raised a $150M Series A (led by Felicis and UC Investments, at a ~$1.7B valuation). You'll still see "LMArena" and "Chatbot Arena" used interchangeably across the web.

But reading the Arena correctly requires understanding what the numbers mean, where the rankings fall short, and how to use them for your specific needs. This guide breaks it all down.

What is LMArena? {#what-is-lmarena}

Arena (formerly LMSYS Chatbot Arena, then LMArena, now hosted at arena.ai) is an open platform for crowdsourced AI benchmarking. Created by researchers from UC Berkeley SkyLab in May 2023, it has grown into the most trusted public benchmark for comparing AI models.

How the Blind Voting Works

You submit a prompt — any question, task, or conversation
Two anonymous models respond side-by-side (you don't know which is which)
You vote for the better response (or tie)
Model names are revealed after voting
ELO ratings update based on the outcome

This blind comparison is critical. When users know they're evaluating ChatGPT vs Claude, brand loyalty biases the results. LMArena eliminates this by keeping everything anonymous until after the vote.

Scale and Credibility

Statistic	Value
Total votes	6.8M+ (as of June 2026)
Models evaluated	360+
Categories	9 (Overall, Expert, Coding, Math, Creative Writing, etc.)
Created by	UC Berkeley, Stanford, UCSD, CMU researchers
Update frequency	Continuous (new models within 1-2 weeks of release)
Methodology	Bradley-Terry model (evolved from chess ELO)

No other public benchmark comes close to this volume of human evaluation data. This makes it extremely difficult to manipulate — you'd need to coordinate millions of votes to meaningfully shift a model's ranking.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Current Rankings (June 2026) {#current-rankings}

Overall Text Leaderboard — Top Models

Rank	Model	Provider	ELO Score	Notes
1	Claude Opus 4.8	Anthropic	~1510+	Top accessible model; also #1 on coding
2	GPT-5.5 Pro	OpenAI	~1510	Broad frontier capabilities
3	Gemini 3.1 Pro Preview	Google	~1505	Strong reasoning + multimodal
4	Claude Opus 4.7	Anthropic	~1505	Prior-gen flagship
5	GPT-5.5	OpenAI	~1506	Fast frontier tier
6	Grok 4.3	xAI	~1496	Rapid improvement
7	Claude Opus 4.6	Anthropic	~1490	Still highly capable
8	Qwen 3.7 Max	Alibaba	~1488	Top proprietary open-lab model
9	Gemini 3.1 Flash	Google	~1473	Fast + capable
10	DeepSeek V4.1 Pro	DeepSeek	~1410	Top open-weight model

⚠️ Not on this list — Claude Fable 5 (and Mythos 5). Fable 5 launched June 9, 2026 and briefly topped the board (~1525 ELO), but on June 12, 2026 Anthropic suspended both Fable 5 and Mythos 5 worldwide under a U.S. export-control order restricting foreign-national access. They are not currently usable, so we don't rank or recommend them. Claude Opus 4.8 and all other Anthropic models stayed online.

Scores approximate and shift daily as new votes accumulate; model version names vary slightly across the frontier tier. The top models are clustered within ~55 ELO points — the tightest spread on record. Visit arena.ai/leaderboard for live rankings.

What the ELO Gap Means

Understanding ELO differences helps you decide if a ranking difference actually matters:

ELO Difference	Win Rate for Higher Model	Practical Meaning
0-20 points	50-53%	Essentially tied — pick either
20-50 points	53-57%	Slight edge — barely noticeable
50-100 points	57-64%	Meaningful difference — you'll notice
100-200 points	64-76%	Significant gap — clear quality difference
200+ points	76%+	Major gap — different tier entirely

Key insight: The top 5 models are often within 20-30 ELO points of each other. This means choosing between them often comes down to pricing, speed, and specific task performance rather than overall quality.

How ELO Ratings Work {#how-elo-works}

LMArena adapted the Bradley-Terry model (evolved from the ELO system used in chess) for AI model ranking.

The Math Behind It

When Model A faces Model B in a blind comparison:

Expected outcome is calculated from their current rating difference
If the favorite wins: small rating change (expected result)
If the underdog wins: large rating change (surprising result)
Ties: small adjustment based on rating gap

After thousands of matchups per model, ratings converge to reflect true relative quality. The system is self-correcting — a few bad votes don't meaningfully shift rankings when millions of votes exist.

Why Bradley-Terry Over Raw ELO

LMArena switched from traditional ELO to Bradley-Terry because:

Maximum likelihood estimation — finds the most statistically likely true ranking
Better confidence intervals — shows how certain the ranking is
Handles asymmetric matchups — some model pairs get more votes than others
More stable — less sensitive to vote ordering

The practical effect is the same: higher number = better model, and the gap between numbers predicts win rates.

Category Leaderboards {#category-leaderboards}

LMArena breaks down rankings into specialized categories. A model's overall rank often differs significantly from its category rank.

Available Categories

Category	What It Measures	Best For
Overall	General chat quality	Default model choice
Expert	Top 5.5% hardest prompts	Complex reasoning, technical work
Coding	Code generation, debugging	Developer tools, AI coding
Math	Mathematical reasoning	Data analysis, science
Creative Writing	Story, poetry, content	Content creation
Instruction Following	Following complex instructions	Workflow automation
Multi-Turn	Extended conversations	Chatbots, tutoring
Hard Prompts	Top 33% difficult prompts	General but challenging tasks
Occupational	Job-specific tasks	Professional applications

Why Category Rankings Matter

A model can rank #1 overall but #5 in coding. For example:

Claude models tend to gain points on Expert and Coding
Gemini models tend to excel on Vision and Multi-Turn
Older general-chat models drop significantly on Expert despite decent overall scores
DeepSeek V4.1 Pro punches above its weight on Math and Reasoning

If you have a specific use case, check the category leaderboard, not just the overall ranking.

Coding Leaderboard Highlights

For developers, the coding leaderboard is especially relevant:

Rank	Model	Coding ELO	Notes
1	Claude Opus 4.8	~1500+	#1 on coding and overall
2-3	GPT-5.5 Pro / Gemini 3.1 Pro	~1490+	Trade the next spots
4	Claude Opus 4.7	~1480	Strong code review
5	DeepSeek V4.1 Pro	~1465	Best open-weight for code

Coding ELO and overall ELO are separate scales. The coding category has well over 100K votes as of June 2026.

For objective coding benchmarks (not just user preference), see our SWE-bench leaderboard guide.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

LMArena vs Other Benchmarks {#lmarena-vs-benchmarks}

Comparison of Major AI Benchmarks

Benchmark	Type	What It Measures	Strengths	Weaknesses
LMArena	Human votes	User preference	Real-world relevance, unbiased	Style bias, English-heavy
SWE-bench	Automated	Code correctness	Objective, real bugs	Python-only, scaffold-dependent
HumanEval	Automated	Algorithm coding	Clean measurement	Synthetic, single-function
MMLU	Automated	Knowledge breadth	Wide coverage	Multiple-choice, memorizable
GPQA	Automated	Expert reasoning	PhD-level difficulty	Small test set
Arena Hard	Automated	Hard prompt quality	Cheap to run	Approximation of human pref

When to Use Each Benchmark

Choosing a general assistant: LMArena Overall
Choosing a coding agent: SWE-bench Verified
Choosing a coding assistant: LMArena Coding
Choosing a reasoning model: LMArena Expert + GPQA
Choosing a local model: Our Ollama ranking (considers VRAM, speed, quality)

Open-Source Model Rankings {#open-source-rankings}

A major value of LMArena is tracking how open-source models compare to proprietary ones.

Open-Source vs Proprietary Gap (June 2026)

Metric	Top Proprietary	Top Open-Weight	Gap
Overall ELO	~1510 (Claude Opus 4.8)	~1410 (DeepSeek V4.1 Pro)	~55 points (~58% win rate)
Coding ELO	~1500 (Claude Opus 4.8)	~1465 (DeepSeek V4.1 Pro)	~35 points (~55% win rate)
Math ELO	~1490 (varies)	~1460 (DeepSeek V4.1 Pro)	~30 points (~54% win rate)

The gap is closing fast. In 2024, the gap was 100-150 ELO points. By June 2026, the best open-weight model (DeepSeek V4.1 Pro) sits within ~55 points of the top closed-source model — meaning the proprietary advantage is only a 54-58% win rate. For many tasks, open models are effectively equivalent.

Best Open Models on LMArena

Model	ELO Range	Can Run Locally?	Hardware Needed
DeepSeek V4.1 Pro	~1410	Yes (via API or local)	48GB+ for full model
Qwen 3.7 Max	~1430	Partially (quantized)	32-64GB+
Llama 4 Maverick	~1420	Yes (quantized)	48GB+
Qwen3-Coder-Next	~1400 (coding)	Yes	46GB (Q4)
Llama 4 Scout	~1390	Yes	24GB (1.78-bit)

For running these locally, see our best Ollama models guide and RTX 5090 vs 5080 comparison.

How to Use LMArena Rankings {#how-to-use}

Step 1: Identify Your Use Case

Don't look at overall rankings unless you need a general-purpose assistant. Use category leaderboards:

Developer? → Coding leaderboard + SWE-bench
Writer? → Creative Writing leaderboard
Researcher? → Expert leaderboard + Math
Building a chatbot? → Multi-Turn leaderboard
General use? → Overall leaderboard

Step 2: Check the ELO Gap

If two models are within 20 ELO points, they're essentially tied. Pick based on:

Price — GPT-OSS is free vs $20/month for ChatGPT Plus
Speed — Flash variants are 3-5x faster
Privacy — Local models keep data on your device
Context window — ranges from 32K to 10M tokens

Step 3: Test on Your Actual Tasks

LMArena reflects average user preference. Your specific workflow might differ:

Pick 2-3 top-ranked models from the relevant category
Test them on 5-10 real tasks from your work
Track quality, speed, and cost over 1-2 weeks
Choose based on your experience, not just rankings

Step 4: Revisit Quarterly

The leaderboard shifts every few months as new models launch. Set a quarterly reminder to check arena.ai/leaderboard and re-evaluate if a significantly better model has appeared.

Sources {#sources}

Arena (formerly LMArena) Official Leaderboard — Live rankings with 6.8M+ votes
LMSYS Blog: Chatbot Arena Launch — Original methodology paper
"LMArena is now Arena" — rebrand announcement (Jan 28, 2026) — Name change details
Arena Leaderboard Changelog — Recent model additions and updates
How to Read ELO Ratings — Statology — ELO rating interpretation guide
LMArena on Hugging Face — Alternative leaderboard interface

LMArena 2026: Live AI Model Rankings (Claude vs GPT-5 vs Gemini)

Want to go deeper than this article?

What is LMArena? {#what-is-lmarena}

How the Blind Voting Works

Scale and Credibility

Reading articles is good. Building is better.

Current Rankings (June 2026) {#current-rankings}

Overall Text Leaderboard — Top Models

What the ELO Gap Means

How ELO Ratings Work {#how-elo-works}

The Math Behind It

Why Bradley-Terry Over Raw ELO

Category Leaderboards {#category-leaderboards}

Available Categories

Why Category Rankings Matter

Coding Leaderboard Highlights

Reading articles is good. Building is better.

LMArena vs Other Benchmarks {#lmarena-vs-benchmarks}

Comparison of Major AI Benchmarks

When to Use Each Benchmark

Open-Source Model Rankings {#open-source-rankings}

Open-Source vs Proprietary Gap (June 2026)

Best Open Models on LMArena

How to Use LMArena Rankings {#how-to-use}

Step 1: Identify Your Use Case

Step 2: Check the ELO Gap

Step 3: Test on Your Actual Tasks

Step 4: Revisit Quarterly

Sources {#sources}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 20 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

What is LMArena (Chatbot Arena)?

How does the LMArena ELO rating system work?

What is the difference between LMArena and SWE-bench?

Which AI model has the highest LMArena ELO rating?

Are LMArena rankings reliable?

How do open-source models rank on LMArena?

What are Arena Hard and Arena Expert?

How often is the LMArena leaderboard updated?

Ready to Go Beyond Tutorials?

Written by the Local AI Master Team

Related Guides

SWE-bench Explained

Best Ollama Models

Best Open Source LLMs

Qwen3-Coder

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI