★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Benchmarks

LMArena 2026: Live AI Model Rankings (Claude vs GPT-5 vs Gemini)

March 18, 2026
14 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

📚AI Learning Path

Go from reading about AI to building with AI 20 structured courses. Hands-on projects. Runs on your machine. Start free.

Start free
Or own it for life — Lifetime $149, pay once

Looking for LMArena? The official live leaderboard is at lmarena.ai (now branded "Arena," same project). This page is an independent guide, not the official site. As of June 2026, the current top 5 on the Overall leaderboard are: 1. Claude Opus 4.8 (Anthropic, ~1510 ELO), 2. GPT-5.5 Pro (OpenAI), 3. Gemini 3.1 Pro Preview (Google), 4. Claude Opus 4.7 (Anthropic), 5. GPT-5.5 (OpenAI) — the top tier is clustered within ~55 ELO points, the tightest spread on record. Full top-10 and what the scores mean are below; live numbers are always at lmarena.ai/leaderboard.

When someone asks "what's the best AI model?" the honest answer is: check the Arena leaderboard. With over 6.8 million blind human votes across 360+ models, it's the closest thing we have to a definitive, unbiased ranking of AI capabilities.

Heads up — the name changed. As of January 28, 2026, LMArena rebranded to simply "Arena" (still at arena.ai). It's the same project — formerly LMSYS Chatbot Arena — now run as an independent company that raised a $150M Series A (led by Felicis and UC Investments, at a ~$1.7B valuation). You'll still see "LMArena" and "Chatbot Arena" used interchangeably across the web.

But reading the Arena correctly requires understanding what the numbers mean, where the rankings fall short, and how to use them for your specific needs. This guide breaks it all down.


What is LMArena? {#what-is-lmarena}

Arena (formerly LMSYS Chatbot Arena, then LMArena, now hosted at arena.ai) is an open platform for crowdsourced AI benchmarking. Created by researchers from UC Berkeley SkyLab in May 2023, it has grown into the most trusted public benchmark for comparing AI models.

How the Blind Voting Works

  1. You submit a prompt — any question, task, or conversation
  2. Two anonymous models respond side-by-side (you don't know which is which)
  3. You vote for the better response (or tie)
  4. Model names are revealed after voting
  5. ELO ratings update based on the outcome

This blind comparison is critical. When users know they're evaluating ChatGPT vs Claude, brand loyalty biases the results. LMArena eliminates this by keeping everything anonymous until after the vote.

Scale and Credibility

StatisticValue
Total votes6.8M+ (as of June 2026)
Models evaluated360+
Categories9 (Overall, Expert, Coding, Math, Creative Writing, etc.)
Created byUC Berkeley, Stanford, UCSD, CMU researchers
Update frequencyContinuous (new models within 1-2 weeks of release)
MethodologyBradley-Terry model (evolved from chess ELO)

No other public benchmark comes close to this volume of human evaluation data. This makes it extremely difficult to manipulate — you'd need to coordinate millions of votes to meaningfully shift a model's ranking.


Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Current Rankings (June 2026) {#current-rankings}

Overall Text Leaderboard — Top Models

RankModelProviderELO ScoreNotes
1Claude Opus 4.8Anthropic~1510+Top accessible model; also #1 on coding
2GPT-5.5 ProOpenAI~1510Broad frontier capabilities
3Gemini 3.1 Pro PreviewGoogle~1505Strong reasoning + multimodal
4Claude Opus 4.7Anthropic~1505Prior-gen flagship
5GPT-5.5OpenAI~1506Fast frontier tier
6Grok 4.3xAI~1496Rapid improvement
7Claude Opus 4.6Anthropic~1490Still highly capable
8Qwen 3.7 MaxAlibaba~1488Top proprietary open-lab model
9Gemini 3.1 FlashGoogle~1473Fast + capable
10DeepSeek V4.1 ProDeepSeek~1410Top open-weight model

⚠️ Not on this list — Claude Fable 5 (and Mythos 5). Fable 5 launched June 9, 2026 and briefly topped the board (~1525 ELO), but on June 12, 2026 Anthropic suspended both Fable 5 and Mythos 5 worldwide under a U.S. export-control order restricting foreign-national access. They are not currently usable, so we don't rank or recommend them. Claude Opus 4.8 and all other Anthropic models stayed online.

Scores approximate and shift daily as new votes accumulate; model version names vary slightly across the frontier tier. The top models are clustered within ~55 ELO points — the tightest spread on record. Visit arena.ai/leaderboard for live rankings.

What the ELO Gap Means

Understanding ELO differences helps you decide if a ranking difference actually matters:

ELO DifferenceWin Rate for Higher ModelPractical Meaning
0-20 points50-53%Essentially tied — pick either
20-50 points53-57%Slight edge — barely noticeable
50-100 points57-64%Meaningful difference — you'll notice
100-200 points64-76%Significant gap — clear quality difference
200+ points76%+Major gap — different tier entirely

Key insight: The top 5 models are often within 20-30 ELO points of each other. This means choosing between them often comes down to pricing, speed, and specific task performance rather than overall quality.


How ELO Ratings Work {#how-elo-works}

LMArena adapted the Bradley-Terry model (evolved from the ELO system used in chess) for AI model ranking.

The Math Behind It

When Model A faces Model B in a blind comparison:

  • Expected outcome is calculated from their current rating difference
  • If the favorite wins: small rating change (expected result)
  • If the underdog wins: large rating change (surprising result)
  • Ties: small adjustment based on rating gap

After thousands of matchups per model, ratings converge to reflect true relative quality. The system is self-correcting — a few bad votes don't meaningfully shift rankings when millions of votes exist.

Why Bradley-Terry Over Raw ELO

LMArena switched from traditional ELO to Bradley-Terry because:

  1. Maximum likelihood estimation — finds the most statistically likely true ranking
  2. Better confidence intervals — shows how certain the ranking is
  3. Handles asymmetric matchups — some model pairs get more votes than others
  4. More stable — less sensitive to vote ordering

The practical effect is the same: higher number = better model, and the gap between numbers predicts win rates.


Category Leaderboards {#category-leaderboards}

LMArena breaks down rankings into specialized categories. A model's overall rank often differs significantly from its category rank.

Available Categories

CategoryWhat It MeasuresBest For
OverallGeneral chat qualityDefault model choice
ExpertTop 5.5% hardest promptsComplex reasoning, technical work
CodingCode generation, debuggingDeveloper tools, AI coding
MathMathematical reasoningData analysis, science
Creative WritingStory, poetry, contentContent creation
Instruction FollowingFollowing complex instructionsWorkflow automation
Multi-TurnExtended conversationsChatbots, tutoring
Hard PromptsTop 33% difficult promptsGeneral but challenging tasks
OccupationalJob-specific tasksProfessional applications

Why Category Rankings Matter

A model can rank #1 overall but #5 in coding. For example:

  • Claude models tend to gain points on Expert and Coding
  • Gemini models tend to excel on Vision and Multi-Turn
  • Older general-chat models drop significantly on Expert despite decent overall scores
  • DeepSeek V4.1 Pro punches above its weight on Math and Reasoning

If you have a specific use case, check the category leaderboard, not just the overall ranking.

Coding Leaderboard Highlights

For developers, the coding leaderboard is especially relevant:

RankModelCoding ELONotes
1Claude Opus 4.8~1500+#1 on coding and overall
2-3GPT-5.5 Pro / Gemini 3.1 Pro~1490+Trade the next spots
4Claude Opus 4.7~1480Strong code review
5DeepSeek V4.1 Pro~1465Best open-weight for code

Coding ELO and overall ELO are separate scales. The coding category has well over 100K votes as of June 2026.

For objective coding benchmarks (not just user preference), see our SWE-bench leaderboard guide.


Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

LMArena vs Other Benchmarks {#lmarena-vs-benchmarks}

Comparison of Major AI Benchmarks

BenchmarkTypeWhat It MeasuresStrengthsWeaknesses
LMArenaHuman votesUser preferenceReal-world relevance, unbiasedStyle bias, English-heavy
SWE-benchAutomatedCode correctnessObjective, real bugsPython-only, scaffold-dependent
HumanEvalAutomatedAlgorithm codingClean measurementSynthetic, single-function
MMLUAutomatedKnowledge breadthWide coverageMultiple-choice, memorizable
GPQAAutomatedExpert reasoningPhD-level difficultySmall test set
Arena HardAutomatedHard prompt qualityCheap to runApproximation of human pref

When to Use Each Benchmark

  • Choosing a general assistant: LMArena Overall
  • Choosing a coding agent: SWE-bench Verified
  • Choosing a coding assistant: LMArena Coding
  • Choosing a reasoning model: LMArena Expert + GPQA
  • Choosing a local model: Our Ollama ranking (considers VRAM, speed, quality)

Open-Source Model Rankings {#open-source-rankings}

A major value of LMArena is tracking how open-source models compare to proprietary ones.

Open-Source vs Proprietary Gap (June 2026)

MetricTop ProprietaryTop Open-WeightGap
Overall ELO~1510 (Claude Opus 4.8)~1410 (DeepSeek V4.1 Pro)~55 points (~58% win rate)
Coding ELO~1500 (Claude Opus 4.8)~1465 (DeepSeek V4.1 Pro)~35 points (~55% win rate)
Math ELO~1490 (varies)~1460 (DeepSeek V4.1 Pro)~30 points (~54% win rate)

The gap is closing fast. In 2024, the gap was 100-150 ELO points. By June 2026, the best open-weight model (DeepSeek V4.1 Pro) sits within ~55 points of the top closed-source model — meaning the proprietary advantage is only a 54-58% win rate. For many tasks, open models are effectively equivalent.

Best Open Models on LMArena

ModelELO RangeCan Run Locally?Hardware Needed
DeepSeek V4.1 Pro~1410Yes (via API or local)48GB+ for full model
Qwen 3.7 Max~1430Partially (quantized)32-64GB+
Llama 4 Maverick~1420Yes (quantized)48GB+
Qwen3-Coder-Next~1400 (coding)Yes46GB (Q4)
Llama 4 Scout~1390Yes24GB (1.78-bit)

For running these locally, see our best Ollama models guide and RTX 5090 vs 5080 comparison.


How to Use LMArena Rankings {#how-to-use}

Step 1: Identify Your Use Case

Don't look at overall rankings unless you need a general-purpose assistant. Use category leaderboards:

  • Developer? → Coding leaderboard + SWE-bench
  • Writer? → Creative Writing leaderboard
  • Researcher? → Expert leaderboard + Math
  • Building a chatbot? → Multi-Turn leaderboard
  • General use? → Overall leaderboard

Step 2: Check the ELO Gap

If two models are within 20 ELO points, they're essentially tied. Pick based on:

  • PriceGPT-OSS is free vs $20/month for ChatGPT Plus
  • Speed — Flash variants are 3-5x faster
  • PrivacyLocal models keep data on your device
  • Context window — ranges from 32K to 10M tokens

Step 3: Test on Your Actual Tasks

LMArena reflects average user preference. Your specific workflow might differ:

  1. Pick 2-3 top-ranked models from the relevant category
  2. Test them on 5-10 real tasks from your work
  3. Track quality, speed, and cost over 1-2 weeks
  4. Choose based on your experience, not just rankings

Step 4: Revisit Quarterly

The leaderboard shifts every few months as new models launch. Set a quarterly reminder to check arena.ai/leaderboard and re-evaluate if a significantly better model has appeared.


Sources {#sources}


FAQ {#faq}

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once

Liked this? 20 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Want structured AI education?

20 courses, 495+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

What is LMArena (Chatbot Arena)?

Arena (formerly LMSYS Chatbot Arena, then LMArena — rebranded to "Arena" on January 28, 2026, now at arena.ai) is the largest crowdsourced AI model benchmark. Created by UC Berkeley researchers, it ranks AI models using blind A/B voting: users chat with two anonymous models and vote for the better response. With over 6.8 million votes across 360+ models, it's considered the most reliable measure of real-world AI quality because it reflects what actual users prefer, not just synthetic benchmark scores.

How does the LMArena ELO rating system work?

LMArena uses the Bradley-Terry model (evolved from chess ELO). When two models face off in a blind comparison, the winner gains rating points and the loser drops. The exchange depends on expected outcomes: if a top model beats a weak one, few points change. If an underdog wins, the swing is large. A 100-point ELO gap means the higher model wins ~64% of the time. A 200-point gap means ~76% win rate. Ratings stabilize after thousands of votes per model.

What is the difference between LMArena and SWE-bench?

LMArena measures human preference through blind voting — which model "feels" better to use for chat, writing, reasoning, and coding. SWE-bench measures objective coding correctness — can the model actually fix real GitHub bugs? A model might rank high on LMArena (clear explanations, good formatting) but lower on SWE-bench (struggles with complex codebases). Use LMArena rankings for choosing a chat/assistant model. Use SWE-bench for choosing a coding agent. Both together give the complete picture.

Which AI model has the highest LMArena ELO rating?

As of June 2026, Claude Opus 4.8 holds the top accessible position at ~1510+ ELO (also #1 on the coding leaderboard), followed closely by GPT-5.5 Pro, Gemini 3.1 Pro Preview, Claude Opus 4.7, and GPT-5.5 — the top models are clustered within ~55 ELO points, the tightest spread on record. Note: Claude Fable 5 briefly topped the board after its June 9 launch, but Anthropic suspended Fable 5 and Mythos 5 on June 12, 2026 under a U.S. export-control order, so neither is currently usable. Rankings change frequently as new models are added.

Are LMArena rankings reliable?

Yes, Arena (LMArena) is considered the most reliable public AI benchmark because: (1) Over 6.8 million real human votes — impossible to game at scale, (2) Blind comparison — users don't know which model they're judging, (3) Diverse user base — not just researchers, but real users with real tasks, (4) Statistical confidence — models need thousands of votes before ranking stabilizes. Limitations: voting skews toward English speakers, chat-style interactions, and users may prefer style over substance. Category leaderboards (coding, math, creative writing) address some of this.

How do open-source models rank on LMArena?

Open-source (open-weight) models have improved dramatically. As of June 2026, DeepSeek V4.1 Pro is the highest-ranked open-weight model and sits within ~55 ELO points of the top closed-source model. Qwen 3.7 Max scores competitively on the Expert leaderboard, and Llama 4 Maverick performs well on general tasks. However, proprietary models (Claude, Gemini, GPT-5) still hold the top 5-10 positions on most category leaderboards. The gap has narrowed significantly — meaning the win-rate difference is often under 10%.

What are Arena Hard and Arena Expert?

Arena Hard filters for the toughest third of all Arena prompts, producing wider score spreads between models. Arena Expert (launched November 2025) is even stricter — only the top 5.5% of prompts by reasoning depth and specificity. Expert rankings often differ from overall rankings: top reasoning models like Claude Opus 4.8 and Qwen 3.7 Max gain significant points on Expert, while older general-chat models drop. Use the Expert leaderboard if your use case involves complex reasoning, multi-step problems, or technical work.

How often is the LMArena leaderboard updated?

The leaderboard updates continuously as new votes come in, but major ranking changes are announced in the changelog at arena.ai/blog/leaderboard-changelog. New models are typically added within 1-2 weeks of public release. As of June 2026, the platform evaluates 360+ models across categories: Overall, Expert, Coding, Math, Creative Writing, Instruction Following, Multi-Turn, Hard Prompts, and Occupational. Visit arena.ai/leaderboard for the latest live rankings.

Ready to Go Beyond Tutorials?

20 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.

Was this helpful?

📅 Published: March 18, 2026🔄 Last Updated: June 19, 2026✓ Manually Reviewed
LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators