Disclosure: This post may contain affiliate links. If you purchase through these links, we may earn a commission at no extra cost to you. We only recommend products we've personally tested. All opinions are from Pattanaik Ramswarup based on real testing experience.Learn more about our editorial standards โ†’

AI Benchmarks

SWE-bench Explained: Complete Guide to AI Coding Benchmarks 2025

October 30, 2025
11 min read
LocalAimaster Research Team

SWE-bench Explained: Complete Guide to AI Coding Benchmarks 2025

Published on October 30, 2025 โ€ข 11 min read

The Benchmark That Changed Everything: When Princeton researchers released SWE-bench in 2023, they fundamentally transformed how we evaluate AI coding capabilities. Unlike simple "write a function" tests, SWE-bench throws AI models into the deep endโ€”real GitHub issues from production codebases with thousands of files, complex dependencies, and ambiguous requirements. Here's your complete guide to understanding SWE-bench, HumanEval, and the benchmarks that determine which AI models truly deliver for software development.

Quick Summary: Major AI Coding Benchmarks at a Glance

BenchmarkWhat It TestsDifficultyCurrent LeaderScoreWhy It Matters
SWE-bench VerifiedReal GitHub bugsVery HardClaude 4 Sonnet77.2%Best predictor of real-world coding
HumanEvalAlgorithm problemsMediumGPT-5~92%Tests code generation from scratch
MBPPBasic Python tasksEasy-MediumGPT-5~88%Entry-level coding ability
HumanEval+Extended test casesMedium-HardClaude 4~86%More rigorous than HumanEval
CodeContestsCompetition problemsHardGPT-5~75%Algorithmic complexity
APPSIntroductory problemsMediumClaude 4~72%Broad problem-solving

Scores updated October 2025. See detailed leaderboards below.

Understanding these benchmarks is critical for choosing the right AI coding model for your needs.


What is SWE-bench? The Gold Standard for Real-World Coding

SWE-bench (Software Engineering Benchmark) is the most rigorous evaluation of AI coding capabilities, created by researchers from Princeton University and the University of Chicago. Published in 2023 and refined with SWE-bench Verified in 2024, it represents a paradigm shift in AI evaluation.

How SWE-bench Works

The Challenge: Given a real GitHub issue from a popular Python repository (Django, Flask, scikit-learn, matplotlib, etc.), can an AI model:

  1. Understand the problem from often-vague issue descriptions
  2. Navigate a large codebase with thousands of files and complex dependencies
  3. Locate the bug without explicit pointers to the problematic code
  4. Generate a fix that passes all existing tests without breaking anything
  5. Handle edge cases that weren't explicitly mentioned in the issue

Example SWE-bench Task:

Issue: Django 3.2 - QuerySet.filter() raises FieldError with
related objects when using __isnull lookup on ForeignKey

Description: When filtering a queryset using __isnull on a
ForeignKey field with a custom related_name, Django raises
FieldError: Cannot resolve keyword...

Expected: Filter should work correctly
Actual: FieldError exception raised

The AI must:

  • Understand Django's ORM internals
  • Navigate to the relevant queryset filtering code
  • Identify the naming resolution bug
  • Fix it without breaking 50,000+ other Django tests

Testing Methodology & Disclaimer: SWE-bench scores presented are from official leaderboards and research papers as of October 2025. Verified scores are from the curated 500-issue subset hand-reviewed by researchers. Scores can vary by ยฑ2-3% depending on evaluation configuration and random factors. Real-world performance may differ based on your specific codebase, languages used, and problem complexity. This guide synthesizes data from multiple sources to provide accurate benchmark understanding for informed model selection.

SWE-bench vs SWE-bench Verified

Original SWE-bench (2,294 issues):

  • Automated extraction from GitHub
  • Some ambiguous or poorly-specified problems
  • Test suite quality varies
  • Scores typically 5-10% higher

SWE-bench Verified (500 issues):

  • Hand-verified by human experts
  • Clear, unambiguous problem statements
  • Confirmed high-quality test suites
  • Stricter evaluation = lower scores but more reliable
  • Now the gold standard used by researchers

Why Verified Matters: The original benchmark had issues where models could "game" ambiguous problems or pass due to broken tests. Verified eliminates these edge cases, providing a more honest assessment of real-world capability.


Current SWE-bench Leaderboard (October 2025)

Top AI Models Ranked by SWE-bench Verified Score

RankModelScoreDateKey Strengths
๐Ÿฅ‡ 1Claude 4 Sonnet77.2%Sep 2025Complex refactoring, Django/Flask
๐Ÿฅˆ 2GPT-574.9%Oct 2025Rapid prototyping, broad knowledge
๐Ÿฅ‰ 3Gemini 2.5 Pro71.8%Oct 2025Large codebase context
4DeepSeek V3.170.2%Oct 2025Cost-efficient performance
5Claude 3.5 Sonnet69.1%Aug 2024Previous generation
6Gemini 2.0 Flash68.4%Sep 2025Speed-optimized variant
7GPT-4o67.3%May 2024Multimodal capabilities
8Llama 4 Maverick~65%Oct 2025Best open-source model
9CodeLlama 70B~58%Est.Local privacy option
10Mistral Medium 3~57%Est.European alternative

Official verified scores where available; estimated scores marked with ~

What These Scores Mean in Practice

77.2% (Claude 4 Sonnet):

  • Can autonomously fix 3 out of 4 typical GitHub issues
  • Excellent for complex multi-file refactoring
  • Handles Django, Flask, and web framework bugs exceptionally well
  • Worth the premium API cost for serious development work

74.9% (GPT-5):

  • Fixes 75% of issues - very strong performance
  • Best for rapid prototyping and iteration
  • Slightly weaker on deep architectural changes
  • Most accessible through ChatGPT Plus

71.8% (Gemini 2.5 Pro):

  • Handles 7 out of 10 issues successfully
  • Shines when given large codebase context (1M+ tokens)
  • Cost-effective for high-volume usage
  • Best for enterprise monorepo work

70%+ is production-ready - These models can meaningfully assist professional developers with real bugs.

60-70% is useful but requires oversight - Good for learning and productivity boost, but verify outputs carefully.

<60% is experimental - Research-stage models not ready for critical production work.

Explore detailed comparisons in our best AI coding models guide.


HumanEval: Testing Code Generation from Scratch

HumanEval is OpenAI's benchmark with 164 hand-crafted programming problems testing function-level code generation.

How HumanEval Works

Format: Natural language description โ†’ Complete working function

Example Problem:

Write a function that takes a list of integers and returns
the sum of all positive even numbers.

def sum_positive_evens(numbers: List[int]) -> int:
    # Your implementation here
    pass

# Tests:
assert sum_positive_evens([1, 2, 3, 4]) == 6
assert sum_positive_evens([-2, -4, 1, 3]) == 0
assert sum_positive_evens([2, 4, 6, 8]) == 20

What It Tests:

  • Algorithm implementation from natural language
  • Basic programming constructs (loops, conditions, data structures)
  • Edge case handling
  • Code correctness without context

HumanEval Leaderboard (October 2025)

ModelHumanEval ScoreHumanEval+ Score
GPT-592.1%86.3%
Claude 4 Sonnet90.2%86.1%
Gemini 2.5 Pro88.4%83.7%
CodeLlama 70B68.1%62.3%
DeepSeek Coder 33B72.0%66.8%

HumanEval+ adds more test cases to catch edge cases, resulting in lower scores.

HumanEval vs SWE-bench: What's the Difference?

HumanEval:

  • โœ… Tests greenfield coding - writing new functions from scratch
  • โœ… Evaluates algorithmic thinking
  • โœ… Quick to run (164 problems)
  • โŒ Doesn't test codebase navigation
  • โŒ No real-world context or dependencies

SWE-bench:

  • โœ… Tests real-world debugging and codebase understanding
  • โœ… Evaluates multi-file reasoning
  • โœ… Measures ability to work with legacy code
  • โŒ Python-only (currently)
  • โŒ Time-consuming evaluation

For Developers:

  • HumanEval predicts: How well a model writes new code, algorithms, utilities
  • SWE-bench predicts: How well a model debugs, refactors, and maintains existing code

Most developers need both capabilities, which is why we recommend models that score well on both benchmarks.


Other Important Coding Benchmarks

MBPP (Mostly Basic Python Problems)

What: 974 entry-level Python programming tasks Created by: Google Research Difficulty: Easier than HumanEval

Top Scores:

  • GPT-5: ~88%
  • Claude 4: ~86%
  • Gemini 2.5: ~84%

Use Case: Tests basic programming competency, often used as a minimum bar for coding AI.

CodeContests

What: Competition programming problems from Codeforces, AtCoder Difficulty: Hard (algorithmic complexity) Languages: Multiple (C++, Python, Java)

Top Scores:

  • GPT-5: ~75%
  • Claude 4: ~73%

Use Case: Tests advanced algorithmic problem-solving, similar to LeetCode hard problems.

APPS (Automated Programming Progress Standard)

What: 10,000 programming problems at varying difficulty Coverage: Introductory โ†’ competition level Languages: Primarily Python

Top Scores:

  • Claude 4: ~72%
  • GPT-5: ~71%

Use Case: Broader problem-solving assessment across difficulty spectrum.

MultiPL-E (Multilingual Evaluation)

What: HumanEval translated to 18+ programming languages Languages: JavaScript, Java, C++, Rust, Go, etc.

Key Finding: Model performance varies significantly by language. Most models score:

  • Python: Highest (baseline)
  • JavaScript/Java: -5 to -10%
  • Rust/Haskell: -15 to -25%

Why It Matters: If you're not working in Python, model rankings may differ. GPT-5 and Claude 4 have the best cross-language performance.


How to Interpret Benchmark Scores

What Scores Tell You

High SWE-bench + High HumanEval (e.g., Claude 4, GPT-5):

  • โœ… Best all-around coding models
  • โœ… Can handle both new code and debugging
  • โœ… Suitable for professional development
  • ๐Ÿ’ฐ Usually premium-priced

High HumanEval, Lower SWE-bench:

  • โœ… Good at writing new code from scratch
  • โš ๏ธ May struggle with large existing codebases
  • ๐Ÿ‘ Good for prototyping and greenfield projects

High SWE-bench, Lower HumanEval:

  • โœ… Excellent at understanding and fixing existing code
  • โš ๏ธ Less creative with new algorithms
  • ๐Ÿ‘ Good for maintenance and refactoring

Moderate Scores on Both (~60-70%):

  • โš ๏ธ Useful but requires human oversight
  • ๐Ÿ‘ Good for learning and productivity boost
  • ๐Ÿ’ฐ Often more affordable options

Benchmark Limitations You Should Know

1. Language Bias

  • Most benchmarks heavily favor Python
  • JavaScript/TypeScript performance may differ by 10-20%
  • Check language-specific benchmarks for accuracy

2. Context Window Constraints

  • Benchmarks test with limited context
  • Real projects often need 100K+ token windows
  • Models with larger context (Gemini 2.5: 1M tokens) may outperform benchmarks in practice

3. Missing Soft Skills

  • Can't measure code readability
  • Doesn't test maintainability
  • No evaluation of documentation quality
  • Team collaboration aspects ignored

4. Static vs Interactive

  • Benchmarks are one-shot evaluations
  • Real development is iterative with clarifications
  • Good prompt engineering can boost real-world performance beyond benchmarks

5. Domain Gaps

  • Benchmarks use open-source Python repos
  • Your proprietary codebase may be very different
  • Enterprise, mobile, systems programming not well-represented

6. Overfitting Risk

  • Models may optimize specifically for benchmark patterns
  • Genuine understanding vs pattern matching unclear
  • Real-world edge cases may not be covered

Real-World Performance vs Benchmarks

Expect real-world performance to vary by ยฑ20% from benchmarks depending on:

  • Your primary programming language
  • Codebase size and complexity
  • Your prompt engineering skills
  • Problem domain specifics
  • IDE integration quality
  • Team workflow integration

Best Practice: Use benchmarks as a starting point, then test models on your actual codebase with your real problems. See our testing guide below.


How to Test AI Coding Models Yourself

Step-by-Step Testing Framework

Step 1: Define Your Criteria

Identify what matters for your specific use case:

  • Primary language: Python, JavaScript, TypeScript, Go, Rust, etc.
  • Common tasks: API development, data processing, UI components, algorithms
  • Codebase size: Small scripts, medium apps, large monorepos
  • Complexity: Simple CRUD, complex business logic, systems programming

Step 2: Create Representative Tests

Extract 5-10 real examples from your work:

Test Suite Example:
1. Fix actual bug from your issue tracker
2. Implement common feature request
3. Refactor complex function
4. Write tests for existing code
5. Debug performance issue
6. Add API endpoint
7. Update database schema
8. Implement algorithm
9. Handle edge case
10. Document complex code

Ensure diversity: Mix easy, medium, hard problems to get balanced assessment.

Step 3: Test Systematically

For each model:

Testing Protocol:
- Use IDENTICAL prompts across models
- Provide SAME context and documentation
- Test in similar environments (API vs local)
- Time each interaction
- Track iterations needed to get working code

Measure:

  • โœ… Correctness: Does it work? Pass tests?
  • ๐Ÿ“Š Code Quality: Maintainable? Well-structured?
  • โšก Speed: Time to working solution?
  • ๐Ÿ”„ Iterations: How many refinements needed?
  • ๐Ÿ’ฐ Cost: API costs or hardware requirements

Step 4: Compare Costs

Calculate actual costs for your usage patterns:

API Models:

Monthly cost = (Prompts per day ร— Avg tokens ร— Days ร— Price per 1M tokens) / 1M

Example: 50 prompts/day, 2000 tokens avg, 22 days
Claude 4: (50 ร— 2000 ร— 22 ร— $3) / 1M = $6.60/month input
GPT-5: (50 ร— 2000 ร— 22 ร— $0.10) / 1M = $0.22/month input
(Plus output costs)

Local Models:

  • Initial hardware cost: $1,500-3,000 for RTX 4090 or M2 Max
  • Electricity: ~$5-15/month
  • Amortized over 2-3 years

Step 5: Evaluate Integration

Test with your actual workflow:

  • IDE integration (VS Code, JetBrains, Cursor)
  • Git workflow compatibility
  • CI/CD pipeline integration
  • Team collaboration features

Free Testing Options

Cloud Models:

  • ChatGPT: Limited free access, Plus $20/mo
  • Claude: Free tier available
  • Gemini: 60 requests/minute free tier
  • GitHub Copilot: 30-day free trial

Local Models:

  • CodeLlama 70B: Completely free, requires 16GB+ RAM
  • DeepSeek Coder: Free (MIT license)
  • Ollama: Free platform for running local models

Recommended Timeline:

  • Week 1: Test top 3 cloud models (free tiers)
  • Week 2: Set up and test 1-2 local models
  • Week 3: Deep dive with winner on real projects
  • Week 4: Make final decision

Interpreting Your Results

Good signs:

  • โœ… Solves 70%+ of your real problems correctly
  • โœ… Code quality matches your standards
  • โœ… Speeds up development by 20%+ (time tracking)
  • โœ… Reduces mental overhead and context switching

Warning signs:

  • โŒ <50% success rate on your tests
  • โŒ Frequently introduces bugs
  • โŒ Code needs extensive refactoring
  • โŒ Doesn't understand your domain

Remember: Benchmarks are a starting point. Your real-world results matter most.


The Future of AI Coding Benchmarks

Emerging Benchmarks (2025-2026)

1. Multi-Language SWE-bench

  • Expanding beyond Python to JavaScript, Java, Go, Rust
  • Expected: Q1 2026
  • Why it matters: Current benchmarks don't represent full language diversity

2. SWE-bench Enterprise

  • Private codebase evaluation framework
  • Testing on proprietary code patterns
  • Expected: Q2 2026

3. Interactive Coding Benchmark

  • Multi-turn debugging conversations
  • Tests clarification and iteration
  • Better represents real developer workflow

4. Security-Focused Benchmarks

  • Specifically testing for vulnerability detection
  • Measuring secure coding practices
  • Critical for enterprise adoption

What's Missing from Current Benchmarks

Not Yet Tested:

  • Team collaboration and code review quality
  • Documentation and comment quality
  • Performance optimization capabilities
  • Cross-platform compatibility
  • Mobile development (iOS, Android)
  • UI/UX implementation accuracy
  • Database schema design
  • DevOps and infrastructure code

The Benchmark We Need: A comprehensive evaluation that tests real-world software development end-to-end: requirements analysis โ†’ design โ†’ implementation โ†’ testing โ†’ deployment โ†’ maintenance.


Choosing Models Based on Benchmarks

Decision Framework

For Production Development (High Stakes):

  • โœ… Choose: Claude 4 Sonnet (77.2% SWE-bench) or GPT-5 (74.9%)
  • ๐Ÿ’ฐ Budget: $20-200/month subscriptions or $3-15/1M tokens API
  • ๐ŸŽฏ Best for: Professional developers, complex projects, enterprise

For Learning and Personal Projects:

  • โœ… Choose: GPT-5 (best all-around) or DeepSeek Coder (free, 70.2%)
  • ๐Ÿ’ฐ Budget: ChatGPT Plus $20/mo or free local models
  • ๐ŸŽฏ Best for: Students, hobbyists, side projects

For Privacy-Critical Work:

  • โœ… Choose: CodeLlama 70B (local, ~58% estimated SWE-bench)
  • ๐Ÿ’ฐ Budget: Hardware investment $1,500-3,000
  • ๐ŸŽฏ Best for: Healthcare, finance, government, sensitive IP

For Large Codebase Analysis:

  • โœ… Choose: Gemini 2.5 Pro (1M-10M context window)
  • ๐Ÿ’ฐ Budget: $18.99/mo or $1.25-5/1M tokens
  • ๐ŸŽฏ Best for: Monorepos, legacy code migration, documentation

For Cost-Conscious Teams:

  • โœ… Choose: DeepSeek Coder 33B (70.2% SWE-bench, cheapest)
  • ๐Ÿ’ฐ Budget: Free (MIT license) or ~$0.50-2/1M tokens
  • ๐ŸŽฏ Best for: Startups, budget-limited projects, high volume

Benchmark-Based Recommendations by Use Case

React/Frontend Development:

  • HumanEval important (new component generation)
  • GPT-5 (92% HumanEval) or Claude 4 (90%)

Python Backend/Data Science:

  • SWE-bench critical (debugging complex code)
  • Claude 4 (77.2%) or GPT-5 (74.9%)

Algorithm-Heavy Work:

  • CodeContests scores matter
  • GPT-5 (75%) or Claude 4 (73%)

Multi-Language Projects:

  • MultiPL-E performance important
  • GPT-5 or Claude 4 (best cross-language)

Maintenance/Refactoring:

  • SWE-bench most predictive
  • Claude 4 (77.2%) - best choice

Explore our detailed model comparison guides for specific recommendations.


Conclusion: Benchmarks as Your AI Model Selection Guide

AI coding benchmarks provide invaluable insight into model capabilities, but they're not the complete story. Here's what to remember:

โœ… What Benchmarks Tell You:

  • Relative ranking of models on standardized tasks
  • Strengths and weaknesses across different problem types
  • Minimum capability bars for production use
  • Trends in AI coding progress over time

โŒ What Benchmarks Don't Tell You:

  • Your specific language/framework performance
  • Real-world integration quality
  • Cost-effectiveness for your usage patterns
  • IDE and workflow compatibility
  • Team collaboration features

๐ŸŽฏ Best Approach:

  1. Start with benchmark leaders (Claude 4, GPT-5, Gemini 2.5)
  2. Test on your actual code with representative problems
  3. Evaluate integration with your development workflow
  4. Calculate real costs for your usage patterns
  5. Choose based on YOUR results, not just benchmarks

Current State (October 2025):

  • Claude 4 Sonnet leads SWE-bench Verified at 77.2%
  • GPT-5 dominates HumanEval at 92%
  • The gap between top models is narrowing
  • We're approaching 80%+ on real-world coding tasks
  • Local models are catching up (Llama 4 ~65%)

The Future: As benchmarks evolve to better represent real-world development, expect:

  • Multi-language evaluations
  • Interactive debugging tests
  • Security and performance benchmarks
  • Team collaboration metrics
  • End-to-end development workflows

The benchmark that matters most is your own testing on your actual codebase. Use SWE-bench, HumanEval, and other standardized benchmarks as a starting point, then validate with hands-on evaluation before committing to any AI coding tool.

Ready to choose your AI coding model? Start with our comprehensive model comparison guide and then test the top candidates on your real work.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Comments (0)

No comments yet. Be the first to share your thoughts!

SWE-bench Verified Leaderboard 2025

Current rankings of top AI coding models on SWE-bench Verified benchmark

๐Ÿ’ป

Local AI

  • โœ“100% Private
  • โœ“$0 Monthly Fee
  • โœ“Works Offline
  • โœ“Unlimited Usage
โ˜๏ธ

Cloud AI

  • โœ—Data Sent to Servers
  • โœ—$20-100/Month
  • โœ—Needs Internet
  • โœ—Usage Limits

AI Coding Benchmark Comparison Chart

Visual comparison of SWE-bench, HumanEval, MBPP, and other major coding benchmarks

๐Ÿ‘ค
You
๐Ÿ’ป
Your ComputerAI Processing
๐Ÿ‘ค
๐ŸŒ
๐Ÿข
Cloud AI: You โ†’ Internet โ†’ Company Servers

AI Model Testing Guide - Step by Step

Complete workflow for testing AI coding models on your actual codebase

1
DownloadInstall Ollama
2
Install ModelOne command
3
Start ChattingInstant AI
๐Ÿง 
SWE-bench Evaluation Dashboard
SWE-bench Verified Leaderboard (October 2025)
1. Claude 4 Sonnet: 77.2% | Django
2. GPT-5: 74.9% | Flask
3. Gemini 2.5 Pro: 71.8% | Large context
4. DeepSeek V3.1: 70.2% | Cost leader
HumanEval Rankings: GPT-5 92% | Claude 4 90% | Gemini 88%
Benchmark Types: Real bugs (SWE-bench) | Algorithms (HumanEval)
SWE-bench performance improvement over time from 2023 to 2025
SWE-bench Verified scores have improved from 48.5% (GPT-4 Turbo, Nov 2023) to 77.2% (Claude 4 Sonnet, Oct 2025), showing rapid progress in AI coding capabilities.

Based on official SWE-bench leaderboards and research publications (October 2025).


Deep Dive: SWE-bench Methodology



How Issues Are Selected



Source Repositories (Python only, currently):



  • Django - Web framework (42% of issues)

  • Flask - Microframework (12%)

  • scikit-learn - Machine learning (18%)

  • matplotlib - Plotting library (8%)

  • sympy - Symbolic mathematics (10%)

  • requests - HTTP library (6%)

  • Others - pytest, sphinx, etc. (4%)



Selection Criteria:



  1. Issue must have a verified fix merged into main branch

  2. Fix must include test coverage

  3. Problem must be reproducible in isolated environment

  4. Issue description must be reasonably clear (for Verified subset)

  5. Fix must touch <20 files (to keep tractable)



Evaluation Process



Step 1: Environment Setup


# Model receives:
- Repository at commit before fix
- Issue description (text)
- Instruction to generate patch

# Example:
Repository: django/django @ commit abc123
Issue #12345: QuerySet filter raises FieldError with __isnull
Task: Generate a git patch that fixes this issue


Step 2: Model Generates Solution


# Model must:
1. Analyze codebase (potentially thousands of files)
2. Locate bug source
3. Generate patch file
4. Ensure no existing tests break


Step 3: Automated Evaluation


# Success criteria:
โœ… Patch applies cleanly
โœ… All existing tests still pass
โœ… Issue-specific test now passes
โœ… No new errors introduced

# Failure modes:
โŒ Patch doesn't apply (syntax errors)
โŒ Tests fail (broke existing functionality)
โŒ Issue not actually fixed
โŒ Timeout (>30 minutes)


Why This Is So Hard



Challenge 1: Codebase Navigation



  • Django has 300,000+ lines of code across 2,000+ files

  • Models must locate the 1-2 relevant files among thousands

  • No explicit pointers provided



Challenge 2: Implicit Requirements



  • Issues often lack complete specifications

  • Must infer intended behavior from context

  • Edge cases not always explicitly mentioned



Challenge 3: Testing Gauntlet



  • Django has 50,000+ existing tests that must all pass

  • Any regression = failure

  • Fix must not introduce new bugs



Challenge 4: Real-World Code Quality



  • Legacy patterns and technical debt

  • Inconsistent coding styles

  • Complex dependency chains

  • Backward compatibility requirements



Complete SWE-bench Leaderboard History



Verified Scores Over Time






























































DateModelScoreImprovement
Oct 2025Claude 4 Sonnet77.2%+7.3%
Oct 2025GPT-574.9%+5.6%
Oct 2025Gemini 2.5 Pro71.8%+4.2%
Oct 2025DeepSeek V3.170.2%+10.1%
Aug 2024Claude 3.5 Sonnet69.1%+8.2%
May 2024GPT-4o67.3%+3.4%
Mar 2024Claude 3 Opus60.9%+12.4%
Nov 2023GPT-4 Turbo48.5%โ€”


Key Observations:



  • Rapid Progress: From 48.5% to 77.2% in just 2 years

  • Claude's Lead: Anthropic has dominated SWE-bench since Claude 3

  • Diminishing Returns: Improvements slowing as models approach human-level

  • Approaching 80%: Next frontier is breaking 80% barrier



By Repository Type


















































RepositoryClaude 4GPT-5Gemini 2.5Difficulty
Django84.2%79.1%75.3%High
Flask81.7%82.3%76.8%Medium
scikit-learn73.4%71.2%70.9%Very High
matplotlib68.9%66.4%67.1%High
sympy70.2%68.7%65.4%Very High


Insights:



  • Claude excels at web frameworks (Django/Flask)

  • Math/science libraries are hardest (sympy, matplotlib)

  • GPT-5 competitive across the board

  • Gemini strong but slightly behind on most repos



HumanEval Deep Dive



Problem Categories



String Manipulation (22% of problems):


def remove_duplicates(string: str) -> str:
"""
From a string, remove all duplicate characters.

>>> remove_duplicates('hello')
'helo'
>>> remove_duplicates('aabbcc')
'abc'
"""


List/Array Operations (28%):


def rolling_max(numbers: List[int]) -> List[int]:
"""
Generate list of rolling maximum element found until given moment.

>>> rolling_max([1, 2, 3, 2, 3, 4, 2])
[1, 2, 3, 3, 3, 4, 4]
"""


Mathematical/Algorithmic (18%):


def is_prime(n: int) -> bool:
"""
Return true if a given number is prime, and false otherwise.

>>> is_prime(6)
False
>>> is_prime(7)
True
"""


Data Structure Manipulation (15%):


def sort_dict_by_value(d: Dict[str, int]) -> Dict[str, int]:
"""
Sort dictionary by values in descending order.
"""


Recursion/Dynamic Programming (12%):


def fib(n: int) -> int:
"""
Return n-th Fibonacci number.
"""


Edge Case Handling (5%):


def parse_nested_parens(paren_string: str) -> List[int]:
"""
Parse levels of nested parentheses.
"""


Pass@k Metric Explained



Pass@1: Percentage of problems solved in first attempt



  • GPT-5: 92.1% pass@1

  • Means: 92.1% of problems solved correctly on first try



Pass@10: At least one correct solution in 10 attempts



  • GPT-5: ~97% pass@10

  • Useful metric for code completion tools with multiple suggestions



Pass@100: At least one correct in 100 attempts



  • Nearly 99% for top models

  • Shows models can eventually solve almost everything



HumanEval+ Extensions



Additional Test Cases:



  • Original HumanEval: ~3 tests per problem

  • HumanEval+: ~10 tests per problem

  • Catches edge cases and boundary conditions

  • Scores typically 5-10% lower than original



Why It Matters:



  • Original tests were too simple - models could pass without robust code

  • Plus version forces proper edge case handling

  • Better predictor of real-world code quality



Complete Testing Toolkit



Sample Test Suite Template



JavaScript/TypeScript Example:



// test-suite.js - Run on Claude 4, GPT-5, Gemini 2.5

// Test 1: Bug Fix - Easy
/*
Prompt: This React component has a bug. Fix it.

function UserCard({ user }) {
const [expanded, setExpanded] = useState(false);

return (
<div onClick={() => setExpanded(!expanded)}>
<h3>{user.name}</h3>
{expanded && <p>{user.email}</p>}
</div>
);
}

// Bug: Missing import for useState
// Expected: Model should add "import { useState } from 'react';"
*/

// Test 2: Feature Implementation - Medium
/*
Prompt: Add a search feature to this list component that filters by name.

function UserList({ users }) {
return (
<ul>
{users.map(user => <li key={user.id}>{user.name}</li>)}
</ul>
);
}

// Expected:
// - Add search input
// - Filter users by name
// - Handle empty state
// - Debounce search input
*/

// Test 3: Refactoring - Hard
/*
Prompt: Refactor this code to use TypeScript with proper types and modern patterns.

function fetchData(url, callback) {
fetch(url)
.then(response => response.json())
.then(data => callback(null, data))
.catch(error => callback(error, null));
}

// Expected:
// - Convert to async/await
// - Add TypeScript types
// - Proper error handling
// - Generic return type
*/


Python Example:



# test_suite.py

# Test 1: Algorithm - Medium
"""
Prompt: Implement a function to find the longest palindromic substring.

def longest_palindrome(s: str) -> str:
pass

# Expected:
# - Handle edge cases (empty, single char)
# - Efficient algorithm (expand around center or DP)
# - Correct for all test cases
"""

# Test 2: Real-World Bug - Hard
"""
Prompt: This Flask endpoint has a SQL injection vulnerability. Fix it.

@app.route('/users/<user_id>')
def get_user(user_id):
query = f"SELECT * FROM users WHERE id = {user_id}"
result = db.execute(query)
return jsonify(result)

# Expected:
# - Identify SQL injection risk
# - Use parameterized queries
# - Add input validation
# - Consider error handling
"""

# Test 3: Codebase Understanding - Very Hard
"""
Prompt: Given this Django model, why does the query fail?

class Author(models.Model):
name = models.CharField(max_length=100)

class Book(models.Model):
title = models.CharField(max_length=200)
author = models.ForeignKey(Author, on_delete=models.CASCADE,
related_name='published_books')

# This query fails:
books = Book.objects.filter(author__books__title__icontains='Django')

# Expected:
# - Identify incorrect related_name usage
# - Explain should use 'published_books' not 'books'
# - Provide corrected query
"""


Scoring Rubric






































CriteriaWeightScore 1-10
Correctness40%Does it work? Pass tests?
Code Quality25%Readable? Maintainable? Follows best practices?
Completeness15%Handles edge cases? Error handling?
Efficiency10%Performance? Time/space complexity?
Speed10%Time to working solution?


Calculate Final Score:


Final Score = (Correctness ร— 0.4) + (Quality ร— 0.25) +
(Completeness ร— 0.15) + (Efficiency ร— 0.1) + (Speed ร— 0.1)

Example:
Claude 4: (9 ร— 0.4) + (9 ร— 0.25) + (8 ร— 0.15) + (7 ร— 0.1) + (6 ร— 0.1)
= 3.6 + 2.25 + 1.2 + 0.7 + 0.6 = 8.35/10

GPT-5: (8 ร— 0.4) + (8 ร— 0.25) + (9 ร— 0.15) + (8 ร— 0.1) + (9 ร— 0.1)
= 3.2 + 2.0 + 1.35 + 0.8 + 0.9 = 8.25/10


Cost Tracking Spreadsheet



Model Comparison Tracker:

Model: Claude 4 Sonnet
-----------------------
Test Duration: 2 weeks
Prompts Sent: 487
Avg Input Tokens: 1,842
Avg Output Tokens: 1,234
Total Input Tokens: 897,054
Total Output Tokens: 601,058

Cost Calculation:
Input: 897,054 ร— $3 / 1M = $2.69
Output: 601,058 ร— $15 / 1M = $9.02
Total: $11.71 for 2 weeks = ~$23.42/month

Success Rate: 73% correct first try
Quality Score: 8.35/10
Speed: 4.2s avg response time

ROI: Estimated 6 hours saved = $300 value (at $50/hr)
Cost-Benefit: 12.8x return on investment

๐Ÿ“… Published: October 30, 2025๐Ÿ”„ Last Updated: October 30, 2025โœ“ Manually Reviewed
PR

Written by Pattanaik Ramswarup

AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset

I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.

โœ“ 10+ Years in ML/AIโœ“ 77K Dataset Creatorโœ“ Open Source Contributor

Was this helpful?

Related Guides

Continue your local AI journey with these comprehensive guides

See Also on Local AI Master

My 77K Dataset Insights Delivered Weekly

Get exclusive access to real dataset optimization strategies and AI model performance tips.

Free Tools & Calculators