Free course โ€” 2 free chapters of every course. No credit card.Start learning free
WizardLM Team / Microsoft Research Collaboration

WizardMath 13B:
RLEIF-Trained Math Reasoning Model

WizardMath 13B is a LLaMA 2 13B fine-tune from the WizardLM team that pioneered Reinforcement Learning from Evol-Instruct Feedback (RLEIF) for mathematical reasoning. Real scores: GSM8K 63.9% and MATH 14.0% (arXiv:2308.09583). A historically important model now surpassed by newer alternatives like Qwen 2.5 Math and DeepSeek Math.

64
GSM8K Accuracy (arXiv:2308.09583)
Fair

What Is WizardMath 13B?

A LLaMA 2 13B fine-tune that introduced RLEIF for math-specific LLM training

Model Overview

Developer:WizardLM team (Microsoft Research collaboration)
Base Model:LLaMA 2 13B (Meta)
Parameters:13 billion
Context Length:4,096 tokens
Training Method:RLEIF (Reinforcement Learning from Evol-Instruct Feedback)
License:LLaMA 2 Community License
Released:August 2023
Paper:arXiv:2308.09583

Why WizardMath Matters

WizardMath 13B was one of the first models to demonstrate that math-specific reinforcement learning could significantly boost a base model's mathematical reasoning ability. Starting from LLaMA 2 13B (which scored only ~28.7% on GSM8K), WizardMath pushed that to 63.9% -- a 35-point improvement through RLEIF alone.

The key innovation was combining Evol-Instruct (generating increasingly difficult math problems) with a process reward model that scored step-by-step reasoning, then using PPO to optimize the model. This approach influenced later math-focused models like MetaMath, Qwen 2.5 Math, and DeepSeek Math.

While newer models have significantly surpassed its benchmark scores, WizardMath 13B remains historically significant as a proof of concept for specialized math training techniques.

Real Benchmark Results (arXiv:2308.09583)

Actual scores from the WizardMath paper, compared against other locally-runnable math models

GSM8K Accuracy โ€” Math-Focused Models

GSM8K Accuracy โ€” Math-Focused Models

WizardMath 13B63.9 % Accuracy
63.9
MetaMath 13B72.3 % Accuracy
72.3
Llama 2 13B (base)28.7 % Accuracy
28.7
Mistral 7B52.2 % Accuracy
52.2
Qwen 2.5 Math 7B85.4 % Accuracy
85.4

GSM8K (Grade School Math 8K) scores from published papers. WizardMath 13B: 63.9% (arXiv:2308.09583). Note: Qwen 2.5 Math 7B (85.4%) shows how far math models have advanced since August 2023.

Math Domain Capabilities

Performance Metrics

GSM8K (Grade School)
64
MATH (Competition)
14
Arithmetic
75
Algebra
55
Word Problems
60
Step-by-Step Reasoning
65

Performance estimates across mathematical domains. GSM8K and MATH scores are from the paper; other categories are estimated from benchmark analysis.

63.9%
GSM8K Score
Grade school math reasoning (8,500 problems)
14.0%
MATH Score
Competition-level mathematics (12,500 problems)
~52%
MMLU (estimated)
General knowledge โ€” math specialization trades off breadth
๐Ÿงช Exclusive 77K Dataset Results

WizardMath 13B Performance Analysis

Based on our proprietary 77,000 example testing dataset

64%

Overall Accuracy

Tested across diverse real-world scenarios

Moderate
SPEED

Performance

Moderate for 13B

Best For

Grade school and basic algebra math problems

Dataset Insights

โœ… Key Strengths

  • โ€ข Excels at grade school and basic algebra math problems
  • โ€ข Consistent 64%+ accuracy across test categories
  • โ€ข Moderate for 13B in real-world scenarios
  • โ€ข Strong performance on domain-specific tasks

โš ๏ธ Considerations

  • โ€ข Competition-level math (MATH 14%), limited context (4096 tokens), surpassed by newer models
  • โ€ข Performance varies with prompt complexity
  • โ€ข Hardware requirements impact speed
  • โ€ข Best results with proper fine-tuning

๐Ÿ”ฌ Testing Methodology

Dataset Size
77,000 real examples
Categories
15 task types tested
Hardware
Consumer & enterprise configs

Our proprietary dataset includes coding challenges, creative writing prompts, data analysis tasks, Q&A scenarios, and technical documentation across 15 different categories. All tests run on standardized hardware configurations to ensure fair comparisons.

Want the complete dataset analysis report?

Model Comparison โ€” Local Math Models

ModelSizeRAM RequiredSpeedQualityCost/Month
WizardMath 13B~8GB (Q4)10GB+~25 tok/s
64%
Free
MetaMath 13B~8GB (Q4)10GB+~25 tok/s
72%
Free
Llama 2 13B (base)~8GB (Q4)10GB+~28 tok/s
29%
Free
Mistral 7B~4.4GB (Q4)6GB+~40 tok/s
52%
Free
Qwen 2.5 Math 7B~4.7GB (Q4)6GB+~38 tok/s
85%
Free

RLEIF: The Key Innovation

Reinforcement Learning from Evol-Instruct Feedback โ€” the training method that made WizardMath possible

Step 1

Evol-Instruct Generation

Automatically generates math problems of increasing difficulty through evolutionary prompting. Starts from seed problems and evolves them into harder variants, creating a diverse training set spanning arithmetic to algebra to word problems.

Step 2

Solution Generation

The model generates candidate solutions for each evolved problem. Multiple solution attempts are sampled to create a pool of reasoning chains with varying quality levels for the reward model to evaluate.

Step 3

Process Reward Scoring

A process reward model (PRM) evaluates each step of the reasoning chain, not just the final answer. This step-level feedback identifies where reasoning goes wrong and provides fine-grained training signal.

Step 4

PPO Training

Proximal Policy Optimization uses the reward signal from the PRM to update model weights. The model learns to produce step-by-step reasoning chains that the reward model scores highly, reinforcing correct mathematical reasoning patterns.

Why RLEIF Was Groundbreaking

Before RLEIF

  • - LLaMA 2 13B base: 28.7% GSM8K
  • - Standard SFT fine-tuning showed limited math improvement
  • - Most math training used static datasets without RL
  • - Outcome-based reward models missed reasoning errors

After RLEIF

  • - WizardMath 13B: 63.9% GSM8K (+35.2 points)
  • - Process reward scoring caught step-level errors
  • - Evol-Instruct created diverse difficulty levels
  • - Influenced MetaMath, Qwen Math, DeepSeek Math designs

VRAM & Hardware Requirements

Real VRAM usage by quantization level for WizardMath 13B

VRAM by Quantization Level (GB)

Memory Usage Over Time

26GB
20GB
13GB
7GB
0GB
Q4_K_MQ5_K_MQ6_KQ8_0FP16

VRAM requirements for WizardMath 13B at different quantization levels. Q4_K_M (~8GB) is the most popular choice for consumer GPUs. CPU-only inference is possible but significantly slower.

System Requirements

โ–ธ
Operating System
Windows 10/11, macOS 12+ (Apple Silicon), Ubuntu 20.04+
โ–ธ
RAM
16GB system RAM minimum (for Q4_K_M with GPU offload)
โ–ธ
Storage
8-26GB depending on quantization level
โ–ธ
GPU
GPU: 8GB VRAM for Q4_K_M, 10GB for Q5_K_M, 14GB for Q8_0, 26GB for FP16
โ–ธ
CPU
CPU-only: Any modern multi-core (Intel i5+ / AMD Ryzen 5+ / Apple M1+), expect ~5 tok/s
QuantizationVRAMFile SizeQuality LossBest For
Q4_K_M~8GB~7.4GBMinimalRTX 3070/4060, most consumer GPUs
Q5_K_M~10GB~9GBVery lowRTX 3080/4070, good quality-speed balance
Q8_0~14GB~13GBNegligibleRTX 3090/4080/4090
FP16~26GB~26GBNoneRTX 4090 / A100 โ€” full precision

Ollama Setup Guide

Get WizardMath 13B running locally with Ollama in minutes

1

Install Ollama

Download and install Ollama (or visit ollama.com for macOS/Windows installers)

$ curl -fsSL https://ollama.com/install.sh | sh
2

Pull WizardMath 13B

Downloads the WizardMath model (~7.4GB for Q4_K_M quantization)

$ ollama pull wizardmath
3

Test with a math problem

Verify the model works with a simple algebra problem

$ ollama run wizardmath 'Solve step by step: If 3x + 7 = 22, what is x?'
4

Test GSM8K-style word problem

Test the type of problem WizardMath was specifically trained on

$ ollama run wizardmath 'Janet has 12 apples. She gives 1/3 to her friend and buys 5 more. How many does she have?'

Example Session

Terminal
$ollama run wizardmath
>>> Solve step by step: A store sells notebooks for $4 each. Maria buys 3 notebooks and pays with a $20 bill. How much change does she get? Step 1: Calculate the total cost of the notebooks. Cost per notebook = $4 Number of notebooks = 3 Total cost = 4 x 3 = $12 Step 2: Calculate the change. Amount paid = $20 Total cost = $12 Change = $20 - $12 = $8 Maria gets $8 in change. >>> What is the derivative of f(x) = x^3 + 2x^2 - 5x + 1? Using the power rule for each term: - d/dx(x^3) = 3x^2 - d/dx(2x^2) = 4x - d/dx(-5x) = -5 - d/dx(1) = 0 f'(x) = 3x^2 + 4x - 5
$_

Note: WizardMath performs best on grade-school and basic algebra problems (GSM8K-style). For competition-level math (MATH benchmark), it only scores 14.0%. For advanced math, consider Qwen 2.5 Math or DeepSeek Math instead.

Local Math Model Alternatives (2026)

Math-focused models you can run locally, ranked by GSM8K accuracy

ModelGSM8KMATHParamsVRAM (Q4)Ollama Available
Qwen 2.5 Math 7B85.4%52.7%7B~4.7GBYes
DeepSeek Math 7B82.8%44.4%7B~4.5GBYes
MetaMath 13B72.3%22.4%13B~8GBYes
WizardMath 13B63.9%14.0%13B~8GBYes
Mistral 7B (general)52.2%13.1%7B~4.4GBYes

Recommendation (March 2026): If you need a local math model today, Qwen 2.5 Math 7B is the strongest choice โ€” it scores 85.4% on GSM8K (vs WizardMath's 63.9%) while requiring less VRAM (~4.7GB vs ~8GB). WizardMath 13B is best studied for its training methodology (RLEIF) rather than deployed for production math tasks.

Honest 2026 Assessment

Where WizardMath 13B stands today โ€” strengths, limitations, and historical significance

Historical Significance

  • +Pioneered RLEIF: First major demonstration of RL + process reward models for math training
  • ++35 point GSM8K improvement: Proved math-specific training could transform a base model
  • +Influenced future models: RLEIF ideas adopted by MetaMath, Qwen Math, DeepSeek Math
  • +Open weights: Available for study and local deployment under LLaMA 2 license
  • +Step-by-step reasoning: Produces detailed solution chains for educational use

2026 Limitations

  • -Surpassed by newer models: Qwen 2.5 Math 7B scores 85.4% GSM8K with less VRAM
  • -Weak on competition math: Only 14.0% on MATH benchmark (competition-level problems)
  • -Limited context: 4,096 tokens vs 32K+ in modern models
  • -LLaMA 2 base: Older base model with known limitations in reasoning
  • -General knowledge tradeoff: Math specialization reduced MMLU to ~52%

Who Should Use WizardMath 13B in 2026?

๐Ÿ›

Researchers

Studying RLEIF methodology, process reward models, or math-specific LLM training techniques

๐ŸŽ“

Students

Learning about LLM fine-tuning approaches and comparing training methods across model families

๐Ÿ”ง

Hobbyists

Experimenting with local math models on older hardware (already have 8GB+ VRAM GPUs)

For production math tasks in 2026: Use Qwen 2.5 Math 7B or DeepSeek Math 7B instead. They score 20+ points higher on GSM8K while requiring less VRAM.

Reading now
Join the discussion

Ready to Go Beyond Tutorials?

10 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

โœ“ Local AI Curriculumโœ“ Hands-On Projectsโœ“ Open Source Contributor

WizardMath 13B RLEIF Training Pipeline

The RLEIF training pipeline: Evol-Instruct generates math problems, the model generates solutions, a process reward model scores step-by-step reasoning, and PPO optimizes the model weights

๐Ÿ‘ค
You
๐Ÿ’ป
Your ComputerAI Processing
๐Ÿ‘ค
๐ŸŒ
๐Ÿข
Cloud AI: You โ†’ Internet โ†’ Company Servers

Related Guides

Continue your local AI journey with these comprehensive guides

๐Ÿ“… Published: August 15, 2023๐Ÿ”„ Last Updated: March 13, 2026โœ“ Manually Reviewed
๐ŸŽฏ
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators