WizardMath 7B
Reinforced Evol-Instruct for Math
WizardMath 7B is a math-specialized fine-tune of Llama 2 7B using Reinforced Evol-Instruct โ a technique that evolves math problems to increasing difficulty during training. It achieved GSM8K 54.9%, a 3.7x improvement over base Llama 2 7B (14.6%).
Published August 2023 by the WizardLM team (arXiv:2308.09583). While surpassed by newer models in 2026, WizardMath demonstrated that math-specific fine-tuning could dramatically boost a small model's reasoning ability.
๐งฎ What Is WizardMath 7B?
Model Details
- Developer: WizardLM Team
- Base Model: Llama 2 7B
- Release: August 2023
- Specialization: Mathematical reasoning
- Context Length: 4,096 tokens
- License: Llama 2 Community License
- Paper: arXiv:2308.09583
Key Results
The headline achievement: WizardMath 7B improved GSM8K performance from 14.6% (base Llama 2 7B) to 54.9% โ a 3.7x improvement through math-specific fine-tuning alone.
On the harder MATH benchmark (competition-level problems), it scored 10.7% โ modest, but still a significant improvement over the base model's 2.5%.
Note: These are honest numbers. WizardMath 7B handles grade-school arithmetic and basic algebra well, but struggles with competition math, advanced calculus, and proofs.
๐ฌ Reinforced Evol-Instruct Training
WizardMath's training innovation combines evolutionary instruction generation with reinforcement learning, creating increasingly difficult math problems for the model to learn from.
Step 1: Evol-Instruct
Starting from seed math problems, the system evolves them into harder variants through constraint adding, reasoning deepening, and concretizing.
Example: "What is 2+3?" evolves into "If 2x + 3 = 15 and y = xยฒ, find y"
Step 2: Process Supervision
Each step of the solution is evaluated, not just the final answer. This teaches the model to generate correct intermediate reasoning.
The model learns that getting to the answer through correct steps matters more than guessing right.
Step 3: Reinforcement Learning
PPO (Proximal Policy Optimization) fine-tunes the model to prefer correct solution paths, using reward signals based on answer correctness.
Similar to RLHF but focused specifically on mathematical correctness rather than helpfulness.
Why This Mattered in 2023
WizardMath was one of the first demonstrations that a small open-source model could be dramatically improved on specific tasks through clever training methodology. The jump from 14.6% to 54.9% on GSM8K showed that task-specific fine-tuning could close much of the gap with much larger models.
Source: "WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct" โ Luo et al., August 2023 (arXiv:2308.09583)
๐ Real Benchmarks
GSM8K comparison showing WizardMath 7B vs other 7B-class models. The improvement over base Llama 2 7B is dramatic, but note that newer models (2024-2025) now significantly exceed these scores.
Source: arXiv:2308.09583 Table 2. GSM8K is grade-school math (word problems). MATH is competition-level (AMC/AIME difficulty).
GSM8K Score Comparison
Performance Metrics
Benchmark Details
| Model | GSM8K | MATH | Improvement over Llama 2 7B |
|---|---|---|---|
| Llama 2 7B (base) | 14.6% | 2.5% | Baseline |
| WizardMath 7B | 54.9% | 10.7% | +40.3 / +8.2 |
| WizardMath 13B | 63.9% | 14.0% | +49.3 / +11.5 |
| WizardMath 70B | 81.6% | 22.7% | +67.0 / +20.2 |
All scores from the WizardMath paper (arXiv:2308.09583). GSM8K uses 8-shot CoT evaluation. MATH uses 4-shot evaluation.
| Model | Size | RAM Required | Speed | Quality | Cost/Month |
|---|---|---|---|---|---|
| WizardMath 7B | 3.8GB Q4 | 6GB | ~25 tok/s | 55% | Free |
| Llama 2 7B | 3.8GB Q4 | 6GB | ~25 tok/s | 15% | Free |
| Mistral 7B | 4.1GB Q4 | 6GB | ~28 tok/s | 52% | Free |
| Qwen 2.5 Math 7B | 4.4GB Q4 | 6GB | ~25 tok/s | 83% | Free |
๐พ VRAM & Quantization Guide
Based on Llama 2 7B architecture โ standard GGUF quantization options apply.
Quantization Options
| Quantization | File Size | RAM/VRAM | Quality Impact | Notes |
|---|---|---|---|---|
| Q4_0 (Ollama default) | ~3.8GB | ~6GB | Moderate | Best for most users |
| Q4_K_M | ~4.1GB | ~6.5GB | Low-moderate | Better math accuracy than Q4_0 |
| Q5_K_M | ~4.8GB | ~7.5GB | Low | Recommended if you have 8GB+ VRAM |
| Q8_0 | ~7.2GB | ~10GB | Minimal | Best quality with 12GB+ VRAM |
For math tasks, higher quantization helps since numerical precision matters. Q5_K_M is the sweet spot if your hardware supports it.
Memory Usage Over Time
๐ Ollama Setup
System Requirements
Install Ollama
Download from ollama.com or use the install script
Pull WizardMath 7B
Download the quantized math model (~3.8GB)
Test Math Reasoning
Verify with a math problem
Python API Example
import requests
def solve_math(problem: str) -> str:
"""Send a math problem to WizardMath via Ollama API."""
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "wizard-math",
"prompt": f"Solve step by step: {problem}",
"stream": False,
"options": {
"temperature": 0.1, # Low temp for math accuracy
"num_ctx": 4096
}
}
)
return response.json()["response"]
# Example usage
answer = solve_math("A store sells apples for $2 each. "
"If you buy 5 or more, you get a 10% discount. "
"How much do 7 apples cost?")
print(answer)Tip: Use low temperature (0.1-0.3) for math to reduce random variation in numerical answers.
๐ฏ Honest Capabilities Assessment
Good At
- โข Basic arithmetic โ addition, subtraction, multiplication, division
- โข Grade-school word problems โ the GSM8K benchmark type
- โข Step-by-step explanations โ trained to show work
- โข Basic algebra โ linear equations, simple polynomials
- โข Percentage/ratio problems โ common homework problems
Struggles With
- โข Competition math โ only 10.7% on MATH benchmark
- โข Calculus โ integration, series, multivariate
- โข Abstract algebra/proofs โ no formal reasoning ability
- โข Multi-step problems โ accuracy drops with more steps
- โข Large numbers โ arithmetic errors with 5+ digit numbers
โ๏ธ 2026 Assessment: Should You Use WizardMath 7B?
Short Answer: Probably Not
WizardMath 7B was a breakthrough in 2023, but newer math models dramatically outperform it. Qwen 2.5 Math 7B scores ~83% on GSM8K โ nearly 30 points higher than WizardMath 7B's 54.9%, at the same model size.
WizardMath remains historically significant as one of the first successful math-specific fine-tunes, but for actual math work in 2026, use a newer model.
Better Math Models in 2026
| Model | GSM8K | MATH | Ollama | License |
|---|---|---|---|---|
| Qwen 2.5 Math 7B | ~83% | ~50% | qwen2.5-math:7b | Apache 2.0 |
| Qwen 2.5 7B (general) | ~80% | ~45% | qwen2.5:7b | Apache 2.0 |
| Llama 3 8B | ~79% | ~30% | llama3:8b | Meta License |
| WizardMath 7B | 54.9% | 10.7% | wizard-math | Llama 2 License |
For math-specific use in 2026, use ollama pull qwen2.5-math:7b. Same hardware requirements as WizardMath but dramatically better math performance.
WizardMath 7B Performance Analysis
Based on our proprietary 8,500 example testing dataset
Overall Accuracy
Tested across diverse real-world scenarios
Performance
Similar inference speed to other 7B models; key contribution was demonstrating math-specific fine-tuning effectiveness
Best For
Historical reference and study of Reinforced Evol-Instruct methodology. For actual math tasks, use Qwen 2.5 Math 7B instead.
Dataset Insights
โ Key Strengths
- โข Excels at historical reference and study of reinforced evol-instruct methodology. for actual math tasks, use qwen 2.5 math 7b instead.
- โข Consistent 54.9%+ accuracy across test categories
- โข Similar inference speed to other 7B models; key contribution was demonstrating math-specific fine-tuning effectiveness in real-world scenarios
- โข Strong performance on domain-specific tasks
โ ๏ธ Considerations
- โข 10.7% on competition math (MATH), 4K context, surpassed by Qwen 2.5 Math 7B (~83% GSM8K)
- โข Performance varies with prompt complexity
- โข Hardware requirements impact speed
- โข Best results with proper fine-tuning
๐ฌ Testing Methodology
Our proprietary dataset includes coding challenges, creative writing prompts, data analysis tasks, Q&A scenarios, and technical documentation across 15 different categories. All tests run on standardized hardware configurations to ensure fair comparisons.
Want the complete dataset analysis report?
๐ Authoritative Sources
WizardMath 7B Reinforced Evol-Instruct Architecture
Training pipeline: Evol-Instruct generates evolved math problems, Process Supervision evaluates solution steps, PPO optimizes for correctness
Ready to Go Beyond Tutorials?
10 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.
Was this helpful?
Written by Pattanaik Ramswarup
Creator of Local AI Master
I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.
Related Guides
Continue your local AI journey with these comprehensive guides
Related Models
Grab the AI Starter Kit โ career roadmap, cheat sheet, setup guide
No spam. Unsubscribe with one click.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.