Best Ollama Models (March 2026): Top 15 Ranked by Task
Before we dive deeper...
Get your free AI Starter Kit
Join 12,000+ developers. Instant download: Career Roadmap + Fundamentals Cheat Sheets.
The best Ollama models in 2026 are Llama 3.3 70B for overall quality (86.0 MMLU, needs 40GB VRAM), Qwen 2.5 Coder 32B for coding (92.7% HumanEval, 20GB VRAM), and Qwen 2.5 32B as the best mid-range all-rounder (83.2 MMLU, 20GB VRAM). For 8GB setups, Llama 3.1 8B remains the most versatile option at 5GB VRAM.
Quick Pick: Best Ollama Model for Your Hardware
| Your Setup | Best Model | Install Command |
|---|---|---|
| 8GB RAM, no GPU | Llama 3.2 3B | ollama pull llama3.2 |
| 16GB RAM / 8GB VRAM | Llama 3.1 8B | ollama pull llama3.1:8b |
| 24GB VRAM | Qwen 2.5 32B | ollama pull qwen2.5:32b |
| 48GB+ VRAM | Llama 3.3 70B | ollama pull llama3.3:70b |
Why Ollama Model Choice Matters
Picking the right Ollama model is the single biggest factor in your local AI experience. A well-matched model runs fast, produces quality output, and fits your hardware. A poor choice gives you either slow responses or disappointing quality.
The Ollama library contains 500+ models, but most users only need to know about 10-15 models that consistently outperform the rest. This guide ranks those top models by task — coding, chat, reasoning, creative writing, and RAG — with real VRAM requirements and speed benchmarks so you can pick the right one immediately.
All models listed here are free, open-weight, and run entirely on your hardware. No API keys, no subscriptions, no data leaving your machine.
Top 15 Ollama Models Ranked
Overall Ranking (March 2026)
| Rank | Model | Parameters | VRAM (Q4) | Best For | HumanEval | MMLU | Speed (RTX 4090) |
|---|---|---|---|---|---|---|---|
| 1 | Llama 3.3 70B | 70B | ~40GB | General, reasoning | 81.7% | 86.0 | ~18 tok/s |
| 2 | Qwen 2.5 32B | 32B | ~20GB | General, multilingual | 79.5% | 83.2 | ~35 tok/s |
| 3 | Qwen 2.5 Coder 32B | 32B | ~20GB | Coding | 92.7% | 76.4 | ~35 tok/s |
| 4 | DeepSeek R1 32B | 32B | ~20GB | Reasoning, math | 72.6% | 79.8 | ~30 tok/s |
| 5 | Llama 3.1 8B | 8B | ~5GB | General (budget) | 72.6% | 68.4 | ~85 tok/s |
| 6 | Qwen 2.5 Coder 7B | 7B | ~5GB | Coding (budget) | 88.4% | 64.2 | ~90 tok/s |
| 7 | DeepSeek R1 14B | 14B | ~9GB | Reasoning (mid-range) | 68.3% | 73.1 | ~55 tok/s |
| 8 | Mistral Small 24B | 24B | ~15GB | Multilingual, chat | 71.2% | 81.0 | ~40 tok/s |
| 9 | Gemma 2 27B | 27B | ~17GB | Instruction following | 64.4% | 78.1 | ~38 tok/s |
| 10 | Phi-4 Mini 3.8B | 3.8B | ~3GB | Small model king | 67.8% | 68.5 | ~110 tok/s |
| 11 | Llama 3.2 3B | 3B | ~2GB | Starter, low-resource | 48.2% | 58.3 | ~120 tok/s |
| 12 | Mistral 7B | 7B | ~5GB | General, fast | 56.7% | 64.1 | ~90 tok/s |
| 13 | Qwen 2.5 Coder 1.5B | 1.5B | ~1.5GB | Autocomplete | 70.6% | 46.8 | ~150 tok/s |
| 14 | Nomic Embed Text | 137M | ~0.5GB | Embeddings, RAG | N/A | N/A | ~1000 tok/s |
| 15 | Llama 3.2 Vision 11B | 11B | ~8GB | Image understanding | N/A | 73.2 | ~50 tok/s |
Benchmark sources: HumanEval and MMLU scores from official model cards on Hugging Face and release announcements (Meta, Alibaba, DeepSeek, Google, Microsoft). Speed figures are community-reported estimates from r/LocalLLaMA and vary by prompt length, context size, and system configuration. Your results may differ.
Best Models by Task
Best for Coding
Coding models need to understand syntax, generate working functions, and debug errors.
| Model | Size | VRAM | HumanEval | Install |
|---|---|---|---|---|
| Qwen 2.5 Coder 32B | 32B | ~20GB | 92.7% | ollama pull qwen2.5-coder:32b |
| Qwen 2.5 Coder 7B | 7B | ~5GB | 88.4% | ollama pull qwen2.5-coder:7b |
| DeepSeek Coder V2 Lite | 16B | ~10GB | 81.1% | ollama pull deepseek-coder-v2:16b |
| Qwen 2.5 Coder 1.5B | 1.5B | ~1.5GB | 70.6% | ollama pull qwen2.5-coder:1.5b |
Why Qwen 2.5 Coder dominates: The Qwen Coder series was trained on 5.5 trillion tokens of code data spanning 92 programming languages. The 32B version scores 92.7% on HumanEval, matching or exceeding GPT-4o on code generation. The 7B version at 88.4% outperforms models 4x its size.
Best setup for AI-assisted coding:
- Use Qwen 2.5 Coder 1.5B for fast autocomplete in Continue.dev or Cursor
- Use Qwen 2.5 Coder 7B or 32B for chat, refactoring, and code review
- See our best local AI coding models guide for detailed comparison
Best for General Chat
Chat models handle conversation, Q&A, summarization, and everyday tasks.
| Model | Size | VRAM | MMLU | Install |
|---|---|---|---|---|
| Llama 3.3 70B | 70B | ~40GB | 86.0 | ollama pull llama3.3:70b |
| Qwen 2.5 32B | 32B | ~20GB | 83.2 | ollama pull qwen2.5:32b |
| Mistral Small 24B | 24B | ~15GB | 81.0 | ollama pull mistral-small:24b |
| Llama 3.1 8B | 8B | ~5GB | 68.4 | ollama pull llama3.1:8b |
| Phi-4 Mini 3.8B | 3.8B | ~3GB | 68.5 | ollama pull phi4-mini |
Llama 3.3 70B is the best local chat model if you have the hardware. It replaced Llama 3.1 70B with better instruction following and reduced hallucination. For most users, Qwen 2.5 32B hits the sweet spot of quality and resource requirements.
Phi-4 Mini is remarkable at 3.8B parameters — it matches Llama 3.1 8B on MMLU while using 40% less VRAM.
Best for Reasoning and Math
Reasoning models excel at logic puzzles, math, analysis, and multi-step problem solving.
| Model | Size | VRAM | MATH | Install |
|---|---|---|---|---|
| DeepSeek R1 32B | 32B | ~20GB | 79.8 | ollama pull deepseek-r1:32b |
| DeepSeek R1 14B | 14B | ~9GB | 73.1 | ollama pull deepseek-r1:14b |
| DeepSeek R1 7B | 7B | ~5GB | 62.4 | ollama pull deepseek-r1:7b |
| Qwen 2.5 32B | 32B | ~20GB | 68.9 | ollama pull qwen2.5:32b |
DeepSeek R1 uses chain-of-thought reasoning — you can see the model's thinking process before it gives the final answer. This makes it significantly better at math, logic, and complex analysis compared to standard models. The 14B version is the best value: strong reasoning at just 9GB VRAM.
Best for RAG (Document Chat)
RAG models work alongside embedding models to answer questions from your documents.
For the language model (answers questions):
| Model | Size | VRAM | Why |
|---|---|---|---|
| Llama 3.1 8B | 8B | ~5GB | Best at grounding answers in provided context |
| Qwen 2.5 32B | 32B | ~20GB | Better comprehension for complex documents |
For the embedding model (indexes documents):
| Model | Size | VRAM | Install |
|---|---|---|---|
| nomic-embed-text | 137M | ~0.5GB | ollama pull nomic-embed-text |
| mxbai-embed-large | 335M | ~0.7GB | ollama pull mxbai-embed-large |
nomic-embed-text is the standard choice for RAG with Ollama. It works with Open WebUI, AnythingLLM, and most RAG frameworks. See our RAG local setup guide for a complete walkthrough.
Best for Vision (Image Understanding)
| Model | Size | VRAM | Install |
|---|---|---|---|
| Llama 3.2 Vision 11B | 11B | ~8GB | ollama pull llama3.2-vision:11b |
| Llama 3.2 Vision 90B | 90B | ~55GB | ollama pull llama3.2-vision:90b |
Vision models can describe images, read text from screenshots (OCR), analyze charts, and answer questions about photos. The 11B version handles most tasks well at 8GB VRAM.
Models by Hardware Budget
8GB RAM / No Dedicated GPU
You're limited to 3B-4B parameter models with CPU inference. Expect 5-15 tok/s.
# Best picks for 8GB RAM
ollama pull llama3.2 # 3B - best general quality
ollama pull phi4-mini # 3.8B - surprisingly capable
ollama pull gemma2:2b # 2B - fastest, basic tasks
16GB RAM / 8GB VRAM (RTX 3060, M1/M2 16GB)
The sweet spot for most users. 7B-8B models run at full GPU speed.
# Best picks for 16GB / 8GB VRAM
ollama pull llama3.1:8b # Best general-purpose 8B
ollama pull qwen2.5-coder:7b # Best coding 7B
ollama pull deepseek-r1:7b # Reasoning with chain-of-thought
ollama pull nomic-embed-text # Embeddings for RAG
24GB VRAM (RTX 4090, M3 Pro 36GB)
Access to 32B models — a massive quality jump over 8B.
# Best picks for 24GB VRAM
ollama pull qwen2.5:32b # Best overall 32B
ollama pull qwen2.5-coder:32b # Best coding model period
ollama pull deepseek-r1:32b # Best reasoning model
48GB+ VRAM (RTX 5090 32GB + offload, 2x GPUs, M4 Max 64GB)
Run 70B models — comparable to GPT-4 turbo.
# Best picks for 48GB+
ollama pull llama3.3:70b # Best overall local model
ollama pull qwen2.5:72b # Excellent multilingual
Speed and VRAM Reference Table
| Model | Q4_K_M Size | VRAM Used | RTX 3060 12GB | RTX 4090 24GB | Mac M4 Max 64GB |
|---|---|---|---|---|---|
| Gemma 2 2B | 1.6 GB | ~2 GB | 110 tok/s | 150 tok/s | 80 tok/s |
| Llama 3.2 3B | 2.0 GB | ~3 GB | 90 tok/s | 120 tok/s | 65 tok/s |
| Phi-4 Mini 3.8B | 2.5 GB | ~3 GB | 85 tok/s | 110 tok/s | 60 tok/s |
| Mistral 7B | 4.1 GB | ~5 GB | 55 tok/s | 90 tok/s | 40 tok/s |
| Llama 3.1 8B | 4.7 GB | ~6 GB | 50 tok/s | 85 tok/s | 38 tok/s |
| Qwen 2.5 Coder 7B | 4.4 GB | ~5 GB | 55 tok/s | 90 tok/s | 40 tok/s |
| DeepSeek R1 14B | 8.7 GB | ~10 GB | 25 tok/s | 55 tok/s | 28 tok/s |
| Mistral Small 24B | 14 GB | ~16 GB | CPU only | 40 tok/s | 25 tok/s |
| Gemma 2 27B | 16 GB | ~18 GB | CPU only | 38 tok/s | 24 tok/s |
| Qwen 2.5 32B | 19 GB | ~21 GB | CPU only | 35 tok/s | 22 tok/s |
| DeepSeek R1 32B | 19 GB | ~21 GB | CPU only | 30 tok/s | 20 tok/s |
| Llama 3.3 70B | 40 GB | ~42 GB | CPU only | CPU offload | 18 tok/s |
Speeds are community-reported estimates and vary significantly by prompt length, context size, quantization, and system load. Treat as rough comparisons, not precise measurements.
How to Pick the Right Model
Decision Flowchart
Step 1: What's your VRAM?
- Under 4GB → Gemma 2 2B or Llama 3.2 1B
- 4-8GB → 7B-8B models
- 8-16GB → 14B-24B models
- 16-24GB → 32B models
- 24GB+ → 70B models
Step 2: What's your primary use case?
- General chat → Llama 3.1/3.3 or Qwen 2.5 (largest that fits)
- Coding → Qwen 2.5 Coder (largest that fits)
- Reasoning/math → DeepSeek R1 (largest that fits)
- Fast autocomplete → Qwen 2.5 Coder 1.5B
- Document Q&A → Llama 3.1 8B + nomic-embed-text
Step 3: Speed vs Quality?
- Need fast responses → Pick one size down from your maximum
- Need best quality → Pick the largest that fits your VRAM
- Running multiple models → Leave 4-6GB headroom for the OS and second model
Common Mistakes to Avoid
- Running a model that barely fits — If your model uses 23.5GB of 24GB VRAM, you'll get swapping and slowdowns. Leave 2-3GB headroom.
- Using general models for coding — Qwen 2.5 Coder 7B massively outperforms Llama 3.1 8B on code tasks despite being smaller. Use specialized models.
- Ignoring quantization — Always use Q4_K_M (Ollama default). Full precision wastes VRAM with negligible quality gain.
- Chasing parameter count — A well-trained 32B model (Qwen 2.5) often outperforms a mediocre 70B model. Quality of training data matters more than size alone.
Model Management Tips
Check Installed Models
ollama list
# NAME ID SIZE MODIFIED
# llama3.1:8b 365c0bd3c000 4.7 GB 2 days ago
# qwen2.5-coder:7b 12345abc 4.4 GB 1 day ago
Free Up Disk Space
# Remove models you no longer use
ollama rm codellama:7b
ollama rm mistral:7b
# Models are stored in:
# macOS: ~/.ollama/models
# Linux: /usr/share/ollama/.ollama/models
# Windows: C:\Users\<user>\.ollama\models
Pull Specific Quantizations
# Default (Q4_K_M) — best balance
ollama pull llama3.1:8b
# Higher quality (Q5_K_M) — 10-15% more VRAM
ollama pull llama3.1:8b-instruct-q5_K_M
# Smallest (Q2_K) — 30% less VRAM, noticeable quality loss
ollama pull llama3.1:8b-instruct-q2_K
Set Context Window Size
# In Ollama chat, increase context window:
/set parameter num_ctx 8192
# Or create a Modelfile for persistent settings:
# Create a file called Modelfile:
# FROM llama3.1:8b
# PARAMETER num_ctx 8192
# PARAMETER temperature 0.7
# Then: ollama create my-llama -f Modelfile
Key Takeaways
- Qwen 2.5 Coder 32B is the best local coding model — 92.7% HumanEval
- Llama 3.3 70B is the best overall model if you have 48GB+ VRAM/RAM
- DeepSeek R1 is the best reasoning model with visible chain-of-thought
- Phi-4 Mini 3.8B punches far above its weight for small hardware
- Always use Q4_K_M quantization (Ollama default) — best quality-per-VRAM
- Match model to task — specialized models (Coder, R1) beat general models on their domains
- nomic-embed-text is the go-to embedding model for RAG
Next Steps
- Set up Open WebUI for a ChatGPT-like interface with your models
- Find models for 8GB RAM if you're on limited hardware
- Set up Continue.dev for AI coding with Ollama
- Compare Jan vs LM Studio vs Ollama for model management
- Check VRAM requirements for detailed GPU sizing
- Run GPT-OSS locally — OpenAI's first open-source model on Ollama
- Run Llama 4 Scout locally — Meta's 109B MoE with 10M token context
- Try Qwen3-Coder — Alibaba's best coding model (70.6% SWE-bench)
- RTX 5090 vs 5080 for local AI — which GPU to buy for running models
- LMArena leaderboard explained — how AI models are ranked by 6M+ votes
The Ollama model ecosystem evolves rapidly. We test and update this ranking monthly. Last verified March 2026.
Ready to start your AI career?
Get the complete roadmap
Download the AI Starter Kit: Career path, fundamentals, and cheat sheets used by 12K+ developers.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!