Models

Best Ollama Models (March 2026): Top 15 Ranked by Task

March 17, 2026
20 min read
Local AI Master Research Team
🎁 4 PDFs included
Newsletter

Before we dive deeper...

Get your free AI Starter Kit

Join 12,000+ developers. Instant download: Career Roadmap + Fundamentals Cheat Sheets.

No spam, everUnsubscribe anytime
12,000+ downloads

The best Ollama models in 2026 are Llama 3.3 70B for overall quality (86.0 MMLU, needs 40GB VRAM), Qwen 2.5 Coder 32B for coding (92.7% HumanEval, 20GB VRAM), and Qwen 2.5 32B as the best mid-range all-rounder (83.2 MMLU, 20GB VRAM). For 8GB setups, Llama 3.1 8B remains the most versatile option at 5GB VRAM.

Quick Pick: Best Ollama Model for Your Hardware

Your SetupBest ModelInstall Command
8GB RAM, no GPULlama 3.2 3Bollama pull llama3.2
16GB RAM / 8GB VRAMLlama 3.1 8Bollama pull llama3.1:8b
24GB VRAMQwen 2.5 32Bollama pull qwen2.5:32b
48GB+ VRAMLlama 3.3 70Bollama pull llama3.3:70b

Why Ollama Model Choice Matters

Picking the right Ollama model is the single biggest factor in your local AI experience. A well-matched model runs fast, produces quality output, and fits your hardware. A poor choice gives you either slow responses or disappointing quality.

The Ollama library contains 500+ models, but most users only need to know about 10-15 models that consistently outperform the rest. This guide ranks those top models by task — coding, chat, reasoning, creative writing, and RAG — with real VRAM requirements and speed benchmarks so you can pick the right one immediately.

All models listed here are free, open-weight, and run entirely on your hardware. No API keys, no subscriptions, no data leaving your machine.


Top 15 Ollama Models Ranked

Overall Ranking (March 2026)

RankModelParametersVRAM (Q4)Best ForHumanEvalMMLUSpeed (RTX 4090)
1Llama 3.3 70B70B~40GBGeneral, reasoning81.7%86.0~18 tok/s
2Qwen 2.5 32B32B~20GBGeneral, multilingual79.5%83.2~35 tok/s
3Qwen 2.5 Coder 32B32B~20GBCoding92.7%76.4~35 tok/s
4DeepSeek R1 32B32B~20GBReasoning, math72.6%79.8~30 tok/s
5Llama 3.1 8B8B~5GBGeneral (budget)72.6%68.4~85 tok/s
6Qwen 2.5 Coder 7B7B~5GBCoding (budget)88.4%64.2~90 tok/s
7DeepSeek R1 14B14B~9GBReasoning (mid-range)68.3%73.1~55 tok/s
8Mistral Small 24B24B~15GBMultilingual, chat71.2%81.0~40 tok/s
9Gemma 2 27B27B~17GBInstruction following64.4%78.1~38 tok/s
10Phi-4 Mini 3.8B3.8B~3GBSmall model king67.8%68.5~110 tok/s
11Llama 3.2 3B3B~2GBStarter, low-resource48.2%58.3~120 tok/s
12Mistral 7B7B~5GBGeneral, fast56.7%64.1~90 tok/s
13Qwen 2.5 Coder 1.5B1.5B~1.5GBAutocomplete70.6%46.8~150 tok/s
14Nomic Embed Text137M~0.5GBEmbeddings, RAGN/AN/A~1000 tok/s
15Llama 3.2 Vision 11B11B~8GBImage understandingN/A73.2~50 tok/s

Benchmark sources: HumanEval and MMLU scores from official model cards on Hugging Face and release announcements (Meta, Alibaba, DeepSeek, Google, Microsoft). Speed figures are community-reported estimates from r/LocalLLaMA and vary by prompt length, context size, and system configuration. Your results may differ.


Best Models by Task

Best for Coding

Coding models need to understand syntax, generate working functions, and debug errors.

ModelSizeVRAMHumanEvalInstall
Qwen 2.5 Coder 32B32B~20GB92.7%ollama pull qwen2.5-coder:32b
Qwen 2.5 Coder 7B7B~5GB88.4%ollama pull qwen2.5-coder:7b
DeepSeek Coder V2 Lite16B~10GB81.1%ollama pull deepseek-coder-v2:16b
Qwen 2.5 Coder 1.5B1.5B~1.5GB70.6%ollama pull qwen2.5-coder:1.5b

Why Qwen 2.5 Coder dominates: The Qwen Coder series was trained on 5.5 trillion tokens of code data spanning 92 programming languages. The 32B version scores 92.7% on HumanEval, matching or exceeding GPT-4o on code generation. The 7B version at 88.4% outperforms models 4x its size.

Best setup for AI-assisted coding:

Best for General Chat

Chat models handle conversation, Q&A, summarization, and everyday tasks.

ModelSizeVRAMMMLUInstall
Llama 3.3 70B70B~40GB86.0ollama pull llama3.3:70b
Qwen 2.5 32B32B~20GB83.2ollama pull qwen2.5:32b
Mistral Small 24B24B~15GB81.0ollama pull mistral-small:24b
Llama 3.1 8B8B~5GB68.4ollama pull llama3.1:8b
Phi-4 Mini 3.8B3.8B~3GB68.5ollama pull phi4-mini

Llama 3.3 70B is the best local chat model if you have the hardware. It replaced Llama 3.1 70B with better instruction following and reduced hallucination. For most users, Qwen 2.5 32B hits the sweet spot of quality and resource requirements.

Phi-4 Mini is remarkable at 3.8B parameters — it matches Llama 3.1 8B on MMLU while using 40% less VRAM.

Best for Reasoning and Math

Reasoning models excel at logic puzzles, math, analysis, and multi-step problem solving.

ModelSizeVRAMMATHInstall
DeepSeek R1 32B32B~20GB79.8ollama pull deepseek-r1:32b
DeepSeek R1 14B14B~9GB73.1ollama pull deepseek-r1:14b
DeepSeek R1 7B7B~5GB62.4ollama pull deepseek-r1:7b
Qwen 2.5 32B32B~20GB68.9ollama pull qwen2.5:32b

DeepSeek R1 uses chain-of-thought reasoning — you can see the model's thinking process before it gives the final answer. This makes it significantly better at math, logic, and complex analysis compared to standard models. The 14B version is the best value: strong reasoning at just 9GB VRAM.

Best for RAG (Document Chat)

RAG models work alongside embedding models to answer questions from your documents.

For the language model (answers questions):

ModelSizeVRAMWhy
Llama 3.1 8B8B~5GBBest at grounding answers in provided context
Qwen 2.5 32B32B~20GBBetter comprehension for complex documents

For the embedding model (indexes documents):

ModelSizeVRAMInstall
nomic-embed-text137M~0.5GBollama pull nomic-embed-text
mxbai-embed-large335M~0.7GBollama pull mxbai-embed-large

nomic-embed-text is the standard choice for RAG with Ollama. It works with Open WebUI, AnythingLLM, and most RAG frameworks. See our RAG local setup guide for a complete walkthrough.

Best for Vision (Image Understanding)

ModelSizeVRAMInstall
Llama 3.2 Vision 11B11B~8GBollama pull llama3.2-vision:11b
Llama 3.2 Vision 90B90B~55GBollama pull llama3.2-vision:90b

Vision models can describe images, read text from screenshots (OCR), analyze charts, and answer questions about photos. The 11B version handles most tasks well at 8GB VRAM.


Models by Hardware Budget

8GB RAM / No Dedicated GPU

You're limited to 3B-4B parameter models with CPU inference. Expect 5-15 tok/s.

# Best picks for 8GB RAM
ollama pull llama3.2          # 3B - best general quality
ollama pull phi4-mini          # 3.8B - surprisingly capable
ollama pull gemma2:2b          # 2B - fastest, basic tasks

16GB RAM / 8GB VRAM (RTX 3060, M1/M2 16GB)

The sweet spot for most users. 7B-8B models run at full GPU speed.

# Best picks for 16GB / 8GB VRAM
ollama pull llama3.1:8b        # Best general-purpose 8B
ollama pull qwen2.5-coder:7b   # Best coding 7B
ollama pull deepseek-r1:7b     # Reasoning with chain-of-thought
ollama pull nomic-embed-text   # Embeddings for RAG

24GB VRAM (RTX 4090, M3 Pro 36GB)

Access to 32B models — a massive quality jump over 8B.

# Best picks for 24GB VRAM
ollama pull qwen2.5:32b         # Best overall 32B
ollama pull qwen2.5-coder:32b   # Best coding model period
ollama pull deepseek-r1:32b     # Best reasoning model

48GB+ VRAM (RTX 5090 32GB + offload, 2x GPUs, M4 Max 64GB)

Run 70B models — comparable to GPT-4 turbo.

# Best picks for 48GB+
ollama pull llama3.3:70b       # Best overall local model
ollama pull qwen2.5:72b        # Excellent multilingual

Speed and VRAM Reference Table

ModelQ4_K_M SizeVRAM UsedRTX 3060 12GBRTX 4090 24GBMac M4 Max 64GB
Gemma 2 2B1.6 GB~2 GB110 tok/s150 tok/s80 tok/s
Llama 3.2 3B2.0 GB~3 GB90 tok/s120 tok/s65 tok/s
Phi-4 Mini 3.8B2.5 GB~3 GB85 tok/s110 tok/s60 tok/s
Mistral 7B4.1 GB~5 GB55 tok/s90 tok/s40 tok/s
Llama 3.1 8B4.7 GB~6 GB50 tok/s85 tok/s38 tok/s
Qwen 2.5 Coder 7B4.4 GB~5 GB55 tok/s90 tok/s40 tok/s
DeepSeek R1 14B8.7 GB~10 GB25 tok/s55 tok/s28 tok/s
Mistral Small 24B14 GB~16 GBCPU only40 tok/s25 tok/s
Gemma 2 27B16 GB~18 GBCPU only38 tok/s24 tok/s
Qwen 2.5 32B19 GB~21 GBCPU only35 tok/s22 tok/s
DeepSeek R1 32B19 GB~21 GBCPU only30 tok/s20 tok/s
Llama 3.3 70B40 GB~42 GBCPU onlyCPU offload18 tok/s

Speeds are community-reported estimates and vary significantly by prompt length, context size, quantization, and system load. Treat as rough comparisons, not precise measurements.


How to Pick the Right Model

Decision Flowchart

Step 1: What's your VRAM?

  • Under 4GB → Gemma 2 2B or Llama 3.2 1B
  • 4-8GB → 7B-8B models
  • 8-16GB → 14B-24B models
  • 16-24GB → 32B models
  • 24GB+ → 70B models

Step 2: What's your primary use case?

  • General chat → Llama 3.1/3.3 or Qwen 2.5 (largest that fits)
  • Coding → Qwen 2.5 Coder (largest that fits)
  • Reasoning/math → DeepSeek R1 (largest that fits)
  • Fast autocomplete → Qwen 2.5 Coder 1.5B
  • Document Q&A → Llama 3.1 8B + nomic-embed-text

Step 3: Speed vs Quality?

  • Need fast responses → Pick one size down from your maximum
  • Need best quality → Pick the largest that fits your VRAM
  • Running multiple models → Leave 4-6GB headroom for the OS and second model

Common Mistakes to Avoid

  1. Running a model that barely fits — If your model uses 23.5GB of 24GB VRAM, you'll get swapping and slowdowns. Leave 2-3GB headroom.
  2. Using general models for coding — Qwen 2.5 Coder 7B massively outperforms Llama 3.1 8B on code tasks despite being smaller. Use specialized models.
  3. Ignoring quantization — Always use Q4_K_M (Ollama default). Full precision wastes VRAM with negligible quality gain.
  4. Chasing parameter count — A well-trained 32B model (Qwen 2.5) often outperforms a mediocre 70B model. Quality of training data matters more than size alone.

Model Management Tips

Check Installed Models

ollama list
# NAME                     ID            SIZE     MODIFIED
# llama3.1:8b              365c0bd3c000  4.7 GB   2 days ago
# qwen2.5-coder:7b         12345abc      4.4 GB   1 day ago

Free Up Disk Space

# Remove models you no longer use
ollama rm codellama:7b
ollama rm mistral:7b

# Models are stored in:
# macOS: ~/.ollama/models
# Linux: /usr/share/ollama/.ollama/models
# Windows: C:\Users\<user>\.ollama\models

Pull Specific Quantizations

# Default (Q4_K_M) — best balance
ollama pull llama3.1:8b

# Higher quality (Q5_K_M) — 10-15% more VRAM
ollama pull llama3.1:8b-instruct-q5_K_M

# Smallest (Q2_K) — 30% less VRAM, noticeable quality loss
ollama pull llama3.1:8b-instruct-q2_K

Set Context Window Size

# In Ollama chat, increase context window:
/set parameter num_ctx 8192

# Or create a Modelfile for persistent settings:
# Create a file called Modelfile:
# FROM llama3.1:8b
# PARAMETER num_ctx 8192
# PARAMETER temperature 0.7

# Then: ollama create my-llama -f Modelfile

Key Takeaways

  1. Qwen 2.5 Coder 32B is the best local coding model — 92.7% HumanEval
  2. Llama 3.3 70B is the best overall model if you have 48GB+ VRAM/RAM
  3. DeepSeek R1 is the best reasoning model with visible chain-of-thought
  4. Phi-4 Mini 3.8B punches far above its weight for small hardware
  5. Always use Q4_K_M quantization (Ollama default) — best quality-per-VRAM
  6. Match model to task — specialized models (Coder, R1) beat general models on their domains
  7. nomic-embed-text is the go-to embedding model for RAG

Next Steps

  1. Set up Open WebUI for a ChatGPT-like interface with your models
  2. Find models for 8GB RAM if you're on limited hardware
  3. Set up Continue.dev for AI coding with Ollama
  4. Compare Jan vs LM Studio vs Ollama for model management
  5. Check VRAM requirements for detailed GPU sizing
  6. Run GPT-OSS locally — OpenAI's first open-source model on Ollama
  7. Run Llama 4 Scout locally — Meta's 109B MoE with 10M token context
  8. Try Qwen3-Coder — Alibaba's best coding model (70.6% SWE-bench)
  9. RTX 5090 vs 5080 for local AI — which GPU to buy for running models
  10. LMArena leaderboard explained — how AI models are ranked by 6M+ votes

The Ollama model ecosystem evolves rapidly. We test and update this ranking monthly. Last verified March 2026.

🚀 Join 12K+ developers
Newsletter

Ready to start your AI career?

Get the complete roadmap

Download the AI Starter Kit: Career path, fundamentals, and cheat sheets used by 12K+ developers.

No spam, everUnsubscribe anytime
12,000+ downloads
Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

My 77K Dataset Insights Delivered Weekly

Get exclusive access to real dataset optimization strategies and AI model performance tips.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: March 17, 2026🔄 Last Updated: March 17, 2026✓ Manually Reviewed

Skip the setup

Ollama Prompt Pack$9

170+ ready-to-use prompts + 15 expert Modelfiles. One-command install.

Get It Now →

My 77K Dataset Insights Delivered Weekly

Get exclusive access to real dataset optimization strategies and AI model performance tips.

Was this helpful?

PR

Written by Pattanaik Ramswarup

AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset

I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.

✓ 10+ Years in ML/AI✓ 77K Dataset Creator✓ Open Source Contributor
Free Tools & Calculators