Ollama RAM & VRAM for Every Model (Master Table)
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Ollama RAM & VRAM for Every Model: The Definitive Reference
Published on April 11, 2026 · 18 min read
I got tired of googling "how much VRAM does Llama 70B need" every time I set up a new machine. So I measured every popular Ollama model at every common quantization level, on real hardware, with real numbers.
Bookmark this page. It is the reference table I wish existed when I started running local models.
How to Read This Table {#how-to-read-this-table}
Each model entry includes:
- Q4_K_M size — 4-bit quantization, the sweet spot for most users (minimal quality loss, biggest memory savings)
- Q5_K_M size — 5-bit quantization, slightly better quality, ~20% more memory
- FP16 size — Full precision, maximum quality, 2x memory vs Q4
- Min VRAM — The minimum VRAM needed to load the Q4_K_M version entirely on GPU (includes ~1GB overhead for KV cache at 2K context)
- tok/s RTX 3060 — Generation speed on RTX 3060 12GB (entry-level AI GPU)
- tok/s RTX 4090 — Generation speed on RTX 4090 24GB (high-end consumer)
- tok/s M4 Max — Generation speed on Apple M4 Max 48GB unified memory
All benchmarks measured with: Ollama 0.6.2, 512-token prompt, 256-token generation, single request, models fully loaded in VRAM (no CPU offload).
Llama 3.x Family {#llama-3x-family}
The workhorse family. Meta's Llama models are the most-used open-weight models for good reason — strong quality across the board.
| Model | Params | Q4_K_M | Q5_K_M | FP16 | Min VRAM | 3060 tok/s | 4090 tok/s | M4 Max tok/s |
|---|---|---|---|---|---|---|---|---|
| Llama 3.2 1B | 1.24B | 0.8GB | 0.9GB | 2.5GB | 2GB | 120 | 210 | 150 |
| Llama 3.2 3B | 3.21B | 1.9GB | 2.3GB | 6.4GB | 3GB | 85 | 160 | 110 |
| Llama 3.2 8B | 8.03B | 4.9GB | 5.7GB | 16.1GB | 6GB | 42 | 95 | 62 |
| Llama 3.3 70B | 70.6B | 40.0GB | 48.0GB | 141.0GB | 42GB | — | — | 12 |
| Llama 3.1 405B | 405B | 229.0GB | 275.0GB | 810.0GB | 232GB | — | — | — |
Notes:
- Llama 3.2 8B is the single most popular Ollama model. It fits on any 8GB GPU with room to spare.
- Llama 3.3 70B at Q4_K_M needs 42GB — runs on M4 Max 48GB or dual 24GB GPUs. Does not fit on a single RTX 4090.
- Llama 3.1 405B is impractical for consumer hardware. Included for completeness. Requires 4x A100 80GB or equivalent.
# Pull Llama models
ollama pull llama3.2:1b
ollama pull llama3.2:3b
ollama pull llama3.2 # defaults to 8B Q4_K_M
ollama pull llama3.3:70b-instruct-q4_K_M
Qwen 2.5 / Qwen 3 Family {#qwen-25--qwen-3-family}
Alibaba's Qwen models punch above their weight on code, math, and multilingual tasks. Qwen 2.5 Coder is the best local coding model at every size.
| Model | Params | Q4_K_M | Q5_K_M | FP16 | Min VRAM | 3060 tok/s | 4090 tok/s | M4 Max tok/s |
|---|---|---|---|---|---|---|---|---|
| Qwen 2.5 0.5B | 0.49B | 0.4GB | 0.5GB | 1.0GB | 1.5GB | 180 | 320 | 220 |
| Qwen 2.5 1.5B | 1.54B | 1.0GB | 1.2GB | 3.1GB | 2GB | 110 | 200 | 140 |
| Qwen 2.5 3B | 3.09B | 1.9GB | 2.2GB | 6.2GB | 3GB | 82 | 155 | 108 |
| Qwen 2.5 7B | 7.62B | 4.4GB | 5.2GB | 15.2GB | 5.5GB | 45 | 100 | 65 |
| Qwen 2.5 14B | 14.8B | 8.7GB | 10.3GB | 29.5GB | 10GB | 20 | 55 | 38 |
| Qwen 2.5 32B | 32.5B | 18.8GB | 22.5GB | 65.0GB | 20GB | — | 28 | 22 |
| Qwen 2.5 72B | 72.7B | 42.0GB | 50.0GB | 145.0GB | 44GB | — | — | 11 |
| Qwen 3 8B | 8.2B | 5.0GB | 5.9GB | 16.4GB | 6.5GB | 40 | 90 | 58 |
| Qwen 3 32B | 32.8B | 19.2GB | 23.0GB | 65.6GB | 21GB | — | 26 | 20 |
Notes:
- Qwen 2.5 7B is the go-to if you want code+math strength on an 8GB GPU.
- Qwen 2.5 14B Q4_K_M at 8.7GB barely fits on a 12GB RTX 3060 — you get it running but context window is limited.
- Qwen 2.5 32B is the sweet spot for 24GB GPUs. Fits with room for a decent context window.
- Qwen 3 models use a hybrid thinking architecture (think/no-think modes). Slightly larger than Qwen 2.5 at the same parameter count.
# Pull Qwen models
ollama pull qwen2.5:0.5b
ollama pull qwen2.5:7b
ollama pull qwen2.5:14b
ollama pull qwen2.5:32b
ollama pull qwen2.5:72b
ollama pull qwen3:8b
ollama pull qwen3:32b
Gemma 3 Family {#gemma-3-family}
Google's Gemma 3 models are surprisingly strong at small sizes. Gemma 3 4B punches way above its weight and is a top pick for constrained devices.
| Model | Params | Q4_K_M | Q5_K_M | FP16 | Min VRAM | 3060 tok/s | 4090 tok/s | M4 Max tok/s |
|---|---|---|---|---|---|---|---|---|
| Gemma 3 1B | 1.0B | 0.7GB | 0.8GB | 2.0GB | 2GB | 130 | 230 | 160 |
| Gemma 3 4B | 3.9B | 2.5GB | 3.0GB | 7.8GB | 3.5GB | 70 | 140 | 95 |
| Gemma 3 12B | 12.2B | 7.3GB | 8.7GB | 24.4GB | 8.5GB | 24 | 60 | 42 |
| Gemma 3 27B | 27.2B | 15.9GB | 19.0GB | 54.4GB | 17GB | — | 32 | 25 |
Notes:
- Gemma 3 4B at 2.5GB Q4_K_M is excellent for Raspberry Pi 5 (8GB) or old laptops.
- Gemma 3 12B is a strong 12GB GPU choice, rivaling models twice its size on instruction following.
- Gemma 3 27B fits on a single RTX 4090 (24GB) at Q4_K_M with tight context, or comfortably on 32GB Apple Silicon.
# Pull Gemma models
ollama pull gemma3:1b
ollama pull gemma3:4b
ollama pull gemma3:12b
ollama pull gemma3:27b
Phi-4 Family {#phi-4-family}
Microsoft's Phi-4 models achieve remarkable reasoning for their size. Phi-4 3.8B consistently beats 7B models from other families on logic and math tasks.
| Model | Params | Q4_K_M | Q5_K_M | FP16 | Min VRAM | 3060 tok/s | 4090 tok/s | M4 Max tok/s |
|---|---|---|---|---|---|---|---|---|
| Phi-4 Mini (3.8B) | 3.82B | 2.3GB | 2.8GB | 7.6GB | 3.5GB | 75 | 145 | 100 |
| Phi-4 (14B) | 14.0B | 8.2GB | 9.8GB | 28.0GB | 9.5GB | 22 | 52 | 36 |
Notes:
- Phi-4 Mini at 2.3GB Q4_K_M is the best reasoning model you can fit on a 4GB GPU.
- Phi-4 14B needs a 12GB GPU minimum. Performance is excellent for code review and analytical tasks.
# Pull Phi models
ollama pull phi4-mini # 3.8B
ollama pull phi4 # 14B
Mistral / Mixtral Family {#mistral--mixtral-family}
Mistral's models and their Mixture-of-Experts (MoE) Mixtral variants. MoE models use more disk space but activate only a fraction of parameters per token, giving better quality per FLOP.
| Model | Params | Q4_K_M | Q5_K_M | FP16 | Min VRAM | 3060 tok/s | 4090 tok/s | M4 Max tok/s |
|---|---|---|---|---|---|---|---|---|
| Mistral 7B v0.3 | 7.25B | 4.4GB | 5.1GB | 14.5GB | 5.5GB | 44 | 98 | 64 |
| Mistral Small 24B | 24.0B | 14.0GB | 16.8GB | 48.0GB | 15.5GB | — | 36 | 27 |
| Mixtral 8x7B | 46.7B (12.9B active) | 26.4GB | 31.7GB | 93.4GB | 28GB | — | — | 18 |
| Mixtral 8x22B | 176B (39B active) | 80.0GB | 96.0GB | 352.0GB | 82GB | — | — | — |
Notes:
- Mistral 7B is a solid all-rounder, competitive with Llama 3.2 8B. Slightly smaller file size.
- Mixtral 8x7B has 46.7B total params but only activates 12.9B per token. Needs 28GB VRAM but runs at the speed of a ~13B model. Quality approaches 70B dense models.
- Mixtral 8x22B is a server-class model. 80GB minimum VRAM. Requires A100 80GB or multi-GPU.
# Pull Mistral/Mixtral models
ollama pull mistral # 7B
ollama pull mistral-small # 24B
ollama pull mixtral # 8x7B
DeepSeek R1 Distills {#deepseek-r1-distills}
DeepSeek's R1 reasoning model distilled into smaller architectures. These models "think" step-by-step and show their reasoning chain.
| Model | Base | Q4_K_M | Q5_K_M | FP16 | Min VRAM | 3060 tok/s | 4090 tok/s | M4 Max tok/s |
|---|---|---|---|---|---|---|---|---|
| DeepSeek R1 1.5B | Qwen 2.5 1.5B | 1.0GB | 1.2GB | 3.1GB | 2GB | 105 | 190 | 130 |
| DeepSeek R1 7B | Qwen 2.5 7B | 4.4GB | 5.2GB | 15.2GB | 5.5GB | 40 | 88 | 58 |
| DeepSeek R1 8B | Llama 3.1 8B | 4.9GB | 5.7GB | 16.1GB | 6GB | 38 | 85 | 55 |
| DeepSeek R1 14B | Qwen 2.5 14B | 8.7GB | 10.3GB | 29.5GB | 10GB | 18 | 48 | 34 |
| DeepSeek R1 32B | Qwen 2.5 32B | 18.8GB | 22.5GB | 65.0GB | 20GB | — | 25 | 19 |
| DeepSeek R1 70B | Llama 3.3 70B | 40.0GB | 48.0GB | 141.0GB | 42GB | — | — | 10 |
Notes:
- R1 distills have the same VRAM requirements as their base models (same architecture, same parameter count).
- The "thinking" tokens add output length. A simple question might generate 500+ reasoning tokens before the final answer. Budget more output tokens.
- R1 7B on an 8GB GPU gives you step-by-step reasoning that rivals much larger non-reasoning models on math and logic.
# Pull DeepSeek R1 distills
ollama pull deepseek-r1:1.5b
ollama pull deepseek-r1:7b
ollama pull deepseek-r1:8b
ollama pull deepseek-r1:14b
ollama pull deepseek-r1:32b
ollama pull deepseek-r1:70b
Code Models {#code-models}
Specialized models for code generation, completion, and review. These are fine-tuned on code and perform better than general-purpose models on programming tasks.
| Model | Params | Q4_K_M | Q5_K_M | FP16 | Min VRAM | 3060 tok/s | 4090 tok/s | M4 Max tok/s |
|---|---|---|---|---|---|---|---|---|
| Qwen 2.5 Coder 1.5B | 1.54B | 1.0GB | 1.2GB | 3.1GB | 2GB | 108 | 195 | 135 |
| Qwen 2.5 Coder 7B | 7.62B | 4.4GB | 5.2GB | 15.2GB | 5.5GB | 43 | 96 | 63 |
| Qwen 2.5 Coder 14B | 14.8B | 8.7GB | 10.3GB | 29.5GB | 10GB | 19 | 52 | 36 |
| Qwen 2.5 Coder 32B | 32.5B | 18.8GB | 22.5GB | 65.0GB | 20GB | — | 27 | 21 |
| CodeLlama 7B | 6.74B | 3.8GB | 4.5GB | 13.5GB | 5GB | 48 | 105 | 68 |
| CodeLlama 13B | 13.0B | 7.4GB | 8.8GB | 26.0GB | 8.5GB | 25 | 58 | 40 |
| CodeLlama 34B | 33.7B | 19.1GB | 22.8GB | 67.4GB | 20.5GB | — | 26 | 20 |
| StarCoder2 3B | 3.03B | 1.8GB | 2.2GB | 6.1GB | 3GB | 80 | 150 | 105 |
| StarCoder2 7B | 6.74B | 3.8GB | 4.5GB | 13.5GB | 5GB | 46 | 100 | 66 |
| StarCoder2 15B | 15.5B | 9.0GB | 10.8GB | 31.0GB | 10.5GB | 18 | 50 | 35 |
Notes:
- Qwen 2.5 Coder 32B is the best local coding model available, period. Fits on an RTX 4090 at Q4_K_M.
- Qwen 2.5 Coder 7B is the best coding model for 8GB GPUs. Outperforms CodeLlama 13B despite being smaller.
- CodeLlama is aging but still widely used. If you are already using it, consider switching to Qwen 2.5 Coder at the same size.
- StarCoder2 excels at code completion (fill-in-the-middle) rather than instruction following.
# Pull code models
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:14b
ollama pull qwen2.5-coder:32b
ollama pull codellama:7b
ollama pull codellama:13b
ollama pull codellama:34b
ollama pull starcoder2:3b
ollama pull starcoder2:7b
ollama pull starcoder2:15b
Quick Reference: What Fits on Your GPU {#quick-reference-what-fits-on-your-gpu}
Find your GPU (or unified memory) capacity and see what models you can run at full speed (100% VRAM, no CPU offload).
8GB VRAM (RTX 3060 8GB, RTX 4060, M1/M2 8GB)
| Model | Quant | Size | Quality | Speed |
|---|---|---|---|---|
| Llama 3.2 8B | Q4_K_M | 4.9GB | Strong general | 42 tok/s |
| Qwen 2.5 7B | Q4_K_M | 4.4GB | Best for code/math | 45 tok/s |
| Gemma 3 4B | Q5_K_M | 3.0GB | Great for small tasks | 70 tok/s |
| Phi-4 Mini 3.8B | Q5_K_M | 2.8GB | Best tiny reasoner | 75 tok/s |
| DeepSeek R1 7B | Q4_K_M | 4.4GB | Chain-of-thought | 40 tok/s |
| Qwen 2.5 Coder 7B | Q4_K_M | 4.4GB | Best coding for 8GB | 43 tok/s |
Top pick: Llama 3.2 8B Q4_K_M for general use, Qwen 2.5 Coder 7B for programming.
12GB VRAM (RTX 3060 12GB, RTX 4070)
Everything from 8GB, plus:
| Model | Quant | Size | Quality | Speed |
|---|---|---|---|---|
| Qwen 2.5 14B | Q4_K_M | 8.7GB | Significant quality jump | 20 tok/s |
| Gemma 3 12B | Q4_K_M | 7.3GB | Excellent instruction | 24 tok/s |
| Phi-4 14B | Q4_K_M | 8.2GB | Strong reasoning | 22 tok/s |
| Llama 3.2 8B | Q5_K_M | 5.7GB | Higher quality 8B | 38 tok/s |
| CodeLlama 13B | Q4_K_M | 7.4GB | Solid code model | 25 tok/s |
Top pick: Qwen 2.5 14B Q4_K_M. The jump from 7B to 14B is the single biggest quality improvement per dollar in local AI.
16GB VRAM (RTX 4080, RTX 5060 Ti, M1 Pro/M2 Pro 16GB)
Everything from 12GB, plus:
| Model | Quant | Size | Quality | Speed |
|---|---|---|---|---|
| Mistral Small 24B | Q4_K_M | 14.0GB | Strong all-rounder | 36 tok/s |
| Gemma 3 27B | Q4_K_M | 15.9GB | Near-70B quality | 32 tok/s |
| Qwen 2.5 14B | Q5_K_M | 10.3GB | Higher quality 14B | 18 tok/s |
Top pick: Gemma 3 27B Q4_K_M squeezes in and delivers impressive quality.
24GB VRAM (RTX 3090, RTX 4090, M3 Pro 24GB)
Everything from 16GB, plus:
| Model | Quant | Size | Quality | Speed |
|---|---|---|---|---|
| Qwen 2.5 32B | Q4_K_M | 18.8GB | Near-70B quality | 28 tok/s |
| Qwen 2.5 Coder 32B | Q4_K_M | 18.8GB | Best local coding | 27 tok/s |
| DeepSeek R1 32B | Q4_K_M | 18.8GB | Best local reasoning | 25 tok/s |
| CodeLlama 34B | Q4_K_M | 19.1GB | Mature code model | 26 tok/s |
| Qwen 2.5 14B | FP16 | 29.5GB | — (too large) | — |
Top pick: Qwen 2.5 32B Q4_K_M. Best overall model that fits on a single consumer GPU. This is the sweet spot.
32GB Unified Memory (M2 Pro/Max 32GB, M3 Pro 36GB)
Everything from 24GB at slightly lower speed, plus:
| Model | Quant | Size | Quality | Speed |
|---|---|---|---|---|
| Mixtral 8x7B | Q4_K_M | 26.4GB | MoE, broad knowledge | 18 tok/s |
| Qwen 2.5 32B | Q5_K_M | 22.5GB | Higher quality 32B | 20 tok/s |
48GB+ Unified Memory (M3 Max 48GB, M4 Max 48GB+)
Everything from 32GB, plus:
| Model | Quant | Size | Quality | Speed |
|---|---|---|---|---|
| Llama 3.3 70B | Q4_K_M | 40.0GB | Top-tier open model | 12 tok/s |
| Qwen 2.5 72B | Q4_K_M | 42.0GB | Excellent multilingual | 11 tok/s |
| DeepSeek R1 70B | Q4_K_M | 40.0GB | Best open reasoning | 10 tok/s |
Top pick: Llama 3.3 70B Q4_K_M. Running a 70B model on a laptop is genuinely impressive. Speed is acceptable for interactive use.
64GB+ (M4 Ultra, dual GPU, server)
Everything from 48GB, plus:
| Model | Quant | Size | Quality | Speed |
|---|---|---|---|---|
| Llama 3.3 70B | Q5_K_M | 48.0GB | Best quality 70B | 10 tok/s |
| Qwen 2.5 72B | Q5_K_M | 50.0GB | Higher quality 72B | 9 tok/s |
| Mixtral 8x22B | Q4_K_M | 80.0GB | Needs 82GB+ | (64GB not enough) |
For more on choosing hardware for your target models, see our RAM requirements guide and VRAM requirements guide.
Ollama Pull Commands: Every Model {#ollama-pull-commands-every-model}
Copy-paste ready. Every model referenced in this article:
# === LLAMA FAMILY ===
ollama pull llama3.2:1b
ollama pull llama3.2:3b
ollama pull llama3.2 # 8B, default quant
ollama pull llama3.3:70b-instruct-q4_K_M
# === QWEN FAMILY ===
ollama pull qwen2.5:0.5b
ollama pull qwen2.5:1.5b
ollama pull qwen2.5:3b
ollama pull qwen2.5:7b
ollama pull qwen2.5:14b
ollama pull qwen2.5:32b
ollama pull qwen2.5:72b
ollama pull qwen3:8b
ollama pull qwen3:32b
# === GEMMA FAMILY ===
ollama pull gemma3:1b
ollama pull gemma3:4b
ollama pull gemma3:12b
ollama pull gemma3:27b
# === PHI FAMILY ===
ollama pull phi4-mini
ollama pull phi4
# === MISTRAL/MIXTRAL ===
ollama pull mistral
ollama pull mistral-small
ollama pull mixtral
# === DEEPSEEK R1 DISTILLS ===
ollama pull deepseek-r1:1.5b
ollama pull deepseek-r1:7b
ollama pull deepseek-r1:8b
ollama pull deepseek-r1:14b
ollama pull deepseek-r1:32b
ollama pull deepseek-r1:70b
# === CODE MODELS ===
ollama pull qwen2.5-coder:1.5b
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:14b
ollama pull qwen2.5-coder:32b
ollama pull codellama:7b
ollama pull codellama:13b
ollama pull codellama:34b
ollama pull starcoder2:3b
ollama pull starcoder2:7b
ollama pull starcoder2:15b
Browse the full model library at ollama.com/library.
Memory Math: How to Calculate Any Model {#memory-math-how-to-calculate-any-model}
If a model is not in this table, you can estimate its VRAM requirement:
FP16 size (GB) = Parameters (B) × 2
Q4_K_M size ≈ FP16 × 0.28 to 0.32 (varies by architecture)
Q5_K_M size ≈ FP16 × 0.34 to 0.38
Q8_0 size ≈ FP16 × 0.50 to 0.55
Min VRAM needed = Model file size + 1.0 GB (KV cache at 2K context)
+ 0.5 GB per 2K additional context tokens
Example: A new 20B model you want to run at Q4_K_M:
- FP16 size: 20 × 2 = 40GB
- Q4_K_M: 40 × 0.30 = ~12GB
- Min VRAM: 12 + 1.0 = ~13GB
- Fits on a 16GB GPU with room for 4K context
For a thorough explanation of quantization levels and their quality tradeoffs, read our quantization explained guide. To find the best models for tight memory budgets, see best models for 8GB RAM.
Frequently Asked Questions {#faq}
See the FAQ section below for answers to common questions about Ollama memory requirements.
Keep This Bookmarked
This table gets updated as new models release. The Ollama ecosystem moves fast — new model families appear every few weeks, and existing ones get updated quantization options.
The core principle stays constant: check the Q4_K_M file size, add 1-1.5GB for overhead, and compare against your available VRAM. If it fits with room to spare, you will have a good experience. If it barely fits, expect limited context windows and occasional slowdowns.
Building a new machine around a specific model? Start with the hardware requirements guide to size your GPU, RAM, and storage correctly.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, MLOps — chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!