Hardware Reference

Ollama RAM & VRAM for Every Model (Master Table)

April 11, 2026
18 min read
Local AI Master Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

Ollama RAM & VRAM for Every Model: The Definitive Reference

Published on April 11, 2026 · 18 min read

I got tired of googling "how much VRAM does Llama 70B need" every time I set up a new machine. So I measured every popular Ollama model at every common quantization level, on real hardware, with real numbers.

Bookmark this page. It is the reference table I wish existed when I started running local models.


How to Read This Table {#how-to-read-this-table}

Each model entry includes:

  • Q4_K_M size — 4-bit quantization, the sweet spot for most users (minimal quality loss, biggest memory savings)
  • Q5_K_M size — 5-bit quantization, slightly better quality, ~20% more memory
  • FP16 size — Full precision, maximum quality, 2x memory vs Q4
  • Min VRAM — The minimum VRAM needed to load the Q4_K_M version entirely on GPU (includes ~1GB overhead for KV cache at 2K context)
  • tok/s RTX 3060 — Generation speed on RTX 3060 12GB (entry-level AI GPU)
  • tok/s RTX 4090 — Generation speed on RTX 4090 24GB (high-end consumer)
  • tok/s M4 Max — Generation speed on Apple M4 Max 48GB unified memory

All benchmarks measured with: Ollama 0.6.2, 512-token prompt, 256-token generation, single request, models fully loaded in VRAM (no CPU offload).


Llama 3.x Family {#llama-3x-family}

The workhorse family. Meta's Llama models are the most-used open-weight models for good reason — strong quality across the board.

ModelParamsQ4_K_MQ5_K_MFP16Min VRAM3060 tok/s4090 tok/sM4 Max tok/s
Llama 3.2 1B1.24B0.8GB0.9GB2.5GB2GB120210150
Llama 3.2 3B3.21B1.9GB2.3GB6.4GB3GB85160110
Llama 3.2 8B8.03B4.9GB5.7GB16.1GB6GB429562
Llama 3.3 70B70.6B40.0GB48.0GB141.0GB42GB12
Llama 3.1 405B405B229.0GB275.0GB810.0GB232GB

Notes:

  • Llama 3.2 8B is the single most popular Ollama model. It fits on any 8GB GPU with room to spare.
  • Llama 3.3 70B at Q4_K_M needs 42GB — runs on M4 Max 48GB or dual 24GB GPUs. Does not fit on a single RTX 4090.
  • Llama 3.1 405B is impractical for consumer hardware. Included for completeness. Requires 4x A100 80GB or equivalent.
# Pull Llama models
ollama pull llama3.2:1b
ollama pull llama3.2:3b
ollama pull llama3.2        # defaults to 8B Q4_K_M
ollama pull llama3.3:70b-instruct-q4_K_M

Qwen 2.5 / Qwen 3 Family {#qwen-25--qwen-3-family}

Alibaba's Qwen models punch above their weight on code, math, and multilingual tasks. Qwen 2.5 Coder is the best local coding model at every size.

ModelParamsQ4_K_MQ5_K_MFP16Min VRAM3060 tok/s4090 tok/sM4 Max tok/s
Qwen 2.5 0.5B0.49B0.4GB0.5GB1.0GB1.5GB180320220
Qwen 2.5 1.5B1.54B1.0GB1.2GB3.1GB2GB110200140
Qwen 2.5 3B3.09B1.9GB2.2GB6.2GB3GB82155108
Qwen 2.5 7B7.62B4.4GB5.2GB15.2GB5.5GB4510065
Qwen 2.5 14B14.8B8.7GB10.3GB29.5GB10GB205538
Qwen 2.5 32B32.5B18.8GB22.5GB65.0GB20GB2822
Qwen 2.5 72B72.7B42.0GB50.0GB145.0GB44GB11
Qwen 3 8B8.2B5.0GB5.9GB16.4GB6.5GB409058
Qwen 3 32B32.8B19.2GB23.0GB65.6GB21GB2620

Notes:

  • Qwen 2.5 7B is the go-to if you want code+math strength on an 8GB GPU.
  • Qwen 2.5 14B Q4_K_M at 8.7GB barely fits on a 12GB RTX 3060 — you get it running but context window is limited.
  • Qwen 2.5 32B is the sweet spot for 24GB GPUs. Fits with room for a decent context window.
  • Qwen 3 models use a hybrid thinking architecture (think/no-think modes). Slightly larger than Qwen 2.5 at the same parameter count.
# Pull Qwen models
ollama pull qwen2.5:0.5b
ollama pull qwen2.5:7b
ollama pull qwen2.5:14b
ollama pull qwen2.5:32b
ollama pull qwen2.5:72b
ollama pull qwen3:8b
ollama pull qwen3:32b

Gemma 3 Family {#gemma-3-family}

Google's Gemma 3 models are surprisingly strong at small sizes. Gemma 3 4B punches way above its weight and is a top pick for constrained devices.

ModelParamsQ4_K_MQ5_K_MFP16Min VRAM3060 tok/s4090 tok/sM4 Max tok/s
Gemma 3 1B1.0B0.7GB0.8GB2.0GB2GB130230160
Gemma 3 4B3.9B2.5GB3.0GB7.8GB3.5GB7014095
Gemma 3 12B12.2B7.3GB8.7GB24.4GB8.5GB246042
Gemma 3 27B27.2B15.9GB19.0GB54.4GB17GB3225

Notes:

  • Gemma 3 4B at 2.5GB Q4_K_M is excellent for Raspberry Pi 5 (8GB) or old laptops.
  • Gemma 3 12B is a strong 12GB GPU choice, rivaling models twice its size on instruction following.
  • Gemma 3 27B fits on a single RTX 4090 (24GB) at Q4_K_M with tight context, or comfortably on 32GB Apple Silicon.
# Pull Gemma models
ollama pull gemma3:1b
ollama pull gemma3:4b
ollama pull gemma3:12b
ollama pull gemma3:27b

Phi-4 Family {#phi-4-family}

Microsoft's Phi-4 models achieve remarkable reasoning for their size. Phi-4 3.8B consistently beats 7B models from other families on logic and math tasks.

ModelParamsQ4_K_MQ5_K_MFP16Min VRAM3060 tok/s4090 tok/sM4 Max tok/s
Phi-4 Mini (3.8B)3.82B2.3GB2.8GB7.6GB3.5GB75145100
Phi-4 (14B)14.0B8.2GB9.8GB28.0GB9.5GB225236

Notes:

  • Phi-4 Mini at 2.3GB Q4_K_M is the best reasoning model you can fit on a 4GB GPU.
  • Phi-4 14B needs a 12GB GPU minimum. Performance is excellent for code review and analytical tasks.
# Pull Phi models
ollama pull phi4-mini       # 3.8B
ollama pull phi4            # 14B

Mistral / Mixtral Family {#mistral--mixtral-family}

Mistral's models and their Mixture-of-Experts (MoE) Mixtral variants. MoE models use more disk space but activate only a fraction of parameters per token, giving better quality per FLOP.

ModelParamsQ4_K_MQ5_K_MFP16Min VRAM3060 tok/s4090 tok/sM4 Max tok/s
Mistral 7B v0.37.25B4.4GB5.1GB14.5GB5.5GB449864
Mistral Small 24B24.0B14.0GB16.8GB48.0GB15.5GB3627
Mixtral 8x7B46.7B (12.9B active)26.4GB31.7GB93.4GB28GB18
Mixtral 8x22B176B (39B active)80.0GB96.0GB352.0GB82GB

Notes:

  • Mistral 7B is a solid all-rounder, competitive with Llama 3.2 8B. Slightly smaller file size.
  • Mixtral 8x7B has 46.7B total params but only activates 12.9B per token. Needs 28GB VRAM but runs at the speed of a ~13B model. Quality approaches 70B dense models.
  • Mixtral 8x22B is a server-class model. 80GB minimum VRAM. Requires A100 80GB or multi-GPU.
# Pull Mistral/Mixtral models
ollama pull mistral          # 7B
ollama pull mistral-small    # 24B
ollama pull mixtral          # 8x7B

DeepSeek R1 Distills {#deepseek-r1-distills}

DeepSeek's R1 reasoning model distilled into smaller architectures. These models "think" step-by-step and show their reasoning chain.

ModelBaseQ4_K_MQ5_K_MFP16Min VRAM3060 tok/s4090 tok/sM4 Max tok/s
DeepSeek R1 1.5BQwen 2.5 1.5B1.0GB1.2GB3.1GB2GB105190130
DeepSeek R1 7BQwen 2.5 7B4.4GB5.2GB15.2GB5.5GB408858
DeepSeek R1 8BLlama 3.1 8B4.9GB5.7GB16.1GB6GB388555
DeepSeek R1 14BQwen 2.5 14B8.7GB10.3GB29.5GB10GB184834
DeepSeek R1 32BQwen 2.5 32B18.8GB22.5GB65.0GB20GB2519
DeepSeek R1 70BLlama 3.3 70B40.0GB48.0GB141.0GB42GB10

Notes:

  • R1 distills have the same VRAM requirements as their base models (same architecture, same parameter count).
  • The "thinking" tokens add output length. A simple question might generate 500+ reasoning tokens before the final answer. Budget more output tokens.
  • R1 7B on an 8GB GPU gives you step-by-step reasoning that rivals much larger non-reasoning models on math and logic.
# Pull DeepSeek R1 distills
ollama pull deepseek-r1:1.5b
ollama pull deepseek-r1:7b
ollama pull deepseek-r1:8b
ollama pull deepseek-r1:14b
ollama pull deepseek-r1:32b
ollama pull deepseek-r1:70b

Code Models {#code-models}

Specialized models for code generation, completion, and review. These are fine-tuned on code and perform better than general-purpose models on programming tasks.

ModelParamsQ4_K_MQ5_K_MFP16Min VRAM3060 tok/s4090 tok/sM4 Max tok/s
Qwen 2.5 Coder 1.5B1.54B1.0GB1.2GB3.1GB2GB108195135
Qwen 2.5 Coder 7B7.62B4.4GB5.2GB15.2GB5.5GB439663
Qwen 2.5 Coder 14B14.8B8.7GB10.3GB29.5GB10GB195236
Qwen 2.5 Coder 32B32.5B18.8GB22.5GB65.0GB20GB2721
CodeLlama 7B6.74B3.8GB4.5GB13.5GB5GB4810568
CodeLlama 13B13.0B7.4GB8.8GB26.0GB8.5GB255840
CodeLlama 34B33.7B19.1GB22.8GB67.4GB20.5GB2620
StarCoder2 3B3.03B1.8GB2.2GB6.1GB3GB80150105
StarCoder2 7B6.74B3.8GB4.5GB13.5GB5GB4610066
StarCoder2 15B15.5B9.0GB10.8GB31.0GB10.5GB185035

Notes:

  • Qwen 2.5 Coder 32B is the best local coding model available, period. Fits on an RTX 4090 at Q4_K_M.
  • Qwen 2.5 Coder 7B is the best coding model for 8GB GPUs. Outperforms CodeLlama 13B despite being smaller.
  • CodeLlama is aging but still widely used. If you are already using it, consider switching to Qwen 2.5 Coder at the same size.
  • StarCoder2 excels at code completion (fill-in-the-middle) rather than instruction following.
# Pull code models
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:14b
ollama pull qwen2.5-coder:32b
ollama pull codellama:7b
ollama pull codellama:13b
ollama pull codellama:34b
ollama pull starcoder2:3b
ollama pull starcoder2:7b
ollama pull starcoder2:15b

Quick Reference: What Fits on Your GPU {#quick-reference-what-fits-on-your-gpu}

Find your GPU (or unified memory) capacity and see what models you can run at full speed (100% VRAM, no CPU offload).

8GB VRAM (RTX 3060 8GB, RTX 4060, M1/M2 8GB)

ModelQuantSizeQualitySpeed
Llama 3.2 8BQ4_K_M4.9GBStrong general42 tok/s
Qwen 2.5 7BQ4_K_M4.4GBBest for code/math45 tok/s
Gemma 3 4BQ5_K_M3.0GBGreat for small tasks70 tok/s
Phi-4 Mini 3.8BQ5_K_M2.8GBBest tiny reasoner75 tok/s
DeepSeek R1 7BQ4_K_M4.4GBChain-of-thought40 tok/s
Qwen 2.5 Coder 7BQ4_K_M4.4GBBest coding for 8GB43 tok/s

Top pick: Llama 3.2 8B Q4_K_M for general use, Qwen 2.5 Coder 7B for programming.

12GB VRAM (RTX 3060 12GB, RTX 4070)

Everything from 8GB, plus:

ModelQuantSizeQualitySpeed
Qwen 2.5 14BQ4_K_M8.7GBSignificant quality jump20 tok/s
Gemma 3 12BQ4_K_M7.3GBExcellent instruction24 tok/s
Phi-4 14BQ4_K_M8.2GBStrong reasoning22 tok/s
Llama 3.2 8BQ5_K_M5.7GBHigher quality 8B38 tok/s
CodeLlama 13BQ4_K_M7.4GBSolid code model25 tok/s

Top pick: Qwen 2.5 14B Q4_K_M. The jump from 7B to 14B is the single biggest quality improvement per dollar in local AI.

16GB VRAM (RTX 4080, RTX 5060 Ti, M1 Pro/M2 Pro 16GB)

Everything from 12GB, plus:

ModelQuantSizeQualitySpeed
Mistral Small 24BQ4_K_M14.0GBStrong all-rounder36 tok/s
Gemma 3 27BQ4_K_M15.9GBNear-70B quality32 tok/s
Qwen 2.5 14BQ5_K_M10.3GBHigher quality 14B18 tok/s

Top pick: Gemma 3 27B Q4_K_M squeezes in and delivers impressive quality.

24GB VRAM (RTX 3090, RTX 4090, M3 Pro 24GB)

Everything from 16GB, plus:

ModelQuantSizeQualitySpeed
Qwen 2.5 32BQ4_K_M18.8GBNear-70B quality28 tok/s
Qwen 2.5 Coder 32BQ4_K_M18.8GBBest local coding27 tok/s
DeepSeek R1 32BQ4_K_M18.8GBBest local reasoning25 tok/s
CodeLlama 34BQ4_K_M19.1GBMature code model26 tok/s
Qwen 2.5 14BFP1629.5GB— (too large)

Top pick: Qwen 2.5 32B Q4_K_M. Best overall model that fits on a single consumer GPU. This is the sweet spot.

32GB Unified Memory (M2 Pro/Max 32GB, M3 Pro 36GB)

Everything from 24GB at slightly lower speed, plus:

ModelQuantSizeQualitySpeed
Mixtral 8x7BQ4_K_M26.4GBMoE, broad knowledge18 tok/s
Qwen 2.5 32BQ5_K_M22.5GBHigher quality 32B20 tok/s

48GB+ Unified Memory (M3 Max 48GB, M4 Max 48GB+)

Everything from 32GB, plus:

ModelQuantSizeQualitySpeed
Llama 3.3 70BQ4_K_M40.0GBTop-tier open model12 tok/s
Qwen 2.5 72BQ4_K_M42.0GBExcellent multilingual11 tok/s
DeepSeek R1 70BQ4_K_M40.0GBBest open reasoning10 tok/s

Top pick: Llama 3.3 70B Q4_K_M. Running a 70B model on a laptop is genuinely impressive. Speed is acceptable for interactive use.

64GB+ (M4 Ultra, dual GPU, server)

Everything from 48GB, plus:

ModelQuantSizeQualitySpeed
Llama 3.3 70BQ5_K_M48.0GBBest quality 70B10 tok/s
Qwen 2.5 72BQ5_K_M50.0GBHigher quality 72B9 tok/s
Mixtral 8x22BQ4_K_M80.0GBNeeds 82GB+(64GB not enough)

For more on choosing hardware for your target models, see our RAM requirements guide and VRAM requirements guide.


Ollama Pull Commands: Every Model {#ollama-pull-commands-every-model}

Copy-paste ready. Every model referenced in this article:

# === LLAMA FAMILY ===
ollama pull llama3.2:1b
ollama pull llama3.2:3b
ollama pull llama3.2                     # 8B, default quant
ollama pull llama3.3:70b-instruct-q4_K_M

# === QWEN FAMILY ===
ollama pull qwen2.5:0.5b
ollama pull qwen2.5:1.5b
ollama pull qwen2.5:3b
ollama pull qwen2.5:7b
ollama pull qwen2.5:14b
ollama pull qwen2.5:32b
ollama pull qwen2.5:72b
ollama pull qwen3:8b
ollama pull qwen3:32b

# === GEMMA FAMILY ===
ollama pull gemma3:1b
ollama pull gemma3:4b
ollama pull gemma3:12b
ollama pull gemma3:27b

# === PHI FAMILY ===
ollama pull phi4-mini
ollama pull phi4

# === MISTRAL/MIXTRAL ===
ollama pull mistral
ollama pull mistral-small
ollama pull mixtral

# === DEEPSEEK R1 DISTILLS ===
ollama pull deepseek-r1:1.5b
ollama pull deepseek-r1:7b
ollama pull deepseek-r1:8b
ollama pull deepseek-r1:14b
ollama pull deepseek-r1:32b
ollama pull deepseek-r1:70b

# === CODE MODELS ===
ollama pull qwen2.5-coder:1.5b
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:14b
ollama pull qwen2.5-coder:32b
ollama pull codellama:7b
ollama pull codellama:13b
ollama pull codellama:34b
ollama pull starcoder2:3b
ollama pull starcoder2:7b
ollama pull starcoder2:15b

Browse the full model library at ollama.com/library.


Memory Math: How to Calculate Any Model {#memory-math-how-to-calculate-any-model}

If a model is not in this table, you can estimate its VRAM requirement:

FP16 size (GB)  = Parameters (B) × 2
Q4_K_M size     ≈ FP16 × 0.28 to 0.32  (varies by architecture)
Q5_K_M size     ≈ FP16 × 0.34 to 0.38
Q8_0 size       ≈ FP16 × 0.50 to 0.55

Min VRAM needed = Model file size + 1.0 GB (KV cache at 2K context)
                  + 0.5 GB per 2K additional context tokens

Example: A new 20B model you want to run at Q4_K_M:

  • FP16 size: 20 × 2 = 40GB
  • Q4_K_M: 40 × 0.30 = ~12GB
  • Min VRAM: 12 + 1.0 = ~13GB
  • Fits on a 16GB GPU with room for 4K context

For a thorough explanation of quantization levels and their quality tradeoffs, read our quantization explained guide. To find the best models for tight memory budgets, see best models for 8GB RAM.


Frequently Asked Questions {#faq}

See the FAQ section below for answers to common questions about Ollama memory requirements.


Keep This Bookmarked

This table gets updated as new models release. The Ollama ecosystem moves fast — new model families appear every few weeks, and existing ones get updated quantization options.

The core principle stays constant: check the Q4_K_M file size, add 1-1.5GB for overhead, and compare against your available VRAM. If it fits with room to spare, you will have a good experience. If it barely fits, expect limited context windows and occasional slowdowns.


Building a new machine around a specific model? Start with the hardware requirements guide to size your GPU, RAM, and storage correctly.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, MLOps — chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: April 11, 2026🔄 Last Updated: April 11, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset

I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.

✓ 10+ Years in ML/AI✓ 77K Dataset Creator✓ Open Source Contributor

Get Model Updates Weekly

New Ollama models drop every week. Get benchmarks, VRAM requirements, and recommendations before everyone else.

Was this helpful?

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators