How much VRAM does Ollama actually need to run a model?

The VRAM requirement equals the model file size plus roughly 500MB-1.5GB of overhead for the KV cache and runtime. For example, Llama 3.2 8B at Q4_K_M is a 4.9GB file, so it needs about 5.5-6.5GB of VRAM to run. If you do not have enough VRAM, Ollama automatically offloads layers to system RAM, but performance drops significantly — expect 2-10x slower inference when spilling to RAM.

What is the difference between Q4_K_M, Q5_K_M, and FP16 quantization?

Q4_K_M uses 4-bit quantization with medium precision, reducing model size to roughly 50% of FP16 with minimal quality loss (typically 1-3% degradation on benchmarks). Q5_K_M uses 5-bit quantization at about 60% of FP16 size with even less quality loss (~0.5-1.5%). FP16 is full 16-bit precision — maximum quality but double the memory of Q4. For most users, Q4_K_M offers the best balance of quality and resource usage.

Can I run 70B models on a system with only 16GB RAM or VRAM?

No, not practically. Llama 3.3 70B at Q4_K_M is about 40GB. You need at least 48GB of combined memory. On Apple Silicon Macs with 48GB+ unified memory, it works well. On PC, you need either a GPU with 48GB+ VRAM (like A6000) or you can split across two 24GB GPUs. With only 16GB, the largest 70B option is Q2_K at ~26GB, but quality degrades substantially and most layers spill to system RAM.

Does Apple Silicon unified memory count as VRAM for Ollama?

Yes. On Apple Silicon Macs, Ollama uses Metal GPU acceleration and treats unified memory as VRAM. A Mac with 32GB unified memory can run models that would need 32GB VRAM on a discrete GPU. The bandwidth is lower than dedicated VRAM (200-400 GB/s on M-series vs 936 GB/s on RTX 4090), so tokens-per-second is slower, but models fit and run correctly.

Why does my model run slower than the benchmarks in the table?

Several factors affect speed: (1) Context length — longer conversations use more memory and slow generation. The benchmarks use 512-token context. At 4K+ context, expect 20-40% slower speeds. (2) Other processes using VRAM — background apps, displays, and OS compositor eat into available memory. (3) Thermal throttling — sustained load on laptops causes GPU clock reduction. (4) Power limits — laptop GPUs are power-limited compared to desktop versions. (5) CPU bottleneck — if layers spill to RAM, your CPU speed and RAM bandwidth become the bottleneck.

How do I check how much VRAM a model is using in Ollama?

Run "ollama ps" to see loaded models and their memory usage. For detailed GPU memory, use "nvidia-smi" on NVIDIA GPUs or "sudo powermetrics --samplers gpu_power" on Mac. You can also check Ollama's verbose output by setting OLLAMA_DEBUG=1 before starting the server. The debug output shows exactly how many layers are loaded on GPU vs CPU.

What happens when a model does not fit entirely in VRAM?

Ollama automatically splits the model between GPU VRAM and system RAM. Layers that fit in VRAM run on the GPU at full speed. Remaining layers run on CPU using system RAM. Performance degrades proportionally to how many layers spill. If 50% of layers are on CPU, expect roughly 50-70% slower generation. You can control this with the OLLAMA_NUM_GPU environment variable — set it to the number of GPU layers you want loaded.

Which Ollama models give the best quality for 8GB VRAM?

For 8GB VRAM, the best options are: Llama 3.2 8B Q4_K_M (4.9GB, strong general purpose), Qwen 2.5 7B Q4_K_M (4.4GB, excellent for code and math), Gemma 3 4B Q5_K_M (3.3GB, surprisingly capable for its size), and Phi-4 3.8B Q5_K_M (2.8GB, best reasoning at tiny size). Llama 3.2 8B Q4_K_M is the overall recommendation — it fits comfortably with room for context window overhead.

Ollama VRAM Requirements 2026: Table for Every Model (1B-405B)

Published on April 11, 2026 · 18 min read

Quick answer: here is the RAM/VRAM each popular Ollama model needs at Q4_K_M (the default quant), plus the minimum GPU/unified memory to load it fully on-GPU. Full per-family tables and benchmarks are further down.

Model	Q4_K_M size	Min VRAM	Fits on
Llama 3.2 1B	0.8GB	2GB	Any GPU / phone-class
Llama 3.2 3B	1.9GB	3GB	4GB GPU
Llama 3.2 8B	4.9GB	6GB	8GB GPU
Qwen 2.5 7B	4.4GB	5.5GB	8GB GPU
Qwen 2.5 Coder 7B	4.4GB	5.5GB	8GB GPU
Gemma 3 12B	7.3GB	8.5GB	12GB GPU
Qwen 2.5 14B	8.7GB	10GB	12GB GPU
Gemma 3 27B	15.9GB	17GB	16GB GPU (tight) / 24GB
Qwen 2.5 32B	18.8GB	20GB	24GB GPU
Qwen 2.5 Coder 32B	18.8GB	20GB	24GB GPU
Llama 3.3 70B	40.0GB	42GB	48GB+ unified / dual 24GB
Llama 3.1 405B	229.0GB	232GB	Multi-GPU server

Rule of thumb: min VRAM ≈ the Q4_K_M file size + ~1–1.5GB overhead (KV cache at 2K context). Best pick for 8GB: Llama 3.2 8B (general) or Qwen 2.5 Coder 7B (code). Best for 16GB VRAM: Gemma 3 27B Q4_K_M. Best for 24GB: Qwen 2.5 32B Q4_K_M. RAM vs VRAM: models run fastest fully in VRAM; if they spill to system RAM, expect 2–10× slower generation.

Want the math done for a specific model and context length? Plug your numbers into our interactive VRAM calculator. For the full picture on CPU, disk, and OS-level needs beyond raw model size, see the Ollama system requirements guide.

I got tired of googling "how much VRAM does Llama 70B need" every time I set up a new machine. So I measured every popular Ollama model at every common quantization level, on real hardware, with real numbers.

Bookmark this page. It is the reference table I wish existed when I started running local models.

How to Read This Table {#how-to-read-this-table}

Each model entry includes:

Q4_K_M size — 4-bit quantization, the sweet spot for most users (minimal quality loss, biggest memory savings)
Q5_K_M size — 5-bit quantization, slightly better quality, ~20% more memory
FP16 size — Full precision, maximum quality, 2x memory vs Q4
Min VRAM — The minimum VRAM needed to load the Q4_K_M version entirely on GPU (includes ~1GB overhead for KV cache at 2K context)
tok/s RTX 3060 — Generation speed on RTX 3060 12GB (entry-level AI GPU)
tok/s RTX 4090 — Generation speed on RTX 4090 24GB (high-end consumer)
tok/s M4 Max — Generation speed on Apple M4 Max 48GB unified memory

All benchmarks measured with: Ollama 0.30.x (the current branch as of June 2026, which added the MLX backend on Apple Silicon), 512-token prompt, 256-token generation, single request, models fully loaded in VRAM (no CPU offload). Treat tok/s figures as approximate — they shift a few percent between Ollama point releases and driver versions.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Llama 3.x Family {#llama-3x-family}

The workhorse family. Meta's Llama models are the most-used open-weight models for good reason — strong quality across the board.

Model	Params	Q4_K_M	Q5_K_M	FP16	Min VRAM	3060 tok/s	4090 tok/s	M4 Max tok/s
Llama 3.2 1B	1.24B	0.8GB	0.9GB	2.5GB	2GB	120	210	150
Llama 3.2 3B	3.21B	1.9GB	2.3GB	6.4GB	3GB	85	160	110
Llama 3.2 8B	8.03B	4.9GB	5.7GB	16.1GB	6GB	42	95	62
Llama 3.3 70B	70.6B	40.0GB	48.0GB	141.0GB	42GB	—	—	12
Llama 3.1 405B	405B	229.0GB	275.0GB	810.0GB	232GB	—	—	—

Notes:

Llama 3.2 8B is the single most popular Ollama model. It fits on any 8GB GPU with room to spare.
Llama 3.3 70B at Q4_K_M needs 42GB — runs on M4 Max 48GB or dual 24GB GPUs. Does not fit on a single RTX 4090.
Llama 3.1 405B is impractical for consumer hardware. Included for completeness. Requires 4x A100 80GB or equivalent.

# Pull Llama models
ollama pull llama3.2:1b
ollama pull llama3.2:3b
ollama pull llama3.2        # defaults to 8B Q4_K_M
ollama pull llama3.3:70b-instruct-q4_K_M

How Much VRAM Does Llama 4 Need? {#llama-4-family}

Meta's Llama 4 generation is built on Mixture-of-Experts (MoE), so the memory math is different from the dense Llama 3.x models above. You still have to fit every expert weight in memory (so the file is large), but only a fraction of parameters activate per token (so it runs faster than a dense model of the same total size). This is the single most common point of confusion people hit when sizing hardware for Llama 4.

Model	Total / Active params	Q4_K_M	Min VRAM	Fits on
Llama 4 Scout	109B / 17B active	~55–60GB	~60GB	64GB+ unified / dual 32GB / H100 80GB
Llama 4 Maverick	400B / 17B active	~210–230GB	~230GB	Multi-GPU server only

Notes:

Llama 4 Scout is the smallest Llama 4 and still needs roughly 55–60GB at Q4_K_M because all 16 experts must be resident. It does not fit a single 24GB consumer GPU — plan on a 64GB+ Apple Silicon machine, dual RTX 5090s (64GB combined), or a data-center card. Once loaded, it generates at the speed of a ~17B dense model thanks to MoE routing.
Llama 4 Maverick is firmly server-class. Treat it like the 405B entry in the table: included for completeness, not for a desktop.
If Llama 4 is out of reach, a Q4_K_M Llama 3.3 70B (40GB) on 48GB Apple Silicon remains the most accessible top-tier Llama for local use. Verify the exact tag and size against the official Ollama library before pulling, since MoE quant sizes vary by release.

Qwen 2.5 / Qwen 3 Family {#qwen-25--qwen-3-family}

Alibaba's Qwen models punch above their weight on code, math, and multilingual tasks. Qwen 2.5 Coder is the best local coding model at every size.

Model	Params	Q4_K_M	Q5_K_M	FP16	Min VRAM	3060 tok/s	4090 tok/s	M4 Max tok/s
Qwen 2.5 0.5B	0.49B	0.4GB	0.5GB	1.0GB	1.5GB	180	320	220
Qwen 2.5 1.5B	1.54B	1.0GB	1.2GB	3.1GB	2GB	110	200	140
Qwen 2.5 3B	3.09B	1.9GB	2.2GB	6.2GB	3GB	82	155	108
Qwen 2.5 7B	7.62B	4.4GB	5.2GB	15.2GB	5.5GB	45	100	65
Qwen 2.5 14B	14.8B	8.7GB	10.3GB	29.5GB	10GB	20	55	38
Qwen 2.5 32B	32.5B	18.8GB	22.5GB	65.0GB	20GB	—	28	22
Qwen 2.5 72B	72.7B	42.0GB	50.0GB	145.0GB	44GB	—	—	11
Qwen 3 8B	8.2B	5.0GB	5.9GB	16.4GB	6.5GB	40	90	58
Qwen 3 14B	14.8B	~9.0GB	10.6GB	29.6GB	10.5GB	19	50	35
Qwen 3 30B-A3B (MoE)	30.5B / 3.3B active	~16.8GB	~18GB	61.0GB	—	48	40
Qwen 3 32B	32.8B	19.2GB	23.0GB	65.6GB	21GB	—	26	20

Notes:

Qwen 2.5 7B is the go-to if you want code+math strength on an 8GB GPU.
Qwen 2.5 14B Q4_K_M at 8.7GB barely fits on a 12GB RTX 3060 — you get it running but context window is limited.
Qwen 2.5 32B is the sweet spot for 24GB GPUs. Fits with room for a decent context window.
Qwen 3 models use a hybrid thinking architecture (think/no-think modes). Slightly larger than Qwen 2.5 at the same parameter count.
Qwen 3 30B-A3B is the standout for 24GB GPUs in 2026. It is a Mixture-of-Experts model: ~30B total params (so the file is ~16.8GB at Q4_K_M and needs ~18GB to load), but only ~3.3B activate per token. The result is 32B-class quality running at roughly the speed of a 3-4B model — comfortably 40+ tok/s on a 24GB card. If you have the VRAM, it is often a better pick than the dense Qwen 3 32B for interactive use.

# Pull Qwen models
ollama pull qwen2.5:0.5b
ollama pull qwen2.5:7b
ollama pull qwen2.5:14b
ollama pull qwen2.5:32b
ollama pull qwen2.5:72b
ollama pull qwen3:8b
ollama pull qwen3:14b
ollama pull qwen3:30b-a3b      # MoE, ~16.8GB, fits 24GB
ollama pull qwen3:32b

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Gemma 3 Family {#gemma-3-family}

Google's Gemma 3 models are surprisingly strong at small sizes. Gemma 3 4B punches way above its weight and is a top pick for constrained devices.

Model	Params	Q4_K_M	Q5_K_M	FP16	Min VRAM	3060 tok/s	4090 tok/s	M4 Max tok/s
Gemma 3 1B	1.0B	0.7GB	0.8GB	2.0GB	2GB	130	230	160
Gemma 3 4B	3.9B	2.5GB	3.0GB	7.8GB	3.5GB	70	140	95
Gemma 3 12B	12.2B	7.3GB	8.7GB	24.4GB	8.5GB	24	60	42
Gemma 3 27B	27.2B	15.9GB	19.0GB	54.4GB	17GB	—	32	25

Notes:

Gemma 3 4B at 2.5GB Q4_K_M is excellent for Raspberry Pi 5 (8GB) or old laptops.
Gemma 3 12B is a strong 12GB GPU choice, rivaling models twice its size on instruction following.
Gemma 3 27B fits on a single RTX 4090 (24GB) at Q4_K_M with tight context, or comfortably on 32GB Apple Silicon.

# Pull Gemma models
ollama pull gemma3:1b
ollama pull gemma3:4b
ollama pull gemma3:12b
ollama pull gemma3:27b

Phi-4 Family {#phi-4-family}

Microsoft's Phi-4 models achieve remarkable reasoning for their size. Phi-4 3.8B consistently beats 7B models from other families on logic and math tasks.

Model	Params	Q4_K_M	Q5_K_M	FP16	Min VRAM	3060 tok/s	4090 tok/s	M4 Max tok/s
Phi-4 Mini (3.8B)	3.82B	2.3GB	2.8GB	7.6GB	3.5GB	75	145	100
Phi-4 (14B)	14.0B	8.2GB	9.8GB	28.0GB	9.5GB	22	52	36

Notes:

Phi-4 Mini at 2.3GB Q4_K_M is the best reasoning model you can fit on a 4GB GPU.
Phi-4 14B needs a 12GB GPU minimum. Performance is excellent for code review and analytical tasks.

# Pull Phi models
ollama pull phi4-mini       # 3.8B
ollama pull phi4            # 14B

Mistral / Mixtral Family {#mistral--mixtral-family}

Mistral's models and their Mixture-of-Experts (MoE) Mixtral variants. MoE models use more disk space but activate only a fraction of parameters per token, giving better quality per FLOP.

Model	Params	Q4_K_M	Q5_K_M	FP16	Min VRAM	3060 tok/s	4090 tok/s	M4 Max tok/s
Mistral 7B v0.3	7.25B	4.4GB	5.1GB	14.5GB	5.5GB	44	98	64
Mistral Small 24B	24.0B	14.0GB	16.8GB	48.0GB	15.5GB	—	36	27
Mixtral 8x7B	46.7B (12.9B active)	26.4GB	31.7GB	93.4GB	28GB	—	—	18
Mixtral 8x22B	176B (39B active)	80.0GB	96.0GB	352.0GB	82GB	—	—	—

Notes:

Mistral 7B is a solid all-rounder, competitive with Llama 3.2 8B. Slightly smaller file size.
Mixtral 8x7B has 46.7B total params but only activates 12.9B per token. Needs 28GB VRAM but runs at the speed of a ~13B model. Quality approaches 70B dense models.
Mixtral 8x22B is a server-class model. 80GB minimum VRAM. Requires A100 80GB or multi-GPU.

# Pull Mistral/Mixtral models
ollama pull mistral          # 7B
ollama pull mistral-small    # 24B
ollama pull mixtral          # 8x7B

DeepSeek R1 Distills {#deepseek-r1-distills}

DeepSeek's R1 reasoning model distilled into smaller architectures. These models "think" step-by-step and show their reasoning chain.

Model	Base	Q4_K_M	Q5_K_M	FP16	Min VRAM	3060 tok/s	4090 tok/s	M4 Max tok/s
DeepSeek R1 1.5B	Qwen 2.5 1.5B	1.0GB	1.2GB	3.1GB	2GB	105	190	130
DeepSeek R1 7B	Qwen 2.5 7B	4.4GB	5.2GB	15.2GB	5.5GB	40	88	58
DeepSeek R1 8B	Llama 3.1 8B	4.9GB	5.7GB	16.1GB	6GB	38	85	55
DeepSeek R1 14B	Qwen 2.5 14B	8.7GB	10.3GB	29.5GB	10GB	18	48	34
DeepSeek R1 32B	Qwen 2.5 32B	18.8GB	22.5GB	65.0GB	20GB	—	25	19
DeepSeek R1 70B	Llama 3.3 70B	40.0GB	48.0GB	141.0GB	42GB	—	—	10

Notes:

R1 distills have the same VRAM requirements as their base models (same architecture, same parameter count).
The "thinking" tokens add output length. A simple question might generate 500+ reasoning tokens before the final answer. Budget more output tokens.
R1 7B on an 8GB GPU gives you step-by-step reasoning that rivals much larger non-reasoning models on math and logic.

# Pull DeepSeek R1 distills
ollama pull deepseek-r1:1.5b
ollama pull deepseek-r1:7b
ollama pull deepseek-r1:8b
ollama pull deepseek-r1:14b
ollama pull deepseek-r1:32b
ollama pull deepseek-r1:70b

Code Models {#code-models}

Specialized models for code generation, completion, and review. These are fine-tuned on code and perform better than general-purpose models on programming tasks.

Model	Params	Q4_K_M	Q5_K_M	FP16	Min VRAM	3060 tok/s	4090 tok/s	M4 Max tok/s
Qwen 2.5 Coder 1.5B	1.54B	1.0GB	1.2GB	3.1GB	2GB	108	195	135
Qwen 2.5 Coder 7B	7.62B	4.4GB	5.2GB	15.2GB	5.5GB	43	96	63
Qwen 2.5 Coder 14B	14.8B	8.7GB	10.3GB	29.5GB	10GB	19	52	36
Qwen 2.5 Coder 32B	32.5B	18.8GB	22.5GB	65.0GB	20GB	—	27	21
CodeLlama 7B	6.74B	3.8GB	4.5GB	13.5GB	5GB	48	105	68
CodeLlama 13B	13.0B	7.4GB	8.8GB	26.0GB	8.5GB	25	58	40
CodeLlama 34B	33.7B	19.1GB	22.8GB	67.4GB	20.5GB	—	26	20
StarCoder2 3B	3.03B	1.8GB	2.2GB	6.1GB	3GB	80	150	105
StarCoder2 7B	6.74B	3.8GB	4.5GB	13.5GB	5GB	46	100	66
StarCoder2 15B	15.5B	9.0GB	10.8GB	31.0GB	10.5GB	18	50	35

Notes:

Qwen 2.5 Coder 32B is the best local coding model available, period. Fits on an RTX 4090 at Q4_K_M.
Qwen 2.5 Coder 7B is the best coding model for 8GB GPUs. Outperforms CodeLlama 13B despite being smaller.
CodeLlama is aging but still widely used. If you are already using it, consider switching to Qwen 2.5 Coder at the same size.
StarCoder2 excels at code completion (fill-in-the-middle) rather than instruction following.

# Pull code models
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:14b
ollama pull qwen2.5-coder:32b
ollama pull codellama:7b
ollama pull codellama:13b
ollama pull codellama:34b
ollama pull starcoder2:3b
ollama pull starcoder2:7b
ollama pull starcoder2:15b

Quick Reference: What Fits on Your GPU {#quick-reference-what-fits-on-your-gpu}

Find your GPU (or unified memory) capacity and see what models you can run at full speed (100% VRAM, no CPU offload).

8GB VRAM (RTX 3060 8GB, RTX 4060, M1/M2 8GB)

Model	Quant	Size	Quality	Speed
Llama 3.2 8B	Q4_K_M	4.9GB	Strong general	42 tok/s
Qwen 2.5 7B	Q4_K_M	4.4GB	Best for code/math	45 tok/s
Gemma 3 4B	Q5_K_M	3.0GB	Great for small tasks	70 tok/s
Phi-4 Mini 3.8B	Q5_K_M	2.8GB	Best tiny reasoner	75 tok/s
DeepSeek R1 7B	Q4_K_M	4.4GB	Chain-of-thought	40 tok/s
Qwen 2.5 Coder 7B	Q4_K_M	4.4GB	Best coding for 8GB	43 tok/s

Top pick: Llama 3.2 8B Q4_K_M for general use, Qwen 2.5 Coder 7B for programming.

12GB VRAM (RTX 3060 12GB, RTX 4070)

Everything from 8GB, plus:

Model	Quant	Size	Quality	Speed
Qwen 2.5 14B	Q4_K_M	8.7GB	Significant quality jump	20 tok/s
Gemma 3 12B	Q4_K_M	7.3GB	Excellent instruction	24 tok/s
Phi-4 14B	Q4_K_M	8.2GB	Strong reasoning	22 tok/s
Llama 3.2 8B	Q5_K_M	5.7GB	Higher quality 8B	38 tok/s
CodeLlama 13B	Q4_K_M	7.4GB	Solid code model	25 tok/s

Top pick: Qwen 2.5 14B Q4_K_M. The jump from 7B to 14B is the single biggest quality improvement per dollar in local AI.

16GB VRAM (RTX 4080, RTX 5060 Ti, M1 Pro/M2 Pro 16GB)

Everything from 12GB, plus:

Model	Quant	Size	Quality	Speed
Mistral Small 24B	Q4_K_M	14.0GB	Strong all-rounder	36 tok/s
Gemma 3 27B	Q4_K_M	15.9GB	Near-70B quality	32 tok/s
Qwen 2.5 14B	Q5_K_M	10.3GB	Higher quality 14B	18 tok/s

Top pick: Gemma 3 27B Q4_K_M squeezes in and delivers impressive quality.

24GB VRAM (RTX 3090, RTX 4090, M3 Pro 24GB)

Everything from 16GB, plus:

Model	Quant	Size	Quality	Speed
Qwen 2.5 32B	Q4_K_M	18.8GB	Near-70B quality	28 tok/s
Qwen 2.5 Coder 32B	Q4_K_M	18.8GB	Best local coding	27 tok/s
DeepSeek R1 32B	Q4_K_M	18.8GB	Best local reasoning	25 tok/s
CodeLlama 34B	Q4_K_M	19.1GB	Mature code model	26 tok/s
Qwen 2.5 14B	FP16	29.5GB	— (too large)	—

Top pick: Qwen 2.5 32B Q4_K_M. Best overall model that fits on a single consumer GPU. This is the sweet spot.

32GB Unified Memory (M2 Pro/Max 32GB, M3 Pro 36GB)

Everything from 24GB at slightly lower speed, plus:

Model	Quant	Size	Quality	Speed
Mixtral 8x7B	Q4_K_M	26.4GB	MoE, broad knowledge	18 tok/s
Qwen 2.5 32B	Q5_K_M	22.5GB	Higher quality 32B	20 tok/s

48GB+ Unified Memory (M3 Max 48GB, M4 Max 48GB+)

Everything from 32GB, plus:

Model	Quant	Size	Quality	Speed
Llama 3.3 70B	Q4_K_M	40.0GB	Top-tier open model	12 tok/s
Qwen 2.5 72B	Q4_K_M	42.0GB	Excellent multilingual	11 tok/s
DeepSeek R1 70B	Q4_K_M	40.0GB	Best open reasoning	10 tok/s

Top pick: Llama 3.3 70B Q4_K_M. Running a 70B model on a laptop is genuinely impressive. Speed is acceptable for interactive use.

64GB+ (M4 Ultra, dual GPU, server)

Everything from 48GB, plus:

Model	Quant	Size	Quality	Speed
Llama 3.3 70B	Q5_K_M	48.0GB	Best quality 70B	10 tok/s
Qwen 2.5 72B	Q5_K_M	50.0GB	Higher quality 72B	9 tok/s
Mixtral 8x22B	Q4_K_M	80.0GB	Needs 82GB+	(64GB not enough)

For more on choosing hardware for your target models, see our RAM requirements guide and VRAM requirements guide.

Ollama Pull Commands: Every Model {#ollama-pull-commands-every-model}

Copy-paste ready. Every model referenced in this article:

# === LLAMA FAMILY ===
ollama pull llama3.2:1b
ollama pull llama3.2:3b
ollama pull llama3.2                     # 8B, default quant
ollama pull llama3.3:70b-instruct-q4_K_M

# === QWEN FAMILY ===
ollama pull qwen2.5:0.5b
ollama pull qwen2.5:1.5b
ollama pull qwen2.5:3b
ollama pull qwen2.5:7b
ollama pull qwen2.5:14b
ollama pull qwen2.5:32b
ollama pull qwen2.5:72b
ollama pull qwen3:8b
ollama pull qwen3:32b

# === GEMMA FAMILY ===
ollama pull gemma3:1b
ollama pull gemma3:4b
ollama pull gemma3:12b
ollama pull gemma3:27b

# === PHI FAMILY ===
ollama pull phi4-mini
ollama pull phi4

# === MISTRAL/MIXTRAL ===
ollama pull mistral
ollama pull mistral-small
ollama pull mixtral

# === DEEPSEEK R1 DISTILLS ===
ollama pull deepseek-r1:1.5b
ollama pull deepseek-r1:7b
ollama pull deepseek-r1:8b
ollama pull deepseek-r1:14b
ollama pull deepseek-r1:32b
ollama pull deepseek-r1:70b

# === CODE MODELS ===
ollama pull qwen2.5-coder:1.5b
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:14b
ollama pull qwen2.5-coder:32b
ollama pull codellama:7b
ollama pull codellama:13b
ollama pull codellama:34b
ollama pull starcoder2:3b
ollama pull starcoder2:7b
ollama pull starcoder2:15b

Browse the full model library at ollama.com/library.

How Much Does Context Length Add to VRAM? {#context-length-vram}

The "Min VRAM" column in every table above assumes a modest 2K-token context. In real use — long chats, RAG pipelines, agents, large code files — the KV cache grows with context length and can quietly become the thing that pushes you out of VRAM. This is the #1 reason a model that "should fit" suddenly spills to CPU.

The KV cache scales roughly linearly with context. A useful approximation for a typical dense model at Q4_K_M:

Extra VRAM for context ≈ 0.5 GB per 2K tokens (7-8B model)
                       ≈ 1.0 GB per 2K tokens (14B model)
                       ≈ 1.5-2.0 GB per 2K tokens (32B+ model)

Model (Q4_K_M)	2K ctx	8K ctx	32K ctx
Llama 3.2 8B	~6GB	~7.5GB	~14GB
Qwen 2.5 14B	~10GB	~13GB	~22GB
Qwen 2.5 32B	~20GB	~24GB	~36GB

Notice what this means in practice: a 32B model that fits a 24GB GPU at 2K context will not fit the same card at 32K context — it needs ~36GB. If you run long-context workloads, size your hardware against the right-hand column, not the headline number. Two ways to claw back memory: enable KV-cache quantization (OLLAMA_KV_CACHE_TYPE=q8_0 roughly halves cache size at a tiny quality cost), or set a smaller num_ctx for the model. These are approximate; exact KV-cache size depends on the model's head count and hidden dimension. The broader VRAM requirements guide covers bandwidth and bus-width effects on top of capacity.

If you are working with a tight memory budget, the safest move is to drop a size class and run a smaller model at a longer context rather than a bigger model that constantly spills — see the best local AI models for 8GB RAM for picks that leave headroom for context.

Memory Math: How to Calculate Any Model {#memory-math-how-to-calculate-any-model}

If a model is not in this table, you can estimate its VRAM requirement:

FP16 size (GB)  = Parameters (B) × 2
Q4_K_M size     ≈ FP16 × 0.28 to 0.32  (varies by architecture)
Q5_K_M size     ≈ FP16 × 0.34 to 0.38
Q8_0 size       ≈ FP16 × 0.50 to 0.55

Min VRAM needed = Model file size + 1.0 GB (KV cache at 2K context)
                  + 0.5 GB per 2K additional context tokens

Example: A new 20B model you want to run at Q4_K_M:

FP16 size: 20 × 2 = 40GB
Q4_K_M: 40 × 0.30 = ~12GB
Min VRAM: 12 + 1.0 = ~13GB
Fits on a 16GB GPU with room for 4K context

For a thorough explanation of quantization levels and their quality tradeoffs, read our quantization explained guide. To find the best models for tight memory budgets, see best models for 8GB RAM.

Frequently Asked Questions {#faq}

See the FAQ section below for answers to common questions about Ollama memory requirements.

Keep This Bookmarked

This table gets updated as new models release. The Ollama ecosystem moves fast — new model families appear every few weeks, and existing ones get updated quantization options.

The core principle stays constant: check the Q4_K_M file size, add 1-1.5GB for overhead, and compare against your available VRAM. If it fits with room to spare, you will have a good experience. If it barely fits, expect limited context windows and occasional slowdowns.

Building a new machine around a specific model? Start with the hardware requirements guide to size your GPU, RAM, and storage correctly.

Ollama RAM & VRAM Calculator: Every Model, Quick Reference

Want to go deeper than this article?

How to Read This Table {#how-to-read-this-table}

Reading articles is good. Building is better.

Llama 3.x Family {#llama-3x-family}

How Much VRAM Does Llama 4 Need? {#llama-4-family}

Qwen 2.5 / Qwen 3 Family {#qwen-25--qwen-3-family}

Reading articles is good. Building is better.

Gemma 3 Family {#gemma-3-family}

Phi-4 Family {#phi-4-family}

Mistral / Mixtral Family {#mistral--mixtral-family}

DeepSeek R1 Distills {#deepseek-r1-distills}

Code Models {#code-models}

Quick Reference: What Fits on Your GPU {#quick-reference-what-fits-on-your-gpu}

8GB VRAM (RTX 3060 8GB, RTX 4060, M1/M2 8GB)

12GB VRAM (RTX 3060 12GB, RTX 4070)

16GB VRAM (RTX 4080, RTX 5060 Ti, M1 Pro/M2 Pro 16GB)

24GB VRAM (RTX 3090, RTX 4090, M3 Pro 24GB)

32GB Unified Memory (M2 Pro/Max 32GB, M3 Pro 36GB)

48GB+ Unified Memory (M3 Max 48GB, M4 Max 48GB+)

64GB+ (M4 Ultra, dual GPU, server)

Ollama Pull Commands: Every Model {#ollama-pull-commands-every-model}

How Much Does Context Length Add to VRAM? {#context-length-vram}

Memory Math: How to Calculate Any Model {#memory-math-how-to-calculate-any-model}

Frequently Asked Questions {#faq}

Keep This Bookmarked

Ollama’s running. Here’s what to build with it.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by the Local AI Master Team

Get Model Updates Weekly

🎓 Continue Learning

Related Guides

RAM Requirements for Local AI

VRAM Requirements Guide

Best Models for 8GB RAM

Quantization Explained

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Ollama’s running. Here’s what to build with it.