VRAM Calculator for AI Models

Calculate exactly how much GPU memory you need to run any LLM locally. Enter the model parameters and quantization level to get instant VRAM estimates for Ollama, llama.cpp, and vLLM.

📅 Published: March 18, 2026🔄 Last Updated: March 18, 2026✓ Manually Reviewed

Calculate Your VRAM Needs

0.5B500B

Good — minimal quality loss (recommended)

5124K32K128K

✅ Model Fits in Your GPU!

5.4 GB
Total VRAM Needed
3.9 GB
Model Weights
0.5 GB
KV Cache
1.0 GB
Overhead
RTX 4090 (24 GB)23% used
Expected speed:Full GPU speed

How VRAM Is Calculated

GPU memory for LLMs has three main components:

Model Weights

The largest component. Formula: Parameters x Bytes_per_param. A 7B model at FP16 (2 bytes) = 14 GB. At Q4 (~0.56 bytes) = ~4 GB. Quantization is the primary lever for reducing this.

KV Cache

Grows with context length. Stores attention key/value pairs for each token. At 4K context: ~0.5-1 GB. At 32K: 2-4 GB. At 128K: 8-16 GB. Models with long context need more KV cache.

Runtime Overhead

CUDA kernels, activation memory, and framework buffers. Typically 0.5-1.5 GB depending on the inference engine. Ollama and llama.cpp are efficient; vLLM uses more for batched serving.

MoE models note: Mixture-of-Experts models like Qwen3-Coder (480B) or GPT-OSS require VRAM for all parameters, not just active ones. A 480B MoE with 35B active still needs ~250 GB for the full model weights at Q4.

Quick Reference: Popular Models

ModelParamsQ4_K_MQ8_0FP16Min GPU
Llama 3.2 3B3B~2.5 GB~4 GB~6.5 GBAny 4GB GPU
Phi-3.5 Mini3.8B~3 GB~5 GB~8 GBAny 4GB GPU
Mistral 7B7B~5 GB~8.5 GB~14.5 GBRTX 3060 8GB
Llama 3.1 8B8B~5.5 GB~9 GB~16 GBRTX 3060 12GB
Qwen 2.5 14B14B~9.5 GB~16 GB~28 GBRTX 4060 Ti 16GB
Qwen 2.5 32B32B~22 GB~36 GB~64 GBRTX 5090 32GB
Llama 3.3 70B70B~42 GB~72 GB~140 GB2x RTX 4090
Llama 4 Scout109B MoE~55 GB~109 GB~218 GB3x RTX 4090

VRAM estimates include ~1 GB overhead for KV cache at 4K context. Actual usage varies by inference engine and context length. See GPU comparison for hardware recommendations.

Frequently Asked Questions

How much VRAM do I need for a 7B parameter model?

A 7B parameter model needs approximately 5 GB VRAM at Q4_K_M quantization (the most common), 8.5 GB at Q8_0, or 14.5 GB at FP16 (full precision). Most consumer GPUs (RTX 3060 12GB, RTX 4060 8GB) handle Q4 7B models easily. On Apple Silicon, 7B models run well on any Mac with 8GB+ unified memory.

How is VRAM calculated for LLMs?

The formula is: VRAM = Parameters x Bytes_per_parameter + Context_overhead. At FP16, each parameter uses 2 bytes (7B x 2 = 14GB). At Q4 (4-bit), each uses ~0.56 bytes (7B x 0.56 = ~4GB). Add 1-2 GB for KV cache and overhead at standard context lengths (4K-8K tokens). Longer contexts (32K+) need significantly more KV cache memory. MoE models need VRAM for all parameters, not just active ones.

What is the best quantization level for local AI?

Q4_K_M is the sweet spot for most users — it reduces model size by ~75% with minimal quality loss (typically <2% on benchmarks vs FP16). Q5_K_M offers slightly better quality at ~40% more VRAM. Q8_0 is near-lossless but uses 2x the VRAM of Q4. Go lower (Q2_K, Q3_K) only if your hardware requires it — quality drops become noticeable. See our quantization guide for detailed comparisons.

Can I run models larger than my VRAM?

Yes, through CPU offloading. Ollama and llama.cpp automatically split models between GPU VRAM and system RAM. Layers that fit in VRAM run at GPU speed; the rest run on CPU. Performance degrades linearly: if 50% of layers are on CPU, expect roughly 50% slower generation. Apple Silicon Macs handle this gracefully since GPU and CPU share unified memory.

Does context length affect VRAM usage?

Yes. The KV (key-value) cache grows with context length. At 4K tokens, KV cache adds ~0.5-1 GB for most 7B models. At 32K tokens, it can add 2-4 GB. At 128K tokens, 8-16 GB extra. This is why a model that "fits" in VRAM at short context may not fit at full context. Our calculator accounts for this — adjust the context length slider to see the impact.

Get AI Breakthroughs Before Everyone Else

Join 10,000+ developers mastering local AI with weekly exclusive insights.

Was this helpful?

PR

Written by Pattanaik Ramswarup

AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset

I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.

✓ 10+ Years in ML/AI✓ 77K Dataset Creator✓ Open Source Contributor
Free Tools & Calculators