VRAM Calculator for AI Models
Calculate exactly how much GPU memory you need to run any LLM locally. Enter the model parameters and quantization level to get instant VRAM estimates for Ollama, llama.cpp, and vLLM.
Calculate Your VRAM Needs
Good — minimal quality loss (recommended)
✅ Model Fits in Your GPU!
How VRAM Is Calculated
GPU memory for LLMs has three main components:
Model Weights
The largest component. Formula: Parameters x Bytes_per_param. A 7B model at FP16 (2 bytes) = 14 GB. At Q4 (~0.56 bytes) = ~4 GB. Quantization is the primary lever for reducing this.
KV Cache
Grows with context length. Stores attention key/value pairs for each token. At 4K context: ~0.5-1 GB. At 32K: 2-4 GB. At 128K: 8-16 GB. Models with long context need more KV cache.
Runtime Overhead
CUDA kernels, activation memory, and framework buffers. Typically 0.5-1.5 GB depending on the inference engine. Ollama and llama.cpp are efficient; vLLM uses more for batched serving.
MoE models note: Mixture-of-Experts models like Qwen3-Coder (480B) or GPT-OSS require VRAM for all parameters, not just active ones. A 480B MoE with 35B active still needs ~250 GB for the full model weights at Q4.
Quick Reference: Popular Models
| Model | Params | Q4_K_M | Q8_0 | FP16 | Min GPU |
|---|---|---|---|---|---|
| Llama 3.2 3B | 3B | ~2.5 GB | ~4 GB | ~6.5 GB | Any 4GB GPU |
| Phi-3.5 Mini | 3.8B | ~3 GB | ~5 GB | ~8 GB | Any 4GB GPU |
| Mistral 7B | 7B | ~5 GB | ~8.5 GB | ~14.5 GB | RTX 3060 8GB |
| Llama 3.1 8B | 8B | ~5.5 GB | ~9 GB | ~16 GB | RTX 3060 12GB |
| Qwen 2.5 14B | 14B | ~9.5 GB | ~16 GB | ~28 GB | RTX 4060 Ti 16GB |
| Qwen 2.5 32B | 32B | ~22 GB | ~36 GB | ~64 GB | RTX 5090 32GB |
| Llama 3.3 70B | 70B | ~42 GB | ~72 GB | ~140 GB | 2x RTX 4090 |
| Llama 4 Scout | 109B MoE | ~55 GB | ~109 GB | ~218 GB | 3x RTX 4090 |
VRAM estimates include ~1 GB overhead for KV cache at 4K context. Actual usage varies by inference engine and context length. See GPU comparison for hardware recommendations.
Frequently Asked Questions
How much VRAM do I need for a 7B parameter model?
A 7B parameter model needs approximately 5 GB VRAM at Q4_K_M quantization (the most common), 8.5 GB at Q8_0, or 14.5 GB at FP16 (full precision). Most consumer GPUs (RTX 3060 12GB, RTX 4060 8GB) handle Q4 7B models easily. On Apple Silicon, 7B models run well on any Mac with 8GB+ unified memory.
How is VRAM calculated for LLMs?
The formula is: VRAM = Parameters x Bytes_per_parameter + Context_overhead. At FP16, each parameter uses 2 bytes (7B x 2 = 14GB). At Q4 (4-bit), each uses ~0.56 bytes (7B x 0.56 = ~4GB). Add 1-2 GB for KV cache and overhead at standard context lengths (4K-8K tokens). Longer contexts (32K+) need significantly more KV cache memory. MoE models need VRAM for all parameters, not just active ones.
What is the best quantization level for local AI?
Q4_K_M is the sweet spot for most users — it reduces model size by ~75% with minimal quality loss (typically <2% on benchmarks vs FP16). Q5_K_M offers slightly better quality at ~40% more VRAM. Q8_0 is near-lossless but uses 2x the VRAM of Q4. Go lower (Q2_K, Q3_K) only if your hardware requires it — quality drops become noticeable. See our quantization guide for detailed comparisons.
Can I run models larger than my VRAM?
Yes, through CPU offloading. Ollama and llama.cpp automatically split models between GPU VRAM and system RAM. Layers that fit in VRAM run at GPU speed; the rest run on CPU. Performance degrades linearly: if 50% of layers are on CPU, expect roughly 50% slower generation. Apple Silicon Macs handle this gracefully since GPU and CPU share unified memory.
Does context length affect VRAM usage?
Yes. The KV (key-value) cache grows with context length. At 4K tokens, KV cache adds ~0.5-1 GB for most 7B models. At 32K tokens, it can add 2-4 GB. At 128K tokens, 8-16 GB extra. This is why a model that "fits" in VRAM at short context may not fit at full context. Our calculator accounts for this — adjust the context length slider to see the impact.
Was this helpful?
Written by Pattanaik Ramswarup
AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset
I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.
Related Guides
Continue your local AI journey with these comprehensive guides