★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds

VRAM Calculator for AI Models

Calculate exactly how much GPU memory you need to run any LLM locally. Enter the model parameters and quantization level to get instant VRAM estimates for Ollama, llama.cpp, and vLLM.

📅 Published: March 18, 2026🔄 Last Updated: March 18, 2026✓ Manually Reviewed

Calculate Your VRAM Needs

Model Preset

Parameters (Billions): 7B

0.5B500B

Quantization: Q4_K_M (4-bit)

Good — minimal quality loss (recommended)

Context Length: 4,096 tokens

5124K32K128K

Your GPU / Hardware

✅ Model Fits in Your GPU!

5.4 GB

Total VRAM Needed

3.9 GB

Model Weights

0.5 GB

KV Cache

1.0 GB

Overhead

RTX 4090 (24 GB)23% used

Expected speed:Full GPU speed

How VRAM Is Calculated

GPU memory for LLMs has three main components:

Model Weights

The largest component. Formula: Parameters x Bytes_per_param. A 7B model at FP16 (2 bytes) = 14 GB. At Q4 (~0.56 bytes) = ~4 GB. Quantization is the primary lever for reducing this.

KV Cache

Grows with context length. Stores attention key/value pairs for each token. At 4K context: ~0.5-1 GB. At 32K: 2-4 GB. At 128K: 8-16 GB. Models with long context need more KV cache.

Runtime Overhead

CUDA kernels, activation memory, and framework buffers. Typically 0.5-1.5 GB depending on the inference engine. Ollama and llama.cpp are efficient; vLLM uses more for batched serving.

MoE models note: Mixture-of-Experts models like Qwen3-Coder (480B) or GPT-OSS require VRAM for all parameters, not just active ones. A 480B MoE with 35B active still needs ~250 GB for the full model weights at Q4.

Quick Reference: Popular Models

Model	Params	Q4_K_M	Q8_0	FP16	Min GPU
Llama 3.2 3B	3B	~2.5 GB	~4 GB	~6.5 GB	Any 4GB GPU
Phi-3.5 Mini	3.8B	~3 GB	~5 GB	~8 GB	Any 4GB GPU
Mistral 7B	7B	~5 GB	~8.5 GB	~14.5 GB	RTX 3060 8GB
Llama 3.1 8B	8B	~5.5 GB	~9 GB	~16 GB	RTX 3060 12GB
Qwen 2.5 14B	14B	~9.5 GB	~16 GB	~28 GB	RTX 4060 Ti 16GB
Qwen 2.5 32B	32B	~22 GB	~36 GB	~64 GB	RTX 5090 32GB
Llama 3.3 70B	70B	~42 GB	~72 GB	~140 GB	2x RTX 4090
Llama 4 Scout	109B MoE	~55 GB	~109 GB	~218 GB	3x RTX 4090

VRAM estimates include ~1 GB overhead for KV cache at 4K context. Actual usage varies by inference engine and context length. See GPU comparison for hardware recommendations.

Frequently Asked Questions

How much VRAM do I need for a 7B parameter model?

A 7B parameter model needs approximately 5 GB VRAM at Q4_K_M quantization (the most common), 8.5 GB at Q8_0, or 14.5 GB at FP16 (full precision). Most consumer GPUs (RTX 3060 12GB, RTX 4060 8GB) handle Q4 7B models easily. On Apple Silicon, 7B models run well on any Mac with 8GB+ unified memory.

How is VRAM calculated for LLMs?

The formula is: VRAM = Parameters x Bytes_per_parameter + Context_overhead. At FP16, each parameter uses 2 bytes (7B x 2 = 14GB). At Q4 (4-bit), each uses ~0.56 bytes (7B x 0.56 = ~4GB). Add 1-2 GB for KV cache and overhead at standard context lengths (4K-8K tokens). Longer contexts (32K+) need significantly more KV cache memory. MoE models need VRAM for all parameters, not just active ones.

What is the best quantization level for local AI?

Q4_K_M is the sweet spot for most users — it reduces model size by ~75% with minimal quality loss (typically <2% on benchmarks vs FP16). Q5_K_M offers slightly better quality at ~40% more VRAM. Q8_0 is near-lossless but uses 2x the VRAM of Q4. Go lower (Q2_K, Q3_K) only if your hardware requires it — quality drops become noticeable. See our quantization guide for detailed comparisons.

Can I run models larger than my VRAM?

Yes, through CPU offloading. Ollama and llama.cpp automatically split models between GPU VRAM and system RAM. Layers that fit in VRAM run at GPU speed; the rest run on CPU. Performance degrades linearly: if 50% of layers are on CPU, expect roughly 50% slower generation. Apple Silicon Macs handle this gracefully since GPU and CPU share unified memory.

Does context length affect VRAM usage?

Yes. The KV (key-value) cache grows with context length. At 4K tokens, KV cache adds ~0.5-1 GB for most 7B models. At 32K tokens, it can add 2-4 GB. At 128K tokens, 8-16 GB extra. This is why a model that "fits" in VRAM at short context may not fit at full context. Our calculator accounts for this — adjust the context length slider to see the impact.

Embed this free VRAM Calculator on your site

Free to use — just keep the attribution link. Works on any site.

<iframe src="https://localaimaster.com/embed/vram-calculator" width="100%" height="560" style="border:1px solid #1f2937;border-radius:12px;max-width:680px" title="VRAM Calculator — Local AI Master" loading="lazy"></iframe>

Ready to Go Beyond Tutorials?

20 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.

Start Learning Free See pricing

Was this helpful?

🎯

AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Start free Browse courses first

Or own it for life — Lifetime $149 $599, pay once

Training your whole team? Get a team quote →

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

GitHub LinkedIn Twitter