AI VRAM Requirements 2026: GPU Sizes for 7B, 13B, 70B Models
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Got the hardware sorted? Now build on it. You know what to buy — the courses show you what to actually run, fine-tune, and ship on it. First chapter free, no card.
How much VRAM do you need for AI models? With Q4 quantization, a 7B model needs 4-6GB, a 13B model needs 8-10GB, a 32B model needs ~20GB, and a 70B model needs 40GB+. As a rule of thumb: VRAM (GB) ≈ parameters (in billions) × bytes-per-param (0.5 for Q4, 1 for Q8, 2 for FP16) × 1.2 for overhead. An 8GB GPU comfortably runs 7B-8B models; 16GB handles up to ~34B; 24GB runs 70B at Q4.
VRAM Quick Reference
VRAM Requirements by Model Size
Quick Reference Table
| Model Size | FP16 | Q8_0 | Q5_K_M | Q4_K_M |
|---|---|---|---|---|
| 7B | 14GB | 8GB | 6GB | 5GB |
| 8B | 16GB | 9GB | 7GB | 6GB |
| 13B | 26GB | 14GB | 10GB | 9GB |
| 14B | 28GB | 15GB | 11GB | 10GB |
| 32B | 64GB | 34GB | 24GB | 20GB |
| 34B | 68GB | 36GB | 26GB | 22GB |
| 70B | 140GB | 75GB | 52GB | 42GB |
| 72B | 144GB | 78GB | 54GB | 44GB |
VRAM Formula
VRAM (GB) = Parameters (B) × Bytes_per_param × 1.2
Bytes per param:
- FP16/BF16: 2 bytes
- Q8_0: 1 byte
- Q5_K_M: 0.7 bytes
- Q4_K_M: 0.55 bytes
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Quantization Impact
What You Lose at Each Level
| Quantization | VRAM Savings | Quality Loss |
|---|---|---|
| FP16 (baseline) | 0% | 0% |
| Q8_0 | ~47% | ~1% |
| Q5_K_M | ~65% | ~2-3% |
| Q4_K_M | ~72% | ~3-5% |
| Q3_K_M | ~78% | ~5-10% |
| Q2_K | ~82% | ~10-20% |
Recommendation: Q4_K_M is the sweet spot—significant savings with minimal quality loss.
Context Window VRAM
Context length adds to base VRAM requirements:
| Context | Additional VRAM (70B) |
|---|---|
| 4K | +0.5GB |
| 8K | +2GB |
| 16K | +8GB |
| 32K | +32GB |
Formula: ~(context² × layers × 2) / 1e9 GB
GPU Recommendations by Use Case
Casual Use / Learning
RTX 4060 8GB ($299)
- Runs: 7B models comfortably
- Use: Learning, simple chat
Hobbyist
RTX 4070 Ti Super 16GB ($799)
- Runs: 14B-32B models, Mixtral
- Use: Daily AI assistant, coding help
Power User
RTX 4090 24GB ($1,599)
- Runs: 70B Q4, most models
- Use: Serious local AI, development
Professional
RTX 5090 32GB (~$3,600 street, mid-2026)
- Runs: 70B Q5/Q8, larger contexts
- Use: Production, enterprise
- Note: GDDR7 shortages have kept the 5090 well above its $1,999 MSRP through mid-2026.
Sweet Spot for Mid-Range (new in 2026)
RTX 5070 Ti SUPER / 5080 SUPER 24GB (~$1,000-1,300)
- Runs: 32B at Q4/Q5, gpt-oss 20B, 70B with light CPU offload
- Use: The best new value tier — 24GB at a mainstream price, between the 16GB 5070 Ti/5080 and the 32GB 5090.
Enterprise
Dual RTX 4090 48GB (~$3,200)
- Runs: 70B Q8, 120B+ models
- Use: Large models, training
Not sure which card matches your budget and target model size? Our interactive which-GPU-to-buy tool maps your VRAM needs to a specific recommendation, and the full best-GPUs-for-AI guide breaks down bandwidth, price, and power for every current card.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What Fits on Your GPU?
8GB VRAM (RTX 4060, 4070)
| Model | Quantization | Fits? |
|---|---|---|
| Llama 3.1 8B | Q4_K_M | Yes ✓ |
| Mistral 7B | Q4_K_M | Yes ✓ |
| Phi-3 14B | Q4_K_M | Tight |
| DeepSeek Coder 7B | Q4_K_M | Yes ✓ |
16GB VRAM (RTX 4070 Ti Super, 4080)
| Model | Quantization | Fits? |
|---|---|---|
| Llama 3.1 70B | Q4_K_M | No ✗ |
| Llama 3.1 8B | Q8_0 | Yes ✓ |
| Mixtral 8x7B | Q4_K_M | Yes ✓ |
| DeepSeek 32B | Q4_K_M | Tight |
24GB VRAM (RTX 4090, 5090)
| Model | Quantization | Fits? |
|---|---|---|
| Llama 3.1 70B | Q4_K_M | Yes ✓ |
| DeepSeek V3 | Q4_K_M | Yes ✓ |
| Llama 4 Maverick | Q4_K_M | Yes ✓ |
| Qwen 72B | Q4_K_M | Tight |
How Much VRAM for 2026's Most Popular Models?
The generic 7B/13B/70B buckets are a great starting point, but the models people actually download in 2026 don't fall neatly on those round numbers — and several are Mixture-of-Experts (MoE), which changes the math. Here are real, current models with approximate VRAM at the quant most people run (Q4_K_M unless noted). Figures include a typical 4K-8K context; budget more for long context (see the KV cache section below).
| Model (2026) | Type | Params | Approx VRAM (Q4_K_M) | Fits on |
|---|---|---|---|---|
| Llama 3.1 8B | Dense | 8B | ~5-6GB | 8GB card |
| Gemma 3 12B | Dense | 12B | ~8-9GB | 12GB card |
| gpt-oss 20B | MoE | 20.9B (3.6B active) | ~14-16GB | 16GB card |
| Gemma 3 27B | Dense | 27B | ~17-18GB | 24GB card |
| Qwen3 32B | Dense | 32B | ~20-22GB | 24GB card |
| DeepSeek R1 Distill 32B | Dense | 32B | ~20-22GB | 24GB card |
| Llama 3.3 70B | Dense | 70B | ~42-43GB | 48GB / 2×24GB |
| Qwen2.5 72B | Dense | 72B | ~44GB | 48GB / 2×24GB |
| Qwen3 235B | MoE | 235B (22B active) | ~130-140GB | Multi-GPU / 192GB Mac |
| gpt-oss 120B | MoE | 116.8B (5.1B active) | ~60-65GB clean | 1× 80GB (H100) or 96GB+ |
| DeepSeek V3 / R1 | MoE | 671B (37B active) | ~380GB+ at Q4 | Server-class only |
A few things worth calling out, because they trip people up:
- MoE memory is the total, not the active count. gpt-oss 120B only activates ~5B parameters per token, so it's fast like a small model — but you still have to hold all ~117B weights in memory. The clean single-GPU answer is ~60-65GB (so an 80GB H100, or a 96GB+ workstation card); a 24GB consumer card cannot host it GPU-resident.
- gpt-oss 20B is the new "16GB sweet spot." It runs comfortably on an RTX 4080/4090 or a 24GB Mac and is the most capable open model that fits a single mainstream card in 2026.
- The 32B class (Qwen3 32B, Gemma 3 27B, DeepSeek R1 distills) is the real home for 24GB cards — better quality than 8B, and you keep headroom for context. If your main use is coding, our guide to picking the right model size for coding (7B vs 14B vs 32B vs 70B) walks through where the quality jumps actually matter.
How Does Context Length (KV Cache) Eat Your VRAM?
The model weights are only half the story. Every token you keep in context lives in the KV cache, and on long contexts the cache can rival or exceed the model itself. A rough, current rule: for a 7B-8B model, each ~1,000 tokens of context adds roughly ~0.1GB to VRAM in FP16. Scale that up by model size and context, and a Llama-3-8B at 32K context burns roughly 4GB on KV cache alone — nearly as much as the 4-5GB Q4 weights.
Approximate KV-cache formula (FP16):
KV cache bytes ≈ 2 × layers × kv_heads × head_dim × tokens × 2 bytes
Two modern features dramatically reduce this:
- GQA (Grouped-Query Attention) — used by almost every 2026 model (Llama 3.x, Qwen3, Gemma 3) — cuts KV cache 50-75% by sharing keys/values across query heads, with no quality loss. It's baked into the model, so you get it for free.
- KV cache quantization (FP8 or INT4) — supported in llama.cpp, vLLM, and others — cuts cache memory another 50-75%. This is how people fit 32K+ context into an 8GB card. In Ollama/llama.cpp, enable it with cache-type flags (e.g.
--cache-type-k q8_0 --cache-type-v q8_0).
If you want exact numbers for a specific model + context + quant combination, plug it into our VRAM calculator tool rather than estimating — it accounts for KV cache, GQA, and overhead so you don't guess wrong on a $1,000+ GPU.
VRAM vs System RAM: When Does RAM Step In?
VRAM is where you want the whole model to live — it's roughly 10-20× faster than system RAM for inference. But when a model doesn't fit, Ollama and llama.cpp spill the overflow layers to system RAM (CPU offload), which works but is much slower for the offloaded portion. That makes system RAM your safety net, especially for MoE giants and partial-offload setups. If you're sizing a build, read our companion RAM requirements for local AI guide alongside this one — the two budgets are different, and people routinely over-buy VRAM while starving the system RAM that long contexts and CPU-offload actually need.
Optimizing VRAM Usage
1. Choose Right Quantization
# Q4 for 70B on 24GB
ollama run llama3.1:70b-q4_K_M
# Q5 if you have headroom
ollama run llama3.1:70b-q5_K_M
2. Reduce Context
# Default context uses more VRAM
ollama run model
# Reduced context saves VRAM
ollama run model --num-ctx 4096
3. Unload Unused Models
# Keep only active model loaded
ollama stop model_name
4. GPU Layers for Hybrid
# Partial GPU, rest on CPU
OLLAMA_NUM_GPU=30 ollama run model
Multi-GPU Setups
Combining VRAM
| Setup | Total VRAM | Usable |
|---|---|---|
| 2× RTX 4090 | 48GB | ~44GB |
| RTX 4090 + 3090 | 48GB | ~42GB |
| 2× RTX 5090 | 64GB | ~58GB |
Configuration
# Automatic multi-GPU in llama.cpp
./main -m model.gguf -ngl 99 # Uses all GPUs
# Ollama multi-GPU
CUDA_VISIBLE_DEVICES=0,1 ollama serve
Key Takeaways
- Q4_K_M is the sweet spot for most users
- 24GB handles most models including 70B
- Context length adds significant VRAM
- Multi-GPU helps but with overhead
- Budget more VRAM than minimum for headroom
Next Steps
- Browse the best Ollama models — VRAM requirements for every model
- AWQ vs GPTQ vs GGUF — quantization formats that determine VRAM usage
- Choose your GPU based on VRAM needs
- Find models for 8GB RAM — budget hardware recommendations
- Set up Open WebUI once your hardware is ready
VRAM is the key constraint for local AI. Understanding these requirements helps you choose the right hardware and optimize your setup.
Got the hardware sorted? Now build on it.
You know what to buy — the courses show you what to actually run, fine-tune, and ship on it. First chapter free, no card.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARLocal AI Hardware Requirements (2026): Complete Guide
- AI Hardware Guide 2026: GPU, CPU & RAM for Local AI
- AI Hardware Requirements 2026: CPU, GPU & RAM Guide for Beginners
- AI RAM Requirements 2026: How Much for 7B, 13B, 70B Models?
- AMD Ryzen AI Max+ 395 (Strix Halo) for Local AI 2026
- Apple M4 for Local AI: Mac Studio + MacBook Guide (2026)
- Best Local AI Models 2025: 6 Compared (RAM, VRAM & Benchmarks)
- Best Mac for Local AI 2026: Every Apple Silicon Chip Ranked (M1–M5)
- Best Mini PC for Ollama: 5 Tested Under $800 (2026)
- Build a Private OpenAI-Compatible API on Your Own Hardware
Comments (0)
No comments yet. Be the first to share your thoughts!