★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Hardware

AI VRAM Requirements 2026: GPU Sizes for 7B, 13B, 70B Models

February 4, 2026
18 min read
Local AI Master Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

📚AI Learning Path

Got the hardware sorted? Now build on it. You know what to buy — the courses show you what to actually run, fine-tune, and ship on it. First chapter free, no card.

Start free
Or own it for life — Lifetime $149, pay once

How much VRAM do you need for AI models? With Q4 quantization, a 7B model needs 4-6GB, a 13B model needs 8-10GB, a 32B model needs ~20GB, and a 70B model needs 40GB+. As a rule of thumb: VRAM (GB) ≈ parameters (in billions) × bytes-per-param (0.5 for Q4, 1 for Q8, 2 for FP16) × 1.2 for overhead. An 8GB GPU comfortably runs 7B-8B models; 16GB handles up to ~34B; 24GB runs 70B at Q4.

VRAM Quick Reference

8GB VRAM
7B-8B models
RTX 4060/4070
16GB VRAM
14B-34B models
RTX 4070 Ti Super
24GB VRAM
70B Q4 models
RTX 4090/5090
48GB+ VRAM
70B Q8, 120B+
Dual GPUs/Pro

VRAM Requirements by Model Size

Quick Reference Table

Model SizeFP16Q8_0Q5_K_MQ4_K_M
7B14GB8GB6GB5GB
8B16GB9GB7GB6GB
13B26GB14GB10GB9GB
14B28GB15GB11GB10GB
32B64GB34GB24GB20GB
34B68GB36GB26GB22GB
70B140GB75GB52GB42GB
72B144GB78GB54GB44GB

VRAM Formula

VRAM (GB) = Parameters (B) × Bytes_per_param × 1.2

Bytes per param:
- FP16/BF16: 2 bytes
- Q8_0: 1 byte
- Q5_K_M: 0.7 bytes
- Q4_K_M: 0.55 bytes

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Quantization Impact

What You Lose at Each Level

QuantizationVRAM SavingsQuality Loss
FP16 (baseline)0%0%
Q8_0~47%~1%
Q5_K_M~65%~2-3%
Q4_K_M~72%~3-5%
Q3_K_M~78%~5-10%
Q2_K~82%~10-20%

Recommendation: Q4_K_M is the sweet spot—significant savings with minimal quality loss.

Context Window VRAM

Context length adds to base VRAM requirements:

ContextAdditional VRAM (70B)
4K+0.5GB
8K+2GB
16K+8GB
32K+32GB

Formula: ~(context² × layers × 2) / 1e9 GB

GPU Recommendations by Use Case

Casual Use / Learning

RTX 4060 8GB ($299)

  • Runs: 7B models comfortably
  • Use: Learning, simple chat

Hobbyist

RTX 4070 Ti Super 16GB ($799)

  • Runs: 14B-32B models, Mixtral
  • Use: Daily AI assistant, coding help

Power User

RTX 4090 24GB ($1,599)

  • Runs: 70B Q4, most models
  • Use: Serious local AI, development

Professional

RTX 5090 32GB (~$3,600 street, mid-2026)

  • Runs: 70B Q5/Q8, larger contexts
  • Use: Production, enterprise
  • Note: GDDR7 shortages have kept the 5090 well above its $1,999 MSRP through mid-2026.

Sweet Spot for Mid-Range (new in 2026)

RTX 5070 Ti SUPER / 5080 SUPER 24GB (~$1,000-1,300)

  • Runs: 32B at Q4/Q5, gpt-oss 20B, 70B with light CPU offload
  • Use: The best new value tier — 24GB at a mainstream price, between the 16GB 5070 Ti/5080 and the 32GB 5090.

Enterprise

Dual RTX 4090 48GB (~$3,200)

  • Runs: 70B Q8, 120B+ models
  • Use: Large models, training

Not sure which card matches your budget and target model size? Our interactive which-GPU-to-buy tool maps your VRAM needs to a specific recommendation, and the full best-GPUs-for-AI guide breaks down bandwidth, price, and power for every current card.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What Fits on Your GPU?

8GB VRAM (RTX 4060, 4070)

ModelQuantizationFits?
Llama 3.1 8BQ4_K_MYes ✓
Mistral 7BQ4_K_MYes ✓
Phi-3 14BQ4_K_MTight
DeepSeek Coder 7BQ4_K_MYes ✓

16GB VRAM (RTX 4070 Ti Super, 4080)

ModelQuantizationFits?
Llama 3.1 70BQ4_K_MNo ✗
Llama 3.1 8BQ8_0Yes ✓
Mixtral 8x7BQ4_K_MYes ✓
DeepSeek 32BQ4_K_MTight

24GB VRAM (RTX 4090, 5090)

ModelQuantizationFits?
Llama 3.1 70BQ4_K_MYes ✓
DeepSeek V3Q4_K_MYes ✓
Llama 4 MaverickQ4_K_MYes ✓
Qwen 72BQ4_K_MTight

The generic 7B/13B/70B buckets are a great starting point, but the models people actually download in 2026 don't fall neatly on those round numbers — and several are Mixture-of-Experts (MoE), which changes the math. Here are real, current models with approximate VRAM at the quant most people run (Q4_K_M unless noted). Figures include a typical 4K-8K context; budget more for long context (see the KV cache section below).

Model (2026)TypeParamsApprox VRAM (Q4_K_M)Fits on
Llama 3.1 8BDense8B~5-6GB8GB card
Gemma 3 12BDense12B~8-9GB12GB card
gpt-oss 20BMoE20.9B (3.6B active)~14-16GB16GB card
Gemma 3 27BDense27B~17-18GB24GB card
Qwen3 32BDense32B~20-22GB24GB card
DeepSeek R1 Distill 32BDense32B~20-22GB24GB card
Llama 3.3 70BDense70B~42-43GB48GB / 2×24GB
Qwen2.5 72BDense72B~44GB48GB / 2×24GB
Qwen3 235BMoE235B (22B active)~130-140GBMulti-GPU / 192GB Mac
gpt-oss 120BMoE116.8B (5.1B active)~60-65GB clean1× 80GB (H100) or 96GB+
DeepSeek V3 / R1MoE671B (37B active)~380GB+ at Q4Server-class only

A few things worth calling out, because they trip people up:

  • MoE memory is the total, not the active count. gpt-oss 120B only activates ~5B parameters per token, so it's fast like a small model — but you still have to hold all ~117B weights in memory. The clean single-GPU answer is ~60-65GB (so an 80GB H100, or a 96GB+ workstation card); a 24GB consumer card cannot host it GPU-resident.
  • gpt-oss 20B is the new "16GB sweet spot." It runs comfortably on an RTX 4080/4090 or a 24GB Mac and is the most capable open model that fits a single mainstream card in 2026.
  • The 32B class (Qwen3 32B, Gemma 3 27B, DeepSeek R1 distills) is the real home for 24GB cards — better quality than 8B, and you keep headroom for context. If your main use is coding, our guide to picking the right model size for coding (7B vs 14B vs 32B vs 70B) walks through where the quality jumps actually matter.

How Does Context Length (KV Cache) Eat Your VRAM?

The model weights are only half the story. Every token you keep in context lives in the KV cache, and on long contexts the cache can rival or exceed the model itself. A rough, current rule: for a 7B-8B model, each ~1,000 tokens of context adds roughly ~0.1GB to VRAM in FP16. Scale that up by model size and context, and a Llama-3-8B at 32K context burns roughly 4GB on KV cache alone — nearly as much as the 4-5GB Q4 weights.

Approximate KV-cache formula (FP16):

KV cache bytes ≈ 2 × layers × kv_heads × head_dim × tokens × 2 bytes

Two modern features dramatically reduce this:

  • GQA (Grouped-Query Attention) — used by almost every 2026 model (Llama 3.x, Qwen3, Gemma 3) — cuts KV cache 50-75% by sharing keys/values across query heads, with no quality loss. It's baked into the model, so you get it for free.
  • KV cache quantization (FP8 or INT4) — supported in llama.cpp, vLLM, and others — cuts cache memory another 50-75%. This is how people fit 32K+ context into an 8GB card. In Ollama/llama.cpp, enable it with cache-type flags (e.g. --cache-type-k q8_0 --cache-type-v q8_0).

If you want exact numbers for a specific model + context + quant combination, plug it into our VRAM calculator tool rather than estimating — it accounts for KV cache, GQA, and overhead so you don't guess wrong on a $1,000+ GPU.

VRAM vs System RAM: When Does RAM Step In?

VRAM is where you want the whole model to live — it's roughly 10-20× faster than system RAM for inference. But when a model doesn't fit, Ollama and llama.cpp spill the overflow layers to system RAM (CPU offload), which works but is much slower for the offloaded portion. That makes system RAM your safety net, especially for MoE giants and partial-offload setups. If you're sizing a build, read our companion RAM requirements for local AI guide alongside this one — the two budgets are different, and people routinely over-buy VRAM while starving the system RAM that long contexts and CPU-offload actually need.

Optimizing VRAM Usage

1. Choose Right Quantization

# Q4 for 70B on 24GB
ollama run llama3.1:70b-q4_K_M

# Q5 if you have headroom
ollama run llama3.1:70b-q5_K_M

2. Reduce Context

# Default context uses more VRAM
ollama run model

# Reduced context saves VRAM
ollama run model --num-ctx 4096

3. Unload Unused Models

# Keep only active model loaded
ollama stop model_name

4. GPU Layers for Hybrid

# Partial GPU, rest on CPU
OLLAMA_NUM_GPU=30 ollama run model

Multi-GPU Setups

Combining VRAM

SetupTotal VRAMUsable
2× RTX 409048GB~44GB
RTX 4090 + 309048GB~42GB
2× RTX 509064GB~58GB

Configuration

# Automatic multi-GPU in llama.cpp
./main -m model.gguf -ngl 99 # Uses all GPUs

# Ollama multi-GPU
CUDA_VISIBLE_DEVICES=0,1 ollama serve

Key Takeaways

  1. Q4_K_M is the sweet spot for most users
  2. 24GB handles most models including 70B
  3. Context length adds significant VRAM
  4. Multi-GPU helps but with overhead
  5. Budget more VRAM than minimum for headroom

Next Steps

  1. Browse the best Ollama models — VRAM requirements for every model
  2. AWQ vs GPTQ vs GGUF — quantization formats that determine VRAM usage
  3. Choose your GPU based on VRAM needs
  4. Find models for 8GB RAM — budget hardware recommendations
  5. Set up Open WebUI once your hardware is ready

VRAM is the key constraint for local AI. Understanding these requirements helps you choose the right hardware and optimize your setup.

🎯
AI Learning Path

Got the hardware sorted? Now build on it.

You know what to buy — the courses show you what to actually run, fine-tune, and ship on it. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once

Liked this? 20 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Want structured AI education?

20 courses, 495+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path
More on Local AI Hardware
See the full AI Hardware Guide 2026 guide.

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: February 4, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Was this helpful?

LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Got the hardware sorted? Now build on it.

You know what to buy — the courses show you what to actually run, fine-tune, and ship on it. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators