★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds

Free Tool · Instant · No Signup

Quantization Calculator

Pick a model size and quantization level — see VRAM needed, quality retention vs full-precision, and the cheapest GPU that fits. Covers all standard formats: Q2 through Q8 GGUF, AWQ INT4, GPTQ INT4, FP8, BF16, and FP16. Updated for 2026 hardware (RTX 5090, H100, MI300X, M3 Ultra).

📅 Published: May 9, 2026🔄 Last Updated: May 9, 2026✓ Manually Reviewed

Total VRAM needed

6.8 GB

Model weights4.2 GB
KV cache @ 8K1.1 GB
Runtime overhead1.5 GB

Quality vs FP16

96.0%

The sweet spot. Best quality/size trade for most use cases. Recommended default.

Recommended hardware

Cheapest fit: RTX 3060 12GB · ~$280

Also fits on: RTX 4060 (8GB), RTX 4070 Ti Super (16GB)
Approximation. Exact VRAM varies ±10% by model architecture (head count, layer count, vocab size). KV cache assumes FP16 GQA with 8 KV heads — most modern models. For MoE models without expert offload, total weights memory is required even though only a fraction is activated per token. Cross-check with VRAM Calculator for model-specific accuracy.
🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

When to use which quantization

Q4_K_M — Recommended Default

75% memory reduction with 96% quality retention. Fits 7B on 8GB, 13B on 12GB, 32B on 24GB, 70B on 48GB. The right choice for almost every local deployment.

Q5_K_M / Q6_K — Quality-First

Use when you have memory headroom and quality matters (production RAG, complex reasoning). 98-99% quality retention at 20-40% more memory than Q4.

FP8 — Production GPU Serving

For H100/H200/RTX 5090/MI300X. Near-lossless quality, dedicated FP8 tensor cores, default in vLLM/SGLang for production. Half the memory of BF16.

AWQ INT4 — vLLM Production

When you need 4× memory reduction on older GPUs without FP8 cores. ~95% quality retention, very fast on consumer GPUs (RTX 3090/4090).

Q3_K_M — Memory-Constrained

Use when Q4 won't fit. Accept ~10% quality loss. Most viable on 70B+ models where the larger size absorbs quantization damage.

Q2_K — Last Resort

Significant quality loss on small models. Only acceptable on 70B+ when nothing else fits. Often better to choose a smaller model at higher quant.

Frequently asked questions

Which quantization should I use for daily LLM work?
Q4_K_M is the right default for almost everyone. It uses ~75% less memory than BF16 with quality typically within 1-3 percentage points on standard benchmarks (MMLU, HumanEval, etc.). Step up to Q5_K_M or Q6_K when you have memory headroom and quality matters (production RAG, complex reasoning). Step down to Q3_K_M only when Q4 won't fit and you accept noticeable quality loss on smaller models. Stay above Q3 if your model is under 13B parameters.
How accurate is "quality retention" — is Q4_K_M really 96% as good as FP16?
The 96% number is an average across MMLU, HumanEval, and GSM8K — public benchmarks where the gap is small. Real-world quality loss varies. Reasoning-heavy tasks (math, multi-step planning) lose more. Style and creative writing lose less. For coding specifically, Q4_K_M on a 7B model loses ~2-4 HumanEval points vs BF16; on a 70B model, less than 1 point. The bigger the model, the smaller the relative quality loss from quantization — which is why aggressive quantization is more viable on 70B+ than on 7B.
What's the difference between Q4_K_M, AWQ INT4, and GPTQ INT4?
All three are 4-bit quantization but optimized for different runtimes. Q4_K_M is the llama.cpp / Ollama format — uses K-quants (block quantization with importance weighting) plus mixed precision for sensitive layers, ends up ~4.83 bits per weight on average. AWQ is the vLLM/SGLang format — protects the 1% of weights critical for activation quality, very fast on GPU. GPTQ is the older format — layer-by-layer quantization with calibration, largely replaced by AWQ for new deployments. Quality is roughly Q4_K_M ≥ AWQ > GPTQ. Speed on GPU is roughly AWQ > GPTQ > Q4_K_M.
When should I use FP8 instead of INT4?
FP8 (E4M3 or E5M2) is the right choice when you have FP8-capable hardware — H100, H200, MI300X, RTX 5090. FP8 keeps near-FP16 quality (>99% retention) at half the memory of BF16, runs through dedicated FP8 tensor cores at higher throughput than any INT4 path on GPU, and is the production standard at vLLM/SGLang/TRT-LLM in 2026. Use FP8 for production serving on modern hardware; use Q4/AWQ when you need 4× memory reduction or run on older GPUs without FP8 cores.
Why does the calculator show different VRAM than the VRAM Calculator?
They're complementary. The VRAM Calculator is model-specific — pick a real model and get its exact VRAM. This calculator is quantization-first — see how the same model size scales across all quant levels at once, useful for comparing trade-offs before committing. For a final number, cross-check with the VRAM Calculator on your specific model. Differences of ±10% are normal due to per-model architecture variation (head count, layer count, vocab size).
How is KV cache calculated here?
We approximate KV cache memory as 2 × layers × kv_heads × head_dim × 2_bytes × seq_len, assuming FP16 KV with 8 KV heads (typical GQA). Actual numbers vary: Llama 3 70B has 8 KV heads × 128 head_dim × 80 layers; DeepSeek V3 uses MLA which compresses KV by ~5×; older non-GQA models have full attention heads as KV heads. For most modern models (Llama 3+, Qwen 2+, Mistral) this approximation is within 15%. For DeepSeek V3/V4, the real KV cache is 5× smaller than shown here.
What about MoE models — does the calculator handle them correctly?
For MoE models like DeepSeek V3 (671B total / 37B active) or Kimi K2.6 (1T total / 32B active), all weights must be loaded into VRAM by default — even though only a fraction activates per token. The "Use active-params only" toggle exists for the rare case where you offload inactive experts to CPU/disk, accepting much slower inference for lower VRAM. Most production MoE deployments load everything; only set the toggle if you actually run with expert offload (very rare in 2026 outside research setups).
Does quantization affect fine-tuning or only inference?
Standard quantization (Q4, Q8, FP8) is for inference. For fine-tuning, you typically keep weights in BF16 — except in QLoRA, where the base model loads in NF4 (Q4-equivalent) but only the small LoRA adapter trains in BF16. QLoRA lets you fine-tune a 70B model on a single 24GB GPU, with quality close to full BF16 fine-tuning. See our QLoRA Fine-Tuning Guide for the full setup.

Master quantization for production

The calculator gets you a number. The course gets you a deployment.

Local AI Deployment course covers KV-cache quantization, FP8 vs INT4 trade-offs, mixed-precision serving, and production tuning — the stuff that matters when quality starts dropping after you ship. First chapter free, no card.

Related tools & resources

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

Free Tools & Calculators