Question 1

Which quantization should I use for daily LLM work?

Accepted Answer

Q4_K_M is the right default for almost everyone. It uses ~75% less memory than BF16 with quality typically within 1-3 percentage points on standard benchmarks (MMLU, HumanEval, etc.). Step up to Q5_K_M or Q6_K when you have memory headroom and quality matters (production RAG, complex reasoning). Step down to Q3_K_M only when Q4 won't fit and you accept noticeable quality loss on smaller models. Stay above Q3 if your model is under 13B parameters.

Question 2

How accurate is "quality retention" — is Q4_K_M really 96% as good as FP16?

Accepted Answer

The 96% number is an average across MMLU, HumanEval, and GSM8K — public benchmarks where the gap is small. Real-world quality loss varies. Reasoning-heavy tasks (math, multi-step planning) lose more. Style and creative writing lose less. For coding specifically, Q4_K_M on a 7B model loses ~2-4 HumanEval points vs BF16; on a 70B model, less than 1 point. The bigger the model, the smaller the relative quality loss from quantization — which is why aggressive quantization is more viable on 70B+ than on 7B.

Question 3

What's the difference between Q4_K_M, AWQ INT4, and GPTQ INT4?

Accepted Answer

All three are 4-bit quantization but optimized for different runtimes. Q4_K_M is the llama.cpp / Ollama format — uses K-quants (block quantization with importance weighting) plus mixed precision for sensitive layers, ends up ~4.83 bits per weight on average. AWQ is the vLLM/SGLang format — protects the 1% of weights critical for activation quality, very fast on GPU. GPTQ is the older format — layer-by-layer quantization with calibration, largely replaced by AWQ for new deployments. Quality is roughly Q4_K_M ≥ AWQ > GPTQ. Speed on GPU is roughly AWQ > GPTQ > Q4_K_M.

Question 4

When should I use FP8 instead of INT4?

Accepted Answer

FP8 (E4M3 or E5M2) is the right choice when you have FP8-capable hardware — H100, H200, MI300X, RTX 5090. FP8 keeps near-FP16 quality (>99% retention) at half the memory of BF16, runs through dedicated FP8 tensor cores at higher throughput than any INT4 path on GPU, and is the production standard at vLLM/SGLang/TRT-LLM in 2026. Use FP8 for production serving on modern hardware; use Q4/AWQ when you need 4× memory reduction or run on older GPUs without FP8 cores.

Question 5

Why does the calculator show different VRAM than the VRAM Calculator?

Accepted Answer

They're complementary. The VRAM Calculator is model-specific — pick a real model and get its exact VRAM. This calculator is quantization-first — see how the same model size scales across all quant levels at once, useful for comparing trade-offs before committing. For a final number, cross-check with the VRAM Calculator on your specific model. Differences of ±10% are normal due to per-model architecture variation (head count, layer count, vocab size).

Question 6

How is KV cache calculated here?

Accepted Answer

We approximate KV cache memory as 2 × layers × kv_heads × head_dim × 2_bytes × seq_len, assuming FP16 KV with 8 KV heads (typical GQA). Actual numbers vary: Llama 3 70B has 8 KV heads × 128 head_dim × 80 layers; DeepSeek V3 uses MLA which compresses KV by ~5×; older non-GQA models have full attention heads as KV heads. For most modern models (Llama 3+, Qwen 2+, Mistral) this approximation is within 15%. For DeepSeek V3/V4, the real KV cache is 5× smaller than shown here.

Question 7

What about MoE models — does the calculator handle them correctly?

Accepted Answer

For MoE models like DeepSeek V3 (671B total / 37B active) or Kimi K2.6 (1T total / 32B active), all weights must be loaded into VRAM by default — even though only a fraction activates per token. The "Use active-params only" toggle exists for the rare case where you offload inactive experts to CPU/disk, accepting much slower inference for lower VRAM. Most production MoE deployments load everything; only set the toggle if you actually run with expert offload (very rare in 2026 outside research setups).

Question 8

Does quantization affect fine-tuning or only inference?

Accepted Answer

Standard quantization (Q4, Q8, FP8) is for inference. For fine-tuning, you typically keep weights in BF16 — except in QLoRA, where the base model loads in NF4 (Q4-equivalent) but only the small LoRA adapter trains in BF16. QLoRA lets you fine-tune a 70B model on a single 24GB GPU, with quality close to full BF16 fine-tuning. See our QLoRA Fine-Tuning Guide for the full setup.

Quantization Calculator

Go from reading about AI to building with AI

When to use which quantization

Q4_K_M — Recommended Default

Q5_K_M / Q6_K — Quality-First

FP8 — Production GPU Serving

AWQ INT4 — vLLM Production

Q3_K_M — Memory-Constrained

Q2_K — Last Resort

Frequently asked questions

The calculator gets you a number. The course gets you a deployment.

Related tools & resources

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide