Question 1

What is the KV cache and why does it dominate long-context memory?

Accepted Answer

During autoregressive generation, every token attends back to all prior tokens. To avoid recomputing attention keys (K) and values (V) for old tokens at every step, transformers cache them — the KV cache. Memory cost: 2 × layers × heads × head_dim × bytes × seq_len per request. For Llama 3.1 70B at 32K context in BF16: 2 × 80 × 8 × 128 × 2 × 32768 ≈ 10.7 GB per request. At 131K context: 43 GB per request. The KV cache is the reason long-context serving needs huge VRAM and the reason batch sizes are limited even on big GPUs. PagedAttention, KV quantization, prefix caching, and CPU offload all exist to manage this cost.

Question 2

What is PagedAttention and how does it differ from contiguous KV cache?

Accepted Answer

Naive KV cache allocates a contiguous buffer of max_seq_len for every request — most of which is wasted on requests that finish early. PagedAttention (Kwon et al., 2023, the vLLM paper) allocates KV in fixed-size pages (default 16 tokens per page) like an OS virtual memory manager. Each request holds a list of page IDs; pages are allocated on-demand as the sequence grows. Result: 60-90% memory waste reduction, 2-4x batch size increase on the same hardware, and natural support for prefix caching, copy-on-write for beam search, and dynamic eviction. PagedAttention is the default in vLLM, SGLang (RadixAttention is a refinement), and TensorRT-LLM.

Question 3

How does prefix caching speed up serving?

Accepted Answer

Prefix caching stores and reuses KV cache for shared prompt prefixes across requests. Common case: a long system prompt + few-shot examples that every request shares. Without caching: every request recomputes attention over the 5K-10K shared tokens (slow, expensive). With caching: the shared prefix is computed once, KV pages are reused across all requests sharing that prefix. Speedup for time-to-first-token (TTFT) on long shared prefixes: 3-10x. Memory cost: shared pages are reference-counted, freed when last request ends. Enable in vLLM with --enable-prefix-caching (default in 0.7+). SGLang's RadixAttention takes this further with hierarchical tree-based prefix matching.

Question 4

How much memory does FP8 / INT8 / INT4 KV cache save vs BF16?

Accepted Answer

KV cache scales linearly with bits per element. BF16 = 2 bytes; FP8 = 1 byte (50% reduction); INT8 = 1 byte (50%); INT4 = 0.5 bytes (75%). Quality impact: FP8 (E5M2 or E4M3) is essentially lossless on H100/H200 — within 0.1 perplexity points of BF16. INT8 W8A8 KV is also near-lossless. INT4 KV (less common) shows 0.5-1 perplexity degradation at long context. For most production: enable FP8 KV on H100 via --kv-cache-dtype fp8_e5m2 in vLLM. Memory savings translate to either larger batch sizes or longer context windows on the same VRAM. For Llama 3.1 70B at 32K context, FP8 KV cuts per-request KV from 10.7 GB to 5.4 GB.

Question 5

Should I offload KV cache to CPU RAM?

Accepted Answer

CPU offload is a fallback for when VRAM cannot hold needed KV — e.g., 200K context on a single GPU, or batch=64 with 32K context. vLLM supports KV offload via --cpu-offload-gb. Cost: PCIe transfer of KV blocks during decode adds 5-30ms per token (PCIe 4.0 x16 = 32 GB/s; 1 GB of KV = 31ms). For latency-sensitive workloads (chat, agents), offload typically hurts more than helps. For batch summarization where TTFT doesn't matter and you need to fit large workloads on small hardware, CPU offload is viable. Better solutions: (1) FP8 KV cache, (2) MLA models like DeepSeek V3, (3) GQA models with smaller KV per head, (4) more GPUs with tensor parallelism.

Question 6

What is disaggregated prefill and when does it matter?

Accepted Answer

Standard serving co-locates prefill (computing KV for the prompt) and decode (generating tokens). For long prompts + short outputs (RAG, code review, document QA), prefill dominates compute and decode dominates bandwidth — they fight each other on the same GPU. Disaggregated prefill (introduced in vLLM 0.6+ and SGLang) runs prefill on dedicated GPUs and ships KV cache over high-speed interconnect (NVLink, RDMA) to decode GPUs. Result: 2-3x throughput on long-prompt workloads at the cost of cluster complexity. Worth it for: B2B services with average prompt >5K tokens, agentic systems with long memories, RAG over big documents. Not worth it for: chat with short prompts, low-volume serving.

Question 7

How does GQA / MQA reduce KV cache vs full MHA?

Accepted Answer

Standard Multi-Head Attention (MHA) has one K, V per query head — KV cache scales with num_heads. Multi-Query Attention (MQA) shares one K, V across all heads — KV cache reduced by num_heads factor (e.g., 32x for 32-head model). Grouped-Query Attention (GQA) is the middle ground: groups of heads share K, V (typically 4-8 KV heads vs 32+ query heads). Llama 3.1 70B uses GQA with 8 KV heads / 64 query heads = 8x KV reduction vs hypothetical full MHA. Quality cost: minimal for GQA (within 0.5% on benchmarks), small for MQA. All modern frontier models (Llama, Qwen, DeepSeek, Hunyuan, Mistral) use GQA or MQA. For very-long-context: combine GQA with MLA (DeepSeek V3) or CLA (Hunyuan-Large) for stacked KV reduction.

Question 8

How do I tune block size, num blocks, and gpu_memory_utilization in vLLM?

Accepted Answer

Three knobs that determine batch capacity and serving behavior. (1) **block_size** (default 16): tokens per KV page. Smaller = less internal fragmentation but more bookkeeping overhead; rarely needs tuning. (2) **num_blocks**: total pages available. vLLM auto-computes from gpu_memory_utilization and model size. (3) **gpu_memory_utilization** (default 0.9): fraction of GPU memory for vLLM. Increase to 0.95 to maximize batch size if no other process needs the GPU; decrease to 0.85 if you see OOM during peak load. To diagnose: monitor `kv_cache_usage` in vLLM stats. <50% utilization means batch is bandwidth-limited (more KV won't help); >90% means batch is KV-limited (consider FP8 KV or smaller context).

Context	Llama 3.1 8B (BF16)	Llama 3.1 70B (BF16)	Llama 3.1 405B (BF16)
4K	0.5 GB	1.3 GB	3.0 GB
32K	4 GB	10.7 GB	24 GB
128K	16 GB	43 GB	96 GB
200K	25 GB	67 GB	150 GB

Technique	Reduction Factor	Quality Loss	Examples
MHA (baseline)	1x	0%	GPT-2, original Llama
MQA (Multi-Query)	num_heads (e.g., 32x)	1-2%	PaLM, Falcon, original Llama 2
GQA (Grouped-Query)	num_heads / num_kv_heads (e.g., 8x)	<0.5%	Llama 3, Qwen 2.5, Mistral
MLA (Multi-Head Latent)	5-10x via low-rank K/V projection	<0.5%	DeepSeek V2, V3
CLA (Cross-Layer)	2-3x via shared K/V across layers	<0.5%	Hunyuan-Large
GQA + MLA	30-80x combined	<1%	DeepSeek V3 (effective)
GQA + CLA	16-24x combined	<1%	Hunyuan-Large (effective)

Quant	Bytes/element	Memory vs BF16	Quality Loss	Best HW
BF16	2	100%	0%	Any
FP8 E5M2	1	50%	<0.1 ppl	H100, H200, MI300X
FP8 E4M3	1	50%	<0.1 ppl	H100, H200
INT8 W8A8	1	50%	<0.5 ppl	Most GPUs
INT4	0.5	25%	0.5-1.5 ppl	Experimental

Workload	Priority	Recommended Config
Chat (short prompt, short output)	Low TTFT	Default vLLM, prefix caching, no offload
Code completion	Throughput	n-gram speculation + FP8 KV
Long-context RAG	Long context	FP8 KV, chunked prefill, disaggregated if scale
Batch summarization	Throughput	Max batch size, aggressive KV quant, optional CPU offload
Agentic system (shared system prompt)	Prefix reuse	RadixAttention (SGLang), prefix caching enabled
Multi-tenant API	Mixed	Continuous batching, FP8 KV, generous max-num-seqs

Config	Max Batch	TTFT (1K prompt)	Throughput
BF16 KV, no prefix cache	16	280 ms	800 tok/s agg
BF16 KV, prefix cache enabled	16	80 ms (cached)	1200 tok/s agg
FP8 KV, prefix cache	32	80 ms (cached)	1900 tok/s agg
FP8 KV, prefix cache, chunked prefill	32	80 ms (cached)	2100 tok/s agg

Config	Max Batch	TTFT (32K prompt)	Throughput
FP8 KV, RadixAttention	16	1.8 s	1500 tok/s agg
FP8 KV, disaggregated prefill	32	1.2 s	2400 tok/s agg

Symptom	Cause	Fix
OOM at startup	gpu_memory_utilization too high	Lower to 0.85
OOM during peak load	KV pressure from concurrent long contexts	Enable FP8 KV; lower max-num-seqs
TTFT degrades over time	KV cache fragmentation	Restart vLLM (rare bug, fixed in 0.7+)
Prefix cache hit rate low	Variable system prompts	Use RadixAttention (SGLang)
CPU offload very slow	PCIe bottleneck	Disable; use FP8 KV instead
Disaggregated prefill misconfig	Network too slow	Need NVLink or RDMA between prefill/decode
Wrong output with FP8 KV	Quantization bug	Try fp8_e4m3 instead of fp8_e5m2
Long context degrades quality	KV quant + RoPE scaling	Use BF16 KV at very long context, or MLA model

KV Cache & PagedAttention Complete Guide (2026): Memory, Quantization, Prefix Caching

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

Why the KV Cache Matters {#why}

KV Cache Memory Math {#math}

GQA / MQA / MLA / CLA: KV Reduction by Architecture {#architecture}

Reading articles is good. Building is better.

PagedAttention Explained {#pagedattention}

Prefix Caching {#prefix-caching}

RadixAttention (SGLang) {#radixattention}

KV Quantization: FP8, INT8, INT4 {#kv-quant}

CPU Offload Trade-offs {#cpu-offload}

Disaggregated Prefill {#disaggregated}

Continuous Batching {#continuous-batching}

vLLM Configuration {#vllm}

SGLang Configuration {#sglang}

TensorRT-LLM Configuration {#tensorrt}

llama.cpp KV Cache {#llamacpp}

Tuning by Workload {#tuning}

Real Benchmarks {#benchmarks}

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

vLLM Complete Setup Guide

TensorRT-LLM Setup

Speculative Decoding Guide

CUDA Optimization

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI