KV Cache & PagedAttention Complete Guide (2026): Memory, Quantization, Prefix Caching
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
The KV cache is the single biggest memory cost in long-context LLM serving. For Llama 3.1 70B at 131K context, the KV cache alone is 43 GB per request — bigger than the model weights at FP8. Every modern serving engine (vLLM, SGLang, TensorRT-LLM) revolves around managing it efficiently. PagedAttention, prefix caching, FP8 KV, GQA/MLA/CLA, CPU offload, and disaggregated prefill are all answers to "how do we serve more requests with less KV cache pain?"
This guide covers the full stack: how the KV cache works mechanically, why PagedAttention beats contiguous allocation, when to enable FP8 / INT8 KV quantization, configuring prefix caching for shared system prompts, when CPU offload helps vs hurts, the disaggregated-prefill architecture for long-prompt workloads, and per-engine configuration for vLLM, SGLang, TensorRT-LLM, and llama.cpp.
Table of Contents
- Why the KV Cache Matters
- KV Cache Memory Math
- GQA / MQA / MLA / CLA: KV Reduction by Architecture
- PagedAttention Explained
- Prefix Caching
- RadixAttention (SGLang)
- KV Quantization: FP8, INT8, INT4
- CPU Offload Trade-offs
- Disaggregated Prefill
- Continuous Batching
- vLLM Configuration
- SGLang Configuration
- TensorRT-LLM Configuration
- llama.cpp KV Cache
- Tuning by Workload
- Real Benchmarks
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Why the KV Cache Matters {#why}
Autoregressive generation: at step t, the model attends from query token t back to all keys and values at positions 0..t-1. To avoid recomputing K and V for old positions every step, they're cached.
Memory growth is linear in sequence length and dominates as context grows:
| Context | Llama 3.1 8B (BF16) | Llama 3.1 70B (BF16) | Llama 3.1 405B (BF16) |
|---|---|---|---|
| 4K | 0.5 GB | 1.3 GB | 3.0 GB |
| 32K | 4 GB | 10.7 GB | 24 GB |
| 128K | 16 GB | 43 GB | 96 GB |
| 200K | 25 GB | 67 GB | 150 GB |
For batch=8 at 32K context on Llama 3.1 70B: 86 GB just for KV cache. Add 140 GB for weights = 226 GB total — needs 4x H100 80GB.
KV Cache Memory Math {#math}
Per-token KV size:
kv_per_token = 2 × num_layers × num_kv_heads × head_dim × bytes_per_element
For Llama 3.1 70B with GQA (8 KV heads / 64 query heads):
2 × 80 layers × 8 KV heads × 128 head_dim × 2 (BF16) = 327,680 bytes ≈ 320 KB per token
Per-request at 32K context: 320 KB × 32768 = 10.7 GB.
For full MHA (no GQA), this would be 8x larger: 86 GB at 32K. GQA is mandatory for serving these models.
For DeepSeek V3 with MLA: KV is compressed to ~50 KB per token = 1.6 GB at 32K. The MLA dimensionality reduction is what makes 671B at 128K context viable.
GQA / MQA / MLA / CLA: KV Reduction by Architecture {#architecture}
| Technique | Reduction Factor | Quality Loss | Examples |
|---|---|---|---|
| MHA (baseline) | 1x | 0% | GPT-2, original Llama |
| MQA (Multi-Query) | num_heads (e.g., 32x) | 1-2% | PaLM, Falcon, original Llama 2 |
| GQA (Grouped-Query) | num_heads / num_kv_heads (e.g., 8x) | <0.5% | Llama 3, Qwen 2.5, Mistral |
| MLA (Multi-Head Latent) | 5-10x via low-rank K/V projection | <0.5% | DeepSeek V2, V3 |
| CLA (Cross-Layer) | 2-3x via shared K/V across layers | <0.5% | Hunyuan-Large |
| GQA + MLA | 30-80x combined | <1% | DeepSeek V3 (effective) |
| GQA + CLA | 16-24x combined | <1% | Hunyuan-Large (effective) |
For new model designs in 2026, GQA is the baseline; MLA or CLA layered on top is the frontier.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
PagedAttention Explained {#pagedattention}
Naive: allocate max_seq_len of contiguous KV per request. Problem: 90%+ wasted on short sequences; can't fragment.
PagedAttention (Kwon et al., 2023):
- Allocate KV in fixed pages (default 16 tokens)
- Each request holds a list of page IDs (mapping to non-contiguous physical pages)
- Pages allocated on-demand as sequence grows
- Pages freed when request ends
- Pages can be shared across requests via reference counting (enables prefix caching, beam search COW)
Result vs naive:
- 60-90% less wasted memory
- 2-4x batch size on same VRAM
- Native support for variable-length, prefix sharing, and dynamic eviction
PagedAttention requires custom CUDA kernels (vLLM, SGLang, TRT-LLM all have them).
Prefix Caching {#prefix-caching}
If many requests share a long system prompt + few-shot examples, prefix caching reuses the KV for that shared prefix.
Without prefix caching:
Request 1: [prefix 5K tokens] [user A 200 tokens] → 5200 prefill tokens
Request 2: [prefix 5K tokens] [user B 200 tokens] → 5200 prefill tokens
Request 3: [prefix 5K tokens] [user C 200 tokens] → 5200 prefill tokens
Total: 15600 prefill tokens
With prefix caching:
Request 1: [prefix 5K] [user A] → 5200 prefill (caches prefix)
Request 2: [user B] → 200 prefill (reuses prefix KV)
Request 3: [user C] → 200 prefill (reuses prefix KV)
Total: 5600 prefill tokens
TTFT speedup on prefix-sharing workloads: 3-10x. Memory: shared pages reference-counted, freed when no requests reference them. Enable in vLLM (default in 0.7+) with --enable-prefix-caching.
RadixAttention (SGLang) {#radixattention}
SGLang's evolution of prefix caching: stores all KV in a radix tree keyed by token sequences. Hierarchical match — multiple requests share partial prefixes naturally.
[system A][user 1] cached
[system A][user 2] cached ← shares [system A]
[system B][user 1] cached ← stores new prefix
For multi-tenant agentic systems with diverse system prompts, RadixAttention captures more reuse than flat prefix caching. SGLang's typical advantage over vLLM on prefix-heavy workloads: 1.3-2x throughput.
KV Quantization: FP8, INT8, INT4 {#kv-quant}
Quantize KV cache values to fewer bits, halving or quartering memory.
| Quant | Bytes/element | Memory vs BF16 | Quality Loss | Best HW |
|---|---|---|---|---|
| BF16 | 2 | 100% | 0% | Any |
| FP8 E5M2 | 1 | 50% | <0.1 ppl | H100, H200, MI300X |
| FP8 E4M3 | 1 | 50% | <0.1 ppl | H100, H200 |
| INT8 W8A8 | 1 | 50% | <0.5 ppl | Most GPUs |
| INT4 | 0.5 | 25% | 0.5-1.5 ppl | Experimental |
For H100/H200 production: FP8 E5M2 KV cache is essentially free (no measurable quality loss, 50% memory savings). For older GPUs (A100): INT8 KV is the equivalent.
Enable in vLLM:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--kv-cache-dtype fp8_e5m2 \
--max-model-len 65536
CPU Offload Trade-offs {#cpu-offload}
When VRAM cannot hold the needed KV (e.g., 200K context on single GPU), offload to CPU RAM:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--cpu-offload-gb 64 \
--max-model-len 200000
Cost: PCIe 4.0 x16 = 32 GB/s; transferring 1 GB of KV during decode = ~31ms. For latency-sensitive workloads, this kills TTFT and per-token latency.
When offload helps:
- Batch summarization where latency is irrelevant
- Solo developer with single GPU + lots of system RAM
- Burst workloads that occasionally exceed VRAM
When offload hurts:
- Chat / agents (latency-sensitive)
- Batched serving (PCIe becomes bottleneck)
Better alternatives: FP8 KV, MLA models, more GPUs.
Disaggregated Prefill {#disaggregated}
Standard serving: prefill (compute KV for prompt) and decode (generate tokens) co-located. They have opposite bottlenecks:
- Prefill: compute-bound (full attention over prompt)
- Decode: bandwidth-bound (load weights for one token at a time)
When co-located, prefill bursts block decode latency. Disaggregated prefill (vLLM 0.6+, SGLang) splits the cluster:
- N "prefill" GPUs do prompt processing
- M "decode" GPUs do token generation
- KV cache shipped over NVLink / RDMA
Result for long-prompt workloads (avg prompt >5K): 2-3x throughput, 2x lower decode latency. Cost: cluster complexity, higher minimum hardware.
Worth it for: B2B SaaS with long prompts, RAG, agentic systems. Not for: short-prompt chat, single-GPU deployments.
Continuous Batching {#continuous-batching}
Static batching waits for the slowest request in a batch to finish before starting new ones — wastes GPU cycles. Continuous batching (introduced in Orca, popularized by vLLM) dynamically swaps in new requests as old ones complete, on a per-iteration basis.
Result: 2-5x throughput improvement on real-world traffic with mixed request lengths. Standard in vLLM, SGLang, TensorRT-LLM. Often combined with PagedAttention because dynamic batching needs flexible KV allocation.
vLLM Configuration {#vllm}
Maximum-batch production config on H100 80GB:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--kv-cache-dtype fp8_e5m2 \
--enable-prefix-caching \
--block-size 16 \
--max-num-seqs 256 \
--max-num-batched-tokens 16384
Key knobs:
--gpu-memory-utilization: 0.85-0.95 depending on co-tenants--max-num-seqs: max concurrent requests--max-num-batched-tokens: prefill chunking budget per iteration--enable-chunked-prefill: split long prompts across iterations to reduce decode latency--swap-space: CPU swap GB for evicted KV (default 4)
See vLLM Complete Setup Guide.
SGLang Configuration {#sglang}
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-70B-Instruct \
--tp 4 \
--kv-cache-dtype fp8_e5m2 \
--enable-cache-report \
--schedule-policy fcfs \
--max-prefill-tokens 16384 \
--context-length 32768
SGLang's RadixAttention is on by default; --enable-cache-report logs hit rates.
TensorRT-LLM Configuration {#tensorrt}
KV cache config goes in the engine build:
trtllm-build \
--checkpoint_dir /trt_ckpt/llama-70b \
--output_dir /trt_engines/llama-70b \
--gemm_plugin bf16 \
--kv_cache_type paged \
--max_input_len 32768 \
--max_output_len 4096 \
--max_batch_size 32 \
--use_paged_context_fmha enable \
--tokens_per_block 64 \
--use_fp8_context_fmha enable
use_fp8_context_fmha enables FP8 for attention KV operations on H100. See TensorRT-LLM Setup.
llama.cpp KV Cache {#llamacpp}
llama.cpp has simpler KV management — no PagedAttention, no continuous batching (single-user focused). Useful flags:
./llama-cli \
-m model.gguf \
-ngl 999 \
-c 32768 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
-fa
--cache-type-k/v q8_0 quantizes KV to INT8 (50% memory savings). -fa enables FlashAttention. For long context, combine with --n-cpu-moe to offload MoE experts to CPU.
Tuning by Workload {#tuning}
| Workload | Priority | Recommended Config |
|---|---|---|
| Chat (short prompt, short output) | Low TTFT | Default vLLM, prefix caching, no offload |
| Code completion | Throughput | n-gram speculation + FP8 KV |
| Long-context RAG | Long context | FP8 KV, chunked prefill, disaggregated if scale |
| Batch summarization | Throughput | Max batch size, aggressive KV quant, optional CPU offload |
| Agentic system (shared system prompt) | Prefix reuse | RadixAttention (SGLang), prefix caching enabled |
| Multi-tenant API | Mixed | Continuous batching, FP8 KV, generous max-num-seqs |
Real Benchmarks {#benchmarks}
Llama 3.1 70B on 4x H100 80GB, vLLM 0.7:
| Config | Max Batch | TTFT (1K prompt) | Throughput |
|---|---|---|---|
| BF16 KV, no prefix cache | 16 | 280 ms | 800 tok/s agg |
| BF16 KV, prefix cache enabled | 16 | 80 ms (cached) | 1200 tok/s agg |
| FP8 KV, prefix cache | 32 | 80 ms (cached) | 1900 tok/s agg |
| FP8 KV, prefix cache, chunked prefill | 32 | 80 ms (cached) | 2100 tok/s agg |
DeepSeek V3 671B on 8x H100 80GB, SGLang:
| Config | Max Batch | TTFT (32K prompt) | Throughput |
|---|---|---|---|
| FP8 KV, RadixAttention | 16 | 1.8 s | 1500 tok/s agg |
| FP8 KV, disaggregated prefill | 32 | 1.2 s | 2400 tok/s agg |
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| OOM at startup | gpu_memory_utilization too high | Lower to 0.85 |
| OOM during peak load | KV pressure from concurrent long contexts | Enable FP8 KV; lower max-num-seqs |
| TTFT degrades over time | KV cache fragmentation | Restart vLLM (rare bug, fixed in 0.7+) |
| Prefix cache hit rate low | Variable system prompts | Use RadixAttention (SGLang) |
| CPU offload very slow | PCIe bottleneck | Disable; use FP8 KV instead |
| Disaggregated prefill misconfig | Network too slow | Need NVLink or RDMA between prefill/decode |
| Wrong output with FP8 KV | Quantization bug | Try fp8_e4m3 instead of fp8_e5m2 |
| Long context degrades quality | KV quant + RoPE scaling | Use BF16 KV at very long context, or MLA model |
FAQ {#faq}
See answers to common KV cache questions below.
Sources: Kwon et al. (2023) PagedAttention / vLLM paper | SGLang paper (RadixAttention) | DeepSeek V2 MLA paper (arXiv 2405.04434) | Yu et al. (2022) Orca continuous batching | vLLM docs | TensorRT-LLM KV cache docs | Internal benchmarks 4x H100 + 8x H100.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!