★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Performance

KV Cache & PagedAttention Complete Guide (2026): Memory, Quantization, Prefix Caching

May 2, 2026
24 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

The KV cache is the single biggest memory cost in long-context LLM serving. For Llama 3.1 70B at 131K context, the KV cache alone is 43 GB per request — bigger than the model weights at FP8. Every modern serving engine (vLLM, SGLang, TensorRT-LLM) revolves around managing it efficiently. PagedAttention, prefix caching, FP8 KV, GQA/MLA/CLA, CPU offload, and disaggregated prefill are all answers to "how do we serve more requests with less KV cache pain?"

This guide covers the full stack: how the KV cache works mechanically, why PagedAttention beats contiguous allocation, when to enable FP8 / INT8 KV quantization, configuring prefix caching for shared system prompts, when CPU offload helps vs hurts, the disaggregated-prefill architecture for long-prompt workloads, and per-engine configuration for vLLM, SGLang, TensorRT-LLM, and llama.cpp.

Table of Contents

  1. Why the KV Cache Matters
  2. KV Cache Memory Math
  3. GQA / MQA / MLA / CLA: KV Reduction by Architecture
  4. PagedAttention Explained
  5. Prefix Caching
  6. RadixAttention (SGLang)
  7. KV Quantization: FP8, INT8, INT4
  8. CPU Offload Trade-offs
  9. Disaggregated Prefill
  10. Continuous Batching
  11. vLLM Configuration
  12. SGLang Configuration
  13. TensorRT-LLM Configuration
  14. llama.cpp KV Cache
  15. Tuning by Workload
  16. Real Benchmarks
  17. Troubleshooting
  18. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Why the KV Cache Matters {#why}

Autoregressive generation: at step t, the model attends from query token t back to all keys and values at positions 0..t-1. To avoid recomputing K and V for old positions every step, they're cached.

Memory growth is linear in sequence length and dominates as context grows:

ContextLlama 3.1 8B (BF16)Llama 3.1 70B (BF16)Llama 3.1 405B (BF16)
4K0.5 GB1.3 GB3.0 GB
32K4 GB10.7 GB24 GB
128K16 GB43 GB96 GB
200K25 GB67 GB150 GB

For batch=8 at 32K context on Llama 3.1 70B: 86 GB just for KV cache. Add 140 GB for weights = 226 GB total — needs 4x H100 80GB.


KV Cache Memory Math {#math}

Per-token KV size:

kv_per_token = 2 × num_layers × num_kv_heads × head_dim × bytes_per_element

For Llama 3.1 70B with GQA (8 KV heads / 64 query heads):

2 × 80 layers × 8 KV heads × 128 head_dim × 2 (BF16) = 327,680 bytes ≈ 320 KB per token

Per-request at 32K context: 320 KB × 32768 = 10.7 GB.

For full MHA (no GQA), this would be 8x larger: 86 GB at 32K. GQA is mandatory for serving these models.

For DeepSeek V3 with MLA: KV is compressed to ~50 KB per token = 1.6 GB at 32K. The MLA dimensionality reduction is what makes 671B at 128K context viable.


GQA / MQA / MLA / CLA: KV Reduction by Architecture {#architecture}

TechniqueReduction FactorQuality LossExamples
MHA (baseline)1x0%GPT-2, original Llama
MQA (Multi-Query)num_heads (e.g., 32x)1-2%PaLM, Falcon, original Llama 2
GQA (Grouped-Query)num_heads / num_kv_heads (e.g., 8x)<0.5%Llama 3, Qwen 2.5, Mistral
MLA (Multi-Head Latent)5-10x via low-rank K/V projection<0.5%DeepSeek V2, V3
CLA (Cross-Layer)2-3x via shared K/V across layers<0.5%Hunyuan-Large
GQA + MLA30-80x combined<1%DeepSeek V3 (effective)
GQA + CLA16-24x combined<1%Hunyuan-Large (effective)

For new model designs in 2026, GQA is the baseline; MLA or CLA layered on top is the frontier.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

PagedAttention Explained {#pagedattention}

Naive: allocate max_seq_len of contiguous KV per request. Problem: 90%+ wasted on short sequences; can't fragment.

PagedAttention (Kwon et al., 2023):

  • Allocate KV in fixed pages (default 16 tokens)
  • Each request holds a list of page IDs (mapping to non-contiguous physical pages)
  • Pages allocated on-demand as sequence grows
  • Pages freed when request ends
  • Pages can be shared across requests via reference counting (enables prefix caching, beam search COW)

Result vs naive:

  • 60-90% less wasted memory
  • 2-4x batch size on same VRAM
  • Native support for variable-length, prefix sharing, and dynamic eviction

PagedAttention requires custom CUDA kernels (vLLM, SGLang, TRT-LLM all have them).


Prefix Caching {#prefix-caching}

If many requests share a long system prompt + few-shot examples, prefix caching reuses the KV for that shared prefix.

Without prefix caching:

Request 1: [prefix 5K tokens] [user A 200 tokens] → 5200 prefill tokens
Request 2: [prefix 5K tokens] [user B 200 tokens] → 5200 prefill tokens
Request 3: [prefix 5K tokens] [user C 200 tokens] → 5200 prefill tokens
Total: 15600 prefill tokens

With prefix caching:

Request 1: [prefix 5K] [user A] → 5200 prefill (caches prefix)
Request 2: [user B] → 200 prefill (reuses prefix KV)
Request 3: [user C] → 200 prefill (reuses prefix KV)
Total: 5600 prefill tokens

TTFT speedup on prefix-sharing workloads: 3-10x. Memory: shared pages reference-counted, freed when no requests reference them. Enable in vLLM (default in 0.7+) with --enable-prefix-caching.


RadixAttention (SGLang) {#radixattention}

SGLang's evolution of prefix caching: stores all KV in a radix tree keyed by token sequences. Hierarchical match — multiple requests share partial prefixes naturally.

[system A][user 1] cached
[system A][user 2] cached  ← shares [system A]
[system B][user 1] cached  ← stores new prefix

For multi-tenant agentic systems with diverse system prompts, RadixAttention captures more reuse than flat prefix caching. SGLang's typical advantage over vLLM on prefix-heavy workloads: 1.3-2x throughput.


KV Quantization: FP8, INT8, INT4 {#kv-quant}

Quantize KV cache values to fewer bits, halving or quartering memory.

QuantBytes/elementMemory vs BF16Quality LossBest HW
BF162100%0%Any
FP8 E5M2150%<0.1 pplH100, H200, MI300X
FP8 E4M3150%<0.1 pplH100, H200
INT8 W8A8150%<0.5 pplMost GPUs
INT40.525%0.5-1.5 pplExperimental

For H100/H200 production: FP8 E5M2 KV cache is essentially free (no measurable quality loss, 50% memory savings). For older GPUs (A100): INT8 KV is the equivalent.

Enable in vLLM:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --kv-cache-dtype fp8_e5m2 \
    --max-model-len 65536

CPU Offload Trade-offs {#cpu-offload}

When VRAM cannot hold the needed KV (e.g., 200K context on single GPU), offload to CPU RAM:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --cpu-offload-gb 64 \
    --max-model-len 200000

Cost: PCIe 4.0 x16 = 32 GB/s; transferring 1 GB of KV during decode = ~31ms. For latency-sensitive workloads, this kills TTFT and per-token latency.

When offload helps:

  • Batch summarization where latency is irrelevant
  • Solo developer with single GPU + lots of system RAM
  • Burst workloads that occasionally exceed VRAM

When offload hurts:

  • Chat / agents (latency-sensitive)
  • Batched serving (PCIe becomes bottleneck)

Better alternatives: FP8 KV, MLA models, more GPUs.


Disaggregated Prefill {#disaggregated}

Standard serving: prefill (compute KV for prompt) and decode (generate tokens) co-located. They have opposite bottlenecks:

  • Prefill: compute-bound (full attention over prompt)
  • Decode: bandwidth-bound (load weights for one token at a time)

When co-located, prefill bursts block decode latency. Disaggregated prefill (vLLM 0.6+, SGLang) splits the cluster:

  • N "prefill" GPUs do prompt processing
  • M "decode" GPUs do token generation
  • KV cache shipped over NVLink / RDMA

Result for long-prompt workloads (avg prompt >5K): 2-3x throughput, 2x lower decode latency. Cost: cluster complexity, higher minimum hardware.

Worth it for: B2B SaaS with long prompts, RAG, agentic systems. Not for: short-prompt chat, single-GPU deployments.


Continuous Batching {#continuous-batching}

Static batching waits for the slowest request in a batch to finish before starting new ones — wastes GPU cycles. Continuous batching (introduced in Orca, popularized by vLLM) dynamically swaps in new requests as old ones complete, on a per-iteration basis.

Result: 2-5x throughput improvement on real-world traffic with mixed request lengths. Standard in vLLM, SGLang, TensorRT-LLM. Often combined with PagedAttention because dynamic batching needs flexible KV allocation.


vLLM Configuration {#vllm}

Maximum-batch production config on H100 80GB:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.92 \
    --kv-cache-dtype fp8_e5m2 \
    --enable-prefix-caching \
    --block-size 16 \
    --max-num-seqs 256 \
    --max-num-batched-tokens 16384

Key knobs:

  • --gpu-memory-utilization: 0.85-0.95 depending on co-tenants
  • --max-num-seqs: max concurrent requests
  • --max-num-batched-tokens: prefill chunking budget per iteration
  • --enable-chunked-prefill: split long prompts across iterations to reduce decode latency
  • --swap-space: CPU swap GB for evicted KV (default 4)

See vLLM Complete Setup Guide.


SGLang Configuration {#sglang}

python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-70B-Instruct \
    --tp 4 \
    --kv-cache-dtype fp8_e5m2 \
    --enable-cache-report \
    --schedule-policy fcfs \
    --max-prefill-tokens 16384 \
    --context-length 32768

SGLang's RadixAttention is on by default; --enable-cache-report logs hit rates.


TensorRT-LLM Configuration {#tensorrt}

KV cache config goes in the engine build:

trtllm-build \
    --checkpoint_dir /trt_ckpt/llama-70b \
    --output_dir /trt_engines/llama-70b \
    --gemm_plugin bf16 \
    --kv_cache_type paged \
    --max_input_len 32768 \
    --max_output_len 4096 \
    --max_batch_size 32 \
    --use_paged_context_fmha enable \
    --tokens_per_block 64 \
    --use_fp8_context_fmha enable

use_fp8_context_fmha enables FP8 for attention KV operations on H100. See TensorRT-LLM Setup.


llama.cpp KV Cache {#llamacpp}

llama.cpp has simpler KV management — no PagedAttention, no continuous batching (single-user focused). Useful flags:

./llama-cli \
    -m model.gguf \
    -ngl 999 \
    -c 32768 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    -fa

--cache-type-k/v q8_0 quantizes KV to INT8 (50% memory savings). -fa enables FlashAttention. For long context, combine with --n-cpu-moe to offload MoE experts to CPU.


Tuning by Workload {#tuning}

WorkloadPriorityRecommended Config
Chat (short prompt, short output)Low TTFTDefault vLLM, prefix caching, no offload
Code completionThroughputn-gram speculation + FP8 KV
Long-context RAGLong contextFP8 KV, chunked prefill, disaggregated if scale
Batch summarizationThroughputMax batch size, aggressive KV quant, optional CPU offload
Agentic system (shared system prompt)Prefix reuseRadixAttention (SGLang), prefix caching enabled
Multi-tenant APIMixedContinuous batching, FP8 KV, generous max-num-seqs

Real Benchmarks {#benchmarks}

Llama 3.1 70B on 4x H100 80GB, vLLM 0.7:

ConfigMax BatchTTFT (1K prompt)Throughput
BF16 KV, no prefix cache16280 ms800 tok/s agg
BF16 KV, prefix cache enabled1680 ms (cached)1200 tok/s agg
FP8 KV, prefix cache3280 ms (cached)1900 tok/s agg
FP8 KV, prefix cache, chunked prefill3280 ms (cached)2100 tok/s agg

DeepSeek V3 671B on 8x H100 80GB, SGLang:

ConfigMax BatchTTFT (32K prompt)Throughput
FP8 KV, RadixAttention161.8 s1500 tok/s agg
FP8 KV, disaggregated prefill321.2 s2400 tok/s agg

Troubleshooting {#troubleshooting}

SymptomCauseFix
OOM at startupgpu_memory_utilization too highLower to 0.85
OOM during peak loadKV pressure from concurrent long contextsEnable FP8 KV; lower max-num-seqs
TTFT degrades over timeKV cache fragmentationRestart vLLM (rare bug, fixed in 0.7+)
Prefix cache hit rate lowVariable system promptsUse RadixAttention (SGLang)
CPU offload very slowPCIe bottleneckDisable; use FP8 KV instead
Disaggregated prefill misconfigNetwork too slowNeed NVLink or RDMA between prefill/decode
Wrong output with FP8 KVQuantization bugTry fp8_e4m3 instead of fp8_e5m2
Long context degrades qualityKV quant + RoPE scalingUse BF16 KV at very long context, or MLA model

FAQ {#faq}

See answers to common KV cache questions below.


Sources: Kwon et al. (2023) PagedAttention / vLLM paper | SGLang paper (RadixAttention) | DeepSeek V2 MLA paper (arXiv 2405.04434) | Yu et al. (2022) Orca continuous batching | vLLM docs | TensorRT-LLM KV cache docs | Internal benchmarks 4x H100 + 8x H100.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 2, 2026🔄 Last Updated: May 2, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes vLLM PagedAttention + FP8 KV production deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators