Speculative Decoding Complete Guide (2026): Draft Models, EAGLE, Medusa, n-grams
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Speculative decoding is the highest-ROI inference optimization for single-stream LLM serving in 2026 — 2-4x speedup with mathematically identical output to greedy or sampled decoding. The trick: a small fast model proposes K tokens, the large target model verifies all K in a single forward pass, and accepted tokens are free. LLM inference is memory-bandwidth bound, so verifying K tokens costs roughly the same wall-time as generating 1 — when the draft is right, you skip K-1 forward passes for free.
This guide covers every modern speculation technique — separate draft models, EAGLE-2 / EAGLE-3, Medusa, n-gram / prompt-lookup decoding, and Multi-Token Prediction (MTP) — with real configuration for vLLM, SGLang, TensorRT-LLM, and llama.cpp. Includes acceptance-rate tuning, when speculation hurts, batched-serving caveats, and decision trees for picking the right technique per workload.
Table of Contents
- Why Speculative Decoding Works (Bandwidth Bottleneck)
- The Core Algorithm
- Draft Model Speculation
- EAGLE-2 and EAGLE-3
- Medusa Heads
- n-gram / Prompt-Lookup Decoding
- Multi-Token Prediction (MTP)
- vLLM Configuration
- SGLang Configuration
- TensorRT-LLM Configuration
- llama.cpp Speculative Mode
- Tuning Acceptance Rate
- Sampling vs Greedy: Speedup Trade-off
- Batched Serving Caveats
- When Speculation Hurts
- Decision Tree by Workload
- Real Benchmarks
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Why Speculative Decoding Works (Bandwidth Bottleneck) {#why}
LLM inference per-token cost has two parts:
- Compute: ~2 × params × batch × tokens FLOPs
- Memory bandwidth: load all weights from HBM into compute units (~params × bytes_per_weight per forward pass)
For a single user generating one token at a time on a 70B BF16 model: 140 GB of weights must move from HBM → SMs each step. On H100 (3 TB/s HBM), that's a 47ms floor regardless of how few tokens you generate. The actual matmul takes <1ms. So generating 1 token vs 5 tokens in a single forward pass costs roughly the same wall time — the bandwidth dominates.
Speculative decoding exploits this: speculate K tokens with a cheap process, run the target on K+1 positions in one forward pass, accept the prefix that matches what the target would have generated. If acceptance rate is α, you generate ~(1+α·K) tokens per target forward pass instead of 1.
The Core Algorithm {#algorithm}
Given: target model T, draft generator D, current sequence x[:t]
1. Sample K draft tokens: d[1..K] = D(x[:t])
2. Run T on x[:t] + d[:K] in a single forward pass
→ get target logits at positions t, t+1, ..., t+K
3. For each position i in 1..K:
- Sample target token from T's distribution at position t+i-1
- If target token == d[i]: accept and continue
- Else: stop, replace d[i] with target's token, discard d[i+1..K]
4. Append accepted tokens + corrective token to x
5. Repeat
For greedy decoding, "sample target token" = argmax. For sampling (temperature > 0), use Leviathan et al. (2023) rejection sampling: accept d[i] with probability min(1, p_target(d[i]) / p_draft(d[i])); on rejection, sample from max(0, p_target − p_draft) normalized.
Both modes produce output mathematically identical in distribution to non-speculative decoding. No quality loss.
Draft Model Speculation {#draft-model}
Use a small aligned model as the draft generator. Typical pairings:
| Target | Draft | Acceptance Rate |
|---|---|---|
| Llama 3.1 70B Instruct | Llama 3.2 1B Instruct | 60-75% |
| Llama 3.1 70B Instruct | Llama 3.2 3B Instruct | 70-80% |
| Llama 3.1 405B Instruct | Llama 3.1 8B Instruct | 65-75% |
| Qwen 2.5 72B Instruct | Qwen 2.5 1.5B Instruct | 65-75% |
| Qwen 2.5 72B Instruct | Qwen 2.5 7B Instruct | 75-82% |
| DeepSeek V3 | DeepSeek V2 Lite | 55-70% |
Requirements: same tokenizer, similar instruction-tuning style, smaller than target by 5-100x. Speedup at K=5: ~2.0-2.5x for greedy, ~1.5-2x for sampling.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
EAGLE-2 and EAGLE-3 {#eagle}
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) trains an auxiliary feature-prediction layer on top of the target. Instead of a separate draft model, EAGLE predicts the target's hidden states for the next K positions and decodes tokens from those hidden states.
| Variant | Released | Key Innovation | Speedup |
|---|---|---|---|
| EAGLE-1 | 2024 | Feature-level draft | 2.5-3x |
| EAGLE-2 | 2024 | Dynamic draft tree | 3-3.5x |
| EAGLE-3 | 2025 | Multi-layer hidden state aggregation | 3.5-4x |
EAGLE-2 introduced tree attention: instead of a linear chain of K speculative tokens, build a tree of multiple candidate continuations and verify all branches in one forward pass. Higher effective K with same memory budget.
Trained EAGLE heads are available on HuggingFace for major target models (Llama 3.1, Qwen 2.5, DeepSeek). For custom targets, training EAGLE on 4 GPUs takes 4-8 hours.
Medusa Heads {#medusa}
Medusa (Cai et al., 2024) trains K parallel prediction heads on top of the target model. At inference, all K heads predict simultaneously — no separate draft pass needed.
Trade-offs vs EAGLE:
- Simpler architecture (K linear heads vs full draft layer)
- Lower acceptance rate (~50-65% vs EAGLE's 70-80%)
- No tree attention (simpler verification)
- Speedup: 1.8-2.5x
Medusa is being superseded by EAGLE-2/3 in production. For new deployments in 2026, prefer EAGLE-2 unless you specifically want the simpler training pipeline.
n-gram / Prompt-Lookup Decoding {#ngram}
Zero-training speculation: maintain a buffer of recent context tokens. When generating, look up the most recent N tokens in the prompt + generated text — if a match exists, copy the next K tokens from that match as the draft.
Prompt: "def fibonacci(n):\n if n <= 1:\n return n\n return fibonacci(n-1) + fibonacci(n-2)"
User: "Now write iterative version."
Generation: "def fibonacci(n):\n if n <= 1:\n return n\n [draft from prompt: a, b = 0, 1...]"
Acceptance rate: 30-60% for code, 20-40% for chat, 50-70% for structured output (JSON, XML), 80%+ for repetitive content (long-context summarization, edit operations).
Speedup: 1.3-2x for typical workloads, up to 3x for repetitive content. Cost: zero extra VRAM, zero training. For most workloads, this is the first speculation method to try.
Multi-Token Prediction (MTP) {#mtp}
Some models (DeepSeek V3, Llama-Nemotron-Ultra, Hunyuan-Large) include MTP weights — auxiliary heads trained to predict the next 1-2 tokens beyond the current. At inference, MTP heads serve as built-in speculators.
# vLLM auto-detects MTP weights
vllm serve deepseek-ai/DeepSeek-V3 \
--tensor-parallel-size 8 \
--speculative-config '{"method": "deepseek_mtp", "num_speculative_tokens": 1}'
DeepSeek V3 MTP gives ~1.6x speedup with no extra training or weights to download.
vLLM Configuration {#vllm}
Draft Model
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--speculative-config '{"model": "meta-llama/Llama-3.2-1B-Instruct", "num_speculative_tokens": 5}'
EAGLE-2
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--speculative-config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-70B", "num_speculative_tokens": 5}'
n-gram
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 4, "prompt_lookup_min": 2}'
MTP (DeepSeek V3)
vllm serve deepseek-ai/DeepSeek-V3 \
--tensor-parallel-size 8 \
--speculative-config '{"method": "deepseek_mtp", "num_speculative_tokens": 1}'
See vLLM Complete Setup Guide.
SGLang Configuration {#sglang}
# EAGLE
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-70B-Instruct \
--speculative-algorithm EAGLE \
--speculative-draft-model-path yuhuili/EAGLE-LLaMA3.1-Instruct-70B \
--speculative-num-steps 5 \
--speculative-eagle-topk 8 \
--speculative-num-draft-tokens 64 \
--tp 4
SGLang's EAGLE implementation includes tree-attention with configurable topk; it generally outperforms vLLM on EAGLE specifically.
TensorRT-LLM Configuration {#tensorrt}
For maximum H100/H200 throughput, build TRT-LLM engines with speculation:
trtllm-build \
--checkpoint_dir /trt_ckpt/llama-70b \
--output_dir /trt_engines/llama-70b-spec \
--gemm_plugin bf16 \
--speculative_decoding_mode eagle \
--max_draft_len 5 \
--max_input_len 4096 \
--max_output_len 2048
Then serve via Triton with the EAGLE draft engine alongside the target. See TensorRT-LLM Setup.
llama.cpp Speculative Mode {#llamacpp}
./llama-speculative \
-m models/llama-3.1-70b-Q5_K_M.gguf \
-md models/llama-3.2-1b-Q8_0.gguf \
-ngl 999 -ngld 999 \
--draft 5 \
-c 8192 \
-p "Explain quantum entanglement to a high schooler."
-md is the draft model, --draft is K. Speedup: 1.8-2.3x for typical 70B + 1B pairings.
Tuning Acceptance Rate {#tuning}
Measure acceptance per workload — vLLM logs this with --log-stats:
spec_decode/acceptance_rate: 0.72
spec_decode/draft_throughput: 850 tok/s
spec_decode/system_efficiency: 2.4x
Tuning levers:
- num_speculative_tokens: 3-8. Sweet spot is usually 4-5. Higher K = more wasted compute on rejected drafts.
- Draft model size: bigger draft = higher acceptance, slower draft step. 1-3B for 70B target is standard.
- Temperature alignment: ensure draft and target use same sampling parameters.
- Tokenizer match: must be exact. Llama draft for Qwen target won't work.
Target: aim for acceptance >0.6 for net speedup at K=5. If acceptance <0.4, lower K or switch to n-gram.
Sampling vs Greedy: Speedup Trade-off {#sampling}
| Decoding | Acceptance | Speedup |
|---|---|---|
| Greedy | 75-85% | 2.5-3.5x |
| Temp 0.3, top-p 0.95 | 65-75% | 2.0-2.8x |
| Temp 0.6, top-p 0.95 | 55-65% | 1.7-2.3x |
| Temp 1.0, top-p 0.95 | 40-55% | 1.3-1.8x |
| Temp 1.0, no top-p | 25-40% | 0.9-1.3x (often slower) |
For high-temperature creative generation, speculative decoding may not help. For chat (temp 0.5-0.7) and code (temp 0.0-0.3), it's a clear win.
Batched Serving Caveats {#batched}
Speedup as a function of batch size:
| Batch | Single-Stream Speedup | Bandwidth Saturation |
|---|---|---|
| 1 | 2.5-3.5x | Bandwidth-bound (best speedup) |
| 4 | 1.8-2.5x | Mixed bound |
| 8 | 1.4-1.8x | Compute-saturated |
| 16 | 1.15-1.4x | Compute-bound |
| 32 | 1.05-1.2x | Diminishing returns |
| 64+ | 1.0-1.1x | Negligible |
For high-throughput batch serving (LLM-as-a-service): focus on continuous batching, FP8, and PagedAttention rather than speculation. For interactive APIs (chatbots, copilots, agents): speculation is essential. See vLLM Complete Setup.
When Speculation Hurts {#failure-modes}
| Symptom | Cause | Fix |
|---|---|---|
| Negative speedup | Acceptance <40% | Lower K, switch to n-gram, or disable |
| OOM after enabling | Draft + target VRAM exceeded | Use EAGLE (smaller) or n-gram (zero extra VRAM) |
| Output differs from non-spec | Bug or wrong rejection sampling | File issue; verify with --enforce-eager comparison |
| Speedup vanishes at batch 8+ | Compute-bound regime | Disable for batched serving |
| High variance in throughput | Acceptance varies by query | Acceptable; total throughput improves on average |
| Tokenizer mismatch error | Draft uses different vocab | Use family-matched draft (Llama 3.2 for Llama 3.1) |
| Quality degradation | Greedy + EAGLE, eager mode bug | Update vLLM to 0.7+ |
Decision Tree by Workload {#decision}
Single-user chat / agent (batch=1):
- Try n-gram speculation first (zero training, zero VRAM cost)
- If acceptance >50%, you're done
- If acceptance <50%, add EAGLE-2 (1-3 GB VRAM, 2.5-3.5x speedup)
- If EAGLE not available for your model, use draft model (1-3B same family)
Code generation:
- n-gram is excellent (high context match) — usually 2-3x out of the box
- Combine with EAGLE for stacking gains
Long-context summarization / edit operations:
- n-gram dominates — 60-80% acceptance from prompt overlap
- No need for separate draft
Batched API serving (batch>16):
- Skip speculation
- Focus on FP8, continuous batching, PagedAttention
MoE models (DeepSeek V3, Hunyuan-Large):
- Use built-in MTP if available (zero training)
- Otherwise EAGLE-2 with tree attention works well
Real Benchmarks {#benchmarks}
Single H100 80GB, Llama 3.1 70B AWQ INT4, batch=1, vLLM:
| Method | Tokens/sec | Speedup |
|---|---|---|
| Baseline (no speculation) | 28 | 1.0x |
| n-gram (K=5) | 42 | 1.5x |
| Draft: Llama 3.2 1B (K=5) | 62 | 2.2x |
| Draft: Llama 3.2 3B (K=5) | 58 | 2.1x |
| EAGLE-2 (K=5) | 84 | 3.0x |
| EAGLE-3 (K=7) | 102 | 3.6x |
DeepSeek V3 671B FP8, 8x H100, batch=1, SGLang:
| Method | Tokens/sec | Speedup |
|---|---|---|
| Baseline | 65 | 1.0x |
| MTP (built-in, K=1) | 105 | 1.6x |
| MTP + n-gram (K=4 total) | 130 | 2.0x |
FAQ {#faq}
See answers to common speculative decoding questions below.
Sources: Leviathan et al. (2023) Fast Inference from Transformers via Speculative Decoding | Cai et al. (2024) Medusa | Li et al. (2024) EAGLE-2 | Li et al. (2025) EAGLE-3 | vLLM speculative decoding docs | SGLang speculative decoding docs | Internal benchmarks H100.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!