★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Performance

Speculative Decoding Complete Guide (2026): Draft Models, EAGLE, Medusa, n-grams

May 2, 2026
22 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

Speculative decoding is the highest-ROI inference optimization for single-stream LLM serving in 2026 — 2-4x speedup with mathematically identical output to greedy or sampled decoding. The trick: a small fast model proposes K tokens, the large target model verifies all K in a single forward pass, and accepted tokens are free. LLM inference is memory-bandwidth bound, so verifying K tokens costs roughly the same wall-time as generating 1 — when the draft is right, you skip K-1 forward passes for free.

This guide covers every modern speculation technique — separate draft models, EAGLE-2 / EAGLE-3, Medusa, n-gram / prompt-lookup decoding, and Multi-Token Prediction (MTP) — with real configuration for vLLM, SGLang, TensorRT-LLM, and llama.cpp. Includes acceptance-rate tuning, when speculation hurts, batched-serving caveats, and decision trees for picking the right technique per workload.

Table of Contents

  1. Why Speculative Decoding Works (Bandwidth Bottleneck)
  2. The Core Algorithm
  3. Draft Model Speculation
  4. EAGLE-2 and EAGLE-3
  5. Medusa Heads
  6. n-gram / Prompt-Lookup Decoding
  7. Multi-Token Prediction (MTP)
  8. vLLM Configuration
  9. SGLang Configuration
  10. TensorRT-LLM Configuration
  11. llama.cpp Speculative Mode
  12. Tuning Acceptance Rate
  13. Sampling vs Greedy: Speedup Trade-off
  14. Batched Serving Caveats
  15. When Speculation Hurts
  16. Decision Tree by Workload
  17. Real Benchmarks
  18. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Why Speculative Decoding Works (Bandwidth Bottleneck) {#why}

LLM inference per-token cost has two parts:

  • Compute: ~2 × params × batch × tokens FLOPs
  • Memory bandwidth: load all weights from HBM into compute units (~params × bytes_per_weight per forward pass)

For a single user generating one token at a time on a 70B BF16 model: 140 GB of weights must move from HBM → SMs each step. On H100 (3 TB/s HBM), that's a 47ms floor regardless of how few tokens you generate. The actual matmul takes <1ms. So generating 1 token vs 5 tokens in a single forward pass costs roughly the same wall time — the bandwidth dominates.

Speculative decoding exploits this: speculate K tokens with a cheap process, run the target on K+1 positions in one forward pass, accept the prefix that matches what the target would have generated. If acceptance rate is α, you generate ~(1+α·K) tokens per target forward pass instead of 1.


The Core Algorithm {#algorithm}

Given: target model T, draft generator D, current sequence x[:t]
1. Sample K draft tokens: d[1..K] = D(x[:t])
2. Run T on x[:t] + d[:K] in a single forward pass
   → get target logits at positions t, t+1, ..., t+K
3. For each position i in 1..K:
   - Sample target token from T's distribution at position t+i-1
   - If target token == d[i]: accept and continue
   - Else: stop, replace d[i] with target's token, discard d[i+1..K]
4. Append accepted tokens + corrective token to x
5. Repeat

For greedy decoding, "sample target token" = argmax. For sampling (temperature > 0), use Leviathan et al. (2023) rejection sampling: accept d[i] with probability min(1, p_target(d[i]) / p_draft(d[i])); on rejection, sample from max(0, p_target − p_draft) normalized.

Both modes produce output mathematically identical in distribution to non-speculative decoding. No quality loss.


Draft Model Speculation {#draft-model}

Use a small aligned model as the draft generator. Typical pairings:

TargetDraftAcceptance Rate
Llama 3.1 70B InstructLlama 3.2 1B Instruct60-75%
Llama 3.1 70B InstructLlama 3.2 3B Instruct70-80%
Llama 3.1 405B InstructLlama 3.1 8B Instruct65-75%
Qwen 2.5 72B InstructQwen 2.5 1.5B Instruct65-75%
Qwen 2.5 72B InstructQwen 2.5 7B Instruct75-82%
DeepSeek V3DeepSeek V2 Lite55-70%

Requirements: same tokenizer, similar instruction-tuning style, smaller than target by 5-100x. Speedup at K=5: ~2.0-2.5x for greedy, ~1.5-2x for sampling.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

EAGLE-2 and EAGLE-3 {#eagle}

EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) trains an auxiliary feature-prediction layer on top of the target. Instead of a separate draft model, EAGLE predicts the target's hidden states for the next K positions and decodes tokens from those hidden states.

VariantReleasedKey InnovationSpeedup
EAGLE-12024Feature-level draft2.5-3x
EAGLE-22024Dynamic draft tree3-3.5x
EAGLE-32025Multi-layer hidden state aggregation3.5-4x

EAGLE-2 introduced tree attention: instead of a linear chain of K speculative tokens, build a tree of multiple candidate continuations and verify all branches in one forward pass. Higher effective K with same memory budget.

Trained EAGLE heads are available on HuggingFace for major target models (Llama 3.1, Qwen 2.5, DeepSeek). For custom targets, training EAGLE on 4 GPUs takes 4-8 hours.


Medusa Heads {#medusa}

Medusa (Cai et al., 2024) trains K parallel prediction heads on top of the target model. At inference, all K heads predict simultaneously — no separate draft pass needed.

Trade-offs vs EAGLE:

  • Simpler architecture (K linear heads vs full draft layer)
  • Lower acceptance rate (~50-65% vs EAGLE's 70-80%)
  • No tree attention (simpler verification)
  • Speedup: 1.8-2.5x

Medusa is being superseded by EAGLE-2/3 in production. For new deployments in 2026, prefer EAGLE-2 unless you specifically want the simpler training pipeline.


n-gram / Prompt-Lookup Decoding {#ngram}

Zero-training speculation: maintain a buffer of recent context tokens. When generating, look up the most recent N tokens in the prompt + generated text — if a match exists, copy the next K tokens from that match as the draft.

Prompt: "def fibonacci(n):\n    if n <= 1:\n        return n\n    return fibonacci(n-1) + fibonacci(n-2)"
User: "Now write iterative version."
Generation: "def fibonacci(n):\n    if n <= 1:\n        return n\n    [draft from prompt: a, b = 0, 1...]"

Acceptance rate: 30-60% for code, 20-40% for chat, 50-70% for structured output (JSON, XML), 80%+ for repetitive content (long-context summarization, edit operations).

Speedup: 1.3-2x for typical workloads, up to 3x for repetitive content. Cost: zero extra VRAM, zero training. For most workloads, this is the first speculation method to try.


Multi-Token Prediction (MTP) {#mtp}

Some models (DeepSeek V3, Llama-Nemotron-Ultra, Hunyuan-Large) include MTP weights — auxiliary heads trained to predict the next 1-2 tokens beyond the current. At inference, MTP heads serve as built-in speculators.

# vLLM auto-detects MTP weights
vllm serve deepseek-ai/DeepSeek-V3 \
    --tensor-parallel-size 8 \
    --speculative-config '{"method": "deepseek_mtp", "num_speculative_tokens": 1}'

DeepSeek V3 MTP gives ~1.6x speedup with no extra training or weights to download.


vLLM Configuration {#vllm}

Draft Model

vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --speculative-config '{"model": "meta-llama/Llama-3.2-1B-Instruct", "num_speculative_tokens": 5}'

EAGLE-2

vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --speculative-config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-70B", "num_speculative_tokens": 5}'

n-gram

vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 4, "prompt_lookup_min": 2}'

MTP (DeepSeek V3)

vllm serve deepseek-ai/DeepSeek-V3 \
    --tensor-parallel-size 8 \
    --speculative-config '{"method": "deepseek_mtp", "num_speculative_tokens": 1}'

See vLLM Complete Setup Guide.


SGLang Configuration {#sglang}

# EAGLE
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-70B-Instruct \
    --speculative-algorithm EAGLE \
    --speculative-draft-model-path yuhuili/EAGLE-LLaMA3.1-Instruct-70B \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 8 \
    --speculative-num-draft-tokens 64 \
    --tp 4

SGLang's EAGLE implementation includes tree-attention with configurable topk; it generally outperforms vLLM on EAGLE specifically.


TensorRT-LLM Configuration {#tensorrt}

For maximum H100/H200 throughput, build TRT-LLM engines with speculation:

trtllm-build \
    --checkpoint_dir /trt_ckpt/llama-70b \
    --output_dir /trt_engines/llama-70b-spec \
    --gemm_plugin bf16 \
    --speculative_decoding_mode eagle \
    --max_draft_len 5 \
    --max_input_len 4096 \
    --max_output_len 2048

Then serve via Triton with the EAGLE draft engine alongside the target. See TensorRT-LLM Setup.


llama.cpp Speculative Mode {#llamacpp}

./llama-speculative \
    -m models/llama-3.1-70b-Q5_K_M.gguf \
    -md models/llama-3.2-1b-Q8_0.gguf \
    -ngl 999 -ngld 999 \
    --draft 5 \
    -c 8192 \
    -p "Explain quantum entanglement to a high schooler."

-md is the draft model, --draft is K. Speedup: 1.8-2.3x for typical 70B + 1B pairings.


Tuning Acceptance Rate {#tuning}

Measure acceptance per workload — vLLM logs this with --log-stats:

spec_decode/acceptance_rate: 0.72
spec_decode/draft_throughput: 850 tok/s
spec_decode/system_efficiency: 2.4x

Tuning levers:

  • num_speculative_tokens: 3-8. Sweet spot is usually 4-5. Higher K = more wasted compute on rejected drafts.
  • Draft model size: bigger draft = higher acceptance, slower draft step. 1-3B for 70B target is standard.
  • Temperature alignment: ensure draft and target use same sampling parameters.
  • Tokenizer match: must be exact. Llama draft for Qwen target won't work.

Target: aim for acceptance >0.6 for net speedup at K=5. If acceptance <0.4, lower K or switch to n-gram.


Sampling vs Greedy: Speedup Trade-off {#sampling}

DecodingAcceptanceSpeedup
Greedy75-85%2.5-3.5x
Temp 0.3, top-p 0.9565-75%2.0-2.8x
Temp 0.6, top-p 0.9555-65%1.7-2.3x
Temp 1.0, top-p 0.9540-55%1.3-1.8x
Temp 1.0, no top-p25-40%0.9-1.3x (often slower)

For high-temperature creative generation, speculative decoding may not help. For chat (temp 0.5-0.7) and code (temp 0.0-0.3), it's a clear win.


Batched Serving Caveats {#batched}

Speedup as a function of batch size:

BatchSingle-Stream SpeedupBandwidth Saturation
12.5-3.5xBandwidth-bound (best speedup)
41.8-2.5xMixed bound
81.4-1.8xCompute-saturated
161.15-1.4xCompute-bound
321.05-1.2xDiminishing returns
64+1.0-1.1xNegligible

For high-throughput batch serving (LLM-as-a-service): focus on continuous batching, FP8, and PagedAttention rather than speculation. For interactive APIs (chatbots, copilots, agents): speculation is essential. See vLLM Complete Setup.


When Speculation Hurts {#failure-modes}

SymptomCauseFix
Negative speedupAcceptance <40%Lower K, switch to n-gram, or disable
OOM after enablingDraft + target VRAM exceededUse EAGLE (smaller) or n-gram (zero extra VRAM)
Output differs from non-specBug or wrong rejection samplingFile issue; verify with --enforce-eager comparison
Speedup vanishes at batch 8+Compute-bound regimeDisable for batched serving
High variance in throughputAcceptance varies by queryAcceptable; total throughput improves on average
Tokenizer mismatch errorDraft uses different vocabUse family-matched draft (Llama 3.2 for Llama 3.1)
Quality degradationGreedy + EAGLE, eager mode bugUpdate vLLM to 0.7+

Decision Tree by Workload {#decision}

Single-user chat / agent (batch=1):

  1. Try n-gram speculation first (zero training, zero VRAM cost)
  2. If acceptance >50%, you're done
  3. If acceptance <50%, add EAGLE-2 (1-3 GB VRAM, 2.5-3.5x speedup)
  4. If EAGLE not available for your model, use draft model (1-3B same family)

Code generation:

  1. n-gram is excellent (high context match) — usually 2-3x out of the box
  2. Combine with EAGLE for stacking gains

Long-context summarization / edit operations:

  1. n-gram dominates — 60-80% acceptance from prompt overlap
  2. No need for separate draft

Batched API serving (batch>16):

  1. Skip speculation
  2. Focus on FP8, continuous batching, PagedAttention

MoE models (DeepSeek V3, Hunyuan-Large):

  1. Use built-in MTP if available (zero training)
  2. Otherwise EAGLE-2 with tree attention works well

Real Benchmarks {#benchmarks}

Single H100 80GB, Llama 3.1 70B AWQ INT4, batch=1, vLLM:

MethodTokens/secSpeedup
Baseline (no speculation)281.0x
n-gram (K=5)421.5x
Draft: Llama 3.2 1B (K=5)622.2x
Draft: Llama 3.2 3B (K=5)582.1x
EAGLE-2 (K=5)843.0x
EAGLE-3 (K=7)1023.6x

DeepSeek V3 671B FP8, 8x H100, batch=1, SGLang:

MethodTokens/secSpeedup
Baseline651.0x
MTP (built-in, K=1)1051.6x
MTP + n-gram (K=4 total)1302.0x

FAQ {#faq}

See answers to common speculative decoding questions below.


Sources: Leviathan et al. (2023) Fast Inference from Transformers via Speculative Decoding | Cai et al. (2024) Medusa | Li et al. (2024) EAGLE-2 | Li et al. (2025) EAGLE-3 | vLLM speculative decoding docs | SGLang speculative decoding docs | Internal benchmarks H100.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 2, 2026🔄 Last Updated: May 2, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes vLLM with EAGLE-2 speculative decoding reference deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators