Q: Are there cases where speculative decoding hurts performance?

Yes — three common failure modes. (1) **Highly stochastic generation** (temperature 1.0+, no top-p): draft acceptance drops below 30%, net throughput becomes negative. Fix: lower temperature or disable speculation. (2) **Domain-shifted draft**: a draft model trained on different distribution than the target (e.g., 1B base model vs target instruct-tuned). Fix: use draft of same family (Llama 3.2 1B Instruct for Llama 3.1 70B Instruct). (3) **Very small target models** (≤7B): bandwidth bottleneck is small, speedup is marginal. Speculative decoding is most valuable for 30B+ targets with smaller drafts. For 1-3B models, plain decoding is often faster than speculation overhead.

Question 1

What is speculative decoding and why does it speed up inference?

Accepted Answer

Speculative decoding generates K candidate tokens with a small fast "draft" model, then verifies all K in a single forward pass of the large "target" model. The target either accepts the draft tokens (free speedup) or corrects from the first mismatch. Because LLM inference is memory-bandwidth bound — loading the model weights into compute is the bottleneck, not the actual matmuls — running the target on K tokens at once is roughly the same wall-time as running it on 1 token. If the draft accepts ~70% of the time at K=5, you get ~3x speedup with mathematically identical output to the unaccelerated greedy/sampling decode. No quality loss because every token is verified by the full target model.

Question 2

How much speedup can I expect in practice?

Accepted Answer

Real-world speedup ranges from 1.5x to 4x depending on draft-target pairing, decoding method (greedy accepts more than sampling), workload (code is more predictable than chat), and hardware. Typical wins: Llama 3.1 70B + Llama 3.2 1B draft = 2.0-2.5x; Llama 3.1 8B + n-gram speculation = 1.5-2x; EAGLE-2 on Llama 3.1 70B = 2.5-3.5x; Medusa heads on Vicuna = 2-2.5x. For batch serving, speedup is reduced because batching already amortizes weight loading — typical batch=8 speedup is 1.2-1.5x. For interactive/single-stream serving, speculative decoding is one of the highest-ROI optimizations available.

Question 3

EAGLE vs Medusa vs draft-model — which should I use?

Accepted Answer

Draft model (separate small LLM): simplest, supported in vLLM/SGLang/llama.cpp. Best when an aligned tokenizer-compatible small model exists (Llama 3.1 70B + 3B; Qwen 2.5 72B + 7B). EAGLE-2 (and 2025's EAGLE-3): trains a tiny extra layer that predicts draft tokens using the target's hidden states. Higher acceptance rate than naive draft models, lower memory overhead. Best for max speedup on a single model. Medusa: trains 4-5 prediction heads on top of the target model, parallel speculation without a separate draft. Simpler to deploy than EAGLE but lower acceptance rate. n-gram / prompt-lookup: zero training, just match recent context for repetitive text. Best for code generation and structured output. For 2026: try n-gram first (free), upgrade to EAGLE-2 if you need max speedup.

Question 4

How do I enable speculative decoding in vLLM?

Accepted Answer

vLLM supports several modes via --speculative-config. Draft model: `vllm serve meta-llama/Llama-3.1-70B-Instruct --speculative-config '{"model": "meta-llama/Llama-3.2-1B-Instruct", "num_speculative_tokens": 5}'`. EAGLE: `--speculative-config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-70B", "num_speculative_tokens": 5}'`. n-gram: `--speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 4}'`. MTP (multi-token prediction, used with DeepSeek V3): automatic when MTP weights are present. Tune num_speculative_tokens between 3-8 — too high wastes compute on rejected tokens.

Question 5

Why does sampling reduce speedup compared to greedy decoding?

Accepted Answer

Greedy decoding picks the highest-probability token deterministically — the draft model usually predicts the same top token, so acceptance is high (~75-85%). Sampling (temperature > 0, top-p, top-k) introduces randomness — the draft's sampled token must match what the target would have sampled given the same RNG state. This reduces raw acceptance, but vLLM and SGLang use rejection sampling (Leviathan et al., 2023) that accepts a draft token with probability min(1, target_prob/draft_prob) and resamples on rejection. Net result: sampling speedup is typically 1.5-2x vs greedy's 2.5-3x, with mathematically identical output distribution to non-speculative sampling.

Question 6

Does speculative decoding work with batched requests?

Accepted Answer

Yes, but the speedup shrinks. Single-stream is bandwidth-bound (one request loads all model weights to generate one token). Batch=8 already amortizes weight loading across 8 requests — the bottleneck shifts toward compute. Speculative decoding still helps by reducing total tokens-of-target-forward-passes, but the "free K tokens" trick is less impactful. Typical batched speedup: 1.2-1.5x at batch 8, 1.05-1.2x at batch 32, negligible at batch 64+. For interactive APIs serving few concurrent users, speculative decoding is essential. For high-throughput batch serving, focus on continuous batching (PagedAttention) and FP8 instead.

Question 7

What hardware overhead does speculative decoding add?

Accepted Answer

Draft model: VRAM = target + draft (e.g., 140 GB + 4 GB for Llama 3.1 70B + 1B). EAGLE: ~1-3 GB extra VRAM for the EAGLE head (much less than a separate draft model). Medusa: ~2-5 GB for the prediction heads. n-gram / prompt-lookup: zero extra VRAM. Compute overhead per step: draft model adds K small forward passes (cheap); EAGLE adds a single layer; Medusa runs heads in parallel with target. The bigger cost is rejected drafts — every rejected token represents wasted compute on the target. Tune num_speculative_tokens by measuring acceptance rate; if <50% acceptance, lower K.

Question 8

Are there cases where speculative decoding hurts performance?

Accepted Answer

Yes — three common failure modes. (1) **Highly stochastic generation** (temperature 1.0+, no top-p): draft acceptance drops below 30%, net throughput becomes negative. Fix: lower temperature or disable speculation. (2) **Domain-shifted draft**: a draft model trained on different distribution than the target (e.g., 1B base model vs target instruct-tuned). Fix: use draft of same family (Llama 3.2 1B Instruct for Llama 3.1 70B Instruct). (3) **Very small target models** (≤7B): bandwidth bottleneck is small, speedup is marginal. Speculative decoding is most valuable for 30B+ targets with smaller drafts. For 1-3B models, plain decoding is often faster than speculation overhead.

Target	Draft	Acceptance Rate
Llama 3.1 70B Instruct	Llama 3.2 1B Instruct	60-75%
Llama 3.1 70B Instruct	Llama 3.2 3B Instruct	70-80%
Llama 3.1 405B Instruct	Llama 3.1 8B Instruct	65-75%
Qwen 2.5 72B Instruct	Qwen 2.5 1.5B Instruct	65-75%
Qwen 2.5 72B Instruct	Qwen 2.5 7B Instruct	75-82%
DeepSeek V3	DeepSeek V2 Lite	55-70%

Variant	Released	Key Innovation	Speedup
EAGLE-1	2024	Feature-level draft	2.5-3x
EAGLE-2	2024	Dynamic draft tree	3-3.5x
EAGLE-3	2025	Multi-layer hidden state aggregation	3.5-4x

Decoding	Acceptance	Speedup
Greedy	75-85%	2.5-3.5x
Temp 0.3, top-p 0.95	65-75%	2.0-2.8x
Temp 0.6, top-p 0.95	55-65%	1.7-2.3x
Temp 1.0, top-p 0.95	40-55%	1.3-1.8x
Temp 1.0, no top-p	25-40%	0.9-1.3x (often slower)

Batch	Single-Stream Speedup	Bandwidth Saturation
1	2.5-3.5x	Bandwidth-bound (best speedup)
4	1.8-2.5x	Mixed bound
8	1.4-1.8x	Compute-saturated
16	1.15-1.4x	Compute-bound
32	1.05-1.2x	Diminishing returns
64+	1.0-1.1x	Negligible

Symptom	Cause	Fix
Negative speedup	Acceptance <40%	Lower K, switch to n-gram, or disable
OOM after enabling	Draft + target VRAM exceeded	Use EAGLE (smaller) or n-gram (zero extra VRAM)
Output differs from non-spec	Bug or wrong rejection sampling	File issue; verify with --enforce-eager comparison
Speedup vanishes at batch 8+	Compute-bound regime	Disable for batched serving
High variance in throughput	Acceptance varies by query	Acceptable; total throughput improves on average
Tokenizer mismatch error	Draft uses different vocab	Use family-matched draft (Llama 3.2 for Llama 3.1)
Quality degradation	Greedy + EAGLE, eager mode bug	Update vLLM to 0.7+

Method	Tokens/sec	Speedup
Baseline (no speculation)	28	1.0x
n-gram (K=5)	42	1.5x
Draft: Llama 3.2 1B (K=5)	62	2.2x
Draft: Llama 3.2 3B (K=5)	58	2.1x
EAGLE-2 (K=5)	84	3.0x
EAGLE-3 (K=7)	102	3.6x

Method	Tokens/sec	Speedup
Baseline	65	1.0x
MTP (built-in, K=1)	105	1.6x
MTP + n-gram (K=4 total)	130	2.0x

Speculative Decoding Complete Guide (2026): Draft Models, EAGLE, Medusa, n-grams

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

Why Speculative Decoding Works (Bandwidth Bottleneck) {#why}

The Core Algorithm {#algorithm}

Draft Model Speculation {#draft-model}

Reading articles is good. Building is better.

EAGLE-2 and EAGLE-3 {#eagle}

Medusa Heads {#medusa}

n-gram / Prompt-Lookup Decoding {#ngram}

Multi-Token Prediction (MTP) {#mtp}

vLLM Configuration {#vllm}

Draft Model

EAGLE-2

n-gram

MTP (DeepSeek V3)

SGLang Configuration {#sglang}

TensorRT-LLM Configuration {#tensorrt}

llama.cpp Speculative Mode {#llamacpp}

Tuning Acceptance Rate {#tuning}

Sampling vs Greedy: Speedup Trade-off {#sampling}

Batched Serving Caveats {#batched}

When Speculation Hurts {#failure-modes}

Decision Tree by Workload {#decision}

Real Benchmarks {#benchmarks}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

vLLM Complete Setup Guide

TensorRT-LLM Setup

CUDA Optimization

Quantization Explained

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI