Knowledge Distillation Complete Guide (2026): Compress 671B Models to 7B with Teacher-Student Training
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Distillation is how 671B frontier models become 7B consumer-GPU models. By training a small "student" to mimic a large "teacher", you can compress capabilities at 5-50x size reduction with surprisingly small quality loss. The DeepSeek R1-Distill family is the most visible 2025 example — R1-Distill-Qwen-32B captures ~85% of R1's reasoning at 1/20th the inference cost. But distillation is also the quiet workhorse behind most production small models: domain QA assistants, code completion engines, and embedding rerankers are typically distilled from larger teachers rather than trained from scratch.
This guide covers the full distillation toolkit: logit / soft-target distillation (Hinton), sequence-level distillation (works with API teachers), hidden-state distillation, on-policy distillation (GKD), rationale and chain-of-thought distillation (R1-style). Includes data pipelines, training recipes for Hugging Face TRL / Axolotl, cross-tokenizer strategies, and decision trees for when distillation beats plain fine-tuning.
Table of Contents
- What Distillation Is
- Why It Works (Information Density)
- Logit / Soft-Target Distillation
- Sequence-Level Distillation
- Hidden-State Distillation
- On-Policy Distillation (GKD)
- Rationale / Chain-of-Thought Distillation
- The R1-Distill Recipe
- Cross-Tokenizer / Cross-Family Distillation
- Data Pipeline
- Training with TRL (GKDTrainer)
- Training with Axolotl
- Custom Distillation Loop (PyTorch)
- Distillation vs Fine-Tuning Decision Tree
- Real Benchmarks
- Legal / Licensing Considerations
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What Distillation Is {#what-it-is}
Train a small student model S to mimic a large teacher model T. Three main signal sources:
- Soft labels (logits): T's probability distribution over the vocab at each position.
- Generated text (sequences): T's actual output tokens — used as labels for plain SFT.
- Hidden states: T's intermediate layer activations.
Loss is typically a weighted combination of distillation loss (matching T) and standard cross-entropy (matching ground truth labels if available).
Output: S that approximates T's behavior at a fraction of the parameters.
Why It Works (Information Density) {#why}
A ground-truth label "the next token is X" carries log2(vocab_size) ≈ 17 bits of information (one of 128K vocab options). A teacher's full probability distribution over the vocab is much richer — it captures relative likelihoods of plausible alternatives.
Example: prompt "The capital of France is"
- Ground truth: "Paris" (1 token, 17 bits)
- Teacher distribution: {Paris: 0.85, the: 0.05, well: 0.02, located: 0.02, ...}
The distribution tells the student: not just "Paris is correct" but also "'the' is a reasonable continuation if a sentence reformulation is needed". This dense supervision lets a small model match a big model's behavior more efficiently than learning from labels alone.
Logit / Soft-Target Distillation {#logit}
The classic Hinton (2015) recipe:
loss = α × KL(softmax(T_logits / temp), softmax(S_logits / temp)) × temp²
+ (1-α) × CE(S_logits, ground_truth)
temp softens the distribution (higher = softer). Typical: temp=2-4, α=0.5-0.9.
Requirements:
- Same tokenizer (so logits compare position-by-position)
- Teacher and student see the same input
Implementation in PyTorch:
def distillation_loss(student_logits, teacher_logits, labels, temp=2.0, alpha=0.7):
soft_loss = F.kl_div(
F.log_softmax(student_logits / temp, dim=-1),
F.softmax(teacher_logits / temp, dim=-1),
reduction="batchmean",
) * (temp ** 2)
hard_loss = F.cross_entropy(student_logits.view(-1, V), labels.view(-1))
return alpha * soft_loss + (1 - alpha) * hard_loss
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Sequence-Level Distillation {#sequence-level}
Treat teacher generations as labels; do plain SFT on student. Pipeline:
1. For each prompt p in dataset:
completion = teacher.generate(p)
append (p, completion) to distillation_data
2. SFT student on distillation_data with standard cross-entropy
Advantages:
- No tokenizer alignment required
- Works with closed-source teachers (GPT-4o, Claude — subject to license)
- Simplest to implement
- Captures teacher's behavior, not just per-token probabilities
Disadvantages:
- Off-policy: student trains on teacher's distribution, not its own
- Can compound errors at inference (covered by on-policy distillation below)
This is what DeepSeek used for R1-Distill — pure SFT on R1 generations.
Hidden-State Distillation {#hidden-state}
Match teacher intermediate activations:
loss = MSE(W · S_hidden_layer_j, T_hidden_layer_k)
Where W is a learned projection if hidden dims differ, j and k are paired layers (e.g., student layer 4 ↔ teacher layer 16 for a 4x compression).
Useful when student and teacher share architecture family. Examples: TinyBERT, MobileBERT, DistilBERT — all distilled BERT into smaller versions using hidden-state matching.
For modern decoder-only LLMs, hidden-state distillation is less common in production — sequence-level + logit distillation typically suffices.
On-Policy Distillation (GKD) {#on-policy}
GKD (Agarwal et al., 2024) addresses off-policy mismatch:
For each training step:
1. Student generates completion from prompt p (its own distribution)
2. Teacher labels each position in student's completion with logit distribution
3. Compute distillation loss on student-generated sequence
Student trains on what it would actually produce at inference, getting teacher-level guidance per token. Result: better generation quality, especially for long-form and reasoning.
Implemented in TRL via GKDTrainer. Compute cost: 2-3x off-policy distillation (need teacher pass per training step), but quality gains usually justify it for high-stakes models.
Rationale / Chain-of-Thought Distillation {#cot}
For reasoning models: distill not just the final answer but the chain-of-thought.
Prompt: "If 3x + 5 = 20, what is x?"
Teacher (R1) output:
<think>
3x + 5 = 20
3x = 20 - 5 = 15
x = 15 / 3 = 5
</think>
x = 5
Student trains on the entire output including thinking tokens. The student learns the reasoning pattern, not just the answer.
This is the technique behind DeepSeek R1-Distill — and why R1-Distill-Qwen-32B beats much larger non-reasoning models on AIME math problems despite being 1/20th the size of full R1.
The R1-Distill Recipe {#r1-distill}
The actual DeepSeek R1-Distill recipe:
-
Generate trajectories: Use full R1 (671B) on ~800K math + code + science prompts. Each output includes
<think>...</think>chain plus final answer. -
Filter for correctness: For math/code, verify final answer against ground truth. For open-ended reasoning, use a verifier model. Discard trajectories where R1 was wrong (~30-40% of generations).
-
SFT the student: On filtered trajectories, plain cross-entropy SFT for 2-3 epochs. No RL, no DPO.
# Pseudocode
trajectories = []
for prompt in math_prompts + code_prompts + science_prompts:
output = r1.generate(prompt, max_tokens=8192, include_thinking=True)
if verify_correct(output.final_answer, prompt.ground_truth):
trajectories.append({"prompt": prompt.text, "completion": output.full_text})
# Standard SFT loop on trajectories
trainer = SFTTrainer(model=qwen_2_5_32b_base, train_dataset=trajectories, ...)
trainer.train()
Result: R1-Distill-Qwen-32B with ~85% of R1's AIME performance at 1/20th the inference cost. Released as open weights — see DeepSeek R1 Local Setup for serving.
Cross-Tokenizer / Cross-Family Distillation {#cross-family}
Sequence-level distillation is tokenizer-agnostic — works across families.
For logit distillation across families: vocabularies don't align, so KL on raw logits doesn't work. Approaches:
| Approach | Complexity | Quality |
|---|---|---|
| Sequence-level only | Easy | Good |
| Top-K alignment (only common subwords) | Medium | OK |
| Embedding-based vocab projection | Hard | Best |
| MinED (Minimum Edit Distance, 2024) | Medium | Strong |
For most practical cross-family distillation: stick with sequence-level. The loss in supervision density is offset by access to a wider variety of teachers.
Pairing tip: pick student with similar tokenizer family (Qwen 2.5 student for DeepSeek teacher, both BPE with similar vocab) for partial logit-distillation viability.
Data Pipeline {#data}
Production pipeline for distillation data:
1. Curate prompt set
- Cover target domains
- Mix difficulty levels
- Include edge cases
2. Generate teacher outputs
- Batch via vLLM / SGLang or API calls
- Sample with appropriate temperature (0.5-0.7 typical)
- Optionally generate multiple completions per prompt for diversity
3. Filter / clean
- Verify correctness for math/code
- Use reward model or LLM judge for open-ended
- Deduplicate exact matches
- Quality-score and keep top X%
4. Format for training
- ChatML, Alpaca, or model-specific template
- Token-count distribution check
- Train/eval split (95/5 typical)
Storage: 1M trajectories at avg 2K tokens = ~8 GB JSONL. Compressed shards on HF Datasets work well.
Training with TRL (GKDTrainer) {#trl}
from trl import GKDTrainer, GKDConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
teacher = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V3", torch_dtype="auto")
student = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
config = GKDConfig(
output_dir="./student-distilled",
learning_rate=5e-6,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
lmbda=0.5, # mix of on-policy vs off-policy
beta=0.1, # JSD divergence interpolation
bf16=True,
)
trainer = GKDTrainer(
model=student,
teacher_model=teacher,
args=config,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
GKDTrainer supports both on-policy and off-policy via the lmbda parameter. lmbda=1.0 is fully on-policy; lmbda=0.0 is fully off-policy SFT on teacher outputs.
Training with Axolotl {#axolotl}
For sequence-level distillation, use Axolotl with teacher-generated dataset:
base_model: Qwen/Qwen2.5-7B-Instruct
load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 64
datasets:
- path: ./distillation_data.jsonl
type: chat_template
sequence_len: 4096
sample_packing: true
micro_batch_size: 4
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 1e-4
optimizer: paged_adamw_8bit
bf16: auto
output_dir: ./distilled-qwen-7b
accelerate launch -m axolotl.cli.train config.yml
This is the QLoRA-on-distilled-data path used for 95% of practical sequence-level distillation.
Custom Distillation Loop (PyTorch) {#custom}
Full logit + sequence distillation:
import torch
import torch.nn.functional as F
def train_step(student, teacher, batch, alpha=0.7, temp=2.0):
with torch.no_grad():
teacher_out = teacher(input_ids=batch["input_ids"]).logits
student_out = student(input_ids=batch["input_ids"]).logits
soft_loss = F.kl_div(
F.log_softmax(student_out / temp, dim=-1),
F.softmax(teacher_out / temp, dim=-1),
reduction="batchmean",
) * (temp ** 2)
hard_loss = F.cross_entropy(
student_out.view(-1, student_out.size(-1)),
batch["labels"].view(-1),
ignore_index=-100,
)
return alpha * soft_loss + (1 - alpha) * hard_loss
For 7B student + 70B teacher on 8x H100: ~20K tokens/sec training. Compress 70B → 7B on 1M trajectories in ~24-48 hours.
Distillation vs Fine-Tuning Decision Tree {#decision}
Pick distillation when:
- You have a strong teacher and modest labeled data
- Goal is to compress big model to small
- Need to transfer reasoning patterns or chain-of-thought
- Have budget for teacher inference (1M trajectories at $10/M = ~$10K for V3-class teachers)
Pick fine-tuning when:
- You have abundant labeled data (100K+ examples)
- Task is narrow (classification, extraction, format conversion)
- Want to specialize behavior, not transfer general capability
- No suitable teacher exists
Hybrid (most production):
- Distill general capability from teacher → base
- SFT/DPO on task-specific labels for specialization
- Optional: continue with RL/DPO on user feedback
See QLoRA Fine-Tuning Guide and DPO / ORPO / KTO Guide for the post-distillation steps.
Real Benchmarks {#benchmarks}
R1-Distill family vs same-size baselines (AIME 2024):
| Model | AIME % | MATH-500 | GPQA Diamond |
|---|---|---|---|
| Llama 3.1 8B (no distill) | 6.7 | 51.9 | 32.0 |
| R1-Distill-Llama-8B | 50.4 | 89.1 | 49.0 |
| Qwen 2.5 7B (no distill) | 11.7 | 70.3 | 36.0 |
| R1-Distill-Qwen-7B | 55.5 | 92.8 | 49.1 |
| Qwen 2.5 32B (no distill) | 16.5 | 76.4 | 49.5 |
| R1-Distill-Qwen-32B | 72.6 | 94.3 | 62.1 |
| DeepSeek R1 (671B teacher) | 79.8 | 97.3 | 71.5 |
Distillation gives 30-60 point boosts on hard reasoning at every size — far more than additional SFT data alone could provide.
Legal / Licensing Considerations {#licensing}
Before distilling, check teacher's license / ToS:
| Teacher | Output Use Allowed for Training? | Notes |
|---|---|---|
| Llama 3.1 / 3.2 | ✓ (with attribution; 700M MAU clause) | Permitted, attribution required |
| Qwen 2.5 | ✓ | Tongyi Qianwen license permits |
| DeepSeek V3 / R1 | ✓ | DeepSeek License permits |
| OLMo 2 | ✓ | Apache 2.0 — fully unrestricted |
| GLM-4.5V | ✓ | MIT — fully unrestricted |
| Mistral (Apache) | ✓ | Apache models permit |
| Mistral (commercial) | partial | Check specific license |
| GPT-4o (OpenAI) | ✗ | OpenAI ToS prohibits "developing models that compete" |
| Claude (Anthropic) | ✗ | Similar prohibition in ToS |
| Gemini (Google) | ✗ | Similar prohibition |
For open-source distillation, stick with permissively-licensed teachers. For internal-use models that won't be redistributed, the ToS questions get murkier — consult legal counsel for production use.
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| Student loss diverges | Temperature too high | Lower temp to 2.0 |
| Student matches teacher on train but fails inference | Off-policy mismatch | Switch to GKD on-policy |
| Cross-tokenizer KL fails | Vocab mismatch | Use sequence-level only |
| OOM with full teacher loaded | Teacher too large | Pre-generate teacher outputs offline (sequence-level) |
| Filtering discards too much | Verifier too strict | Use partial-credit grading or LLM judge |
| Distilled student ignores chain-of-thought | Missing think tokens in data | Re-format trajectories with explicit thinking blocks |
| Quality plateaus quickly | Insufficient data diversity | Add more prompt domains, sample multiple completions per prompt |
| Distillation slower than fine-tuning | Expected | Both teacher + student passes per step (or pre-generation cost) |
FAQ {#faq}
See answers to common distillation questions below.
Sources: Hinton et al. (2015) Distilling the Knowledge in a Neural Network | Kim & Rush (2016) Sequence-Level Distillation | Agarwal et al. (2024) GKD | DeepSeek R1 paper (arXiv 2501.12948) | TRL GKDTrainer docs | MiniLLM paper | Internal benchmarks 8x H100.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!