Question 1

What is knowledge distillation in LLMs and why does it work?

Accepted Answer

Distillation trains a small "student" model to mimic a large "teacher" model. The student learns from the teacher's outputs (text, logits, or hidden states) instead of (or in addition to) raw labeled data. The reason it works: a strong teacher provides denser supervision than ground-truth labels — instead of "the next token is X" (1 bit), the teacher provides a full probability distribution over the vocab ("70% X, 12% Y, 8% Z, ..."), which carries much more information per example. Result: student can reach 80-95% of teacher quality at 5-50x smaller size, often beating a same-size model trained from scratch on the same labeled data. The most famous example: DeepSeek-R1-Distill-Qwen-32B captures ~85% of R1's reasoning at consumer-GPU sizes.

Question 2

What's the difference between logit distillation, hidden-state distillation, and sequence-level distillation?

Accepted Answer

**Logit / soft-target distillation** (Hinton et al., 2015): student matches the teacher's output token-probability distribution (softened by temperature). Most common, requires tokenizer alignment. **Hidden-state distillation**: student matches teacher's intermediate layer activations — denser supervision but architecture-dependent (typically used when student and teacher share architecture family). **Sequence-level distillation** (Kim & Rush, 2016): student trains on greedy or sampled outputs of teacher — treat teacher generations as labels, do plain SFT on student. Easiest to implement, no tokenizer constraint, no need for teacher to be open-source (works with API teachers like GPT-4o or Claude). **Rationale / chain-of-thought distillation**: student trains on teacher's chain-of-thought reasoning, not just final answer — essential for reasoning distillation (R1-style).

Question 3

How was DeepSeek R1 distilled into smaller models?

Accepted Answer

DeepSeek released R1-Distill variants of Qwen (1.5B, 7B, 14B, 32B) and Llama (8B, 70B). The recipe: (1) Use full R1 (671B) to generate ~800K reasoning trajectories on a curated math + code + science prompt set. Each trajectory includes thinking tokens + final answer. (2) Filter: only keep trajectories where R1's final answer is correct (verified against ground truth for math/code, RL judge for open-ended). (3) SFT the smaller model on these trajectories for 2-3 epochs. No RL, no DPO — pure supervised fine-tuning on teacher outputs. Result: R1-Distill-Qwen-32B beats most non-distilled 32B models on AIME by 30-40 percentage points, captures ~85% of R1 reasoning at 1/20th the inference cost. The full pipeline works on 8x H100 over a few weeks.

Question 4

Can I distill from closed-source teachers like GPT-4o or Claude?

Accepted Answer

Yes — sequence-level distillation (also called "data distillation" in this context) works with any teacher you can call. Process: (1) Curate a prompt set covering your target domain (10K-1M prompts depending on budget). (2) Call the teacher API to generate completions. (3) SFT a smaller open-source base on the (prompt, completion) pairs. Cost: ~$1-10K per 1M prompts at GPT-4o pricing; less at DeepSeek V3 / Qwen API rates. Legal note: many commercial APIs (OpenAI, Anthropic) explicitly prohibit using their outputs to train competing models — read terms. For permissive teachers (DeepSeek V3, Qwen 2.5, Llama 3.1, GLM-4.5): no restrictions on output use. Most successful 2025-2026 small open models are distilled this way.

Question 5

What's on-policy distillation and why does it matter?

Accepted Answer

Standard distillation: train on a fixed set of teacher outputs (off-policy). Student may learn behavior the teacher never produced in similar contexts, leading to compounding errors at inference. On-policy distillation (GKD - Generalized Knowledge Distillation, Agarwal et al., 2024): during training, the student generates its own outputs, then the teacher labels each token in the student's output. The student trains on its own distribution — what it actually does at inference. Result: significantly better generation quality, especially for long generations and reasoning. Cost: requires running both student and teacher per training step (more compute than off-policy). Implemented in TRL via GKDTrainer. For high-stakes distillation (production reasoning models, agentic systems), on-policy distillation is worth the extra compute.

Question 6

Can I distill across model families (e.g., GPT-4 → Llama)?

Accepted Answer

Yes for sequence-level distillation (just text) — the teacher outputs are tokenizer-agnostic strings. The student tokenizes them with its own tokenizer. This is what most cross-family distillation does in practice. Logit distillation across families is harder — different vocabularies make probability distributions non-comparable. Workarounds: (1) align the vocabularies via embedding similarity, (2) use only the top-K tokens that are common, (3) use hidden-state distillation with a learned projection between architectures. Most cross-family distillation in 2026 sticks to sequence-level with optionally adding logit-distillation on shared sub-tokens. Production tip: pick a student whose tokenizer is similar to the teacher (Qwen 2.5 student for DeepSeek teacher; both use BPE with similar vocab) to maximize logit-distillation viability.

Question 7

How much data and compute does LLM distillation need?

Accepted Answer

For task-specific distillation (e.g., domain QA): 10K-100K teacher-generated examples, 2-10 hours on a single H100 for a 7B student via QLoRA. For general capability distillation (R1-style reasoning): 500K-2M trajectories, 1-4 weeks on 8x H100. For full-capability distillation matching teacher across many domains: typically requires 5M+ examples and 1-3 months on a small cluster. Quality scaling: roughly logarithmic in data — 10x more data buys 5-10 quality points. For most users in 2026: start with sequence-level distillation on 50K-200K teacher outputs, evaluate, scale data only if needed. Use existing R1-Distill / V3-distill checkpoints if your needs match those domains — they save months of work.

Question 8

When is fine-tuning better than distillation?

Accepted Answer

Fine-tuning is better when: (1) You have abundant labeled data — teacher signal isn't needed if you have 100K+ high-quality examples already. (2) Your task is narrow and well-defined (classification, extraction). (3) You want to specialize behavior, not transfer general capability. Distillation is better when: (1) Labels are scarce or expensive but you have a strong teacher. (2) You want to compress a large capable model into a small fast one (the canonical use case). (3) You need to transfer multiple capabilities (knowledge, reasoning, style) that would be hard to label individually. (4) The teacher has unique reasoning patterns (R1 thinking, GPT-4o tool use) that you can't easily reproduce with hand-curated labels. Most production pipelines combine both: distill general capability from a teacher, then SFT/DPO with task-specific labels. See [QLoRA Fine-Tuning Guide](/blog/qlora-fine-tuning-guide) and [DPO/ORPO/KTO Guide](/blog/dpo-orpo-kto-guide).

Approach	Complexity	Quality
Sequence-level only	Easy	Good
Top-K alignment (only common subwords)	Medium	OK
Embedding-based vocab projection	Hard	Best
MinED (Minimum Edit Distance, 2024)	Medium	Strong

Model	AIME %	MATH-500	GPQA Diamond
Llama 3.1 8B (no distill)	6.7	51.9	32.0
R1-Distill-Llama-8B	50.4	89.1	49.0
Qwen 2.5 7B (no distill)	11.7	70.3	36.0
R1-Distill-Qwen-7B	55.5	92.8	49.1
Qwen 2.5 32B (no distill)	16.5	76.4	49.5
R1-Distill-Qwen-32B	72.6	94.3	62.1
DeepSeek R1 (671B teacher)	79.8	97.3	71.5

Teacher	Output Use Allowed for Training?	Notes
Llama 3.1 / 3.2	✓ (with attribution; 700M MAU clause)	Permitted, attribution required
Qwen 2.5	✓	Tongyi Qianwen license permits
DeepSeek V3 / R1	✓	DeepSeek License permits
OLMo 2	✓	Apache 2.0 — fully unrestricted
GLM-4.5V	✓	MIT — fully unrestricted
Mistral (Apache)	✓	Apache models permit
Mistral (commercial)	partial	Check specific license
GPT-4o (OpenAI)	✗	OpenAI ToS prohibits "developing models that compete"
Claude (Anthropic)	✗	Similar prohibition in ToS
Gemini (Google)	✗	Similar prohibition

Symptom	Cause	Fix
Student loss diverges	Temperature too high	Lower temp to 2.0
Student matches teacher on train but fails inference	Off-policy mismatch	Switch to GKD on-policy
Cross-tokenizer KL fails	Vocab mismatch	Use sequence-level only
OOM with full teacher loaded	Teacher too large	Pre-generate teacher outputs offline (sequence-level)
Filtering discards too much	Verifier too strict	Use partial-credit grading or LLM judge
Distilled student ignores chain-of-thought	Missing think tokens in data	Re-format trajectories with explicit thinking blocks
Quality plateaus quickly	Insufficient data diversity	Add more prompt domains, sample multiple completions per prompt
Distillation slower than fine-tuning	Expected	Both teacher + student passes per step (or pre-generation cost)

Knowledge Distillation Complete Guide (2026): Compress 671B Models to 7B with Teacher-Student Training

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What Distillation Is {#what-it-is}

Why It Works (Information Density) {#why}

Logit / Soft-Target Distillation {#logit}

Reading articles is good. Building is better.

Sequence-Level Distillation {#sequence-level}

Hidden-State Distillation {#hidden-state}

On-Policy Distillation (GKD) {#on-policy}

Rationale / Chain-of-Thought Distillation {#cot}

The R1-Distill Recipe {#r1-distill}

Cross-Tokenizer / Cross-Family Distillation {#cross-family}

Data Pipeline {#data}

Training with TRL (GKDTrainer) {#trl}

Training with Axolotl {#axolotl}

Custom Distillation Loop (PyTorch) {#custom}

Distillation vs Fine-Tuning Decision Tree {#decision}

Real Benchmarks {#benchmarks}

Legal / Licensing Considerations {#licensing}

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

QLoRA Fine-Tuning Guide

DPO / ORPO / KTO Guide

DeepSeek R1 Local Setup

DeepSeek V3 Local Setup

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI