★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Training

Knowledge Distillation Complete Guide (2026): Compress 671B Models to 7B with Teacher-Student Training

May 2, 2026
24 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

Distillation is how 671B frontier models become 7B consumer-GPU models. By training a small "student" to mimic a large "teacher", you can compress capabilities at 5-50x size reduction with surprisingly small quality loss. The DeepSeek R1-Distill family is the most visible 2025 example — R1-Distill-Qwen-32B captures ~85% of R1's reasoning at 1/20th the inference cost. But distillation is also the quiet workhorse behind most production small models: domain QA assistants, code completion engines, and embedding rerankers are typically distilled from larger teachers rather than trained from scratch.

This guide covers the full distillation toolkit: logit / soft-target distillation (Hinton), sequence-level distillation (works with API teachers), hidden-state distillation, on-policy distillation (GKD), rationale and chain-of-thought distillation (R1-style). Includes data pipelines, training recipes for Hugging Face TRL / Axolotl, cross-tokenizer strategies, and decision trees for when distillation beats plain fine-tuning.

Table of Contents

  1. What Distillation Is
  2. Why It Works (Information Density)
  3. Logit / Soft-Target Distillation
  4. Sequence-Level Distillation
  5. Hidden-State Distillation
  6. On-Policy Distillation (GKD)
  7. Rationale / Chain-of-Thought Distillation
  8. The R1-Distill Recipe
  9. Cross-Tokenizer / Cross-Family Distillation
  10. Data Pipeline
  11. Training with TRL (GKDTrainer)
  12. Training with Axolotl
  13. Custom Distillation Loop (PyTorch)
  14. Distillation vs Fine-Tuning Decision Tree
  15. Real Benchmarks
  16. Legal / Licensing Considerations
  17. Troubleshooting
  18. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What Distillation Is {#what-it-is}

Train a small student model S to mimic a large teacher model T. Three main signal sources:

  1. Soft labels (logits): T's probability distribution over the vocab at each position.
  2. Generated text (sequences): T's actual output tokens — used as labels for plain SFT.
  3. Hidden states: T's intermediate layer activations.

Loss is typically a weighted combination of distillation loss (matching T) and standard cross-entropy (matching ground truth labels if available).

Output: S that approximates T's behavior at a fraction of the parameters.


Why It Works (Information Density) {#why}

A ground-truth label "the next token is X" carries log2(vocab_size) ≈ 17 bits of information (one of 128K vocab options). A teacher's full probability distribution over the vocab is much richer — it captures relative likelihoods of plausible alternatives.

Example: prompt "The capital of France is"

  • Ground truth: "Paris" (1 token, 17 bits)
  • Teacher distribution: {Paris: 0.85, the: 0.05, well: 0.02, located: 0.02, ...}

The distribution tells the student: not just "Paris is correct" but also "'the' is a reasonable continuation if a sentence reformulation is needed". This dense supervision lets a small model match a big model's behavior more efficiently than learning from labels alone.


Logit / Soft-Target Distillation {#logit}

The classic Hinton (2015) recipe:

loss = α × KL(softmax(T_logits / temp), softmax(S_logits / temp)) × temp²
       + (1-α) × CE(S_logits, ground_truth)

temp softens the distribution (higher = softer). Typical: temp=2-4, α=0.5-0.9.

Requirements:

  • Same tokenizer (so logits compare position-by-position)
  • Teacher and student see the same input

Implementation in PyTorch:

def distillation_loss(student_logits, teacher_logits, labels, temp=2.0, alpha=0.7):
    soft_loss = F.kl_div(
        F.log_softmax(student_logits / temp, dim=-1),
        F.softmax(teacher_logits / temp, dim=-1),
        reduction="batchmean",
    ) * (temp ** 2)
    hard_loss = F.cross_entropy(student_logits.view(-1, V), labels.view(-1))
    return alpha * soft_loss + (1 - alpha) * hard_loss

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Sequence-Level Distillation {#sequence-level}

Treat teacher generations as labels; do plain SFT on student. Pipeline:

1. For each prompt p in dataset:
     completion = teacher.generate(p)
     append (p, completion) to distillation_data
2. SFT student on distillation_data with standard cross-entropy

Advantages:

  • No tokenizer alignment required
  • Works with closed-source teachers (GPT-4o, Claude — subject to license)
  • Simplest to implement
  • Captures teacher's behavior, not just per-token probabilities

Disadvantages:

  • Off-policy: student trains on teacher's distribution, not its own
  • Can compound errors at inference (covered by on-policy distillation below)

This is what DeepSeek used for R1-Distill — pure SFT on R1 generations.


Hidden-State Distillation {#hidden-state}

Match teacher intermediate activations:

loss = MSE(W · S_hidden_layer_j, T_hidden_layer_k)

Where W is a learned projection if hidden dims differ, j and k are paired layers (e.g., student layer 4 ↔ teacher layer 16 for a 4x compression).

Useful when student and teacher share architecture family. Examples: TinyBERT, MobileBERT, DistilBERT — all distilled BERT into smaller versions using hidden-state matching.

For modern decoder-only LLMs, hidden-state distillation is less common in production — sequence-level + logit distillation typically suffices.


On-Policy Distillation (GKD) {#on-policy}

GKD (Agarwal et al., 2024) addresses off-policy mismatch:

For each training step:
    1. Student generates completion from prompt p (its own distribution)
    2. Teacher labels each position in student's completion with logit distribution
    3. Compute distillation loss on student-generated sequence

Student trains on what it would actually produce at inference, getting teacher-level guidance per token. Result: better generation quality, especially for long-form and reasoning.

Implemented in TRL via GKDTrainer. Compute cost: 2-3x off-policy distillation (need teacher pass per training step), but quality gains usually justify it for high-stakes models.


Rationale / Chain-of-Thought Distillation {#cot}

For reasoning models: distill not just the final answer but the chain-of-thought.

Prompt: "If 3x + 5 = 20, what is x?"
Teacher (R1) output:
<think>
3x + 5 = 20
3x = 20 - 5 = 15
x = 15 / 3 = 5
</think>
x = 5

Student trains on the entire output including thinking tokens. The student learns the reasoning pattern, not just the answer.

This is the technique behind DeepSeek R1-Distill — and why R1-Distill-Qwen-32B beats much larger non-reasoning models on AIME math problems despite being 1/20th the size of full R1.


The R1-Distill Recipe {#r1-distill}

The actual DeepSeek R1-Distill recipe:

  1. Generate trajectories: Use full R1 (671B) on ~800K math + code + science prompts. Each output includes <think>...</think> chain plus final answer.

  2. Filter for correctness: For math/code, verify final answer against ground truth. For open-ended reasoning, use a verifier model. Discard trajectories where R1 was wrong (~30-40% of generations).

  3. SFT the student: On filtered trajectories, plain cross-entropy SFT for 2-3 epochs. No RL, no DPO.

# Pseudocode
trajectories = []
for prompt in math_prompts + code_prompts + science_prompts:
    output = r1.generate(prompt, max_tokens=8192, include_thinking=True)
    if verify_correct(output.final_answer, prompt.ground_truth):
        trajectories.append({"prompt": prompt.text, "completion": output.full_text})

# Standard SFT loop on trajectories
trainer = SFTTrainer(model=qwen_2_5_32b_base, train_dataset=trajectories, ...)
trainer.train()

Result: R1-Distill-Qwen-32B with ~85% of R1's AIME performance at 1/20th the inference cost. Released as open weights — see DeepSeek R1 Local Setup for serving.


Cross-Tokenizer / Cross-Family Distillation {#cross-family}

Sequence-level distillation is tokenizer-agnostic — works across families.

For logit distillation across families: vocabularies don't align, so KL on raw logits doesn't work. Approaches:

ApproachComplexityQuality
Sequence-level onlyEasyGood
Top-K alignment (only common subwords)MediumOK
Embedding-based vocab projectionHardBest
MinED (Minimum Edit Distance, 2024)MediumStrong

For most practical cross-family distillation: stick with sequence-level. The loss in supervision density is offset by access to a wider variety of teachers.

Pairing tip: pick student with similar tokenizer family (Qwen 2.5 student for DeepSeek teacher, both BPE with similar vocab) for partial logit-distillation viability.


Data Pipeline {#data}

Production pipeline for distillation data:

1. Curate prompt set
   - Cover target domains
   - Mix difficulty levels
   - Include edge cases

2. Generate teacher outputs
   - Batch via vLLM / SGLang or API calls
   - Sample with appropriate temperature (0.5-0.7 typical)
   - Optionally generate multiple completions per prompt for diversity

3. Filter / clean
   - Verify correctness for math/code
   - Use reward model or LLM judge for open-ended
   - Deduplicate exact matches
   - Quality-score and keep top X%

4. Format for training
   - ChatML, Alpaca, or model-specific template
   - Token-count distribution check
   - Train/eval split (95/5 typical)

Storage: 1M trajectories at avg 2K tokens = ~8 GB JSONL. Compressed shards on HF Datasets work well.


Training with TRL (GKDTrainer) {#trl}

from trl import GKDTrainer, GKDConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

teacher = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V3", torch_dtype="auto")
student = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

config = GKDConfig(
    output_dir="./student-distilled",
    learning_rate=5e-6,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    lmbda=0.5,           # mix of on-policy vs off-policy
    beta=0.1,            # JSD divergence interpolation
    bf16=True,
)

trainer = GKDTrainer(
    model=student,
    teacher_model=teacher,
    args=config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)
trainer.train()

GKDTrainer supports both on-policy and off-policy via the lmbda parameter. lmbda=1.0 is fully on-policy; lmbda=0.0 is fully off-policy SFT on teacher outputs.


Training with Axolotl {#axolotl}

For sequence-level distillation, use Axolotl with teacher-generated dataset:

base_model: Qwen/Qwen2.5-7B-Instruct
load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 64

datasets:
  - path: ./distillation_data.jsonl
    type: chat_template

sequence_len: 4096
sample_packing: true

micro_batch_size: 4
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 1e-4
optimizer: paged_adamw_8bit
bf16: auto
output_dir: ./distilled-qwen-7b
accelerate launch -m axolotl.cli.train config.yml

This is the QLoRA-on-distilled-data path used for 95% of practical sequence-level distillation.


Custom Distillation Loop (PyTorch) {#custom}

Full logit + sequence distillation:

import torch
import torch.nn.functional as F

def train_step(student, teacher, batch, alpha=0.7, temp=2.0):
    with torch.no_grad():
        teacher_out = teacher(input_ids=batch["input_ids"]).logits
    student_out = student(input_ids=batch["input_ids"]).logits

    soft_loss = F.kl_div(
        F.log_softmax(student_out / temp, dim=-1),
        F.softmax(teacher_out / temp, dim=-1),
        reduction="batchmean",
    ) * (temp ** 2)
    hard_loss = F.cross_entropy(
        student_out.view(-1, student_out.size(-1)),
        batch["labels"].view(-1),
        ignore_index=-100,
    )
    return alpha * soft_loss + (1 - alpha) * hard_loss

For 7B student + 70B teacher on 8x H100: ~20K tokens/sec training. Compress 70B → 7B on 1M trajectories in ~24-48 hours.


Distillation vs Fine-Tuning Decision Tree {#decision}

Pick distillation when:

  • You have a strong teacher and modest labeled data
  • Goal is to compress big model to small
  • Need to transfer reasoning patterns or chain-of-thought
  • Have budget for teacher inference (1M trajectories at $10/M = ~$10K for V3-class teachers)

Pick fine-tuning when:

  • You have abundant labeled data (100K+ examples)
  • Task is narrow (classification, extraction, format conversion)
  • Want to specialize behavior, not transfer general capability
  • No suitable teacher exists

Hybrid (most production):

  1. Distill general capability from teacher → base
  2. SFT/DPO on task-specific labels for specialization
  3. Optional: continue with RL/DPO on user feedback

See QLoRA Fine-Tuning Guide and DPO / ORPO / KTO Guide for the post-distillation steps.


Real Benchmarks {#benchmarks}

R1-Distill family vs same-size baselines (AIME 2024):

ModelAIME %MATH-500GPQA Diamond
Llama 3.1 8B (no distill)6.751.932.0
R1-Distill-Llama-8B50.489.149.0
Qwen 2.5 7B (no distill)11.770.336.0
R1-Distill-Qwen-7B55.592.849.1
Qwen 2.5 32B (no distill)16.576.449.5
R1-Distill-Qwen-32B72.694.362.1
DeepSeek R1 (671B teacher)79.897.371.5

Distillation gives 30-60 point boosts on hard reasoning at every size — far more than additional SFT data alone could provide.


Before distilling, check teacher's license / ToS:

TeacherOutput Use Allowed for Training?Notes
Llama 3.1 / 3.2✓ (with attribution; 700M MAU clause)Permitted, attribution required
Qwen 2.5Tongyi Qianwen license permits
DeepSeek V3 / R1DeepSeek License permits
OLMo 2Apache 2.0 — fully unrestricted
GLM-4.5VMIT — fully unrestricted
Mistral (Apache)Apache models permit
Mistral (commercial)partialCheck specific license
GPT-4o (OpenAI)OpenAI ToS prohibits "developing models that compete"
Claude (Anthropic)Similar prohibition in ToS
Gemini (Google)Similar prohibition

For open-source distillation, stick with permissively-licensed teachers. For internal-use models that won't be redistributed, the ToS questions get murkier — consult legal counsel for production use.


Troubleshooting {#troubleshooting}

SymptomCauseFix
Student loss divergesTemperature too highLower temp to 2.0
Student matches teacher on train but fails inferenceOff-policy mismatchSwitch to GKD on-policy
Cross-tokenizer KL failsVocab mismatchUse sequence-level only
OOM with full teacher loadedTeacher too largePre-generate teacher outputs offline (sequence-level)
Filtering discards too muchVerifier too strictUse partial-credit grading or LLM judge
Distilled student ignores chain-of-thoughtMissing think tokens in dataRe-format trajectories with explicit thinking blocks
Quality plateaus quicklyInsufficient data diversityAdd more prompt domains, sample multiple completions per prompt
Distillation slower than fine-tuningExpectedBoth teacher + student passes per step (or pre-generation cost)

FAQ {#faq}

See answers to common distillation questions below.


Sources: Hinton et al. (2015) Distilling the Knowledge in a Neural Network | Kim & Rush (2016) Sequence-Level Distillation | Agarwal et al. (2024) GKD | DeepSeek R1 paper (arXiv 2501.12948) | TRL GKDTrainer docs | MiniLLM paper | Internal benchmarks 8x H100.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 2, 2026🔄 Last Updated: May 2, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes vLLM teacher generation + Axolotl student distillation reference deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators