Knowledge Distillation#
From Supervised Fine-Tuning to On-Policy Distillation#
1. Motivation#
Large language models (LLMs) are often trained in multiple stages:
pretraining on large corpora,
supervised fine-tuning (SFT),
and sometimes reinforcement learning–based alignment.
Distillation plays a central role in this pipeline. It allows a student model to learn from a stronger teacher model, transferring capabilities while reducing cost, latency, or model size.
This tutorial introduces:
off-policy distillation (including SFT),
on-policy distillation,
and the role of forward vs. reverse KL divergence.
2. What Is Distillation?#
At a high level, distillation trains a student policy \( \pi\_\theta \) to imitate a teacher policy \( \pi_T \).
The key questions are:
Where do training trajectories come from?
What loss is optimized?
Different answers lead to very different training dynamics.
3. Supervised Fine-Tuning (SFT) as Distillation#
3.1 Next-token prediction loss#
In SFT, the student is trained on a dataset of prompt–response pairs:
The objective is the negative log-likelihood (NLL):
This is often called:
next-token prediction,
cross-entropy loss,
teacher forcing.
3.2 SFT as forward KL divergence#
SFT can be interpreted as minimizing forward KL divergence.
Define a teacher distribution that is one-hot:
Then:
Thus:
SFT is forward KL minimization where the teacher distribution is one-hot.
4. Off-Policy Distillation#
4.1 Definition#
Off-policy distillation generalizes SFT by allowing the teacher to provide a soft distribution over tokens.
Training data consists of fixed teacher-generated trajectories:
4.2 Objective#
The student minimizes forward KL:
This includes:
classical knowledge distillation,
logit matching,
temperature-scaled distillation.
4.3 Limitations#
Off-policy distillation suffers from distributional shift:
training conditions on teacher trajectories,
inference runs on student trajectories.
Errors can compound in long-horizon generation.
5. On-Policy Distillation#
5.1 Key idea#
On-policy distillation flips the data source:
The student trains on its own generated trajectories, while the teacher provides feedback.
Formally:
5.2 Why forward KL no longer works#
Forward KL requires expectations over teacher-generated trajectories. But in on-policy training:
only student-induced states are available,
teacher trajectories are inaccessible under those states.
This motivates a different objective.
6. Reverse KL Divergence#
6.1 Definition#
Reverse KL is defined as:
Key properties:
mode-seeking,
penalizes actions the teacher dislikes,
naturally on-policy.
6.2 On-policy distillation objective#
The canonical on-policy distillation loss is:
This loss:
evaluates the teacher on student states,
avoids distribution mismatch,
provides dense supervision.
7. Comparison Summary#
Method |
Trajectory source |
Teacher signal |
Objective |
|---|---|---|---|
SFT |
Teacher / human |
One-hot token |
NLL (forward KL) |
Off-policy distillation |
Teacher |
Soft distribution |
Forward KL |
On-policy distillation |
Student |
Soft distribution |
Reverse KL |
RLHF / PPO |
Student |
Scalar reward |
Policy gradient |
8. Practical Takeaways#
SFT is distillation: it is forward KL with a degenerate teacher.
Off-policy distillation improves supervision but not distribution shift.
On-policy distillation aligns training with inference behavior.
Reverse KL is essential for stable on-policy optimization.
9. Final Perspective#
Distillation methods form a continuum:
from pure imitation (SFT),
to teacher-guided optimization (on-policy distillation),
to reward-based learning (RL).
Understanding the objective functions clarifies why each method behaves the way it does—and when to use them.