Knowledge Distillation
Compress large models into efficient versions using automated knowledge distillation and training workflows
Knowledge Distillation is a community skill for compressing large neural networks into smaller student models, covering teacher-student training, temperature scaling, intermediate layer matching, task-specific distillation, and quality evaluation for model deployment optimization.
What Is This?
Overview
Knowledge Distillation provides patterns for training compact student models that approximate the behavior of larger teacher models. It covers teacher-student training that optimizes the student to match teacher output distributions using soft targets, temperature scaling that controls the softness of teacher probability distributions to expose more information about relative class similarities, intermediate layer matching that aligns hidden representations between teacher and student at selected layers, task-specific distillation that focuses the student on particular capabilities of the teacher relevant to the target deployment, and quality evaluation that compares student performance against the teacher on benchmarks to verify acceptable distillation quality. The skill enables teams to deploy smaller models with near-teacher accuracy.
Who Should Use This
This skill serves ML engineers compressing models for edge deployment, research teams exploring knowledge transfer techniques, and inference platform teams reducing serving costs through model size reduction.
Why Use It?
Problems It Solves
Large teacher models exceed latency and memory budgets for production serving environments. Training small models from scratch on the same data often produces significantly lower quality than distillation from a strong teacher. Temperature and loss weight hyperparameters require systematic tuning to achieve good distillation results. Intermediate layer alignment between architecturally different teacher and student models needs dimensionality matching.
Core Highlights
Distillation trainer optimizes student models using soft targets from teacher inference. Temperature controller adjusts softmax temperature for optimal knowledge transfer. Layer matcher aligns selected intermediate representations between teacher and student. Benchmark runner compares student against teacher on evaluation datasets.
How to Use It?
Basic Usage
import torch
import torch.nn as nn
import torch.nn\
.functional as F
class DistillationLoss(
nn.Module):
def __init__(
self,
temperature: float
= 4.0,
alpha: float = 0.7
):
super().__init__()
self.T = temperature
self.alpha = alpha
self.ce = (
nn.CrossEntropyLoss())
def forward(
self,
student_logits,
teacher_logits,
labels
):
soft_loss = F.kl_div(
F.log_softmax(
student_logits
/ self.T,
dim=-1),
F.softmax(
teacher_logits
/ self.T,
dim=-1),
reduction='batchmean'
) * (self.T ** 2)
hard_loss = self.ce(
student_logits,
labels)
return (
self.alpha
* soft_loss
+ (1 - self.alpha)
* hard_loss)Real-World Examples
class DistillTrainer:
def __init__(
self,
teacher: nn.Module,
student: nn.Module,
temperature: float
= 4.0,
alpha: float = 0.7,
lr: float = 1e-4
):
self.teacher = teacher
self.student = student
self.criterion = (
DistillationLoss(
temperature,
alpha))
self.optimizer = (
torch.optim.AdamW(
student.parameters(),
lr=lr))
self.teacher.eval()
def train_step(
self,
inputs,
labels
) -> float:
with torch.no_grad():
teacher_out = (
self.teacher(
inputs))
student_out = (
self.student(inputs))
loss = self.criterion(
student_out,
teacher_out,
labels)
self.optimizer\
.zero_grad()
loss.backward()
self.optimizer.step()
return loss.item()Advanced Tips
Sweep temperature values from 1 to 20 on a validation set to find the setting that maximizes student performance for the specific task. Add intermediate layer distillation losses using projection layers when teacher and student have different hidden dimensions. Distill from an ensemble of teachers to provide richer soft targets than a single model.
When to Use It?
Use Cases
Compress a large classification model into a smaller version for mobile deployment with minimal accuracy loss. Train a fast student model from a powerful but slow teacher for latency-sensitive serving. Distill specific capabilities from a general-purpose model into a task-focused specialist.
Related Topics
Model compression, knowledge transfer, teacher-student training, temperature scaling, model optimization, neural network pruning, and edge deployment.
Important Notes
Requirements
PyTorch for model training and loss computation. Trained teacher model with inference capability. Training dataset for distillation with or without labels.
Usage Recommendations
Do: use a well-trained teacher model since distillation quality is bounded by teacher capability. Tune temperature and alpha loss weights on a validation set rather than using defaults. Evaluate on the target task rather than only on distillation loss metrics.
Don't: expect a very small student to match a much larger teacher since capacity gap limits distillation effectiveness. Skip the hard label loss component entirely as it provides useful ground truth signal. Distill from a teacher that is not yet converged which transfers noise.
Limitations
Student models with significantly fewer parameters will not fully recover teacher performance. Optimal distillation hyperparameters vary across tasks and model pairs requiring empirical tuning. Distillation training requires teacher inference on the full training set which adds computational cost.
More Skills You Might Like
Explore similar skills to enhance your workflow
Analytics Tracking
Set up, audit, and debug analytics tracking implementation — GA4, Google Tag Manager, event taxonomy, conversion tracking, and data quality. Use when
Big Data Cloud Automation
Automate Big Data Cloud tasks via Rube MCP (Composio)
Aivoov Automation
Automate Aivoov operations through Composio's Aivoov toolkit via Rube MCP
Scrum Sage
Apply Scrum methodology with sprint planning, retrospectives, and agile coaching
Harness Writing
Automate and integrate Harness Writing into your development and testing workflows
Monorepo Navigator
Manage complex monorepo structures with automated navigation and integration