Knowledge Distillation

Compress large models into efficient versions using automated knowledge distillation and training workflows

Knowledge Distillation is a community skill for compressing large neural networks into smaller student models, covering teacher-student training, temperature scaling, intermediate layer matching, task-specific distillation, and quality evaluation for model deployment optimization.

What Is This?

Overview

Knowledge Distillation provides patterns for training compact student models that approximate the behavior of larger teacher models. It covers teacher-student training that optimizes the student to match teacher output distributions using soft targets, temperature scaling that controls the softness of teacher probability distributions to expose more information about relative class similarities, intermediate layer matching that aligns hidden representations between teacher and student at selected layers, task-specific distillation that focuses the student on particular capabilities of the teacher relevant to the target deployment, and quality evaluation that compares student performance against the teacher on benchmarks to verify acceptable distillation quality. The skill enables teams to deploy smaller models with near-teacher accuracy.

Who Should Use This

This skill serves ML engineers compressing models for edge deployment, research teams exploring knowledge transfer techniques, and inference platform teams reducing serving costs through model size reduction.

Why Use It?

Problems It Solves

Large teacher models exceed latency and memory budgets for production serving environments. Training small models from scratch on the same data often produces significantly lower quality than distillation from a strong teacher. Temperature and loss weight hyperparameters require systematic tuning to achieve good distillation results. Intermediate layer alignment between architecturally different teacher and student models needs dimensionality matching.

Core Highlights

Distillation trainer optimizes student models using soft targets from teacher inference. Temperature controller adjusts softmax temperature for optimal knowledge transfer. Layer matcher aligns selected intermediate representations between teacher and student. Benchmark runner compares student against teacher on evaluation datasets.

How to Use It?

Basic Usage

import torch
import torch.nn as nn
import torch.nn\
  .functional as F

class DistillationLoss(
    nn.Module):
  def __init__(
    self,
    temperature: float
      = 4.0,
    alpha: float = 0.7
  ):
    super().__init__()
    self.T = temperature
    self.alpha = alpha
    self.ce = (
      nn.CrossEntropyLoss())

  def forward(
    self,
    student_logits,
    teacher_logits,
    labels
  ):
    soft_loss = F.kl_div(
      F.log_softmax(
        student_logits
        / self.T,
        dim=-1),
      F.softmax(
        teacher_logits
        / self.T,
        dim=-1),
      reduction='batchmean'
    ) * (self.T ** 2)

    hard_loss = self.ce(
      student_logits,
      labels)

    return (
      self.alpha
      * soft_loss
      + (1 - self.alpha)
      * hard_loss)

Real-World Examples

class DistillTrainer:
  def __init__(
    self,
    teacher: nn.Module,
    student: nn.Module,
    temperature: float
      = 4.0,
    alpha: float = 0.7,
    lr: float = 1e-4
  ):
    self.teacher = teacher
    self.student = student
    self.criterion = (
      DistillationLoss(
        temperature,
        alpha))
    self.optimizer = (
      torch.optim.AdamW(
        student.parameters(),
        lr=lr))
    self.teacher.eval()

  def train_step(
    self,
    inputs,
    labels
  ) -> float:
    with torch.no_grad():
      teacher_out = (
        self.teacher(
          inputs))
    student_out = (
      self.student(inputs))
    loss = self.criterion(
      student_out,
      teacher_out,
      labels)
    self.optimizer\
      .zero_grad()
    loss.backward()
    self.optimizer.step()
    return loss.item()

Advanced Tips

Sweep temperature values from 1 to 20 on a validation set to find the setting that maximizes student performance for the specific task. Add intermediate layer distillation losses using projection layers when teacher and student have different hidden dimensions. Distill from an ensemble of teachers to provide richer soft targets than a single model.

When to Use It?

Use Cases

Compress a large classification model into a smaller version for mobile deployment with minimal accuracy loss. Train a fast student model from a powerful but slow teacher for latency-sensitive serving. Distill specific capabilities from a general-purpose model into a task-focused specialist.

Related Topics

Model compression, knowledge transfer, teacher-student training, temperature scaling, model optimization, neural network pruning, and edge deployment.

Important Notes

Requirements

PyTorch for model training and loss computation. Trained teacher model with inference capability. Training dataset for distillation with or without labels.

Usage Recommendations

Do: use a well-trained teacher model since distillation quality is bounded by teacher capability. Tune temperature and alpha loss weights on a validation set rather than using defaults. Evaluate on the target task rather than only on distillation loss metrics.

Don't: expect a very small student to match a much larger teacher since capacity gap limits distillation effectiveness. Skip the hard label loss component entirely as it provides useful ground truth signal. Distill from a teacher that is not yet converged which transfers noise.

Limitations

Student models with significantly fewer parameters will not fully recover teacher performance. Optimal distillation hyperparameters vary across tasks and model pairs requiring empirical tuning. Distillation training requires teacher inference on the full training set which adds computational cost.