Model Pruning

Reduce model size and latency through automated pruning and optimization integration

Model Pruning is a community skill for reducing neural network size through weight removal techniques, covering structured and unstructured pruning, magnitude-based selection, iterative pruning schedules, and post-pruning fine-tuning.

What Is This?

Overview

Model Pruning provides patterns for reducing model size by removing unnecessary weights while maintaining acceptable performance. It covers magnitude-based weight pruning, structured pruning that removes entire neurons or attention heads, iterative pruning schedules that gradually increase sparsity, lottery ticket identification, and recovery fine-tuning after pruning. The skill enables engineers to deploy smaller, faster models that fit within resource-constrained environments.

Who Should Use This

This skill serves ML engineers deploying models to edge devices with limited compute and memory, teams optimizing inference costs by reducing model size for cloud deployments, and researchers studying which parameters contribute most to model performance.

Why Use It?

Problems It Solves

Large models exceed memory budgets on edge devices and mobile platforms. Inference costs scale with model size, making large deployments expensive at scale. Many model parameters contribute minimally to output quality and can be removed without significant degradation. Dense models have slower inference than necessary for production latency requirements.

Core Highlights

Magnitude pruning removes weights with the smallest absolute values as a simple and effective baseline strategy. Structured pruning eliminates entire neurons, channels, or attention heads for actual speedup on standard hardware. Iterative pruning gradually increases sparsity across multiple cycles with recovery training between rounds. Sparsity analysis tools identify which layers tolerate aggressive pruning and which are sensitive.

How to Use It?

Basic Usage

from dataclasses import dataclass, field

@dataclass
class PruningConfig:
    target_sparsity: float = 0.5
    method: str = "magnitude"
    schedule: str = "one_shot"
    num_iterations: int = 1

@dataclass
class LayerStats:
    name: str
    total_params: int
    pruned_params: int = 0
    sparsity: float = 0.0

class MagnitudePruner:
    def __init__(self, config: PruningConfig):
        self.config = config
        self.stats: list[LayerStats] = []

    def prune_layer(self, name: str,
                    weights: list[float]) -> tuple[list[float], LayerStats]:
        sorted_vals = sorted(abs(w) for w in weights)
        cutoff_idx = int(len(sorted_vals) * self.config.target_sparsity)
        threshold = sorted_vals[cutoff_idx] if cutoff_idx < len(sorted_vals) else 0
        pruned = [w if abs(w) > threshold else 0.0 for w in weights]
        num_pruned = sum(1 for w in pruned if w == 0.0)
        stat = LayerStats(
            name=name, total_params=len(weights),
            pruned_params=num_pruned,
            sparsity=num_pruned / max(len(weights), 1)
        )
        self.stats.append(stat)
        return pruned, stat

Real-World Examples

from dataclasses import dataclass, field

@dataclass
class IterativePruner:
    initial_sparsity: float = 0.2
    final_sparsity: float = 0.8
    num_iterations: int = 4
    history: list[dict] = field(default_factory=list)

    def get_schedule(self) -> list[float]:
        step = ((self.final_sparsity - self.initial_sparsity)
                / max(self.num_iterations - 1, 1))
        return [self.initial_sparsity + i * step
                for i in range(self.num_iterations)]

    def run(self, weights: dict[str, list[float]],
            eval_fn) -> dict:
        schedule = self.get_schedule()
        current_weights = dict(weights)
        for iteration, sparsity in enumerate(schedule):
            pruner = MagnitudePruner(
                PruningConfig(target_sparsity=sparsity))
            for name in current_weights:
                pruned, stat = pruner.prune_layer(
                    name, current_weights[name])
                current_weights[name] = pruned
            score = eval_fn(current_weights)
            self.history.append({
                "iteration": iteration,
                "sparsity": round(sparsity, 3),
                "score": round(score, 4)
            })
        return {"final_sparsity": schedule[-1],
                "history": self.history}

Advanced Tips

Use sensitivity analysis to determine per-layer pruning ratios rather than applying uniform sparsity across all layers. Apply structured pruning to attention heads by evaluating each head contribution to downstream task performance. Combine pruning with quantization for compounding size reduction benefits.

When to Use It?

Use Cases

Reduce a transformer model to fit within mobile device memory constraints for on-device inference. Lower cloud inference costs by deploying pruned models that require less compute per request. Identify the minimum model size that maintains acceptable quality for a specific production task.

Related Topics

Model quantization, knowledge distillation, neural architecture search, sparse inference engines, and model compression techniques.

Important Notes

Requirements

A trained model with weights accessible for pruning operations. An evaluation dataset for measuring quality after each pruning round. Fine-tuning infrastructure for recovery training to restore accuracy after aggressive pruning.

Usage Recommendations

Do: evaluate model quality after each pruning iteration to identify the accuracy degradation curve. Use iterative pruning with recovery fine-tuning for better results than one-shot pruning at high sparsity levels. Profile actual inference speedup to confirm that theoretical sparsity translates to real performance gains.

Don't: prune models to high sparsity without post-pruning fine-tuning to recover lost accuracy. Assume that unstructured sparsity automatically translates to faster inference on standard hardware. Apply the same pruning ratio to all layers when sensitivity varies significantly between them.

Limitations

Unstructured pruning requires specialized sparse inference kernels to achieve actual speedups on GPUs. High sparsity levels cause significant quality degradation that recovery fine-tuning cannot fully restore. Pruning decisions are typically irreversible, making the original dense model necessary as a checkpoint for experimentation.