Model Pruning
Reduce model size and latency through automated pruning and optimization integration
Model Pruning is a community skill for reducing neural network size through weight removal techniques, covering structured and unstructured pruning, magnitude-based selection, iterative pruning schedules, and post-pruning fine-tuning.
What Is This?
Overview
Model Pruning provides patterns for reducing model size by removing unnecessary weights while maintaining acceptable performance. It covers magnitude-based weight pruning, structured pruning that removes entire neurons or attention heads, iterative pruning schedules that gradually increase sparsity, lottery ticket identification, and recovery fine-tuning after pruning. The skill enables engineers to deploy smaller, faster models that fit within resource-constrained environments.
Who Should Use This
This skill serves ML engineers deploying models to edge devices with limited compute and memory, teams optimizing inference costs by reducing model size for cloud deployments, and researchers studying which parameters contribute most to model performance.
Why Use It?
Problems It Solves
Large models exceed memory budgets on edge devices and mobile platforms. Inference costs scale with model size, making large deployments expensive at scale. Many model parameters contribute minimally to output quality and can be removed without significant degradation. Dense models have slower inference than necessary for production latency requirements.
Core Highlights
Magnitude pruning removes weights with the smallest absolute values as a simple and effective baseline strategy. Structured pruning eliminates entire neurons, channels, or attention heads for actual speedup on standard hardware. Iterative pruning gradually increases sparsity across multiple cycles with recovery training between rounds. Sparsity analysis tools identify which layers tolerate aggressive pruning and which are sensitive.
How to Use It?
Basic Usage
from dataclasses import dataclass, field
@dataclass
class PruningConfig:
target_sparsity: float = 0.5
method: str = "magnitude"
schedule: str = "one_shot"
num_iterations: int = 1
@dataclass
class LayerStats:
name: str
total_params: int
pruned_params: int = 0
sparsity: float = 0.0
class MagnitudePruner:
def __init__(self, config: PruningConfig):
self.config = config
self.stats: list[LayerStats] = []
def prune_layer(self, name: str,
weights: list[float]) -> tuple[list[float], LayerStats]:
sorted_vals = sorted(abs(w) for w in weights)
cutoff_idx = int(len(sorted_vals) * self.config.target_sparsity)
threshold = sorted_vals[cutoff_idx] if cutoff_idx < len(sorted_vals) else 0
pruned = [w if abs(w) > threshold else 0.0 for w in weights]
num_pruned = sum(1 for w in pruned if w == 0.0)
stat = LayerStats(
name=name, total_params=len(weights),
pruned_params=num_pruned,
sparsity=num_pruned / max(len(weights), 1)
)
self.stats.append(stat)
return pruned, statReal-World Examples
from dataclasses import dataclass, field
@dataclass
class IterativePruner:
initial_sparsity: float = 0.2
final_sparsity: float = 0.8
num_iterations: int = 4
history: list[dict] = field(default_factory=list)
def get_schedule(self) -> list[float]:
step = ((self.final_sparsity - self.initial_sparsity)
/ max(self.num_iterations - 1, 1))
return [self.initial_sparsity + i * step
for i in range(self.num_iterations)]
def run(self, weights: dict[str, list[float]],
eval_fn) -> dict:
schedule = self.get_schedule()
current_weights = dict(weights)
for iteration, sparsity in enumerate(schedule):
pruner = MagnitudePruner(
PruningConfig(target_sparsity=sparsity))
for name in current_weights:
pruned, stat = pruner.prune_layer(
name, current_weights[name])
current_weights[name] = pruned
score = eval_fn(current_weights)
self.history.append({
"iteration": iteration,
"sparsity": round(sparsity, 3),
"score": round(score, 4)
})
return {"final_sparsity": schedule[-1],
"history": self.history}Advanced Tips
Use sensitivity analysis to determine per-layer pruning ratios rather than applying uniform sparsity across all layers. Apply structured pruning to attention heads by evaluating each head contribution to downstream task performance. Combine pruning with quantization for compounding size reduction benefits.
When to Use It?
Use Cases
Reduce a transformer model to fit within mobile device memory constraints for on-device inference. Lower cloud inference costs by deploying pruned models that require less compute per request. Identify the minimum model size that maintains acceptable quality for a specific production task.
Related Topics
Model quantization, knowledge distillation, neural architecture search, sparse inference engines, and model compression techniques.
Important Notes
Requirements
A trained model with weights accessible for pruning operations. An evaluation dataset for measuring quality after each pruning round. Fine-tuning infrastructure for recovery training to restore accuracy after aggressive pruning.
Usage Recommendations
Do: evaluate model quality after each pruning iteration to identify the accuracy degradation curve. Use iterative pruning with recovery fine-tuning for better results than one-shot pruning at high sparsity levels. Profile actual inference speedup to confirm that theoretical sparsity translates to real performance gains.
Don't: prune models to high sparsity without post-pruning fine-tuning to recover lost accuracy. Assume that unstructured sparsity automatically translates to faster inference on standard hardware. Apply the same pruning ratio to all layers when sensitivity varies significantly between them.
Limitations
Unstructured pruning requires specialized sparse inference kernels to achieve actual speedups on GPUs. High sparsity levels cause significant quality degradation that recovery fine-tuning cannot fully restore. Pruning decisions are typically irreversible, making the original dense model necessary as a checkpoint for experimentation.
More Skills You Might Like
Explore similar skills to enhance your workflow
Fund
Automate and integrate funding workflows for efficient financial management and transaction processing
Ip2location Automation
Automate Ip2location tasks via Rube MCP (Composio)
Advanced Evaluation
Automate advanced evaluation metrics and integrate comprehensive performance analysis into your systems
Desktime Automation
Automate Desktime operations through Composio's Desktime toolkit via
Phase 1: Determine Scope
Check for existing performance targets in design docs or CLAUDE.md:
Agent Designer
Automate AI agent design processes and integrate intelligent behavior modeling into your applications