Torchforge

Automate and integrate TorchForge for streamlined PyTorch model building and training

Source: Orchestra-Research/AI-Research-SKILLs

TorchForge is a community skill for streamlining PyTorch model post-training workflows, covering fine-tuning pipelines, quantization, pruning, knowledge distillation, and model optimization for deploying efficient deep learning models.

What Is This?

Overview

TorchForge provides guidance on optimizing and refining trained PyTorch models for production deployment. It covers fine-tuning pipelines that adapt pretrained models to domain-specific tasks with learning rate scheduling and gradient management, quantization workflows that reduce model precision from 32-bit to 8-bit integers for faster inference with minimal accuracy loss, pruning strategies that remove redundant weights and neurons to create smaller models that maintain performance, knowledge distillation that transfers learned representations from large teacher models to compact student architectures, and export pipelines that convert optimized models to formats like ONNX and TorchScript for cross-platform deployment. The skill helps engineers prepare research models for efficient production serving with reduced latency, lower memory footprint, and cross-platform compatibility across CPU, GPU, and mobile runtimes.

Who Should Use This

This skill serves ML engineers deploying PyTorch models to production environments, research teams optimizing large models for edge devices, and platform engineers building model serving infrastructure. It is particularly valuable for teams working under strict latency budgets or memory constraints on resource-limited hardware.

Why Use It?

Problems It Solves

Research-trained models are often too large and slow for production latency requirements. Quantization and pruning require careful implementation to avoid significant accuracy degradation. Fine-tuning pretrained models on small datasets risks catastrophic forgetting of learned representations. Exporting models across different serving frameworks needs proper conversion, numerical validation, and operator compatibility verification pipelines.

Core Highlights

Fine-tuner adapts pretrained models with careful learning rate control. Quantizer reduces model precision for significantly faster inference speed. Pruner removes redundant weights and connections while preserving model accuracy. Exporter converts optimized models to ONNX and TorchScript formats for flexible deployment targets.

How to Use It?

Basic Usage

import torch
from torch.quantization import (
    quantize_dynamic
)

model = torch.load(
    'model.pt')
model.eval()

quantized = quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8)

def model_size(m):
    torch.save(
        m.state_dict(),
        '/tmp/temp.pt')
    import os
    size = os.path.getsize(
        '/tmp/temp.pt')
    return size / 1e6

orig = model_size(model)
quant = model_size(
    quantized)
print(
    f'Original: {orig:.1f}MB')
print(
    f'Quantized: {quant:.1f}MB')
print(
    f'Reduction: '
    f'{(1-quant/orig)*100:.0f}%')

Real-World Examples

import torch
import torch.nn.utils.prune\
    as prune

def prune_model(
    model, amount=0.3
):
    for name, module in (
        model.named_modules()
    ):
        if isinstance(
            module,
            torch.nn.Linear
        ):
            prune.l1_unstructured(
                module,
                name='weight',
                amount=amount)
            prune.remove(
                module,
                'weight')
    return model

model = prune_model(model)

dummy = torch.randn(
    1, 3, 224, 224)
torch.onnx.export(
    model, dummy,
    'model.onnx',
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input': {0: 'batch'},
        'output': {0: 'batch'}
    })

Advanced Tips

Combine quantization-aware training with pruning for maximum compression while maintaining accuracy. Use calibration datasets that represent production data distribution for static quantization. Validate exported ONNX models against PyTorch outputs to catch conversion discrepancies. When fine-tuning, consider layer-wise learning rate decay to protect earlier representations while adapting later layers more aggressively to the target domain.

When to Use It?

Use Cases

Quantize a vision transformer for mobile deployment with reduced latency. Prune a language model to fit within edge device memory constraints. Distill a large ensemble into a single compact model for real-time serving. Export an optimized classification model to ONNX for deployment across multiple inference backends without rewriting serving code.

Important Notes

Requirements

PyTorch with quantization and pruning utilities from the torch.quantization and torch.nn.utils.prune modules. Representative calibration dataset for static quantization that captures the typical input distribution. ONNX runtime or TorchScript runtime for validating and serving exported optimized models in production inference environments with hardware-specific acceleration support.

Usage Recommendations

Do: benchmark inference latency and accuracy before and after each optimization step to measure real impact. Apply optimizations incrementally and validate carefully after each step. Use calibration data that matches production input patterns for quantization.

Don't: apply aggressive pruning ratios without validating accuracy on a held-out test set. Skip ONNX model validation since numerical differences can accumulate across layers. Assume that quantized models will run faster on all hardware since acceleration depends on platform support.

Limitations

Dynamic quantization only accelerates linear layers and may not speed up convolution-heavy architectures. Pruned models require sparse tensor support for actual speedup which not all deployment runtimes provide. Knowledge distillation requires training a student model from scratch with carefully designed loss functions that balance task performance and teacher alignment, which adds significant computational cost and tuning effort to the optimization pipeline.

More Skills You Might Like

Explore similar skills to enhance your workflow