Deepspeed

Automate and integrate DeepSpeed for optimized deep learning training at scale

Source: Orchestra-Research/AI-Research-SKILLs

What Is This?

Overview

DeepSpeed provides patterns for scaling deep learning training across multiple GPUs and nodes using Microsoft DeepSpeed library. It covers ZeRO optimizer stages that partition optimizer states, gradients, and parameters across devices to reduce memory per GPU, mixed-precision training that uses FP16 or BF16 computation with FP32 master weights for faster throughput, pipeline parallelism that splits model layers across GPUs with micro-batch scheduling, model sharding that distributes parameters across devices enabling models larger than single GPU memory, and inference optimization that applies kernel fusion and quantization for faster serving. The skill enables researchers to train billion-parameter models on available hardware without requiring expensive specialized infrastructure.

Who Should Use This

This skill serves machine learning engineers training large language models, research teams scaling experiments beyond single-GPU capacity, and MLOps engineers optimizing training infrastructure costs and resource utilization.

Why Use It?

Problems It Solves

Large models exceed single GPU memory requiring distribution across multiple devices. Naive data parallelism duplicates the full model on each GPU wasting memory. Training throughput does not scale linearly when communication overhead between nodes is not managed carefully. Moving from training to inference requires separate optimization for serving latency and throughput.

Core Highlights

ZeRO optimizer partitions states across GPUs reducing memory by up to eight times compared to standard data parallelism. Mixed-precision engine manages FP16 forward and backward passes with automatic loss scaling to prevent underflow. Pipeline scheduler overlaps computation with communication using micro-batch interleaving. Inference engine applies kernel fusion and weight quantization for serving.

How to Use It?

Basic Usage

import deepspeed
import torch
from transformers\
  import AutoModelForCausalLM

model =\
  AutoModelForCausalLM\
    .from_pretrained(
      'gpt2-large')

ds_config = {
  'train_batch_size': 32,
  'gradient_accumulation'
  '_steps': 4,
  'fp16': {
    'enabled': True},
  'zero_optimization': {
    'stage': 2,
    'offload_optimizer':
      {'device': 'cpu'},
    'allgather_bucket'
    '_size': 5e8},
}

engine, optimizer,\
  _, scheduler =\
    deepspeed.initialize(
      model=model,
      config=ds_config,
      model_parameters=\
        model.parameters())

for batch in dataloader:
  loss = engine(
    batch['input_ids'])
  engine.backward(loss)
  engine.step()

Real-World Examples

ds_config_zero3 = {
  'train_batch_size': 16,
  'fp16': {
    'enabled': True,
    'loss_scale_window':
      1000},
  'zero_optimization': {
    'stage': 3,
    'offload_param': {
      'device': 'cpu',
      'pin_memory':
        True},
    'offload_optimizer':
      {'device': 'cpu',
       'pin_memory':
         True},
    'overlap_comm':
      True,
    'contiguous_gradients'
    '': True,
    'sub_group_size':
      1e9,
    'stage3_prefetch'
    '_bucket_size':
      5e8},
  'activation'
  '_checkpointing': {
    'partition'
    '_activations':
      True,
    'cpu_checkpointing':
      True},
}

engine, _, _, _ =\
  deepspeed.initialize(
    model=model,
    config=\
      ds_config_zero3,
    model_parameters=\
      model.parameters())

Advanced Tips

Enable activation checkpointing to trade compute for memory when training models that barely fit in GPU memory, which recomputes activations during the backward pass instead of storing them. This typically increases training time by 30 to 40 percent but can halve memory consumption. Use ZeRO Stage 2 for most training scenarios and only move to Stage 3 when model parameters do not fit in aggregate GPU memory. Pin CPU memory when using offloading to avoid slow paged memory transfers, and set overlap_comm to True to hide communication latency behind computation.

When to Use It?

Use Cases

Train a billion-parameter language model across multiple GPUs using ZeRO memory optimization. Fine-tune a large pre-trained model on limited GPU hardware using CPU offloading. Optimize a trained model for inference with kernel fusion and INT8 quantization.

Important Notes

Requirements

NVIDIA GPUs with CUDA support for training acceleration. DeepSpeed library installed with compatible PyTorch version. Sufficient CPU memory for offloading when using ZeRO Stage 3 with CPU offload enabled.

Usage Recommendations

Do: start with ZeRO Stage 2 and increase to Stage 3 only if memory is still insufficient. Monitor GPU memory utilization and communication overhead to find the optimal batch size. Use gradient accumulation to achieve larger effective batch sizes without increasing per-GPU memory.

Don't: enable CPU offloading by default since it significantly reduces training throughput and should only be used when GPU memory is insufficient. Mix DeepSpeed configuration with manual distributed training code which can cause conflicts. Skip the loss scaling window tuning in FP16 mode which can cause training instability.

Limitations

CPU offloading reduces training throughput significantly due to data transfer latency between CPU and GPU memory. Pipeline parallelism introduces bubble time where some GPUs idle while waiting for micro-batches. Multi-node training requires high-bandwidth interconnects and network configuration that adds operational complexity.

More Skills You Might Like

Explore similar skills to enhance your workflow