Ray Train

Automate distributed model training and integrate Ray Train into scalable machine learning pipelines

Source: Orchestra-Research/AI-Research-SKILLs

Ray Train is a community skill for distributed machine learning training using the Ray Train library, covering data-parallel training, framework integration, checkpoint management, hyperparameter scaling, and fault-tolerant execution for scalable model training.

What Is This?

Overview

Ray Train provides tools for scaling machine learning training across multiple GPUs and nodes using the Ray distributed computing framework. It covers data-parallel training that distributes model replicas across workers with synchronized gradient updates, framework integration that wraps PyTorch, TensorFlow, and HuggingFace training loops for distributed execution, checkpoint management that saves and restores model state during long training runs, hyperparameter scaling that combines training with Ray Tune for parallel experiment search, and fault-tolerant execution that recovers from worker failures without restarting entire training jobs. The skill enables ML engineers to scale training to multi-GPU clusters.

Who Should Use This

This skill serves ML engineers scaling model training beyond single-GPU capacity, research teams running distributed experiments across GPU clusters, and organizations building training platforms that support multiple ML frameworks.

Why Use It?

Problems It Solves

Training large models on a single GPU takes days or weeks when data-parallel distribution could reduce training time proportionally. Each ML framework has different distributed training APIs requiring separate infrastructure for PyTorch and TensorFlow workloads. Long training runs that fail lose all progress without checkpointing and recovery mechanisms. Scaling from single-GPU to multi-node training requires significant code changes with native distributed APIs.

Core Highlights

Trainer wrapper scales existing training loops to multiple GPUs with minimal changes. Framework connector integrates PyTorch, TensorFlow, and HuggingFace transparently. Checkpoint manager saves and restores training state for resilience. Fault handler recovers from worker failures without full restart.

How to Use It?

Basic Usage

import ray
from ray.train.torch\
  import TorchTrainer
from ray.train import (
  ScalingConfig,
  RunConfig,
  Checkpoint)
import torch
import torch.nn as nn

def train_fn(config):
  import ray.train\
    as train
  model = nn.Linear(
    10, 1)
  model = train.torch\
    .prepare_model(
      model)
  optimizer = (
    torch.optim.Adam(
      model.parameters(),
      lr=config['lr']))

  for epoch in range(
    config['epochs']
  ):
    loss = model(
      torch.randn(
        32, 10)).sum()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    train.report({
      'loss':
        loss.item()})

trainer = TorchTrainer(
  train_fn,
  train_loop_config={
    'lr': 1e-3,
    'epochs': 10},
  scaling_config=
    ScalingConfig(
      num_workers=4,
      use_gpu=True))

result = trainer.fit()
print(result.metrics)

Real-World Examples

from ray.train\
  .huggingface import (
    TransformersTrainer)
from ray.train import (
  ScalingConfig)
from transformers import (
  TrainingArguments)

class DistributedTrainer:
  def __init__(
    self,
    model_name: str,
    num_gpus: int = 4
  ):
    self.model_name = (
      model_name)
    self.num_gpus = (
      num_gpus)

  def train_fn(self):
    def _inner(config):
      from transformers\
        import (
          AutoModel,
          Trainer)
      model = AutoModel\
        .from_pretrained(
          config[
            'model_name'])
      args = (
        TrainingArguments(
          output_dir=
            './output',
          per_device_train_batch_size=
            config[
              'batch_size'],
          num_train_epochs=
            config[
              'epochs']))
      return _inner

  def run(
    self,
    batch_size: int,
    epochs: int
  ):
    trainer = (
      TransformersTrainer(
        train_loop_per_worker=
          self.train_fn(),
        train_loop_config={
          'model_name':
            self
              .model_name,
          'batch_size':
            batch_size,
          'epochs':
            epochs},
        scaling_config=
          ScalingConfig(
            num_workers=
              self
                .num_gpus,
            use_gpu=
              True)))
    return trainer.fit()

Advanced Tips

Use prepare_model and prepare_data_loader to automatically handle distributed data parallel wrapping and data sharding across workers. Enable checkpoint saving at regular intervals to recover from failures without losing all training progress. Combine Ray Train with Ray Tune to run distributed hyperparameter search where each trial uses multiple GPUs.

When to Use It?

Use Cases

Scale a PyTorch training loop from single-GPU to multi-GPU execution by wrapping it with the TorchTrainer. Fine-tune a HuggingFace transformer model across multiple GPUs with automatic data sharding. Run fault-tolerant training jobs on preemptible cloud instances with checkpoint recovery.

Important Notes

Requirements

Ray Python package with the train module and ML framework integrations. Multi-GPU setup with CUDA drivers for GPU-accelerated training. Shared storage accessible from all worker nodes for checkpoints.

Usage Recommendations

Do: use prepare_model to handle distributed wrapping rather than manually configuring DDP or FSDP. Save checkpoints regularly during long training runs to enable recovery from failures. Scale the effective batch size proportionally with the number of workers.

Don't: expect linear speedup with more workers since communication overhead reduces scaling efficiency. Skip learning rate adjustment when increasing effective batch size since this affects training convergence. Run distributed training for small models where single-GPU training is already fast enough.

Limitations

Communication overhead between workers reduces scaling efficiency for small models and batch sizes. All workers must have matching GPU specifications for balanced data-parallel training. Some custom training loop patterns may require adaptation to work with Ray Train's wrapper.

More Skills You Might Like

Explore similar skills to enhance your workflow