Megatron Core

Scale large language model training using Megatron Core automation and integration

Source: Orchestra-Research/AI-Research-SKILLs

Megatron Core is a community skill for training large-scale language models with Megatron-Core, covering model parallelism, pipeline parallelism, data loading, mixed precision training, and checkpoint management for distributed model training.

What Is This?

Overview

Megatron Core provides tools for distributed training of large language models using the Megatron-Core framework. It covers model parallelism that splits model layers across multiple GPUs using tensor parallelism for large matrix operations, pipeline parallelism that distributes model stages across GPU groups with micro-batch scheduling for throughput optimization, data loading that streams training data with efficient tokenization and sequence packing, mixed precision training that uses BF16 or FP16 computation with FP32 accumulation for memory efficiency without numerical instability, and checkpoint management that saves and restores distributed model states across parallelism configurations. The skill enables teams to train billion-parameter models on multi-GPU clusters, making it practical to scale architectures like GPT or LLaMA variants beyond what any single accelerator can accommodate.

Who Should Use This

This skill serves ML engineers training large language models on GPU clusters, infrastructure teams managing distributed training environments, and researchers scaling model architectures beyond single-GPU capacity.

Why Use It?

Problems It Solves

Models with billions of parameters exceed single GPU memory requiring model parallelism strategies to distribute layers. Pipeline scheduling across GPU groups needs careful micro-batch management to minimize idle time. Data loading at training throughput rates requires optimized tokenization and buffering pipelines. Checkpointing distributed models across different parallelism configurations requires specialized save and restore logic that accounts for how tensors are sharded across ranks.

Core Highlights

Tensor parallelizer splits large matrix operations across GPUs within a group. Pipeline scheduler distributes model stages with micro-batch interleaving. Data streamer loads tokenized sequences with efficient packing. Checkpoint handler saves distributed states with parallelism-aware restoration.

How to Use It?

Basic Usage

from megatron.core\
  import parallel_state
from megatron.core\
  .transformer\
    .transformer_config\
      import (
        TransformerConfig)

class TrainingSetup:
  def __init__(
    self,
    tp_size: int = 2,
    pp_size: int = 2,
    dp_size: int = 4
  ):
    self.tp = tp_size
    self.pp = pp_size
    self.dp = dp_size

  def init_parallel(
    self
  ):
    parallel_state\
      .initialize_model\
        _parallel(
          tensor_model\
            _parallel_size\
              =self.tp,
          pipeline_model\
            _parallel_size\
              =self.pp)

  def model_config(
    self,
    hidden: int = 4096,
    layers: int = 32,
    heads: int = 32
  ) -> TransformerConfig:
    return (
      TransformerConfig(
        num_layers=layers,
        hidden_size=hidden,
        num_attention_heads
          =heads,
        tensor_model\
          _parallel_size\
            =self.tp,
        pipeline_model\
          _parallel_size\
            =self.pp))

Real-World Examples

import torch
from pathlib import Path

class CheckpointMgr:
  def __init__(
    self,
    save_dir: str,
    save_interval:
      int = 1000
  ):
    self.dir = Path(
      save_dir)
    self.dir.mkdir(
      parents=True,
      exist_ok=True)
    self.interval = (
      save_interval)

  def save(
    self,
    model,
    optimizer,
    iteration: int
  ):
    path = (
      self.dir /
      f'iter_{iteration}')
    path.mkdir(
      exist_ok=True)
    torch.save({
      'model': model
        .state_dict(),
      'optimizer':
        optimizer
          .state_dict(),
      'iteration':
        iteration
    }, str(path /
      'checkpoint.pt'))

  def load_latest(
    self
  ) -> dict:
    dirs = sorted(
      self.dir.glob(
        'iter_*'),
      key=lambda p:
        int(p.name
          .split('_')[1]))
    if not dirs:
      return None
    return torch.load(
      str(dirs[-1] /
        'checkpoint.pt'))

Advanced Tips

Use interleaved pipeline scheduling to reduce bubble time between micro-batches for better GPU utilization. Enable sequence packing to combine shorter sequences into single training examples reducing padding waste. Profile communication overhead between tensor parallel groups to identify bandwidth bottlenecks. When tuning micro-batch count, increasing the number of micro-batches per pipeline flush generally improves throughput but raises memory pressure, so benchmark both dimensions before finalizing your configuration.

When to Use It?

Use Cases

Configure tensor and pipeline parallelism to train a model that exceeds single GPU memory. Set up distributed checkpointing that saves and restores across different parallelism configurations. Optimize training throughput by tuning micro-batch sizes and pipeline scheduling.

Important Notes

Requirements

Multi-GPU cluster with high-bandwidth interconnects. Megatron-Core package with PyTorch and NCCL installed. CUDA-capable GPUs with sufficient aggregate memory.

Usage Recommendations

Do: profile parallelism configurations on a small scale before committing to long training runs. Use gradient accumulation to increase effective batch size without additional memory. Monitor GPU utilization to detect pipeline bubbles and communication stalls.

Don't: set tensor parallelism degree higher than the number of attention heads since heads must divide evenly. Skip checkpoint validation after saving since corrupted checkpoints waste training compute. Ignore NCCL errors that may indicate interconnect problems affecting training stability.

Limitations

Efficient scaling requires high-bandwidth GPU interconnects like NVLink or InfiniBand. Parallelism configuration depends on model architecture details that change between model families. Checkpoint format is tightly tied to the specific parallelism configuration used during the training run, meaning a checkpoint saved with tensor parallelism degree four cannot be loaded directly under a different degree without conversion tooling.

More Skills You Might Like

Explore similar skills to enhance your workflow