Megatron Core
Scale large language model training using Megatron Core automation and integration
Megatron Core is a community skill for training large-scale language models with Megatron-Core, covering model parallelism, pipeline parallelism, data loading, mixed precision training, and checkpoint management for distributed model training.
What Is This?
Overview
Megatron Core provides tools for distributed training of large language models using the Megatron-Core framework. It covers model parallelism that splits model layers across multiple GPUs using tensor parallelism for large matrix operations, pipeline parallelism that distributes model stages across GPU groups with micro-batch scheduling for throughput optimization, data loading that streams training data with efficient tokenization and sequence packing, mixed precision training that uses BF16 or FP16 computation with FP32 accumulation for memory efficiency without numerical instability, and checkpoint management that saves and restores distributed model states across parallelism configurations. The skill enables teams to train billion-parameter models on multi-GPU clusters, making it practical to scale architectures like GPT or LLaMA variants beyond what any single accelerator can accommodate.
Who Should Use This
This skill serves ML engineers training large language models on GPU clusters, infrastructure teams managing distributed training environments, and researchers scaling model architectures beyond single-GPU capacity.
Why Use It?
Problems It Solves
Models with billions of parameters exceed single GPU memory requiring model parallelism strategies to distribute layers. Pipeline scheduling across GPU groups needs careful micro-batch management to minimize idle time. Data loading at training throughput rates requires optimized tokenization and buffering pipelines. Checkpointing distributed models across different parallelism configurations requires specialized save and restore logic that accounts for how tensors are sharded across ranks.
Core Highlights
Tensor parallelizer splits large matrix operations across GPUs within a group. Pipeline scheduler distributes model stages with micro-batch interleaving. Data streamer loads tokenized sequences with efficient packing. Checkpoint handler saves distributed states with parallelism-aware restoration.
How to Use It?
Basic Usage
from megatron.core\
import parallel_state
from megatron.core\
.transformer\
.transformer_config\
import (
TransformerConfig)
class TrainingSetup:
def __init__(
self,
tp_size: int = 2,
pp_size: int = 2,
dp_size: int = 4
):
self.tp = tp_size
self.pp = pp_size
self.dp = dp_size
def init_parallel(
self
):
parallel_state\
.initialize_model\
_parallel(
tensor_model\
_parallel_size\
=self.tp,
pipeline_model\
_parallel_size\
=self.pp)
def model_config(
self,
hidden: int = 4096,
layers: int = 32,
heads: int = 32
) -> TransformerConfig:
return (
TransformerConfig(
num_layers=layers,
hidden_size=hidden,
num_attention_heads
=heads,
tensor_model\
_parallel_size\
=self.tp,
pipeline_model\
_parallel_size\
=self.pp))Real-World Examples
import torch
from pathlib import Path
class CheckpointMgr:
def __init__(
self,
save_dir: str,
save_interval:
int = 1000
):
self.dir = Path(
save_dir)
self.dir.mkdir(
parents=True,
exist_ok=True)
self.interval = (
save_interval)
def save(
self,
model,
optimizer,
iteration: int
):
path = (
self.dir /
f'iter_{iteration}')
path.mkdir(
exist_ok=True)
torch.save({
'model': model
.state_dict(),
'optimizer':
optimizer
.state_dict(),
'iteration':
iteration
}, str(path /
'checkpoint.pt'))
def load_latest(
self
) -> dict:
dirs = sorted(
self.dir.glob(
'iter_*'),
key=lambda p:
int(p.name
.split('_')[1]))
if not dirs:
return None
return torch.load(
str(dirs[-1] /
'checkpoint.pt'))Advanced Tips
Use interleaved pipeline scheduling to reduce bubble time between micro-batches for better GPU utilization. Enable sequence packing to combine shorter sequences into single training examples reducing padding waste. Profile communication overhead between tensor parallel groups to identify bandwidth bottlenecks. When tuning micro-batch count, increasing the number of micro-batches per pipeline flush generally improves throughput but raises memory pressure, so benchmark both dimensions before finalizing your configuration.
When to Use It?
Use Cases
Configure tensor and pipeline parallelism to train a model that exceeds single GPU memory. Set up distributed checkpointing that saves and restores across different parallelism configurations. Optimize training throughput by tuning micro-batch sizes and pipeline scheduling.
Related Topics
Distributed training, Megatron-Core, tensor parallelism, pipeline parallelism, large language models, mixed precision, and GPU cluster training.
Important Notes
Requirements
Multi-GPU cluster with high-bandwidth interconnects. Megatron-Core package with PyTorch and NCCL installed. CUDA-capable GPUs with sufficient aggregate memory.
Usage Recommendations
Do: profile parallelism configurations on a small scale before committing to long training runs. Use gradient accumulation to increase effective batch size without additional memory. Monitor GPU utilization to detect pipeline bubbles and communication stalls.
Don't: set tensor parallelism degree higher than the number of attention heads since heads must divide evenly. Skip checkpoint validation after saving since corrupted checkpoints waste training compute. Ignore NCCL errors that may indicate interconnect problems affecting training stability.
Limitations
Efficient scaling requires high-bandwidth GPU interconnects like NVLink or InfiniBand. Parallelism configuration depends on model architecture details that change between model families. Checkpoint format is tightly tied to the specific parallelism configuration used during the training run, meaning a checkpoint saved with tensor parallelism degree four cannot be loaded directly under a different degree without conversion tooling.
More Skills You Might Like
Explore similar skills to enhance your workflow
Crafting Effective Readmes
Crafting Effective Readmes automation and integration
Coda Automation
Automate Coda tasks via Rube MCP (Composio): manage docs, pages, tables, rows, formulas, permissions, and publishing. Always search tools first for cu
Make Automation
Automate Make (Integromat) tasks via Rube MCP (Composio): operations, enums, language and timezone lookups. Always search tools first for current sche
Senior Pm
Senior Project Manager for enterprise software, SaaS, and digital transformation projects. Specializes in portfolio management, quantitative risk anal
Aeroleads Automation
Automate Aeroleads operations through Composio's Aeroleads toolkit via
Umap Learn
Automate high-dimensional data visualization and dimension reduction using Umap Learn workflows