Verl

Automate and integrate VERL for scalable and efficient reinforcement learning training workflows

Source: Orchestra-Research/AI-Research-SKILLs

veRL is a community skill for reinforcement learning from human feedback training of language models, covering reward model training, PPO optimization, RLHF pipelines, preference data handling, and distributed training orchestration for aligning large language models with human preferences.

What Is This?

Overview

veRL provides guidance on training language models using reinforcement learning from human feedback with the veRL framework. It covers reward model training that learns a scoring function from human preference comparisons between model outputs to guide policy optimization, PPO optimization that applies proximal policy optimization to fine-tune the language model using reward signals while constraining divergence from the reference model, preference data handling that processes human comparison datasets with proper formatting of chosen and rejected response pairs, distributed training orchestration that coordinates actor, critic, reward model, and reference model across multiple GPUs efficiently, and evaluation pipelines that measure alignment quality through win rate comparisons and human preference consistency. The skill helps teams align language models with human values and preferences across a wide range of deployment contexts, from general-purpose assistants to specialized domain applications.

Who Should Use This

This skill serves AI researchers implementing RLHF training pipelines, ML engineers aligning language models for production deployment, and teams building custom reward models for domain-specific preference optimization, including use cases such as code generation quality, customer support tone, and safety-critical response filtering.

Why Use It?

Problems It Solves

RLHF training requires coordinating four separate models that must communicate during training steps. PPO optimization for language models needs careful hyperparameter tuning to avoid reward hacking and mode collapse. Managing GPU memory across actor, critic, reward, and reference models requires efficient sharding strategies. Preference data quality directly affects alignment outcomes but is difficult to validate automatically. Without a structured framework, teams often spend significant engineering effort on infrastructure rather than alignment research.

Core Highlights

Reward trainer learns accurate scoring functions from human preferences. PPO optimizer fine-tunes the language model policy with KL-constrained updates. Orchestrator coordinates complex multi-model training across GPUs. Evaluator measures alignment quality through win rate comparisons.

How to Use It?

Basic Usage

from verl import (
    RLHFTrainer,
    PPOConfig,
    RewardModelConfig
)

ppo_config = PPOConfig(
    actor_model=(
        'llama-3-8b'),
    critic_model=(
        'llama-3-8b'),
    reward_model=(
        'reward-model'),
    ref_model=(
        'llama-3-8b'),
    kl_coeff=0.1,
    clip_range=0.2,
    learning_rate=1e-6,
    batch_size=64,
    mini_batch_size=8,
    ppo_epochs=4)

trainer = RLHFTrainer(
    config=ppo_config,
    train_dataset=(
        prompts_dataset),
    num_gpus=4)

trainer.train(
    num_episodes=1000)
trainer.save_model(
    'aligned-model')

Real-World Examples

from verl import (
    RewardModelTrainer
)

reward_trainer = (
    RewardModelTrainer(
        base_model=(
            'llama-3-8b'),
        learning_rate=1e-5,
        num_epochs=3))

reward_trainer.train(
    preference_dataset)
reward_trainer.save(
    'reward-model')

from verl import evaluate
results = evaluate(
    model='aligned-model',
    prompts=eval_prompts,
    reward_model=(
        'reward-model'))
print(
    f'Win rate: '
    f'{results.win_rate:.2%}')
print(
    f'Mean reward: '
    f'{results.mean_reward:.3f}')

Advanced Tips

Use KL divergence penalty to prevent the policy from diverging too far from the reference model which causes reward hacking. Train the reward model on diverse preference data that covers edge cases and ambiguous comparisons. Monitor reward scores during training to detect collapse or score inflation early. Logging per-episode reward distributions rather than only mean values helps identify instability before it becomes severe enough to require restarting training.

When to Use It?

Use Cases

Align a language model to be helpful and harmless using human preference feedback. Train a domain-specific reward model for code quality or medical accuracy preferences. Optimize a chat model for user satisfaction metrics using RLHF.

Important Notes

Requirements

Multiple GPUs with sufficient combined VRAM to host actor, critic, reward, and reference models simultaneously during training. Human preference dataset with chosen and rejected response pairs formatted for reward model training. PyTorch with the veRL framework installed and configured for distributed multi-model training.

Usage Recommendations

Do: start with a well-trained reward model before beginning PPO optimization since reward quality determines alignment outcomes. Use gradient accumulation to maintain effective batch sizes when GPU memory is limited. Validate alignment regularly with held-out evaluation prompts to catch degradation before it compounds across training episodes.

Don't: train for too many PPO steps without monitoring for reward hacking where the model exploits reward model weaknesses. Use noisy or contradictory preference data since reward model quality depends on data consistency. Skip the KL penalty since unconstrained optimization causes severe mode collapse.

Limitations

RLHF training requires significantly more compute than supervised fine-tuning due to the multi-model architecture. Reward models can develop blind spots that the policy learns to exploit through reward hacking. Human preference data is expensive to collect at scale and inherently noisy which limits alignment precision.

More Skills You Might Like

Explore similar skills to enhance your workflow