Grpo Rl Training

Grpo Rl Training

Automate and integrate GRPO reinforcement learning training workflows

Category: productivity Source: Orchestra-Research/AI-Research-SKILLs

GRPO RL Training is a community skill for implementing Group Relative Policy Optimization to fine-tune language models using reinforcement learning, covering reward modeling, group sampling, advantage estimation, and training loop configuration.

What Is This?

Overview

GRPO RL Training provides patterns for applying Group Relative Policy Optimization to language model alignment. It covers group sampling where multiple outputs are generated per prompt, relative advantage computation, policy gradient updates with KL constraints, and reward function integration. The skill enables practitioners to implement GRPO as a simpler alternative to PPO for aligning language models.

Who Should Use This

This skill serves ML engineers implementing RLHF pipelines who want simpler alternatives to PPO, researchers exploring group-based optimization methods for language model alignment, and teams fine-tuning models where relative ranking of outputs is more natural than absolute reward scoring.

Why Use It?

Problems It Solves

PPO requires a separate critic network that adds memory overhead and training complexity to the alignment pipeline. Absolute reward scores from reward models can be noisy and poorly calibrated. Training instability in standard RL approaches for language models leads to mode collapse or reward hacking. Managing the balance between reward optimization and generation quality requires careful hyperparameter tuning.

Core Highlights

Group sampling generates multiple completions per prompt and computes advantages relative to the group mean. KL divergence penalty keeps the trained policy close to the reference model to prevent reward hacking. Simplified training loop requires fewer components than PPO. Batch-level normalization of advantages reduces variance in policy gradient estimates.

How to Use It?

Basic Usage

from dataclasses import dataclass, field
import math

@dataclass
class GRPOConfig:
    group_size: int = 4
    kl_coeff: float = 0.1
    learning_rate: float = 1e-6
    clip_range: float = 0.2
    max_grad_norm: float = 1.0

@dataclass
class GroupSample:
    prompt: str
    completions: list[str] = field(default_factory=list)
    rewards: list[float] = field(default_factory=list)
    advantages: list[float] = field(default_factory=list)

class GRPOTrainer:
    def __init__(self, config: GRPOConfig):
        self.config = config

    def compute_advantages(self, sample: GroupSample) -> GroupSample:
        mean_reward = sum(sample.rewards) / len(sample.rewards)
        std_reward = max(
            math.sqrt(sum((r - mean_reward) ** 2
                         for r in sample.rewards) / len(sample.rewards)),
            1e-8
        )
        sample.advantages = [
            (r - mean_reward) / std_reward for r in sample.rewards
        ]
        return sample

    def compute_loss(self, log_probs: list[float],
                     ref_log_probs: list[float],
                     advantages: list[float]) -> float:
        policy_loss = sum(
            -adv * lp for adv, lp in zip(advantages, log_probs)
        ) / len(advantages)
        kl_penalty = sum(
            lp - rlp for lp, rlp in zip(log_probs, ref_log_probs)
        ) / len(log_probs)
        return policy_loss + self.config.kl_coeff * kl_penalty

Real-World Examples

from dataclasses import dataclass, field

@dataclass
class TrainingMetrics:
    steps: list[int] = field(default_factory=list)
    rewards: list[float] = field(default_factory=list)
    kl_values: list[float] = field(default_factory=list)
    losses: list[float] = field(default_factory=list)

class GRPOPipeline:
    def __init__(self, config: GRPOConfig, reward_fn):
        self.config = config
        self.reward_fn = reward_fn
        self.trainer = GRPOTrainer(config)
        self.metrics = TrainingMetrics()

    def generate_group(self, prompt: str,
                       generate_fn) -> GroupSample:
        completions = [
            generate_fn(prompt)
            for _ in range(self.config.group_size)
        ]
        rewards = [self.reward_fn(prompt, c) for c in completions]
        sample = GroupSample(
            prompt=prompt, completions=completions, rewards=rewards
        )
        return self.trainer.compute_advantages(sample)

    def train_step(self, prompts: list[str],
                   generate_fn, step: int) -> float:
        total_loss = 0.0
        total_reward = 0.0
        for prompt in prompts:
            group = self.generate_group(prompt, generate_fn)
            total_reward += sum(group.rewards) / len(group.rewards)
        avg_reward = total_reward / len(prompts)
        self.metrics.steps.append(step)
        self.metrics.rewards.append(avg_reward)
        return avg_reward

Advanced Tips

Increase group size for more stable advantage estimates. Monitor KL divergence throughout training and adjust the coefficient if the policy drifts too far. Use reward model ensembles to reduce reward noise.

When to Use It?

Use Cases

Align a language model with human preferences using a simpler training loop than PPO requires. Fine-tune a code generation model using execution-based rewards where relative ranking within a group is more meaningful than absolute scores. Implement iterative alignment where the model improves through multiple rounds of group sampling and policy updates.

Related Topics

Proximal Policy Optimization for LLMs, reward modeling, RLHF training pipelines, Direct Preference Optimization, and language model alignment methods.

Important Notes

Requirements

A reward function or reward model for scoring generated completions. A pre-trained language model serving as both the policy and reference model. Sufficient compute for generating multiple completions per prompt during training.

Usage Recommendations

Do: start with small group sizes and scale up based on advantage estimate variance. Freeze the reference model throughout training to maintain a stable KL anchor. Log reward distributions per group to verify that the model generates diverse completions.

Don't: set the KL coefficient to zero, which removes the constraint keeping outputs natural. Use a reward function with high variance that produces inconsistent scores for similar outputs. Skip advantage normalization, which can lead to unstable gradient updates.

Limitations

Group sampling multiplies generation costs by the group size factor for each training prompt. GRPO performance depends on the quality and calibration of the reward function used for scoring. Small group sizes produce noisy advantage estimates that can slow convergence.