Simpo

Automate and integrate Simpo for in-app guidance and user onboarding experiences

Source: Orchestra-Research/AI-Research-SKILLs

SimPO is a community skill for preference optimization of language models using the Simple Preference Optimization method, covering reward-free alignment, reference-model-free training, length normalization, margin-based objectives, and efficient fine-tuning for improved model behavior.

What Is This?

Overview

SimPO provides tools for aligning language models with human preferences without requiring a separate reward model. It covers reward-free alignment that trains models directly from preference pairs without intermediate reward model training, reference-model-free optimization that eliminates the need for a frozen reference model during training, length normalization that prevents the model from favoring longer responses regardless of quality, margin-based objectives that enforce a minimum quality gap between preferred and rejected outputs, and efficient fine-tuning that reduces memory and compute requirements compared to RLHF. The skill helps researchers align models efficiently.

Who Should Use This

This skill serves ML researchers training aligned language models, engineers fine-tuning models for specific behavioral preferences, and teams implementing preference optimization on limited compute.

Why Use It?

Problems It Solves

RLHF requires training a separate reward model which adds complexity and compute cost. DPO uses a frozen reference model that consumes GPU memory during training. Models optimized with standard preference methods can exploit length as a proxy for quality. Training instability in preference optimization leads to inconsistent alignment results.

Core Highlights

Reference-free trainer eliminates frozen model memory overhead. Length normalizer prevents response length exploitation. Margin enforcer maintains quality gaps between preferred and rejected pairs. Memory optimizer reduces GPU requirements for preference training.

How to Use It?

Basic Usage

import torch
from torch.utils.data import (
  DataLoader)
from dataclasses import (
  dataclass)

@dataclass
class SimPOConfig:
  beta: float = 2.0
  gamma: float = 1.0
  lr: float = 1e-6
  batch_size: int = 4

class SimPOTrainer:
  def __init__(
    self, model,
    config: SimPOConfig
  ):
    self.model = model
    self.config = config
    self.optimizer = (
      torch.optim.AdamW(
        model.parameters(),
        lr=config.lr))

  def compute_loss(
    self,
    chosen_logps: torch
      .Tensor,
    rejected_logps: torch
      .Tensor,
    chosen_len: torch
      .Tensor,
    rejected_len: torch
      .Tensor
  ) -> torch.Tensor:
    chosen_avg = (
      chosen_logps /
      chosen_len)
    rejected_avg = (
      rejected_logps /
      rejected_len)
    logits = (
      self.config.beta *
      (chosen_avg -
       rejected_avg) -
      self.config.gamma)
    loss = (
      -torch.nn
      .functional
      .logsigmoid(logits)
      .mean())
    return loss

  def train_step(
    self, batch
  ) -> float:
    loss = self.compute_loss(
      batch['chosen_logps'],
      batch['rejected_logps'],
      batch['chosen_len'],
      batch['rejected_len'])
    loss.backward()
    self.optimizer.step()
    self.optimizer\
      .zero_grad()
    return loss.item()

config = SimPOConfig(
  beta=2.0, gamma=1.0)
trainer = SimPOTrainer(
  model, config)

Real-World Examples

import json
from pathlib import Path

class PreferenceDataset:
  def __init__(
    self, path: str
  ):
    data = json.loads(
      Path(path)
      .read_text())
    self.pairs = data

  def __len__(self):
    return len(self.pairs)

  def __getitem__(
    self, idx: int
  ) -> dict:
    pair = self.pairs[idx]
    return {
      'prompt':
        pair['prompt'],
      'chosen':
        pair['chosen'],
      'rejected':
        pair['rejected']}

  def stats(self) -> dict:
    chosen_lens = [
      len(
        p['chosen']
        .split())
      for p in
      self.pairs]
    rejected_lens = [
      len(
        p['rejected']
        .split())
      for p in
      self.pairs]
    return {
      'count':
        len(self.pairs),
      'avg_chosen_len':
        sum(chosen_lens)
        // len(
          chosen_lens),
      'avg_rejected_len':
        sum(rejected_lens)
        // len(
          rejected_lens)}

ds = PreferenceDataset(
  'prefs.json')
print(ds.stats())
sample = ds[0]
print(
  f'Prompt: '
  f'{sample["prompt"][:50]}')

Advanced Tips

Tune the gamma margin parameter to control how strongly the model distinguishes between preferred and rejected responses. Use gradient accumulation to achieve larger effective batch sizes on limited GPU memory. Monitor the average log probability gap between chosen and rejected responses during training to detect convergence.

When to Use It?

Use Cases

Align a fine-tuned language model with human preferences using collected feedback pairs. Train a chat model to produce helpful responses without a separate reward model. Optimize model outputs for quality while preventing length exploitation.

Important Notes

Requirements

PyTorch with GPU support for model training. Pre-trained language model as the starting checkpoint. Dataset of preference pairs with chosen and rejected responses for each prompt.

Usage Recommendations

Do: use length-normalized log probabilities as SimPO specifies rather than raw totals. Start with recommended beta and gamma values and tune based on evaluation results. Validate alignment with held-out preference pairs after training.

Don't: train with very small datasets since preference optimization needs sufficient examples for stable training. Ignore length distribution in your preference data as imbalanced lengths can bias the model. Skip evaluation against human preferences after training to verify alignment quality.

Limitations

Alignment quality depends on the diversity and accuracy of preference pair annotations. SimPO may not capture complex preference patterns that require multi-turn interaction context. Hyperparameter sensitivity means that beta and gamma values need careful tuning for each model and dataset combination.

More Skills You Might Like

Explore similar skills to enhance your workflow