Simpo
Automate and integrate Simpo for in-app guidance and user onboarding experiences
SimPO is a community skill for preference optimization of language models using the Simple Preference Optimization method, covering reward-free alignment, reference-model-free training, length normalization, margin-based objectives, and efficient fine-tuning for improved model behavior.
What Is This?
Overview
SimPO provides tools for aligning language models with human preferences without requiring a separate reward model. It covers reward-free alignment that trains models directly from preference pairs without intermediate reward model training, reference-model-free optimization that eliminates the need for a frozen reference model during training, length normalization that prevents the model from favoring longer responses regardless of quality, margin-based objectives that enforce a minimum quality gap between preferred and rejected outputs, and efficient fine-tuning that reduces memory and compute requirements compared to RLHF. The skill helps researchers align models efficiently.
Who Should Use This
This skill serves ML researchers training aligned language models, engineers fine-tuning models for specific behavioral preferences, and teams implementing preference optimization on limited compute.
Why Use It?
Problems It Solves
RLHF requires training a separate reward model which adds complexity and compute cost. DPO uses a frozen reference model that consumes GPU memory during training. Models optimized with standard preference methods can exploit length as a proxy for quality. Training instability in preference optimization leads to inconsistent alignment results.
Core Highlights
Reference-free trainer eliminates frozen model memory overhead. Length normalizer prevents response length exploitation. Margin enforcer maintains quality gaps between preferred and rejected pairs. Memory optimizer reduces GPU requirements for preference training.
How to Use It?
Basic Usage
import torch
from torch.utils.data import (
DataLoader)
from dataclasses import (
dataclass)
@dataclass
class SimPOConfig:
beta: float = 2.0
gamma: float = 1.0
lr: float = 1e-6
batch_size: int = 4
class SimPOTrainer:
def __init__(
self, model,
config: SimPOConfig
):
self.model = model
self.config = config
self.optimizer = (
torch.optim.AdamW(
model.parameters(),
lr=config.lr))
def compute_loss(
self,
chosen_logps: torch
.Tensor,
rejected_logps: torch
.Tensor,
chosen_len: torch
.Tensor,
rejected_len: torch
.Tensor
) -> torch.Tensor:
chosen_avg = (
chosen_logps /
chosen_len)
rejected_avg = (
rejected_logps /
rejected_len)
logits = (
self.config.beta *
(chosen_avg -
rejected_avg) -
self.config.gamma)
loss = (
-torch.nn
.functional
.logsigmoid(logits)
.mean())
return loss
def train_step(
self, batch
) -> float:
loss = self.compute_loss(
batch['chosen_logps'],
batch['rejected_logps'],
batch['chosen_len'],
batch['rejected_len'])
loss.backward()
self.optimizer.step()
self.optimizer\
.zero_grad()
return loss.item()
config = SimPOConfig(
beta=2.0, gamma=1.0)
trainer = SimPOTrainer(
model, config)Real-World Examples
import json
from pathlib import Path
class PreferenceDataset:
def __init__(
self, path: str
):
data = json.loads(
Path(path)
.read_text())
self.pairs = data
def __len__(self):
return len(self.pairs)
def __getitem__(
self, idx: int
) -> dict:
pair = self.pairs[idx]
return {
'prompt':
pair['prompt'],
'chosen':
pair['chosen'],
'rejected':
pair['rejected']}
def stats(self) -> dict:
chosen_lens = [
len(
p['chosen']
.split())
for p in
self.pairs]
rejected_lens = [
len(
p['rejected']
.split())
for p in
self.pairs]
return {
'count':
len(self.pairs),
'avg_chosen_len':
sum(chosen_lens)
// len(
chosen_lens),
'avg_rejected_len':
sum(rejected_lens)
// len(
rejected_lens)}
ds = PreferenceDataset(
'prefs.json')
print(ds.stats())
sample = ds[0]
print(
f'Prompt: '
f'{sample["prompt"][:50]}')Advanced Tips
Tune the gamma margin parameter to control how strongly the model distinguishes between preferred and rejected responses. Use gradient accumulation to achieve larger effective batch sizes on limited GPU memory. Monitor the average log probability gap between chosen and rejected responses during training to detect convergence.
When to Use It?
Use Cases
Align a fine-tuned language model with human preferences using collected feedback pairs. Train a chat model to produce helpful responses without a separate reward model. Optimize model outputs for quality while preventing length exploitation.
Related Topics
Preference optimization, RLHF, DPO, language model alignment, fine-tuning, reward modeling, and SimPO.
Important Notes
Requirements
PyTorch with GPU support for model training. Pre-trained language model as the starting checkpoint. Dataset of preference pairs with chosen and rejected responses for each prompt.
Usage Recommendations
Do: use length-normalized log probabilities as SimPO specifies rather than raw totals. Start with recommended beta and gamma values and tune based on evaluation results. Validate alignment with held-out preference pairs after training.
Don't: train with very small datasets since preference optimization needs sufficient examples for stable training. Ignore length distribution in your preference data as imbalanced lengths can bias the model. Skip evaluation against human preferences after training to verify alignment quality.
Limitations
Alignment quality depends on the diversity and accuracy of preference pair annotations. SimPO may not capture complex preference patterns that require multi-turn interaction context. Hyperparameter sensitivity means that beta and gamma values need careful tuning for each model and dataset combination.
More Skills You Might Like
Explore similar skills to enhance your workflow
Performance
Optimize and monitor system Performance with powerful automation and integration
Lmnt Automation
Automate Lmnt operations through Composio's Lmnt toolkit via Rube MCP
Google Classroom Automation
Automate Google Classroom tasks via Rube MCP (Composio): course
Bugbug Automation
Automate Bugbug operations through Composio's Bugbug toolkit via Rube MCP
Hystruct Automation
Automate Hystruct operations through Composio's Hystruct toolkit via
Aeroleads Automation
Automate Aeroleads operations through Composio's Aeroleads toolkit via