Stable Baselines3

Implementing Stable Baselines3 for reinforcement learning automation and seamless integration into AI workflows

Stable Baselines3 is a community skill for reinforcement learning with the Stable Baselines3 library, covering algorithm selection, environment setup, policy training, hyperparameter tuning, and model evaluation for RL agent development.

What Is This?

Overview

Stable Baselines3 provides guidance on training reinforcement learning agents using the SB3 library built on PyTorch. It covers algorithm selection that matches RL algorithms like PPO, SAC, and DQN to environment characteristics, environment setup that wraps custom and Gymnasium environments with proper observation and action spaces, policy training that configures neural network architectures and learning schedules, hyperparameter tuning that optimizes reward performance through systematic search, and model evaluation that measures agent performance with statistical rigor. The skill helps engineers build effective RL agents by providing tested implementations that avoid common pitfalls in custom algorithm development.

Who Should Use This

This skill serves ML engineers building RL agents for control tasks, researchers experimenting with RL algorithms, and robotics teams training policies for simulation and real-world deployment.

Why Use It?

Problems It Solves

Implementing RL algorithms from scratch introduces subtle bugs that cause training instability. Choosing the wrong algorithm for an environment type wastes training compute on methods that will not converge. Default hyperparameters rarely produce optimal performance for specific tasks. Evaluating agent performance without proper statistical methods leads to unreliable conclusions about whether one approach genuinely outperforms another.

Core Highlights

Algorithm selector matches PPO, SAC, or DQN to environment types. Environment wrapper configures observation and action spaces. Training manager runs learning with callbacks and checkpoints. Evaluation runner measures agent performance across episodes.

How to Use It?

Basic Usage

from stable_baselines3 import (
  PPO)
from stable_baselines3\
  .common.env_util import (
    make_vec_env)
from stable_baselines3\
  .common.evaluation import (
    evaluate_policy)

env = make_vec_env(
  'CartPole-v1',
  n_envs=4)

model = PPO(
  'MlpPolicy', env,
  learning_rate=3e-4,
  n_steps=2048,
  batch_size=64,
  n_epochs=10,
  verbose=1)

model.learn(
  total_timesteps=100000)

mean_reward, std = (
  evaluate_policy(
    model, env,
    n_eval_episodes=20))
print(
  f'Reward: '
  f'{mean_reward:.1f} '
  f'+/- {std:.1f}')

model.save('ppo_cartpole')
loaded = PPO.load(
  'ppo_cartpole')

Real-World Examples

from stable_baselines3 import (
  SAC)
from stable_baselines3\
  .common.callbacks import (
    BaseCallback,
    EvalCallback)
from stable_baselines3\
  .common.monitor import (
    Monitor)
import gymnasium as gym

class RewardLogger(
  BaseCallback
):
  def __init__(self):
    super().__init__()
    self.rewards = []

  def _on_step(
    self
  ) -> bool:
    if self.locals.get(
      'dones', [False]
    )[0]:
      info = (
        self.locals[
          'infos'][0])
      ep_reward = info.get(
        'episode', {}).get(
          'r', 0)
      self.rewards.append(
        ep_reward)
    return True

env = Monitor(
  gym.make(
    'Pendulum-v1'))
eval_env = Monitor(
  gym.make(
    'Pendulum-v1'))

eval_cb = EvalCallback(
  eval_env,
  best_model_save_path=(
    './best_model/'),
  eval_freq=5000,
  n_eval_episodes=10)

logger = RewardLogger()

model = SAC(
  'MlpPolicy', env,
  learning_rate=3e-4,
  buffer_size=100000,
  batch_size=256)

model.learn(
  total_timesteps=50000,
  callback=[
    eval_cb, logger])

print(
  f'Episodes: '
  f'{len(logger.rewards)}')
if logger.rewards:
  avg = sum(
    logger.rewards[-10:]
  ) / min(
    len(logger.rewards),
    10)
  print(
    f'Avg reward: '
    f'{avg:.1f}')

Advanced Tips

Use PPO for discrete and continuous action spaces as a reliable default algorithm. Use SAC for continuous control tasks where sample efficiency matters, such as robotic arm manipulation or locomotion. Normalize observations and rewards with VecNormalize wrappers to stabilize training across environments. Log training metrics using TensorBoard integration to identify reward plateaus and diagnose convergence issues early.

When to Use It?

Use Cases

Train a control agent for a robotics simulation using PPO with vectorized environments. Compare algorithm performance across multiple seeds with evaluation callbacks. Fine-tune hyperparameters for a custom environment using Optuna integration.

Related Topics

Reinforcement learning, PPO, SAC, DQN, Gymnasium, policy gradient, reward shaping, and agent evaluation.

Important Notes

Requirements

Python with stable-baselines3 and PyTorch installed for neural network policy training. A Gymnasium-compatible environment with properly defined observation and action spaces including bounds and data types. Sufficient compute resources with GPU acceleration recommended for complex high-dimensional environments and long training runs.

Usage Recommendations

Do: run multiple training seeds and report mean performance with standard deviation for reliable comparisons. Use vectorized environments to increase training throughput with parallel rollouts. Save checkpoints during training to recover from interruptions.

Don't: compare algorithms trained with different numbers of timesteps since this produces unfair evaluations. Use on-policy algorithms like PPO for tasks requiring high sample efficiency since off-policy methods like SAC are more efficient. Skip reward normalization for environments with large reward ranges since this causes training instability.

Limitations

RL training is highly sensitive to hyperparameters and may require extensive tuning for each new environment and reward structure. Sample efficiency remains a fundamental challenge for complex tasks requiring millions of environment interactions to converge. Sim-to-real transfer requires careful domain randomization and accurate environment modeling to bridge the reality gap.