Saelens

Seamlessly automate and integrate Saelens into your existing workflows

Source: Orchestra-Research/AI-Research-SKILLs

SAELens is a community skill for training and analyzing sparse autoencoders on language model activations, covering feature extraction, activation analysis, interpretability visualization, dictionary learning, and mechanistic interpretability research.

What Is This?

Overview

SAELens provides tools for training sparse autoencoders that decompose neural network activations into interpretable features. It covers feature extraction that identifies monosemantic features in language model hidden states using learned dictionaries, activation analysis that measures feature activation patterns across different input types and contexts, interpretability visualization that displays feature activation maps and distribution plots for analysis, dictionary learning that trains sparse autoencoders with configurable sparsity penalties and architecture choices, and mechanistic interpretability research that connects discovered features to model behavior and circuits. The skill helps researchers understand language model internals.

Who Should Use This

This skill serves interpretability researchers studying language model representations, ML safety teams analyzing model behavior through feature decomposition, and AI researchers building tools for understanding neural network internals.

Why Use It?

Problems It Solves

Language model neurons activate for multiple unrelated concepts making individual neuron analysis uninformative. Understanding what features a model uses for specific tasks requires decomposition beyond raw activation values. Training sparse autoencoders from scratch requires collecting activations, configuring training loops, and evaluating reconstruction quality. Visualizing and interpreting learned features across thousands of dictionary elements needs specialized tooling.

Core Highlights

SAE trainer fits sparse autoencoders on model activation datasets. Feature finder identifies monosemantic features in learned dictionaries. Activation analyzer measures feature patterns across inputs and layers. Visualization dashboard displays feature activation maps and statistics.

How to Use It?

Basic Usage

from sae_lens import SAE
from transformer_lens\
  import HookedTransformer

model = HookedTransformer\
  .from_pretrained(
    'gpt2-small')
sae = SAE.from_pretrained(
  release='gpt2-small',
  sae_id='blocks.8'
    '.hook_resid_pre')

text = 'The capital of'\
  ' France is Paris'
logits, cache = (
  model.run_with_cache(
    text))
acts = cache[
  'blocks.8'
  '.hook_resid_pre']

features = sae.encode(
  acts)
print(
  f'Active features: '
  f'{(features > 0)'
  f'.sum().item()}')

top_feats = features[0]\
  .topk(10)
for idx, val in zip(
  top_feats.indices,
  top_feats.values
):
  print(
    f'Feature {idx}: '
    f'{val:.3f}')

Real-World Examples

from sae_lens import SAE
import torch

class FeatureAnalyzer:
  def __init__(
    self,
    sae: SAE
  ):
    self.sae = sae

  def top_features(
    self,
    activations:
      torch.Tensor,
    k: int = 20
  ) -> list[dict]:
    encoded = (
      self.sae.encode(
        activations))
    mean_acts = (
      encoded.mean(
        dim=0))
    top = mean_acts\
      .topk(k)
    return [
      {'feature':
        idx.item(),
       'activation':
        val.item()}
      for idx, val in
        zip(top.indices,
          top.values)]

  def feature_density(
    self,
    activations:
      torch.Tensor
  ) -> float:
    encoded = (
      self.sae.encode(
        activations))
    return (encoded > 0)\
      .float().mean()\
      .item()

  def reconstruct(
    self,
    activations:
      torch.Tensor
  ) -> dict:
    encoded = (
      self.sae.encode(
        activations))
    decoded = (
      self.sae.decode(
        encoded))
    error = (
      (activations
        - decoded)
      .pow(2).mean()
      .item())
    return {
      'mse': error,
      'sparsity':
        self
          .feature_density(
            activations)}

Advanced Tips

Compare feature activations across different prompts to identify features that correspond to specific semantic concepts. Use reconstruction error as a quality metric for evaluating how well the sparse autoencoder captures the original activation space. Train SAEs on multiple layers to study how features evolve through the model.

When to Use It?

Use Cases

Identify interpretable features in GPT-2 activations that correspond to specific linguistic patterns. Measure feature sparsity and reconstruction quality for a trained sparse autoencoder. Analyze how specific features activate differently across prompts containing different topics.

Important Notes

Requirements

PyTorch and TransformerLens for model loading and activation extraction. Pre-trained SAE weights or training data for fitting new sparse autoencoders. GPU with sufficient memory for processing model activations.

Usage Recommendations

Do: evaluate reconstruction quality to ensure the SAE captures meaningful information from the activations. Compare feature activations across diverse prompts to validate that features are genuinely monosemantic. Use pre-trained SAEs when available to save training time and resources.

Don't: assume all learned features are interpretable since some dictionary elements may capture noise or polysemantic patterns. Train SAEs with insufficient data since small datasets produce unreliable feature dictionaries. Interpret feature activations without statistical validation across multiple examples.

Limitations

Sparse autoencoder quality depends on the choice of dictionary size and sparsity penalty which requires experimentation. Feature interpretability is not guaranteed and some features may not correspond to human-understandable concepts. Pre-trained SAEs are available for a limited set of models and layers.

More Skills You Might Like

Explore similar skills to enhance your workflow