Saelens
Seamlessly automate and integrate Saelens into your existing workflows
SAELens is a community skill for training and analyzing sparse autoencoders on language model activations, covering feature extraction, activation analysis, interpretability visualization, dictionary learning, and mechanistic interpretability research.
What Is This?
Overview
SAELens provides tools for training sparse autoencoders that decompose neural network activations into interpretable features. It covers feature extraction that identifies monosemantic features in language model hidden states using learned dictionaries, activation analysis that measures feature activation patterns across different input types and contexts, interpretability visualization that displays feature activation maps and distribution plots for analysis, dictionary learning that trains sparse autoencoders with configurable sparsity penalties and architecture choices, and mechanistic interpretability research that connects discovered features to model behavior and circuits. The skill helps researchers understand language model internals.
Who Should Use This
This skill serves interpretability researchers studying language model representations, ML safety teams analyzing model behavior through feature decomposition, and AI researchers building tools for understanding neural network internals.
Why Use It?
Problems It Solves
Language model neurons activate for multiple unrelated concepts making individual neuron analysis uninformative. Understanding what features a model uses for specific tasks requires decomposition beyond raw activation values. Training sparse autoencoders from scratch requires collecting activations, configuring training loops, and evaluating reconstruction quality. Visualizing and interpreting learned features across thousands of dictionary elements needs specialized tooling.
Core Highlights
SAE trainer fits sparse autoencoders on model activation datasets. Feature finder identifies monosemantic features in learned dictionaries. Activation analyzer measures feature patterns across inputs and layers. Visualization dashboard displays feature activation maps and statistics.
How to Use It?
Basic Usage
from sae_lens import SAE
from transformer_lens\
import HookedTransformer
model = HookedTransformer\
.from_pretrained(
'gpt2-small')
sae = SAE.from_pretrained(
release='gpt2-small',
sae_id='blocks.8'
'.hook_resid_pre')
text = 'The capital of'\
' France is Paris'
logits, cache = (
model.run_with_cache(
text))
acts = cache[
'blocks.8'
'.hook_resid_pre']
features = sae.encode(
acts)
print(
f'Active features: '
f'{(features > 0)'
f'.sum().item()}')
top_feats = features[0]\
.topk(10)
for idx, val in zip(
top_feats.indices,
top_feats.values
):
print(
f'Feature {idx}: '
f'{val:.3f}')Real-World Examples
from sae_lens import SAE
import torch
class FeatureAnalyzer:
def __init__(
self,
sae: SAE
):
self.sae = sae
def top_features(
self,
activations:
torch.Tensor,
k: int = 20
) -> list[dict]:
encoded = (
self.sae.encode(
activations))
mean_acts = (
encoded.mean(
dim=0))
top = mean_acts\
.topk(k)
return [
{'feature':
idx.item(),
'activation':
val.item()}
for idx, val in
zip(top.indices,
top.values)]
def feature_density(
self,
activations:
torch.Tensor
) -> float:
encoded = (
self.sae.encode(
activations))
return (encoded > 0)\
.float().mean()\
.item()
def reconstruct(
self,
activations:
torch.Tensor
) -> dict:
encoded = (
self.sae.encode(
activations))
decoded = (
self.sae.decode(
encoded))
error = (
(activations
- decoded)
.pow(2).mean()
.item())
return {
'mse': error,
'sparsity':
self
.feature_density(
activations)}Advanced Tips
Compare feature activations across different prompts to identify features that correspond to specific semantic concepts. Use reconstruction error as a quality metric for evaluating how well the sparse autoencoder captures the original activation space. Train SAEs on multiple layers to study how features evolve through the model.
When to Use It?
Use Cases
Identify interpretable features in GPT-2 activations that correspond to specific linguistic patterns. Measure feature sparsity and reconstruction quality for a trained sparse autoencoder. Analyze how specific features activate differently across prompts containing different topics.
Related Topics
SAELens, sparse autoencoders, mechanistic interpretability, language models, feature analysis, dictionary learning, and neural network interpretability.
Important Notes
Requirements
PyTorch and TransformerLens for model loading and activation extraction. Pre-trained SAE weights or training data for fitting new sparse autoencoders. GPU with sufficient memory for processing model activations.
Usage Recommendations
Do: evaluate reconstruction quality to ensure the SAE captures meaningful information from the activations. Compare feature activations across diverse prompts to validate that features are genuinely monosemantic. Use pre-trained SAEs when available to save training time and resources.
Don't: assume all learned features are interpretable since some dictionary elements may capture noise or polysemantic patterns. Train SAEs with insufficient data since small datasets produce unreliable feature dictionaries. Interpret feature activations without statistical validation across multiple examples.
Limitations
Sparse autoencoder quality depends on the choice of dictionary size and sparsity penalty which requires experimentation. Feature interpretability is not guaranteed and some features may not correspond to human-understandable concepts. Pre-trained SAEs are available for a limited set of models and layers.
More Skills You Might Like
Explore similar skills to enhance your workflow
Tooluniverse Sequence Retrieval
Tooluniverse Sequence Retrieval automation and integration
Leadfeeder Automation
Automate Leadfeeder operations through Composio's Leadfeeder toolkit
Firmao Automation
Automate Firmao operations through Composio's Firmao toolkit via Rube MCP
Google Admin Automation
Automate Google Workspace Admin tasks via Rube MCP (Composio):
Tycana
Persistent task management and productivity intelligence via MCP. Captures tasks from conversation, plans your day, tracks patterns, and gives persona
Memory Optimize
Enhance system performance through automated memory optimization and management