Pyvene

Advanced PyVene automation and integration for intervening on internal model representations

Source: Orchestra-Research/AI-Research-SKILLs

Pyvene is a community skill for mechanistic interpretability of neural networks using the pyvene Python library, covering activation interventions, causal tracing, representation editing, circuit discovery, and interchange experiments for understanding model behavior.

What Is This?

Overview

Pyvene provides tools for understanding how neural networks process information by intervening on internal activations during forward passes. It covers activation interventions that replace, add to, or zero out hidden state values at specific model layers, causal tracing that identifies which components contribute to specific model outputs by measuring intervention effects, representation editing that modifies internal representations to change model behavior in targeted ways, circuit discovery that maps the computational subgraphs responsible for specific capabilities, and interchange experiments that swap activations between different inputs to test causal hypotheses. The skill enables researchers to understand neural network decision processes.

Who Should Use This

This skill serves interpretability researchers studying how language models encode and process information, ML scientists investigating failure modes through causal intervention analysis, and alignment researchers mapping computational circuits in transformer models.

Why Use It?

Problems It Solves

Understanding why neural networks produce specific outputs requires tools beyond input-output observation. Identifying which model components contribute to specific behaviors needs causal intervention rather than correlation analysis. Editing model behavior without retraining requires precise activation-level modifications. Mapping computational circuits through manual ablation studies is tedious without automation.

Core Highlights

Intervention engine modifies activations at specific layers during forward passes. Causal tracer measures component contributions to model outputs. Representation editor changes model behavior through targeted activation modifications. Circuit finder maps computational subgraphs for specific capabilities.

How to Use It?

Basic Usage

import pyvene as pv
from transformers import (
  AutoModelForCausalLM,
  AutoTokenizer)

model_name = (
  'gpt2')
model = (
  AutoModelForCausalLM
    .from_pretrained(
      model_name))
tokenizer = (
  AutoTokenizer
    .from_pretrained(
      model_name))

config = pv\
  .IntervenableConfig(
    representations=[{
      'layer': 6,
      'component':
        'block_output',
      'intervention_type':
        pv.VanillaIntervention
    }])

intervenable = (
  pv.IntervenableModel(
    config, model))

base_input = tokenizer(
  'The cat sat on',
  return_tensors='pt')
source_input = tokenizer(
  'The dog ran to',
  return_tensors='pt')

outputs = intervenable(
  base_input,
  [source_input])

Real-World Examples

import pyvene as pv
import torch

class CausalTracer:
  def __init__(
    self,
    model,
    tokenizer
  ):
    self.model = model
    self.tok = tokenizer
    self.n_layers = (
      model.config
        .n_layer)

  def trace_layer(
    self,
    base_text: str,
    source_text: str,
    layer: int
  ) -> float:
    config = pv\
      .IntervenableConfig(
        representations=[{
          'layer': layer,
          'component':
            'block_output',
          'intervention_type':
            pv
            .VanillaIntervention
        }])
    iv = (
      pv.IntervenableModel(
        config,
        self.model))
    base = self.tok(
      base_text,
      return_tensors=
        'pt')
    src = self.tok(
      source_text,
      return_tensors=
        'pt')
    out = iv(
      base, [src])
    logits = out[0][
      0, -1]
    return float(
      logits.max())

  def full_trace(
    self,
    base_text: str,
    source_text: str
  ) -> list[float]:
    return [
      self.trace_layer(
        base_text,
        source_text, i)
      for i in range(
        self.n_layers)]

Advanced Tips

Use interchange interventions to test causal hypotheses about which model components encode specific information by swapping activations between inputs that differ in one feature. Combine interventions across multiple layers and components to trace information flow through the entire network. Cache base model activations to speed up experiments that test many intervention configurations on the same inputs.

When to Use It?

Use Cases

Trace which transformer layers contribute most to factual recall by intervening on residual stream activations. Discover computational circuits responsible for specific linguistic capabilities in language models. Edit model behavior by replacing activations associated with incorrect outputs to fix specific failure cases.

Important Notes

Requirements

Pyvene Python package with PyTorch and HuggingFace transformers. GPU memory sufficient for the target model plus activation storage during interventions. HuggingFace model compatible with pyvene's intervention hooks.

Usage Recommendations

Do: start with single-layer interventions before building complex multi-component experiments. Use control experiments with random interventions to establish baselines for causal effect measurements. Verify intervention results across multiple input examples to confirm they generalize.

Don't: interpret intervention effects from a single example since model behavior varies across inputs. Apply interventions to all layers simultaneously since this makes it impossible to attribute effects to specific components. Assume that intervention effects are additive when combining multiple component modifications.

Limitations

Intervention experiments on large models require significant GPU memory for storing activations across all layers. Causal tracing results can be sensitive to the choice of source and base inputs used in experiments. Not all model architectures are supported since pyvene relies on specific hook points in transformer implementations.

More Skills You Might Like

Explore similar skills to enhance your workflow