Nnsight

Nnsight automation and integration for neural network interpretability and inspection

nnsight is a community skill for interpreting and intervening on neural network internals using the nnsight Python library, covering activation inspection, hidden state manipulation, intervention experiments, computational graph tracing, and mechanistic interpretability research for understanding deep learning models.

What Is This?

Overview

nnsight provides tools for inspecting and modifying the internal computations of neural networks during forward passes. It covers activation inspection that captures intermediate layer outputs and attention patterns during model inference, hidden state manipulation that modifies internal representations at specified layers to study causal effects on model behavior, intervention experiments that apply targeted changes to specific components and measure downstream impact, computational graph tracing that records the flow of information through model layers, and mechanistic interpretability that identifies functional circuits within trained models. The skill enables researchers to understand how neural networks process information internally.

Who Should Use This

This skill serves interpretability researchers studying neural network mechanisms, ML scientists investigating model behavior through intervention experiments, and alignment researchers analyzing language model internals.

Why Use It?

Problems It Solves

Standard model APIs only expose input-output behavior without visibility into internal computations. Modifying model internals requires custom hooks that are tedious to write and maintain across different architectures. Intervention experiments need careful bookkeeping to track which components were modified and their downstream effects. Large language models have complex architectures where manual inspection code becomes unwieldy.

Core Highlights

Activation tracer captures intermediate outputs from any model layer during inference. State editor modifies hidden representations at targeted positions and layers. Intervention runner executes controlled experiments with automatic effect measurement. Graph inspector traces information flow through the computational graph.

How to Use It?

Basic Usage

from nnsight import LanguageModel
import torch

model = LanguageModel(
  'gpt2',
  device_map='auto')

with model.trace(
  'The capital of France'
) as tracer:
  # Get hidden states
  # from layer 5
  hidden = model\
    .transformer\
    .h[5].output[0]
  hidden_saved = (
    hidden.save())

  # Get attention
  # pattern from layer 3
  attn = model\
    .transformer\
    .h[3].attn\
    .output[0]
  attn_saved = (
    attn.save())

print(
  f'Hidden shape: '
  f'{hidden_saved.shape}')
print(
  f'Attn shape: '
  f'{attn_saved.shape}')

Real-World Examples

from nnsight import LanguageModel
import torch

model = LanguageModel(
  'gpt2',
  device_map='auto')

class InterventionExp:
  def __init__(
    self, model
  ):
    self.model = model

  def patch_layer(
    self,
    prompt: str,
    source_prompt: str,
    layer: int,
    pos: int
  ) -> dict:
    # Get source acts
    with self.model.trace(
      source_prompt
    ) as src:
      src_hidden = (
        self.model
          .transformer
          .h[layer]
          .output[0]
          [:, pos, :]
          .save())

    # Patch target
    with self.model.trace(
      prompt
    ) as tgt:
      self.model\
        .transformer\
        .h[layer]\
        .output[0]\
        [:, pos, :] = (
          src_hidden)
      logits = (
        self.model
          .lm_head
          .output
          .save())

    probs = torch\
      .softmax(
        logits[0, -1],
        dim=-1)
    top = torch.topk(
      probs, 5)
    return {
      'top_probs':
        top.values
          .tolist(),
      'top_tokens': [
        model.tokenizer
          .decode(t)
        for t in
          top.indices]}

Advanced Tips

Use activation patching between clean and corrupted prompts to identify which layers and positions are causally responsible for specific model behaviors. Combine nnsight tracing with dimensionality reduction to visualize how representations evolve across layers. Cache activation saves across multiple traces to build datasets for probing classifier experiments.

When to Use It?

Use Cases

Inspect attention patterns across layers to understand how a language model processes factual queries. Run activation patching experiments to localize which components store specific knowledge. Trace information flow through a model to identify functional circuits that implement particular capabilities.

Related Topics

Mechanistic interpretability, activation patching, neural network inspection, language model internals, causal intervention, and AI safety research.

Important Notes

Requirements

nnsight Python package with PyTorch backend. Pre-trained model weights accessible through HuggingFace or local storage. GPU memory sufficient for the target model plus activation storage.

Usage Recommendations

Do: start with small models like GPT-2 to develop intervention techniques before scaling to larger architectures. Save traced activations to disk for repeated analysis without rerunning inference. Validate intervention effects across multiple input examples to ensure findings generalize.

Don't: draw conclusions from single example interventions since individual inputs may trigger atypical model behavior. Modify multiple model components simultaneously when trying to isolate causal mechanisms. Assume layer numbering is consistent across model families since architectures vary.

Limitations

Activation storage for large models consumes significant GPU and system memory especially when saving across many layers. Intervention results on small models may not generalize to larger versions of the same architecture family. The tracing API depends on model implementation details that may change across library versions.