Mechanistic Interpretability — Opening the Black Box — Bench

The Opacity Problem

A trained neural network is, in principle, a fully specified mathematical function. Every weight is a number; every computation is a matrix multiplication or an activation function. There is no mystery about what operations occur. The opacity is not at the level of operations but at the level of algorithms: what is the network computing, in the sense of what is the high-level procedure it’s implementing?

A human programmer writing code to perform a task can describe what the code is doing at multiple levels — the line level (this statement assigns this value), the function level (this function finds the minimum), the algorithm level (this sorts by insertion). A trained network has the equivalent of line-level description (this matrix multiply, this ReLU) but typically lacks the higher-level description. The algorithm it implements was not written by a programmer — it was found by gradient descent in a vast parameter space, and what it found is not labeled.

This matters for safety and reliability. A network that performs well on a benchmark may be exploiting statistical regularities in the benchmark rather than the underlying capability the benchmark is intended to measure. Without understanding what the network is doing, you can’t distinguish robust generalization from benchmark overfitting. And you can’t predict when the network will fail — at what inputs, in what ways, and for what reasons.

The Circuits Framework

The dominant research program in mechanistic interpretability is circuits, developed by Anthropic researchers and collaborators beginning around 2020. The core idea: neural networks implement algorithms in the form of circuits — subgraphs of neurons that work together to compute specific functions. The program attempts to find these circuits, understand what they compute, and verify the understanding against the network’s behavior.

The first clean result was in curve detectors in convolutional networks (Olah et al., 2020): networks trained on image classification learn curve detector units in early layers, and the curve detectors are assembled from a small number of simpler line and edge detectors from the layer below through weights that implement a geometric computation. The circuit was identified, its computation was understood in closed form, and the understanding predicted where the circuit would activate and how it would respond to perturbations.

Work on transformers followed. Elhage et al. (2021) showed that two-layer attention-only transformers implement induction heads — attention heads that work together to implement in-context learning of repeated patterns. One head (the previous-token head) attends to the previous token and copies information into the current position. A second head (the induction head) uses that information to attend to positions that previously followed the same token, implementing the algorithm “what came after the last time I saw this token?” This is a simple form of in-context learning, implemented by a two-head circuit.

Induction heads were subsequently found in virtually every transformer model, suggesting they are a universal feature of the architecture rather than a model-specific accident. Their formation during training coincides with a phase transition in in-context learning ability — a result consistent with induction heads being the mechanism for in-context learning.

Superposition

A complication that has shaped mechanistic interpretability research: superposition. The hypothesis is that neural networks represent more features than they have neurons, by encoding multiple features in overlapping patterns across neurons.

The classical view of neural network representations is that each feature is encoded in one or a small set of neurons — one neuron fires for cat images, another for dog images, another for edges. Monosemantic features that allow clean circuit analysis. But probing studies of large networks consistently find polysemantic neurons — neurons that respond to semantically unrelated inputs. A single neuron in a language model might activate for strings related to “python the programming language,” “python the snake,” “monty python,” and certain legal terminology. The neuron is not computing a coherent single concept.

Elhage et al. (2022) showed that networks trade off between efficiency and interpretability: representing more features with a fixed number of neurons requires superposition — encoding features as directions in activation space that are not aligned with individual neurons, but with controlled interference between features. The model learns to use interference patterns that are low enough not to significantly degrade performance while allowing far more features than neurons.

Superposition makes circuit analysis harder. If features aren’t neurons, the circuits aren’t paths through individual neurons — they’re paths through directions in activation space, which are harder to identify and harder to visualize.

Sparse Autoencoders

The current technical approach to resolving superposition is sparse autoencoders (SAEs). The idea: train a linear autoencoder on the activations of a network layer, with a sparsity penalty that encourages each input to be explained by a small number of dictionary elements. The dictionary elements — the learned features — should correspond to the underlying monosemantic features the network has superposed.

Anthropic’s 2023-2024 work on SAEs applied to language models found millions of interpretable features — tokens, contexts, and concepts that activate specific dictionary elements. Features for specific people, places, programming constructs, emotional states, grammatical roles. The features are interpretable in the sense that you can often describe what a feature responds to by examining the inputs that activate it.

This is progress but not completion. Identifying what features a network represents is the first step toward understanding what algorithms it implements. The circuits analysis program requires connecting features to the computations that operate on them — how features from one layer interact with attention patterns and weight matrices to produce features in the next layer. That analysis is tractable for small models and specific circuits; it’s still far from a complete picture for large frontier models.

What Has Been Learned So Far

The main findings from mechanistic interpretability as of 2025:

Transformers implement recognizable algorithms in identifiable circuits. Induction heads for in-context pattern completion, previous-token heads for local tracking, name-mover heads that retrieve specific named entities, and others have been identified and characterized. The circuits implement legible computations, not inscrutable magic.

The algorithms implemented are often surprisingly simple compared to the tasks they enable. The in-context learning ability of large language models — which looks like generalization and flexible reasoning from the outside — is partly implemented by circuits that do something closer to sophisticated pattern matching. Whether this means in-context learning is “merely” pattern matching or whether sophisticated pattern matching implemented at sufficient scale constitutes reasoning is a live conceptual question.

Superposition is real and pervasive. Individual neurons are not the right unit of analysis; features encoded as directions in activation space are. This makes circuits harder to find but doesn’t make them nonexistent.

Universality — the observation that similar features and circuits appear across different model architectures and training regimes — suggests that mechanistic findings may generalize. If curve detectors and induction heads appear in every model, the circuits framework is pointing at something about what neural networks learn, not just about the specific models analyzed.

Why This Matters Beyond Science

The motivation for mechanistic interpretability is partly intellectual and partly safety-critical. If you’re deploying AI systems in high-stakes settings — medical diagnosis, legal reasoning, critical infrastructure — understanding what the system is actually computing matters. A system you can’t understand is a system you can’t reliably predict, and a system you can’t predict is a system you can’t fully trust.

More specifically, interpretability is necessary for verification. If a system claims to have performed a reasoning process — checked the math, verified the sources, followed the rules — you can’t currently verify that claim by inspecting the computation. The model outputs a plausible description of its reasoning, which may or may not correspond to what it actually computed. Mechanistic interpretability aims to close this gap: to make it possible to verify what computation was performed, not just what was reported.

The field is young enough that complete circuit analyses exist only for small models and specific circuits in larger ones. Whether the approach scales to frontier models with hundreds of billions of parameters is uncertain. The superposition problem may be tractable with SAEs or may require architectural changes that reduce superposition directly. The gap between what the field can do now and what safety applications require is large. The direction is clear; the distance is not.