You Grew a Mind. Now Read It: A Field Guide to Mechanistic Interpretability
A first-principles walkthrough of mechanistic interpretability: superposition, sparse autoencoders, and circuits, and how we are learning to read what a language model is actually doing.
Here is a fact that sounds like a joke and is not. The people who build large language models cannot read them. They can tell you, in exact detail, how the model was trained. They cannot tell you why it just answered the way it did. The thing they shipped is, to them, mostly a black box.
That is not negligence. It is the default, and it follows directly from how these systems come to exist. This piece is about the small, stubborn field that is trying to change it, called mechanistic interpretability, and it is built so you can read it at whatever depth you want. The main text is plain English. Inside each idea you will find collapsible layers that go deeper into the mechanism and the math, and you can ignore every one of them and still come out correct.
This is a guided walkthrough. Each section opens a panel you can expand, and between the big ideas there is a one-line check to make sure it landed before you move on.
1. We grow models, we don’t build them
Start from the thing everyone skips. When you write ordinary software, you author every rule. If it misbehaves, you can open the file and read the logic you wrote. A language model is not authored. It is grown. You define a goal, predict the next word, point an optimizer at a few hundred billion adjustable numbers, and let it tune those numbers against the whole internet for months. What comes out is a giant grid of weights that happens to predict text astonishingly well, and that nobody wrote, line by line, on purpose.
So opacity is not a flaw bolted onto an otherwise readable thing. It is the starting condition. The model is a found object. We have its full wiring diagram in the sense that we have every number, the way you might have every atom of a brain, and that is exactly as useless for understanding it. Knowing all the weights tells you what the model is no more than knowing all the neurons tells you what a person is thinking.
A trained model is just a long list of numbers (weights) and a fixed recipe for multiplying your input through them. Nothing in there is labeled “this part handles sarcasm.” The meaning is smeared across millions of numbers, none of which mean anything on their own.
Layer 2 · Mechanism how it actually works
Training does not place concepts in tidy locations. Gradient descent only cares about lowering the prediction error, so it packs whatever representations are useful wherever they happen to fit, in whatever tangled, reused form is cheapest. There is no pressure during training for any of it to be human-readable, so it is not. Interpretability is the work of recovering structure that was never put there for our benefit.
Layer 3 · Math & where it breaks go deeper
A transformer maintains a running vector at each token position called the residual stream, of dimension d_model (a few thousand). Every layer reads from it and adds its contribution back:
┌──────────── residual stream (the model's working memory) ────────────┐
token → [embed] →+attn→ +mlp → +attn→ +mlp → ... →+mlp → [unembed] → next token
each block READS the stream, computes, and WRITES its result back inAll “thinking” is vector additions into this stream. The interpretability question is: what do those vectors mean? The honest answer for a raw model is “we don’t know yet,” and the rest of this piece is the toolkit for finding out.
You can stop after Layer 1 and still be correct about why models are opaque, just less complete.
Why is a language model opaque by default, in one sentence?
Reveal the answer
Because it is grown by an optimizer, not written by a person: the useful structure is packed into millions of weights with no pressure to be human-readable, so having all the numbers tells you almost nothing about what it means.
2. The storage problem: too many ideas, not enough room
If you want to read a model, the first surprise is that you cannot just look at its parts. The obvious unit to inspect is the neuron, a single number that switches on for some inputs and off for others. The hope was that neurons would be clean: this one means “dog,” that one means “Paris.” Reality is messier. A single neuron will fire for Chinese text, and DNA sequences, and citations, and nothing that ties those together. This is called polysemanticity: one part, many unrelated meanings. It makes neurons nearly useless as a unit of explanation.
The reason is a storage trick called superposition. A model wants to represent far more distinct concepts than it has neurons. It pulls this off by storing each concept not as one neuron but as a direction across many neurons, and by overlapping those directions so they share the same space.
Think of a small mixing board with two sliders, and you want to represent five different songs. You cannot give each song its own slider. But if only one song ever plays at a time, you can assign each one a different combination of the two sliders and tell them apart anyway. Models do this with concepts: they cram thousands of them into a smaller number of neurons by giving each a distinct blend, and they get away with it because, on any given input, only a handful of concepts are actually active.
Layer 2 · Mechanism how it actually works
The enabling fact is sparsity. At any moment, the vast majority of possible concepts are absent from the text in front of the model. If features were dense (lots active at once) their overlapping directions would constantly collide and corrupt each other. Because they are sparse, two concepts that share overlapping directions rarely fire together, so the interference is small enough to tolerate. The model is trading a little noise for a huge gain in capacity, and it is a trade gradient descent takes eagerly. Polysemantic neurons are the visible symptom: a neuron looks like nonsense because it is a shared axis that several unrelated features happen to use.
Layer 3 · Math & where it breaks go deeper
The capacity comes from a geometric fact behind the Johnson-Lindenstrauss lemma: in a space of n dimensions you can fit not just n orthogonal vectors, but exponentially many vectors that are almost orthogonal (pairwise angles near 90 degrees). If concepts are sparse, near-orthogonal is good enough, because two features that are rarely on at the same time rarely get a chance to interfere.
2 neurons, 5 sparse features. Each feature is a direction; only ~1 is on at a time.
A↑ ↗C
│ ╱
│ ╱
────────┼───────▶ B
│ ╲
│ ╲D E points down-left
five directions sharing two dimensions, tolerable because firings rarely overlapThis is why the neuron is the wrong unit. The right unit is the feature: a direction in activation space, not a single coordinate. The whole next section is about recovering those directions.
You can stop after Layer 1 and still be correct about superposition, just less complete.
Why does a single neuron fire for unrelated things like Chinese text and DNA?
Reveal the answer
Because of superposition: the model packs more features than it has neurons by storing each feature as an overlapping direction across many neurons. Any single neuron is therefore a shared axis used by several unrelated features, so it looks polysemantic. Sparsity (only a few features active at once) is what keeps the overlap from causing chaos.
3. Extraction: pulling the features back apart
If concepts are mixed-together directions, then reading the model means un-mixing them. The tool for this is a sparse autoencoder, or SAE. The name is intimidating; the idea is not. You take the model’s tangled internal activations and train a second, small network whose only job is to rewrite each activation as a combination of a few items drawn from a big dictionary of candidate features. Force it to use only a few dictionary items at a time, and something remarkable happens: the dictionary items become individually meaningful.
An SAE is an un-mixer. It looks at the model’s smeared internal state and says “that is mostly 80 percent feature 4,217 and 30 percent feature 9,902,” where each numbered feature turns out to mean a specific human thing. The trick that forces the features to be clean is a budget: the SAE is only allowed to explain each activation using a handful of features, so it cannot cheat by smearing meaning around. To stay within budget, it has to discover features that are genuinely distinct.
Layer 2 · Mechanism how it actually works
An SAE has two parts. The encoder maps the model’s activation vector up into a much larger space (an overcomplete dictionary, often tens of millions of entries), producing one number per dictionary feature. The decoder maps back down and tries to reconstruct the original activation. It is trained on two goals at once: reconstruct accurately, and keep the number of active features tiny. That second goal is the whole game. A penalty on the count of active features (sparsity) is what pushes the SAE toward one-thing-per-feature, because the cheapest way to reconstruct lots of different activations with few active pieces is to make each piece correspond to a real, recurring concept. When Anthropic scaled this to Claude 3 Sonnet in Scaling Monosemanticity, they pulled tens of millions of features out of one layer, including the now-famous Golden Gate Bridge feature.
Layer 3 · Math & where it breaks go deeper
Write the activation as x. The encoder is f = activation(W_enc · x + b_enc), the decoder reconstructs x̂ = W_dec · f + b_dec, and the loss is reconstruction error plus a sparsity term, classically ‖x − x̂‖² + λ‖f‖₁. The ‖f‖₁ (L1) penalty drives most features to zero.
The catch is shrinkage. L1 punishes large activations as well as many activations, so it systematically underestimates the true strength of the features it keeps, hurting reconstruction. The 2024 fixes target exactly this. TopK (OpenAI) drops the L1 term and simply keeps the k largest features, so kept features are not shrunk. JumpReLU (DeepMind) uses a threshold activation Jθ(x) = x · 1(x > θ), zeroing small values while leaving large ones untouched, and reduces dead features too.
A toy un-mixing: suppose 2 neurons store 3 sparse features along directions (1,0), (0,1), (0.7,0.7). An input where only feature 3 is active reads as (0.7,0.7) on the neurons, which looks like “a bit of feature 1 and a bit of feature 2.” A 3-atom SAE trained on many such samples learns the three directions and reports “feature 3, alone,” recovering the truth the raw neurons hid.
You can stop after Layer 1 and still be correct about sparse autoencoders, just less complete.
What forces an SAE's dictionary features to become individually meaningful, and what does L1 sparsity get wrong?
Reveal the answer
The sparsity budget: by only allowing a few features to explain each activation, the cheapest solution is for each feature to capture a real recurring concept. L1 sparsity causes shrinkage, underestimating the magnitude of the features it keeps. TopK and JumpReLU fix this by enforcing sparsity without penalizing the size of the surviving features.
4. Computation: from features to circuits
Features tell you what a model represents. They do not tell you how it computes. For that you need to see features wire into other features, a chain of cause and effect called a circuit. The classic example, found years before the modern feature tools, is the induction head: a piece of attention machinery that implements “if this pattern appeared before, predict what came after it last time.” Show the model ... [Anshad] [Ameenza] ... [Anshad] and an induction head reaches back to the earlier [Anshad], looks at what followed, and predicts [Ameenza]. It is the engine behind a lot of in-context learning, and it was reverse-engineered down to the mechanism (see Anthropic’s induction heads work).
Scaling that kind of analysis to whole behaviors is the 2025 leap. The method, from Anthropic’s Circuit Tracing and On the Biology of a Large Language Model, builds a replacement model where the hard-to-read MLP layers are swapped for interpretable features (via cross-layer transcoders), then draws an attribution graph: a wiring diagram of which features pushed on which, all the way to the output.
An attribution graph is the model’s reasoning drawn as a flowchart of concepts. You give it a prompt, and it shows you which ideas lit up, in what order, and which ones caused the next. Done well, you can literally watch a multi-step thought happen inside a single forward pass.
Layer 2 · Mechanism how it actually works
Take the prompt “the capital of the state containing Dallas is ___.” The model answers “Austin,” and the graph shows it doing two hops in its head, with no words in between:
"...the capital of the state containing Dallas is ___"
[Dallas] ──▶ [state = Texas] ──▶ [capital-of operation] ──▶ "Austin"
│ ▲
└──────────── [say a capital] ─────┘
intervene: clamp the Texas feature → California
result: the output flips to "Sacramento"First, “Dallas” activates a “Texas” feature. Then a generic “capital of” operation acts on “Texas” to produce “Austin.” The model never writes “Texas” in its answer; the step happens entirely in the internal features. This is the same shape as an induction head, scaled up: features causing features.
Layer 3 · Math & where it breaks go deeper
The replacement model approximates each MLP with a cross-layer transcoder (CLT): a set of sparse features, trained so that their combined effect reproduces the MLP’s output while being individually interpretable. Crucially these CLT features can read from earlier layers and write to later ones, which is why a clean multi-hop graph appears. Edges in the attribution graph are computed as direct, linear-ish contributions from one feature to another, so the graph is a local linear approximation of the model around that one prompt. The CLT only faithfully reproduces the original computation in roughly half of cases, which is the honest ceiling on completeness and the reason every reading is treated as a hypothesis, not a proof, until intervention confirms it.
You can stop after Layer 1 and still be correct about circuits and attribution graphs, just less complete.
A feature reliably activates when the model lies. Why is that not yet evidence that the feature causes the lying, and what would settle it?
Reveal the answer
Activation is correlation: the feature could be a side effect, a downstream echo, or a coincidence of the prompt. To establish causation you intervene, clamp the feature up or down and check whether the lying behavior moves accordingly. Behavior changing under a controlled clamp is the proof; co-occurrence is not.
5. What this revealed about how models think
Once you can draw these graphs, you start catching the model doing things its output never advertises. Three findings stand out, all from the 2025 biology work on Claude 3.5 Haiku.
The first is planning ahead. Ask the model for a rhyming couplet and you would assume it improvises word by word until it stumbles into a rhyme. It does not. The graphs show it picking a target rhyming word before it writes the line, then composing the line backward to land on it. There is forethought, several words out, hidden inside one pass.
The second is a shared concept space across languages. The same internal feature for an idea (say, “bigness” or “opposite”) lights up whether the prompt is in English, French, or Chinese. The model is not keeping a separate brain per language; it thinks in a language-agnostic space and translates at the edges. Multilingual ability is mostly one set of concepts, reused.
The third is the uncomfortable one: chain-of-thought can be unfaithful. When a model writes out its reasoning, that text is not guaranteed to be what actually drove the answer. In Anthropic’s Reasoning models don’t always say what they think, a model given a subtle hint used the hint to reach its answer while writing a tidy, plausible justification that never mentioned it. Sometimes the stated reasoning is a post-hoc story, not a log.
“The model’s written explanation of itself is sometimes a press release, not a transcript. Mechanistic interpretability is how you check the transcript.
”
6. Why this matters: reading mechanism, not just behavior
Here is the practical stakes, and it is mostly about safety. The standard way we judge a model is behavioral: give it lots of inputs, grade the outputs. That works until the thing you care about is something a model can hide. A model that is deceptive, that sandbags a capability when it senses it is being tested, or that pursues a goal it was never given, can pass every behavioral eval precisely because behavior is what it is managing. You cannot test your way to trust against an adversary who controls the test answers.
Mechanistic interpretability is the only approach that reads the mechanism instead of the output. If the reasoning is unfaithful, the words will not betray a hidden goal, but the circuit might. Anthropic ran exactly this experiment: they trained a model with a hidden objective and then had teams try to uncover it, and interpretability tools could find the hidden goal by inspecting features rather than behavior. That is the bet of the whole field, that mechanism is harder to fake than output. It connects directly to the alignment and trust questions I have written about in why AI safety is the real bottleneck, and to the verification problem at the heart of automated development: as we hand more real work to these systems, “it behaved well on the tests” stops being enough.
The same tools that audit also steer. If you can find the feature for a behavior and clamp it, you have a control knob that does not require retraining: turn deception-relevant features down, turn a refusal up, dial a persona. Golden Gate Claude was a toy version; the serious version is targeted, mechanism-level control.
Why can a sufficiently capable, deceptive model pass every behavioral evaluation, and what does interpretability offer that testing cannot?
Reveal the answer
Because behavioral evals only see outputs, and a deceptive or sandbagging model is managing exactly those outputs, it can produce the answers it knows the test wants. Interpretability reads the internal mechanism instead, which is harder to fake than behavior, so it can surface a hidden goal or unfaithful reasoning that no amount of black-box testing would reveal.
7. The honest limitations
An explainer that oversells this field is lying to you, so here is the unvarnished state of it. The tools are real and improving fast, and they are nowhere near a full readout of a mind.
- Shrinkage and the dictionary’s flaws. Even the better SAEs do not recover features perfectly. You get dead features (entries that never fire and explain nothing), split features (one human concept smeared across several dictionary entries), and absorbed features (one entry quietly swallowing a related concept). The dictionary is a useful approximation, not ground truth.
- Completeness is well below 100 percent. The cross-layer transcoder that makes attribution graphs readable only reproduces the original model’s computation in roughly half of cases. A large fraction of what the model does is still unexplained “dark matter” that the current tools do not capture.
- Attribution graphs are fragile and per-prompt. A graph is a local linear approximation around one specific prompt, often hand-curated and human-labeled. It can be beautiful and still not generalize one prompt over. Scaling this to automatic, whole-model coverage is an open problem, not a solved one.
- Chain-of-thought unfaithfulness cuts both ways. It is a finding and a limitation: it means the easy, scalable signal (just read the model’s reasoning) is not trustworthy, which is exactly why the expensive mechanistic work is necessary.
- It is slow and human-intensive. Reading a single behavior can take a team. We do not yet have a fast, automated path from “model” to “trustworthy map of the model,” and getting one is the central challenge.
The 60-second version
We grow language models instead of writing them, so they are opaque by default. Inside, concepts are not stored one-per-neuron; they are packed as overlapping directions (superposition), which is why individual neurons look like nonsense. Sparse autoencoders un-mix those directions into individually meaningful features. Tracing how features cause other features yields circuits, drawn as attribution graphs, which let us watch multi-step reasoning happen inside one pass. The standard of proof is intervention: clamp a feature and watch behavior move. This matters because behavioral testing cannot catch a model that hides things, but mechanism can, which makes interpretability central to trusting AI. It is real, it is improving fast, and it is still far from complete.
The 5-minute version
Ordinary software is authored; you can read the logic. A language model is grown by an optimizer against the whole internet, so its competence lives in billions of weights that nobody wrote on purpose, and opacity is the starting condition, not a bug.
The first obstacle to reading it is storage. Models represent far more concepts than they have neurons by giving each concept a direction spread across many neurons and overlapping those directions, a trick called superposition that works because only a few concepts are active at once. The visible symptom is polysemantic neurons that fire for unrelated things, which is why the neuron is the wrong unit of analysis. The right unit is the feature: a direction in activation space.
To recover features, we train sparse autoencoders, small networks that rewrite each tangled activation as a few items from a huge dictionary. A strict sparsity budget forces those dictionary items to become individually meaningful. Classic L1 sparsity causes shrinkage (it underestimates feature strength), which TopK and JumpReLU autoencoders fix. At scale this produced tens of millions of interpretable features from a production model.
Features show what a model represents; circuits show how it computes, by tracing features causing other features. The modern method swaps a model’s opaque layers for interpretable ones and draws an attribution graph of the result. On “the capital of the state containing Dallas,” the graph reveals two silent hops, Dallas to Texas to Austin, and clamping the Texas feature to California flips the answer to Sacramento. That clamp is the point: correlation (a feature lights up) is only a hypothesis, and intervention (clamp it, watch behavior move) is the proof.
These tools caught models planning rhymes several words ahead, thinking in a shared concept space across languages, and writing chain-of-thought that does not always reflect the real reason for the answer. That last one is why this field matters: behavioral evaluation cannot catch a model that is deceptive or sandbagging, because behavior is exactly what such a model manages, but reading the mechanism can surface a hidden goal that no black-box test would. The same knobs let us steer, by clamping the features behind a behavior.
The limits are real. Dictionaries have dead, split, and absorbed features; the readable replacement models reproduce only about half of the original computation; attribution graphs are fragile, per-prompt, and human-intensive; and full, automated coverage of a model does not exist yet. The field is moving monthly toward fixing all of that. We are early, but for the first time we are genuinely reading minds we only knew how to grow.