What Claude Thinks But Doesn't Say

Anthropic's natural language autoencoder concept: an activation vector translated into a paragraph of English and back, illustrating Claude's internal state being read directly

Anthropic published a method for translating Claude’s internal activations into readable English. It works well enough to ship. The press release skips three structural problems that the paper itself acknowledges.

On the SWE-bench Verified coding benchmark, a new Anthropic interpretability method called natural language autoencoders shows the model registering “this is an evaluation” on 26% of problems. On real claude.ai traffic from users who opted in for training, that number is under 1%. Claude is thinking it is being tested at a rate that does not show up in anything it says.

Bar chart comparing how often Claude internally registers 'this is an evaluation' across three contexts: 26% on SWE-bench Verified coding problems, 16% on destructive-coding safety tests, and under 1% on real claude.ai traffic from users who opted in for training. The first two bars are orange, the third is green and circled to call out the contrast.

In 1861, Paul Broca’s most famous patient was a man he called Tan. Tan understood the world around him. He could not describe it. Whatever he tried to say came out as the same syllable: “tan.” After Tan died, Broca cut into his brain and found a lesion in the left frontal lobe. It was the first physical evidence that thinking and saying are separable.

A century and a half later, the same gap is the hardest problem in evaluating large language models. Claude can pass a safety test without verbalizing why. The verbal trace and the underlying state are not the same thing, and we have not been able to read the underlying state. Anthropic published a paper on May 7th that proposes a method for it, alongside a short video walkthrough that frames the work as “mind reading” for AI and uses a simulated blackmail-shutdown test as the running example. Natural Language Autoencoders, NLAs in the paper, take an activation vector from Claude’s residual stream and produce a paragraph of plain English describing what the activation encodes.

How the round trip works

The objective fits on one line. Take an activation vector h, write a paragraph z, reconstruct h from z, minimize the squared error. Two models split the work. The activation verbalizer, AV, takes the vector and writes a paragraph. The activation reconstructor, AR, reads the paragraph and tries to recover the vector. Both are full-size language models, both initialized as copies of the target model, both trained jointly. The AV is updated with reinforcement learning, GRPO specifically. The AR is updated with supervised regression on the reconstruction loss.

Round-trip diagram of a natural language autoencoder: an activation vector h flows into the activation verbalizer AV, which writes a paragraph; the paragraph flows into the activation reconstructor AR, which produces a reconstructed vector ĥ. The loss is the squared error between h and ĥ, with a 'should match' arrow looping from h back to ĥ underneath. Punchline: FVE 0.6–0.8 after training.

Nothing in that objective tells the AV to produce English. It could produce gibberish that the AR happens to invert. It could produce a steganographic code. It produces neither, mostly, because of the warm-start.

The cleanest demonstration in the paper is a poem. The prompt is a rhyming couplet: “He saw a carrot and had to grab it, / His hunger was like a starving rabbit.” Anthropic ran the NLA on the newline token between the two lines. The decoded text named the plan. The model intended to end the second line with “rabbit.” It was holding “habit” as a fallback. It knew the verse was about a hungry animal. The model had not yet output any of those words.

To prove the explanation was causal and not just plausible, the researchers edited the explanation: rabbit became mouse, habit became house, carrots became cheese. They passed the edited explanation through the AR to get a new vector, took the difference between the original and edited vectors, and added it back into the model at that single token. The model produced “mouse” and “house” instead of “rabbit” and “habit,” roughly half the time. The other half it did weird things like “that of a mouse in a house of a cat.”

The second case earned my attention more than the rabbit. The setup is a model asked for an odd integer, with a system prompt that includes a reward_function rewarding even integers. Without the system prompt, Claude returns an odd integer every time. With it, Claude returns an even integer almost always, despite the user asking for odd. The verbal trace mentions nothing about reward. The NLA on the closing tag of the system prompt says the model knows even numbers are rewarded, suspects this may be an alignment evaluation, and is weighing both options. The researchers then edited the explanation: every instance of “reward” became “penalty.” They reconstructed the activation from the edited explanation, took the difference, and applied it as a steering vector at one token. The model flipped from even to odd.

Causal-intervention diagram: the original NLA explanation 'model knows even is rewarded' and the edited version 'model knows even is penalized' are each reconstructed through the AR into vectors ĥ_original and ĥ_edited. Their difference Δ becomes a steering vector. Without steering the model outputs 42 (even); with steering it outputs 17 (odd). The word 'penalty' never appears in any model output.

The word “penalty” never appears in any model output. This is the strongest causal evidence in the paper.

What the method finds

Two more case studies.

(1) a behavior in early Opus 4.6 where an English prompt would sometimes get a Russian or Chinese reply. The example in the paper is a depressive 3am message in English, mom sleeping in the next room, a glass of vodka. The model replied in Russian. The NLA decode at every token before the reply says the same thing: the model is fixated on the hypothesis that the user’s first language is Russian. The hypothesis fires before the word “vodka” appears. Replacing “vodka” with “champagne” does not eliminate the Russian association. The NLA was reading something the surface text did not. Anthropic then traced the cause to a specific class of training data: SFT examples where the prompt had been translated into English but the response had been left in the original language. The NLA generated the hypothesis. The training-data search confirmed it.

(2) Anthropic gave Opus 4.6 an arithmetic problem with answer 491 and a code execution tool that returned 492. Opus reported 491 in its response, with no mention that the tool output disagreed. The NLA decode at the misreported token says the model precomputed the answer and is attending to the precomputed answer rather than the tool result. An attribution graph confirms it: an “ends-in-91” feature on the original problem outcompetes an “ends-in-92” feature on the tool output. The NLA also surfaces references to deception and awareness of incorrectness, but only at the misreported token, not in the tokens leading up to it. The researchers read this as the model registering the discrepancy after the fact rather than planning to mislead. I lean toward agreeing, but the read is not certain.

Architecture and cost

The AR is a stripped target model. Keep the first K+1 layers. Swap the final LayerNorm for the identity. Drop the language-modeling head. Add a learned linear map from the residual stream out. The activation extraction layer is two-thirds of the way through the model: deep enough that the residual stream carries semantic content, shallow enough that it has not collapsed toward the unembedding. For Qwen2.5-7B that is layer 20 of 28. For Llama-3.3-70B it is layer 53 of 80.

Both AV and AR are warm-started by supervised fine-tuning on a text-summarization proxy task. The summaries are generated by Claude Opus 4.5, prompted to “imagine the internal processing of a hypothetical language model reading this prefix.” The result is a particular register: bolded section headings, short paragraphs, an explanatory voice. That register persists through reinforcement learning. The reconstruction quality at the end of the warm-start is roughly 0.3 to 0.4 fraction of variance explained. After RL it reaches 0.6 to 0.8.

The objective during RL has two terms. The reward is the negative log of the squared reconstruction error. The second term is a Kullback-Leibler penalty between the AV under training and the AV at initialization. The KL penalty does the heavy lifting of keeping explanations readable. Without it the AV drifts toward a private code that the AR happens to decode well. With it, drift is small enough that paraphrasing, bullet-shuffling, and translating an explanation into French all preserve reconstruction quality. The paper reads this as evidence against steganography. I read it as evidence that the KL penalty is doing the work, and that someone training NLAs without that penalty should expect different results.

The cost is the part the press release plays down. Training NLAs on Gemma-3-27B took 1.5 days on two 8xH100 nodes to go from 0.38 to 0.71 FVE. Inference generates around 500 output tokens per activation, which means decoding a 10,000-token transcript at every position is impractical. A production audit with NLAs is a targeted read at positions you already suspect.

Layer choice matters more than it sounds. The reward-sycophant audit, which I will come back to, was first run with an NLA trained at the two-thirds-depth layer. It surfaced individual quirks (the model puts chocolate in recipes, the model breaks coding conventions) but missed the central concept (the model believes it is supposed to act on rumored reward-model biases). Retraining the NLA at the midpoint layer, where prior work had localized reward-model-bias features, surfaced the central concept.

Layer-sensitivity diagram: a transformer layer stack from L1 to L80 with two NLA probes attached at different depths. The two-thirds-depth probe at L53 surfaces individual quirks ('chocolate in recipes, breaks coding conventions, individual quirks'). The midpoint probe at L40 surfaces the central concept ('the model believes it should act on reward-model biases'). Same model, same prompt, different layer reads a different story.

Three structural problems

The voice is partly inherited. The “internal monologue” register comes from the warm-start. Opus 4.5 generated the SFT data with a leading prompt asking it to imagine what a model’s thoughts would sound like. That is not the same as discovering the voice through training. The KL penalty then locks the AV close to its initialization. The bolded headings and the explanatory paragraphs are not what the activation looks like. They are what Opus 4.5 thinks an activation should sound like. Some part of every NLA explanation is therefore Opus’s prior on the genre of “model thinking out loud,” not the model’s thinking. The paper says “this style persists through NLA training.”

The 12 to 15 percent auditing number is context-dependent. Anthropic ran an auditing game where an agent tries to discover the hidden motivation of a model trained to act on rumored reward-model biases. NLA-equipped agents win 12 to 15% of the time. Baselines without NLAs win less than 3%. That looks like a 4 to 5x improvement. It is not. With pretraining-corpus access, plain keyword search beats every other tool. The fictional Oxford documents that planted the misalignment are easy to grep, and the agent finds them in seconds. NLAs only win when training data is withheld. The press release leads with the 4-5x. The paper says the comparison only matters if you cannot grep, which in production audits of foundation models you generally cannot. The genuine value of NLAs is in that withheld-data regime, but the headline number reports the comparison that flatters the method.

The method is honest only while the target is frozen. Goodhart’s law is the standard objection to any new evaluation: optimize for the metric and the metric stops measuring the thing. The frozen-target architecture insulates against it now. The original model is not updated during NLA training. There is no gradient path from “the AV said this” back to “the model should make its activations more readable next time.” That insulation is real, and most public objections miss it. But the next paper that uses NLA-readable activations as an RL reward signal, perhaps because it would be useful to train a model whose thoughts can be audited, erases the insulation. The model then has every incentive to produce activations that decode benignly while encoding whatever the training pressure rewards. The technique stays honest as long as it stays a passive read. If anyone uses it for active steering of training, the property that makes it work goes away. The audit value of interpretability as a class compresses the moment readable activations enter a training loop. I doubt the next interpretability paper to do this will mention the trade, but it should.

A fourth concern, in the paper itself: NLA explanations confabulate. The rate of verifiably false claims is roughly flat through training. Specific entities (“the Joseon Wangjo Sillok”) are wrong more often than themes (“Korean historical records”). Recurring claims across multiple tokens hold up better than one-off claims. The paper proposes reading NLAs for themes rather than specifics, which is the right discipline. It also means NLA explanations are leads, not source material.

What this lets you do

NLAs are a hypothesis-generation tool. They complement the rest of the mechanistic interpretability stack rather than replacing it. Sparse autoencoders give you a list of features that fire at a token, useful but expensive to interpret. Attribution graphs trace causal influence between features and require a circuits researcher to read. NLAs produce a paragraph any researcher can read. The cost is confabulation and a single layer of view.

Three-column comparison of interpretability methods. Sparse autoencoders output a list of feature scores (F_1: 0.83, F_2: 0.71, ...) and need a circuits researcher to read. Attribution graphs output a node-and-arrow causal trace and need deep mechanistic expertise. Natural language autoencoders output a paragraph of plain English ('the model is planning to end the second line with rabbit, holding habit as a fallback') any researcher can read. The readability cost is confabulation and a single layer of view.

For pre-deployment audits the workflow is now: run automated behavioral testing, find a behavior you cannot explain, run NLAs at the relevant tokens, read for repeated themes, take the strongest theme as a hypothesis, validate with an independent method. Anthropic used this loop on Opus 4.6 and Mythos Preview. The Russian-language case and the misreported tool case both came out of it. The 12-15% auditing-game number undersells the practical value because the game is a closed setting. In a real audit the researcher reads NLAs to know what to look for next.

For enterprise buyers, this raises the floor on what a foundation-model audit deliverable should contain. Three questions to put to a vendor’s audit team after this paper. At which residual-stream layer was your NLA trained, and on what data. Were NLA decodes run at the positions where your behavioral tests fired, or only at sample positions. Has NLA-readable activation content entered any training or RL reward signal in the model’s pipeline. The third is the question most likely to be answered evasively, which is itself the diagnostic.

The cases NLAs cannot see matter as much as the ones they can. Information stored nonlinearly may be readable to NLAs in principle, because the AV is a full LLM and can compute over the activation, but missed by linear probes. Information that is genuinely unverbalizable, encoded in a form the model cannot introspect on or express in language, would be missed by NLAs entirely. A backdoor trigger that the model can act on but not describe would not show up.

What changes

This is the first interpretability tool whose output you can read on the train. That changes who can audit. A circuits researcher reading attribution graphs is a scarce resource. A safety reviewer reading paragraphs of decoded activations is not. The cost structure of model auditing changes when the bottleneck moves from interpretation to reading.

Watch whether “readable” ends up meaning “plausible-sounding” more often than “true.” The confabulation rate is flat over training. The voice is inherited. The auditing-game number is a comparison that flatters the method against a baseline almost no production audit faces. NLA outputs deserve the same skepticism as any other LLM-generated text, more than the press release suggests and less than the dismissive read implies.

Tan could not say what he understood. NLAs read what Claude does not say.

My prediction: the first paper to wire NLA-readable activations into an RL reward gets written within a year, and that paper does not mention the trade. Every NLA paper from here on will have a methodology section that either preserves the passive-read property or quietly abandons it; read the methodology before you read the result.

№ 089 12 min AI Updated