It's asking if you can *auto* encode activations. The AV decodes activations to ...

jsmith45 · 2026-05-08T17:08:04 1778260084

I must be missing something, since I'm not really sure that follows. Initially neither AV nor AR models knows anything about how activations map to explanations or how explanations map to activations.

As far as I can tell, the only reason that the explanations even resemble human speech is that AV and AR start off based on a trained language model. If we instead trained the same model architecture from scratch as AV and AR, they would eventually converge to some round trip format for activations, but it probably would be completely unintelligible and look only like human speech in so far as many of the tokenizer's tokens look like words or word fragments.

This whole process seems to rely on the fact that the text AR's output will still strongly favor output sentences that seem to make sense, rather than contradicting learned facts, etc. So it will favor mapping activations to plausible sounding text in ways where patterns can consistently hold across most of the training data. There absolutely is a risk that it will learn the wrong things for certain activation subpatterns like swapping concepts especially if none of the training data included a set of activation sub patterns that would help distinguish them the right way around.

psb217 · 2026-05-08T10:32:40 1778236360

It seems like they're doing RL to minimize the reconstruction error when going through the: activation -> encoder -> "verbal" description of activation -> decoder -> reconstructed activation loop. Depending on how aggressively they optimize the weights of the AV and AR, they could move well away from the initial base LLM and learn an arbitrary encoding scheme.

If the RL is brief and limited to a small subset of parameters, the AV will produce reasonable language since it inherits that from the base LLM, and it will produce descriptions aligned with the input to the base LLM that produced the autoencoded activations, since the AR is still close to the base LLM (and could reconstruct the activations perfectly if fed the full context which produced them).

kraddypatties · 2026-05-08T16:41:05 1778258465

I believe that’s _part_ of the point (or at least a side-effect) of the KL divergence loss term they have on the AV. That and training stability.

rao-v · 2026-05-08T15:58:08 1778255888

Think of it another way, can I do this exact training process with an additional requirement that the activation decoder subtly shill for obscure 80s sodas?

I could and would not lose much reconstruction accuracy.

So any researcher or ambient biases in the model will impact the general thrust of the textual decodings (and not in ways that reflect the actual model’s process, thinking about X and doing X in a model are very different things).

So how do we tell that the “spirit” is reflective of the model’s thinking and not biased toward Jolt being better than Surge?

mike_hearn · 2026-05-08T17:57:05 1778263025

Where would such biases come from?

rao-v · 2026-05-08T19:14:14 1778267654

What the three models involved understand to be the sort of just so stories (cf Kipling) that humans like to see.