> Can a game state be encoded as a set of weights?
It's not in the weights because the weights don't change.
> These can be in the MLP part after the LLM. Sure
I'm not even sure what this means. The mlps are not used at all by the network.
> I don't find it completely surprising that you could get some probabilistic map of how tokens interact (game pieces) and what the next likely token is just from training it as an LLM.
You might not but the idea that they are just outputting based on sequences without having an internal model of the world is a common one. This experiment was a test to get more information on that question.
> After all tokens are just placeholders and the relationships between them are encoded in the text.
Sorry by weights, I really meant the pattern of activations... I should have made that more clear. But the weights are trained by the game transcripts to produce activation patterns that could represent the board state. Or it could be local position patterns learnt during training. Positional representation (attention) of the N-1 tokens in the autoregressive task. Did they look at the attention patterns? Anyway there is a recent PhD from Stanford who looked at CNNs with SAT similarly and presented some evidence that the activations patterns can be decoded to determine the satisfying solution.
> . But the weights are trained by the game transcripts to produce activation patterns that could represent the board state
A slight phrasing thing here just to be clear - the model is not trained to produce a representation of the board state explicitly. It is never given [moves] = [board state] and it is not trained on correctly predicting the board state by passing it in like [state] + move. The only thing that is trained on that is the probes, which is done after the training of OthelloGPT and does not impact what the model does.
Their argument is that the state is represented in the activation patterns and that this is then used to determine the next move, are you countering that to suggest it instead may be "local position patterns learnt during training. Positional representation (attention) of the N-1 tokens in the autoregressive task"?
If the pattern of activations did not correspond to the current board state, modifying those activations to produce a different internal model of the board wouldn't work. I also don't follow how the activations would mirror the expected board state.
What I am trying to say is that the game state is encoded as patterns in the attention matrices of the N-1 tokens. So yes, not explicitly trained to represent the game state but that game state is encoded in the tokens and their positions.
It's not in the weights because the weights don't change.
> These can be in the MLP part after the LLM. Sure
I'm not even sure what this means. The mlps are not used at all by the network.
> I don't find it completely surprising that you could get some probabilistic map of how tokens interact (game pieces) and what the next likely token is just from training it as an LLM.
You might not but the idea that they are just outputting based on sequences without having an internal model of the world is a common one. This experiment was a test to get more information on that question.
> After all tokens are just placeholders and the relationships between them are encoded in the text.
They don't tell you the state of the board.