Its not suprising, but it answers the question "Do Large Language Models learn world models or just surface statistics?" - OthelloGTP is not using some weird trick to come up with the next move "G4". You can imagine some sort of shortcut trick where you say "use a letter thats near the middle of the bell curve of letters you've seen so far, and a number thats a bit to the left of the bell curve" or something. Its not using a weird trick, its actually modelling the board, the counters, and the rules about where the black and white discs are allowed to go, and keeping track of the game state. It derived all that from the input.
But the point is that Othello notation is basically 64 tokens which map 1:1 to positions on an Othello board, and the "grammar" of whether one token is a valid continuation is basically how the previous sequence of moves updates game state, so surface statistics absolutely do lead inexorably towards a representation of the game board. Whether a move is a suitable continuation or not absolutely is a matter of probability contingent on previous inputs (some moves common, some moves uncommon, many other moves not in training set due to impossibility). Translating inputs into an array of game state has a far higher accuracy rate than "weird tricks" like outputting the most common numbers and letters in the set, so it's not surprising an optimisation process involving a large array converges on that to generate its outputs. Indeed I'd expect a dumb process involving a big array of numbers to be more likely to converge on that solution from a lot of data than a sentient being with a priori ideas about bell curves of letters...
I think some of the stuff ChatGPT can actually do like reject the possibility of Magellan circumnavigating my living room is much more surprising than a specialist NN learning how to play Othello from a DSL providing a perfect representation of Othello games, but there's still a big difference between acquiring through training a very basic model of time periods and the relevance of verbs to them such that it can conclude an assertion in the form was impossible for to X have [Verb]ed Y "because X lived in V and Y lived in Q is a suitable continuation and having a high fidelity, well rounded word model. It has some sort of world model, but it's tightly bound to syntax and approval and very loosely bound to the actual world. The rest of the world doesn't have neat 1:1 mapping to sentence structure like Othello to Othello notation, which is why LLMs appear to have quite limited and inadequate internal representations even of things which computers can excel at (and humans be taught with considerably fewer textbooks) like mathematics, never mind being able to deduce what it's like to have an emotional state from tokens typically combined with the string "sad".