What does it mean to “not be given the rules”? If you set a child down in front of a chess board with the pieces nearby and they are not aware of the rules, I doubt they’d ever figure out how to play even a single correct game of chess. Heck, the child may decide to put the pieces in their mouth or dress them up as make belief characters.
Without any concept of the rules you have no way of even knowing that you’ve set up the pieces for a legal starting position, never mind executing a legal move to open the game.
Basically, AlphaZero was provided with a simulator that was able to distinguish legal and illegal moves and determine which future game states would be wins or losses. This was used to generate the search tree of possible states and actions.
MuZero doesn't have access to a simulator, it only has access to its direct environment. MuZero excludes actions that are immediately illegal, which solves the problem you mention in your penultimate paragraph, but it needs to learn the game's dynamics in order to determine which future moves and states are possible.
> AlphaZero
used the set of legal actions obtained from the simulator to mask the
policy network at interior nodes. MuZero does not perform any masking
within the search tree, but only masks legal actions at the root of the
search tree where the set of available actions is directly observed. The
policy network rapidly learns to exclude actions that are unavailable,
simply because they are never selected.
MuZero still masks legal moves, but only at the root.
All its parts are eventually trained on the output of its root, and so learn the legal moves.
The justify this root level masking by how the Atari will only allow you to perform legal moves, while a weak enough player may consider illegal moves while planning in your head.
The main thing that's slightly "hidden under the rug" is that for "masking" to make sense in the first place, MuZero needs to know a set of all moves that may be legal at some point in the games.
> while a weak enough player may consider illegal moves while planning in your head
This isn't just weak players. E.g. strong chess players often consider moves as if blocking pawns weren't there. They might consider a bishop to be on a strong diagonal despite there being a blocking pawn because they can imagine moves that would happen if that pawn would disappear.
No it doesn't. MuZero does its planning entirely in its own latent space (it may not even actually think of the game in terms of 'moves' but in whatever steps it considers relevant instead), only the output is filtered for legal moves.
It's no different than a monkey operating a chess computer that makes sure the monkey only performs legal moves. Your suggestion would be akin to suggesting that the chess computer would be affecting the monkey's mind so that it can only think in terms of legal chess moves.
Seems you could equivalently treat rule breaking as a loss, and any algorithm sophisticated enough to learn how to win will also learn to avoid breaking the rules.
Without any concept of the rules you have no way of even knowing that you’ve set up the pieces for a legal starting position, never mind executing a legal move to open the game.
This is really bizarre.