> we wanted to verify whether the model is actually capable of reasoning by building a simulation for a much simpler game - Connect 4 (see 'llmc4.py').
> When asked to play Connect 4, all LLMs fail to do so, even at most basic level. This should not be the case, as the rules of the game are simpler and widely available.
Wouldn't there have to be historical matches to train on? Tons of chess games out there but doubt there are any connect 4 games. Is there even official notation for that?
My assumption is that chatgpt can play chess because it has studied the games rather than just reading the rules.
Good point, would be interesting to have one public dataset and one hidden as well, just to see how scores compare, to understand if any of it might actually have got to a dataset somewhere.
I would assume it goes over all the public github codebases, but no clue if there's some sort of filtering for filetypes, sizes or amount of stars on a repo etc.