The other commenter is more articulate, but you simply cannot draw the conclusion from this paper that reasoning models don't work well. They trained tiny little models and showed they don't work. Big surprise! Meanwhile every other piece of evidence available shows that reasoning models are more reliable at sophisticated problems. Just a few examples.
"Typically for these AI results, like in Go/Dota/Poker/Diplomacy, researchers spend years making an AI that masters one narrow domain and does little else. But this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques."
- https://arcprize.org/leaderboard
- https://aider.chat/docs/leaderboards/
- https://arstechnica.com/ai/2025/07/google-deepmind-earns-gol...
Surely the IMO problems weren't "within the bounds" of Gemini's training data.