Is it possible the people administering these tests are less than honest about the outcome? Perhaps they have a vested interest in the results they've shown?
It's also not the case that all (or even most) AIs ace all (or even most) meaningful tests that aren't pattern or memory based. Adding multi-digit uncommon integers is too hard for many LLMs for example.
My fellow uni students could pass tests on one of my courses by regurgitating the (wrong) answers that had been published and reused for years. I was not popular with faculty for pointing that out, and had most of a year's papers cancelled after the fact...