I agree that existing benchmarks are no longer useful now that there's basically...

I agree that existing benchmarks are no longer useful now that there's basically nothing left in them that seems to stump LLMs.

But when I hear that models are failing to meet expectations, I imagine what they're saying is that the researchers had some sort of eval in mind with room to grow and a target, and that the model in question failed to hit the target they had in mind.

Honestly, problem with sentiments like these is on Twitter is that you can't tell if they're being sincere or just making a snarky, useless remark. Probably a mix of both.