Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't know about groundbreaking. It's just more academic questions. We already have a lot of those benchmarks, this is just a bit harder, but at this point these models are so glaringly bad at so many other areas APART from academic questions. Benchmarks for spatial reasoning or theory of mind are more interesting now, for example. These kinds of understanding are far more important if we expect to integrate AI into our everyday lives. I suspect even our most distant primate cousins could outperform multi-modal models on these kinds of tests.


It does feel a bit like the early days of AI:

"We want to make computers do what smart people do. What do smart people do? They play chess! Once we've solved that, everything else will be easier."

It has been remarkable how much of the "easier" stuff they've made progress on -- like natural language and images. But after a huge quantum improvement, it doesn't seem very good at adapting to a lot of the things we really need them for.


Exactly!

Whatever world model LLMs have is like this crippled view through the lens of the internet. They are really like savants.

It's annoying the AI companies are still touting their performance on all these metrics for domain knowledge in white collar jobs, but in truth they will fail in all but the most narrow application in those domains because they can't understand basic human behaviour.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: