Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

SWE-Bench Verified score, with thinking, ties Opus 4.1 without thinking.

AIME scores do not appear too impressive at first glance.

They are downplaying benchmarks heavily in the live stream. This was the lab that has been flexing benchmarks as headline figures since forever.

This is a product-focused update. There is no significant jump in raw intelligence or agentic behavior against SOTA.



what does it mean for a bench to be not impressive when it's saturated?


they aren't downplaying anything.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: