AIME scores do not appear too impressive at first glance.
They are downplaying benchmarks heavily in the live stream. This was the lab that has been flexing benchmarks as headline figures since forever.
This is a product-focused update. There is no significant jump in raw intelligence or agentic behavior against SOTA.
AIME scores do not appear too impressive at first glance.
They are downplaying benchmarks heavily in the live stream. This was the lab that has been flexing benchmarks as headline figures since forever.
This is a product-focused update. There is no significant jump in raw intelligence or agentic behavior against SOTA.