SWE-Bench Verified score, with thinking, ties Opus 4.1 without thinking. AIME sc... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

thegeomaster 11 months ago | parent | context | favorite | on: GPT-5

SWE-Bench Verified score, with thinking, ties Opus 4.1 without thinking.

AIME scores do not appear too impressive at first glance.

They are downplaying benchmarks heavily in the live stream. This was the lab that has been flexing benchmarks as headline figures since forever.

This is a product-focused update. There is no significant jump in raw intelligence or agentic behavior against SOTA.

Davidzheng 11 months ago | [–]

what does it mean for a bench to be not impressive when it's saturated?

byyoung3 11 months ago | [–]

they aren't downplaying anything.

Consider applying for YC's Fall 2026 batch! Applications are open till July 27.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact