The variance is way too high for this test to have any value at all. I ran it 10... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		Stevvo 11 days ago \| parent \| context \| favorite \| on: GPT-5.2 The variance is way too high for this test to have any value at all. I ran it 10 times, and each pelican on a bicycle was a better rendition than that, about half of them you could say were perfect.

golly_ned 11 days ago | [–]

Compared to the other benchmarks which are much more gameable, I trust PelicanBikeEval way more.

getnormality 11 days ago | [–]

Well, the variance is itself interesting.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact