Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
mewpmewp2
on Nov 10, 2024
|
parent
|
context
|
favorite
| on:
FrontierMath: A benchmark for evaluating advanced ...
Ideally they would have batches of those exercises, where the only use the next batch when someone has solved a suspicious amount of those exercises. If it performs much worse on the next batch, that is a tell of leakage.
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: