Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Benchmarks on public tests are too easy to game. The model owners can just incorporate the answers in to the dataset. Only the private problems actually matter.


In this case the code is public and you can see they are not cheating in that sense.


The harness seems extremely benchmark specific that gives them a huge advantage over what most models can use. This isn't a qualifying score for that reason.

Here is the ARC-AGI-3 specific harness by the way - lots of challenge information encoded inside: https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...


I agree it's not cheating that restricted sense. But I'm not really convinced that it can't be cheating in a more general sense. You can try like 10^10 variations of harnesses and select the one that performs best. And probably if you then look at it, it will not look like it's necessarily cheating. But you have biased the estimator by selecting the harness according to the value.


Once the model has seen the questions and answers in the training stage, the questions are worthless. Only a test using previously unseen questions has merit.


They aren't training new models for this. This is an agent harness for Opus 4.6.


All traffic is monitored, all signal sources are eventually incorporated into the training set in one way or another. The person you're responding to is correct, even a single API call to any AI provider is sufficient to discount future results from the same provider.


ok! So if someone uses an existing, checkpointed, open source model then the answer is yes the results are valid and it doesn't matter that the tests are public.


Yes, assuming the checkpoint was before the announcement & public availability of the test set.


You live in a conspiracy world. Those AI providers don't update the models that fast. You can try ask them solve ARC-AGI-3 without harness and see them struggle as yesterday yourself.


Which part is the conspiracy? Be as concrete as possible.


They are definitely cheating, they have crafted prompts[1] that explain the game rules rather than have the model explore and learn.

1. https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...


Where do you see that? I only skimmed the prompts but don't see any aspects of any of the games explained in there. There are a few hints which are legitimate prior knowledge about games in general, though some looks too inflexible to me. Prior knowledge ("Core priors") is a critical requirement of the ARC series, read the reports.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: