First of all, Putnam is not in the test data, at least I haven't seen OpenAI claiming that publicly. Secondly, removing it from internet data is not 100% accurate. There are translations of the problems and solutions or references and direct match is not enough. MMLU and test set benchmarks show more resilience though in some previous research.
OpenAI is extremely cagey about what's in their test data set generally, but absent more specific info, they're widely assumed to be grabbing whatever they can. (Notably including copyrighted information used without explicit authorization -- I'll take no position on legal issues in the New York Times's lawsuit against OpenAI, but at the very least, getting their models to regurgitate NYT articles verbatim demonstrates pretty clearly that those articles are in the training set.)
> Putnam is not in the test data, at least I haven't seen OpenAI claiming that publicly
What exactly is the source of your belief that the Putnam would not be in the test data? Didn’t they train on everything they could get their hands on?
These models are trained in two steps: training base model and then uptraining it. First step includes as much data as possible, everything company can find. For Llama models it's 15T tokens, which is ~40 TB of data. No-one really puts an effort on splitting this data into train/test/eval (and it's not very achievable either). It's just as much data as possible.
So it's like 99.9999999% wrong to assume something public isn't on the train set, such as Putnam problems in this case. This is about it.
There are benchmarks that are decided beforehand and similar sentences are removed from even the first stage of training. This is useful for tracking model performance and comparing different choices. e.g. see section 'Contamination of downstream tasks' of [1].
Every decent AI lab does this, else the benchmark result couldn't be trusted. OpenAI publishes results of ~20 benchmarks[2] and it is safe to assume they have made reasonable attempt to remove it from training set
the point is that putnam was never a test/benchmark being used by OpenAI or anyone else, so there is no smoking gun if you find putnam on the train set nor is it cheating or nefarious because nobody ever claimed otherwise.
this whole notion of putnam as test being trained on is a fully invented grievance
I've read the thread and I think it's not very coherent overall, I also not sure if we disagree =)
I agree that having putnam problems on OpenAI training set is not a smoking gun, however it's (almost) certain they are on training set, and having them would affect performance of the model on them too. Hence research like this is important, since it shows that observed behavior of the models is memoization to large extent, and not necessarily generalization we would like it to be.
nobody serious (like OAI) was using the putnam problems to claim generalization. this is a refutation in search of a claim - and many people in the upstream thread are suggesting that OAI is doing something wrong by training on a benchmark.
OAI uses datasets like frontiermath or arc-agi that are actually held out to evaluate generalization.
I, actually, would disagree with this.
To me ability to solve frontiermath does imply ability to solve putnam problems too. Only with putnam problems being easier - they are already been seen by the model, and they are also simpler problems. And just like this - putnam problem with simple changes are also one of the easier stops on the way to truly generalizing math models, with frontiermath being one of the last stops on the way there.