First of all, Putnam is not in the test data, at least I haven't seen OpenAI cla...

rst · 2025-01-01T15:00:59 1735743659

OpenAI is extremely cagey about what's in their test data set generally, but absent more specific info, they're widely assumed to be grabbing whatever they can. (Notably including copyrighted information used without explicit authorization -- I'll take no position on legal issues in the New York Times's lawsuit against OpenAI, but at the very least, getting their models to regurgitate NYT articles verbatim demonstrates pretty clearly that those articles are in the training set.)

fn-mote · 2025-01-01T15:12:01 1735744321

Let’s think about this.

> Putnam is not in the test data, at least I haven't seen OpenAI claiming that publicly

What exactly is the source of your belief that the Putnam would not be in the test data? Didn’t they train on everything they could get their hands on?

whimsicalism · 2025-01-01T15:13:27 1735744407

do you understand the difference between test data and train data? just reread this thread of comments

YetAnotherNick · 2025-01-01T16:03:59 1735747439

I don't know why I and you are getting downvoted. Sometimes, HN crowd is just unhinged against AI.

boroboro4 · 2025-01-01T23:14:06 1735773246

These models are trained in two steps: training base model and then uptraining it. First step includes as much data as possible, everything company can find. For Llama models it's 15T tokens, which is ~40 TB of data. No-one really puts an effort on splitting this data into train/test/eval (and it's not very achievable either). It's just as much data as possible.

So it's like 99.9999999% wrong to assume something public isn't on the train set, such as Putnam problems in this case. This is about it.

YetAnotherNick · 2025-01-02T01:45:22 1735782322

There are benchmarks that are decided beforehand and similar sentences are removed from even the first stage of training. This is useful for tracking model performance and comparing different choices. e.g. see section 'Contamination of downstream tasks' of [1].

Every decent AI lab does this, else the benchmark result couldn't be trusted. OpenAI publishes results of ~20 benchmarks[2] and it is safe to assume they have made reasonable attempt to remove it from training set

[1]: https://arxiv.org/pdf/2107.06499

[2]: https://openai.com/index/hello-gpt-4o/

whimsicalism · 2025-01-01T23:34:51 1735774491

right, but where did someone assume it wasn’t in the train set? they just said it wasn’t in the test set

boroboro4 · 2025-01-01T23:46:15 1735775175

What test set is being talked about here? Why does it matter what’s on this set?

whimsicalism · 2025-01-01T23:51:24 1735775484

the point is that putnam was never a test/benchmark being used by OpenAI or anyone else, so there is no smoking gun if you find putnam on the train set nor is it cheating or nefarious because nobody ever claimed otherwise.

this whole notion of putnam as test being trained on is a fully invented grievance

read the entire thread in this context

boroboro4 · 2025-01-02T01:49:18 1735782558

I've read the thread and I think it's not very coherent overall, I also not sure if we disagree =)

I agree that having putnam problems on OpenAI training set is not a smoking gun, however it's (almost) certain they are on training set, and having them would affect performance of the model on them too. Hence research like this is important, since it shows that observed behavior of the models is memoization to large extent, and not necessarily generalization we would like it to be.

whimsicalism · 2025-01-02T02:24:20 1735784660

nobody serious (like OAI) was using the putnam problems to claim generalization. this is a refutation in search of a claim - and many people in the upstream thread are suggesting that OAI is doing something wrong by training on a benchmark.

OAI uses datasets like frontiermath or arc-agi that are actually held out to evaluate generalization.

boroboro4 · 2025-01-02T03:49:44 1735789784

I, actually, would disagree with this. To me ability to solve frontiermath does imply ability to solve putnam problems too. Only with putnam problems being easier - they are already been seen by the model, and they are also simpler problems. And just like this - putnam problem with simple changes are also one of the easier stops on the way to truly generalizing math models, with frontiermath being one of the last stops on the way there.

chvid · 2025-01-01T15:25:35 1735745135

It is on the open internet - questions and suggested solutions:

https://kskedlaya.org/putnam-archive/

I would expect all llms to be trained on it.

whimsicalism · 2025-01-01T15:17:29 1735744649

funny that nobody replying to you seems to even know what a test set is. i always overestimate the depth of ML conversation you can have on HN