1. OpenAI has confirmed it’s not in their train (unlike putnam where they have never made any such claims)
2. They don't train on API calls
3. It is funny to me that HN finds it easier to believe theories about stealing data from APIs rather than an improvement in capabilities. It would be nice if symmetric scrutiny were applied to optimistic and pessimistic claims about LLMs, but I certainly don’t feel that is the case here.
> 1. OpenAI has confirmed it’s not in their train (unlike putnam where they have never made any such claims)
Companies claim lots of things when it's in their best financial interest to spread that message. Unfortunately history has shown that in public communications, financial interest almost always trumps truth (pick whichever $gate you are aware of for convenience, i'll go with Dieselgate for a specific example).
> It is funny to me that HN finds it easier to believe theories about stealing data from APIs rather than an improvement in capabilities. It would be nice if symmetric scrutiny were applied to optimistic and pessimistic claims about LLMs, but I certainly don’t feel that is the case here.
What I see is generic unsubstantiated claims of artificial intelligence on one side and specific, reproducible examples that dismantle that claim on the other. I wonder how your epistemology works that leads you to accept marketing claims without evidence
OpenAI’s credibility is central to its business: overstating capabilities risks public blowback, loss of trust, and regulatory scrutiny. As a result, it is unlikely that OpenAI would knowingly lie about its models. They have much stronger incentives to be as accurate as possible—maintaining their reputation and trust from users, researchers, and investors—than to overstate capabilities for a short-term gain that would undermine their long-term position.
From a game-theoretic standpoint, repeated interactions with the public (research community, regulators, and customers) create strong disincentives for OpenAI to lie. In a single-shot scenario, overstating model performance might yield short-term gains—heightened buzz or investment—but repeated play changes the calculus:
1. Reputation as “collateral”
OpenAI’s future deals, collaborations, and community acceptance rely on maintaining credibility. In a repeated game, players who defect (by lying) face future punishment: loss of trust, diminished legitimacy, and skepticism of future claims.
2. Long-term payoff maximization
If OpenAI is caught making inflated claims, the fallout undermines the brand and reduces willingness to engage in future transactions. Therefore, even if there is a short-term payoff, the long-term expected value of accuracy trumps the momentary benefit of deceit.
3. Strong incentives for verification
Independent researchers, open-source projects, and competitor labs can test or replicate claims. The availability of external scrutiny acts as a built-in enforcement mechanism, making dishonest “moves” too risky.
Thus, within the repeated game framework, OpenAI maximizes its overall returns by preserving its credibility rather than lying about capabilities for a short-lived advantage.
Find me the folks who see nothing but good will in OpenAI’s actions and I’ll find you the folks who have been hyping up AGI for the last 2 years.
4 was literally sitting on a shelf waiting for release when 3.5 was launched. 4o was a fine tune that took over two years. o1 is embarrassingly unimpressive chain of thought which is why they hide it.
The company hit a wall a year ago. But showing progress towards AGI keeps the lights on. If they told the truth at their current burn rate…they’d have no money.
You don’t need game theory to figure that one out.
>OpenAI’s credibility is central to its business: overstating capabilities risks public blowback, loss of trust, and regulatory scrutiny.
Uh huh. Kinda like what's happening right now?
They're marketing blow-hards. Everyone knows it. They've been wildly over-stating capabilities (and future capabilities!) as long as Altman has had power, and arguably longer.
They'll do it as long as they can get away with it, because that's all that is needed to make money on it. Factual accuracy rarely impacts the market when it's so hype-driven, especially when there is still some unique utility in the product.
OpenAI's apparent credibility is central to their business.
They're spruiking a 93rd percentile performance on the 2024 International Olympiad in Informatics with 10 hours of processing and 10,000 submissions per question.
Like many startups they're still a machine built to market itself.
OpenAI never actually directly participated in the contests. OpenAI says they took (unspecified) problems and "simulated" the score it would have had.
If they solved recent contests in a realistic contest simulation I would expect them to give the actual solutions and success rates as well, like they did for IOI problems, so I'm actually confused as to why they didn't.
very good clarification, thanks - they should absolutely release more details to provide more clarity and ideally just participate live? i suspect that the model takes a while for individual problems so time might be a constraint there
The modern state of training is to try to use everything they can get their hands on. Even if there are privileged channels that are guaranteed not to be used as training data, mentioning the problems on ancillary channels (say emailing another colleague to discuss the problem) can still create a risk of leakage because nobody making the decision to include the data is aware that stuff that should be excluded is in that data set. And as we've seen from decades of cybersecurity, people are absolute shit at the necessary operational security to avoid mentioning stuff on ancillary channels!
Given that performance is known to drop considerably on these kinds of tests when novel problems are tried, and given the ease with which these problems could leak into the training set somehow, it's not unreasonable to be suspicious of a sudden jump in performance as merely a sign that the problems made it into the training set rather than being true performance improvements in LLMs.
Okay, then what about elite level codeforces performance? Those problems weren’t even constructed until after the model was made.
The real problem with all of these theories is most of these benchmarks were constructed after their training dataset cutoff points.
A sudden performance improvement on a new model release is not suspicious. Any model release that is much better than a previous one is going to be a “sudden jump in performance.”
Also, OpenAI is not reading your emails - certainly not with a less than one month lead time.
o1 has a ~1650 rating, at that level many or most problems you will be solving are going to be a transplant of a relatively known problem.
Since o1 on codeforces just tried hundreds or thousands of solutions, it's not surprising it can solve problems where it is really about finding a relatively simple correspondence to a known problem and regurgitating an algorithm.
In fact when you run o1 on ""non-standard"" codeforces problems it will almost always fail.
So the thesis that it's about recognizing a problem with a known solution and not actually coming up with a solution yourself seems to hold, as o1 seems to fail even on low rated problems which require more than fitting templates.
It's extremely unlikely for o3 to have hit 2700 on live contests as such a rapid increase in score would have been noticed by the community. I can't find anything online detailing how contamination was avoided since it clearly wasn't run live, including in their video, and neither could I find details about the methodology (number of submissions being the big one, in contests you can also get 'hacked' esp. at a high level), problem selection, etc...
Additionally, people weren't able to replicate o1-mini results in live contests straightforwardly - often getting scores between 700 and 1200, which raises questions as for the methodology.
Perhaps o3 really is that good, but I just don't see how you can claim what you claimed for o3, we have no idea that the problems have never been seen, and the fact people find much lower Elo scores with o1/o1-mini with proper methodology raises even more questions, let alone conclusively proving these are truly novel tasks it's never seen.
Sorry, I thought the whole point of this thread was that models can’t handle problems when they are “slightly varied”. Mottes and baileys all over the place today.
The point is that it's not consistent on variations, unless it finds a way to connect it to something it already knows. The fact it sometimes succeeds on variations (in codeforces the models are allowed multiple tries, sometimes ridiculous numbers, to be useful) doesn't matter.
The point is that the fact it's no longer consistent once you vary the terminology indicates it's fitting a memorized template instead of reasoning from first principles.
2. They don't train on API calls
3. It is funny to me that HN finds it easier to believe theories about stealing data from APIs rather than an improvement in capabilities. It would be nice if symmetric scrutiny were applied to optimistic and pessimistic claims about LLMs, but I certainly don’t feel that is the case here.