1. OpenAI has confirmed it’s not in their train (unlike putnam where they have n...

s1mplicissimus · 2025-01-01T15:46:09 1735746369

> 1. OpenAI has confirmed it’s not in their train (unlike putnam where they have never made any such claims)

Companies claim lots of things when it's in their best financial interest to spread that message. Unfortunately history has shown that in public communications, financial interest almost always trumps truth (pick whichever $gate you are aware of for convenience, i'll go with Dieselgate for a specific example).

> It is funny to me that HN finds it easier to believe theories about stealing data from APIs rather than an improvement in capabilities. It would be nice if symmetric scrutiny were applied to optimistic and pessimistic claims about LLMs, but I certainly don’t feel that is the case here.

What I see is generic unsubstantiated claims of artificial intelligence on one side and specific, reproducible examples that dismantle that claim on the other. I wonder how your epistemology works that leads you to accept marketing claims without evidence

ttul · 2025-01-01T20:21:12 1735762872

OpenAI’s credibility is central to its business: overstating capabilities risks public blowback, loss of trust, and regulatory scrutiny. As a result, it is unlikely that OpenAI would knowingly lie about its models. They have much stronger incentives to be as accurate as possible—maintaining their reputation and trust from users, researchers, and investors—than to overstate capabilities for a short-term gain that would undermine their long-term position.

From a game-theoretic standpoint, repeated interactions with the public (research community, regulators, and customers) create strong disincentives for OpenAI to lie. In a single-shot scenario, overstating model performance might yield short-term gains—heightened buzz or investment—but repeated play changes the calculus:

1. Reputation as “collateral”

OpenAI’s future deals, collaborations, and community acceptance rely on maintaining credibility. In a repeated game, players who defect (by lying) face future punishment: loss of trust, diminished legitimacy, and skepticism of future claims.

2. Long-term payoff maximization

If OpenAI is caught making inflated claims, the fallout undermines the brand and reduces willingness to engage in future transactions. Therefore, even if there is a short-term payoff, the long-term expected value of accuracy trumps the momentary benefit of deceit.

3. Strong incentives for verification

Independent researchers, open-source projects, and competitor labs can test or replicate claims. The availability of external scrutiny acts as a built-in enforcement mechanism, making dishonest “moves” too risky.

Thus, within the repeated game framework, OpenAI maximizes its overall returns by preserving its credibility rather than lying about capabilities for a short-lived advantage.

F7F7F7 · 2025-01-01T22:05:20 1735769120

Find me the folks who see nothing but good will in OpenAI’s actions and I’ll find you the folks who have been hyping up AGI for the last 2 years.

4 was literally sitting on a shelf waiting for release when 3.5 was launched. 4o was a fine tune that took over two years. o1 is embarrassingly unimpressive chain of thought which is why they hide it.

The company hit a wall a year ago. But showing progress towards AGI keeps the lights on. If they told the truth at their current burn rate…they’d have no money.

You don’t need game theory to figure that one out.

Groxx · 2025-01-01T20:52:32 1735764752

>OpenAI’s credibility is central to its business: overstating capabilities risks public blowback, loss of trust, and regulatory scrutiny.

Uh huh. Kinda like what's happening right now?

They're marketing blow-hards. Everyone knows it. They've been wildly over-stating capabilities (and future capabilities!) as long as Altman has had power, and arguably longer.

They'll do it as long as they can get away with it, because that's all that is needed to make money on it. Factual accuracy rarely impacts the market when it's so hype-driven, especially when there is still some unique utility in the product.

tomlockwood · 2025-01-01T23:06:57 1735772817

OpenAI's apparent credibility is central to their business.

They're spruiking a 93rd percentile performance on the 2024 International Olympiad in Informatics with 10 hours of processing and 10,000 submissions per question.

Like many startups they're still a machine built to market itself.

cma · 2025-01-02T04:49:33 1735793373

The gpt 4 paper said they only excluded benchmarks by exact text matches. That means discussion of them probably doesn't get excluded.

fldskfjdslkfj · 2025-01-01T15:26:58 1735745218

Easier to believe or not, thinking that it's not a reasonable possibility is also funny.

whimsicalism · 2025-01-01T15:30:30 1735745430

Do you also think they somehow stole the codeforces problems before they were even written or you are willing to believe the #175 global rank there?

fldskfjdslkfj · 2025-01-01T15:38:28 1735745908

I dont think codeforce claims to contain novel unpublished problems.

But i'm not saying it's what they did, just that it's a possibility that should be considered till/if it is debunked.

whimsicalism · 2025-01-01T15:47:56 1735746476

frankly i’m not sure what standard you would possibly consider a debunking

codeforces constantly adds new problems that’s like the entire point of the contest, no?

sudosysgen · 2025-01-02T02:09:59 1735783799

OpenAI never actually directly participated in the contests. OpenAI says they took (unspecified) problems and "simulated" the score it would have had.

If they solved recent contests in a realistic contest simulation I would expect them to give the actual solutions and success rates as well, like they did for IOI problems, so I'm actually confused as to why they didn't.

whimsicalism · 2025-01-02T02:27:52 1735784872

very good clarification, thanks - they should absolutely release more details to provide more clarity and ideally just participate live? i suspect that the model takes a while for individual problems so time might be a constraint there

sudosysgen · 2025-01-02T06:07:52 1735798072

Yes, it would also probably be rate limited or timed out since they're doing 50+ submissions per problem

jcranmer · 2025-01-01T15:29:36 1735745376

The modern state of training is to try to use everything they can get their hands on. Even if there are privileged channels that are guaranteed not to be used as training data, mentioning the problems on ancillary channels (say emailing another colleague to discuss the problem) can still create a risk of leakage because nobody making the decision to include the data is aware that stuff that should be excluded is in that data set. And as we've seen from decades of cybersecurity, people are absolute shit at the necessary operational security to avoid mentioning stuff on ancillary channels!

Given that performance is known to drop considerably on these kinds of tests when novel problems are tried, and given the ease with which these problems could leak into the training set somehow, it's not unreasonable to be suspicious of a sudden jump in performance as merely a sign that the problems made it into the training set rather than being true performance improvements in LLMs.

whimsicalism · 2025-01-01T15:35:34 1735745734

Okay, then what about elite level codeforces performance? Those problems weren’t even constructed until after the model was made.

The real problem with all of these theories is most of these benchmarks were constructed after their training dataset cutoff points.

A sudden performance improvement on a new model release is not suspicious. Any model release that is much better than a previous one is going to be a “sudden jump in performance.”

Also, OpenAI is not reading your emails - certainly not with a less than one month lead time.

sudosysgen · 2025-01-01T22:32:44 1735770764

o1 has a ~1650 rating, at that level many or most problems you will be solving are going to be a transplant of a relatively known problem.

Since o1 on codeforces just tried hundreds or thousands of solutions, it's not surprising it can solve problems where it is really about finding a relatively simple correspondence to a known problem and regurgitating an algorithm.

In fact when you run o1 on ""non-standard"" codeforces problems it will almost always fail.

See for example this post running o1 multiple times on various problems: https://codeforces.com/blog/entry/133887

So the thesis that it's about recognizing a problem with a known solution and not actually coming up with a solution yourself seems to hold, as o1 seems to fail even on low rated problems which require more than fitting templates.

whimsicalism · 2025-01-01T23:36:36 1735774596

o3 is what i’m referring to and it is 2700

sudosysgen · 2025-01-02T01:39:00 1735781940

It's extremely unlikely for o3 to have hit 2700 on live contests as such a rapid increase in score would have been noticed by the community. I can't find anything online detailing how contamination was avoided since it clearly wasn't run live, including in their video, and neither could I find details about the methodology (number of submissions being the big one, in contests you can also get 'hacked' esp. at a high level), problem selection, etc...

Additionally, people weren't able to replicate o1-mini results in live contests straightforwardly - often getting scores between 700 and 1200, which raises questions as for the methodology.

Perhaps o3 really is that good, but I just don't see how you can claim what you claimed for o3, we have no idea that the problems have never been seen, and the fact people find much lower Elo scores with o1/o1-mini with proper methodology raises even more questions, let alone conclusively proving these are truly novel tasks it's never seen.

ImPostingOnHN · 2025-01-01T16:35:19 1735749319

Can you give an example of one of these problems that 'wasn't even constructed until after the model was made'?

I'd like to see if it's truly novel and unique, the first problem of its type ever construed by mankind, or if it's similar to existing problems.

whimsicalism · 2025-01-01T16:40:44 1735749644

Sorry, I thought the whole point of this thread was that models can’t handle problems when they are “slightly varied”. Mottes and baileys all over the place today.

sudosysgen · 2025-01-01T22:34:37 1735770877

The point is that it's not consistent on variations, unless it finds a way to connect it to something it already knows. The fact it sometimes succeeds on variations (in codeforces the models are allowed multiple tries, sometimes ridiculous numbers, to be useful) doesn't matter.

The point is that the fact it's no longer consistent once you vary the terminology indicates it's fitting a memorized template instead of reasoning from first principles.

Spooky23 · 2025-01-01T16:29:03 1735748943

Frankly you need to read what they say explicitly and not infer what they mean by your reckoning.

They are the system to beat and their competitors are either too small or too risk averse.

They ingest millions of data sources. Among them is the training data needed to answer the benchmark questions.