Because it impacts me, and I don't want it to impact me anymore?
Not because I use these products, but because I have to live in a society with these people, and if they are unhappy and angry, that impacts me directly, through various second-order effects.
Well, "Orchestrate" and "Steer" are verbs, while "Harness" is a noun. You need a noun here, not a verb, because the harness is not actively doing anything, it's just a set of constraints and a toolset.
Suggestion: run the identical prompt N times (2 identical calls to Gemini 3.0 Pro + 2 identical calls to GPT 5.2 Thinking), then running some basic text post-processing to see where the 4 responses agree vs disagree. The disagreements (substrings that aren't identical matches) are where scrutiny is needed. But if all 4 agree on some substring it's almost certainly a correct transcription. Wouldn't be too hard to get codex to vibe code all this.
"I already decided in my private reasoning trace to resolve this ambiguity by emitting the string '27' instead of '22' right here, thus '27' has 100% probability"
That's an intentional trade-off in the name of latency. We're going to see a further bifurcation in inference use-cases in the next 12 months. I'm expecting this distinction to become prominent:
(A) Massively parallel (optimize for token/$)
(B) Serial low latency (optimize for token/s).
Users will switch between A and B depending on need.
Examples of (A):
- "Search this 1M line codebase for DRY violations subject to $spec."
An example of (B):
- "Diagnose this one specific bug."
- "Apply this diff".
(B) is used in funnels to unblock (A). (A) is optimized for cost and bandwidth, (B) is optimized for latency.
It's very hard to tell the difference between bad models and stinginess with compute.
I subscribe to both Gemini ($20/mo) and ChatGPT Pro ($200/mo).
If I give the same question to "Gemini 3.0 Pro" and "ChatGPT 5.2 Thinking + Heavy thinking", the latter is 4x slower and it gives smarter answers.
I shouldn't have to enumerate all the different plausible explanations for this observation. Anything from Gemini deciding to nerf the reasoning effort to save compute, versus TPUs being faster, to Gemini being worse, to this being my idiosyncratic experience, all fit the same data, and are all plausible.
You nailed it. Gemini 3 Pro seems very "lazy" and seems to never reason for more than 30 seconds, which significantly impacts the quality of its outputs.
5 days for Ai is by no mean short! If it can solve it, it would need perhaps 1-2 hours. If it can not, 5 days continuous running would produce gibberish only. We can safely assume that such private models will run inferences entirely on dedicated hardware, sharing with nobody. So if they could not solve the problems, it's not due to any artificial constraint or lack of resources, far from it.
The 5 days window, however, is a sweat spot because it likely prevents cheating by hiring a math PhD and feed the AI with hints and ideas.
That's not really how it works, the recent Erdos proofs in Lean were done by a specialized proprietary model (Aristotle by Harmonic) that's specifically trained for this task. Normal agents are not effective.
Why did you omit the other AI-generated Erdos proofs not done by a proprietary model, which occurred on timescales stretched across significantly longer time than 5 days?
Those were not really "proofs" by the standard of 1stproof. The only way an AI can possibly convince an unsympathetic peer reviewer that its proof is correct is to write it completely in a formal system like Lean. The so-called "proofs" done with GPT were half baked and required significant human input, hints, fixing after the fact etc. which is enough to disqualify them from this effort.
That wasn't my recollection. The individual who generated one of the proofs did a write-up for his methodology and it didn't involve a human correcting the model.
They could do it this way: generate 10 reasoning traces and then every N tokens they prune the 9 that have the lowest likelihood, and continue from the highest likelihood trace.
This is a form of task-agnostic test time search that is more general than multi agent parallel prompt harnesses.
10 traces makes sense because ChatGPT 5.2 Pro is 10x more expensive per token.
That's something you can't replicate without access to the network output pre token sampling.
reply