More

energy123 · 2026-02-14T08:47:34 1771058854

We are the French artisans being replaced by English factories. OpenAI and its employees are the factory.

nitwit005 · 2026-02-15T00:12:32 1771114352

That has little to do with what I wrote, and isn't addressing the central issue.

snowwrestler · 2026-02-14T20:02:52 1771099372

Checking the scoreboard a bit later on: the French economy is currently about the same size as the UK.

energy123 · 2026-02-14T04:39:28 1771043968

Because it impacts me, and I don't want it to impact me anymore?

Not because I use these products, but because I have to live in a society with these people, and if they are unhappy and angry, that impacts me directly, through various second-order effects.

energy123 · 2026-02-13T08:30:20 1770971420

That's coupling two different usecases.

Many coding usecases care about tokens/second, not tokens/dollar.

energy123 · 2026-02-13T07:26:38 1770967598

Well, "Orchestrate" and "Steer" are verbs, while "Harness" is a noun. You need a noun here, not a verb, because the harness is not actively doing anything, it's just a set of constraints and a toolset.

viraptor · 2026-02-13T07:49:01 1770968941

That doesn't really answer the questions, because there's orchestrator and steering.

energy123 · 2026-02-13T07:21:05 1770967265

They wouldn't do full transcription, it'd be keyword spotting of useful nouns ("baby", "pain", "desk", etc).

The iPhone already does this when you wake it up with Siri.

dieortin · 2026-02-13T10:43:19 1770979399

I really doubt that’s what the iPhone does.

energy123 · 2026-02-14T06:45:49 1771051549

How else would they do the at-rest wakeup without draining battery?

energy123 · 2026-02-13T07:12:57 1770966777

Suggestion: run the identical prompt N times (2 identical calls to Gemini 3.0 Pro + 2 identical calls to GPT 5.2 Thinking), then running some basic text post-processing to see where the 4 responses agree vs disagree. The disagreements (substrings that aren't identical matches) are where scrutiny is needed. But if all 4 agree on some substring it's almost certainly a correct transcription. Wouldn't be too hard to get codex to vibe code all this.

matjet · 2026-02-13T11:22:21 1770981741

Look what they need to mimic a fraction of [the power of having the logit probabilities exposed so you can actually see where the model is uncertain]

kfajdsl · 2026-02-13T22:37:45 1771022265

All the LLM logprob outputs I've seen aren't very well calibrated, at least for transcription tasks - I'm guessing it's similar for OCR type tasks.

energy123 · 2026-02-14T10:11:00 1771063860

"I already decided in my private reasoning trace to resolve this ambiguity by emitting the string '27' instead of '22' right here, thus '27' has 100% probability"

energy123 · 2026-02-13T07:01:38 1770966098

> Yield will be much lower and prices higher.

That's an intentional trade-off in the name of latency. We're going to see a further bifurcation in inference use-cases in the next 12 months. I'm expecting this distinction to become prominent:

(A) Massively parallel (optimize for token/$)

(B) Serial low latency (optimize for token/s).

Users will switch between A and B depending on need.

Examples of (A):

- "Search this 1M line codebase for DRY violations subject to $spec."

An example of (B):

- "Diagnose this one specific bug."

- "Apply this diff".

(B) is used in funnels to unblock (A). (A) is optimized for cost and bandwidth, (B) is optimized for latency.

energy123 · 2026-02-13T05:57:56 1770962276

It's very hard to tell the difference between bad models and stinginess with compute.

I subscribe to both Gemini ($20/mo) and ChatGPT Pro ($200/mo).

If I give the same question to "Gemini 3.0 Pro" and "ChatGPT 5.2 Thinking + Heavy thinking", the latter is 4x slower and it gives smarter answers.

I shouldn't have to enumerate all the different plausible explanations for this observation. Anything from Gemini deciding to nerf the reasoning effort to save compute, versus TPUs being faster, to Gemini being worse, to this being my idiosyncratic experience, all fit the same data, and are all plausible.

timpera · 2026-02-13T07:53:24 1770969204

You nailed it. Gemini 3 Pro seems very "lazy" and seems to never reason for more than 30 seconds, which significantly impacts the quality of its outputs.

energy123 · 2026-02-13T04:37:03 1770957423

Feels like an unforced blunder to make the time window so short after going to so much effort and coming up with something so useful.

sinuhe69 · 2026-02-13T10:06:50 1770977210

5 days for Ai is by no mean short! If it can solve it, it would need perhaps 1-2 hours. If it can not, 5 days continuous running would produce gibberish only. We can safely assume that such private models will run inferences entirely on dedicated hardware, sharing with nobody. So if they could not solve the problems, it's not due to any artificial constraint or lack of resources, far from it.

The 5 days window, however, is a sweat spot because it likely prevents cheating by hiring a math PhD and feed the AI with hints and ideas.

energy123 · 2026-02-13T10:27:41 1770978461

5 days is short for memetic propagation on social media to reach everyone who has their own harness and agentic setup that wants to have a go.

zozbot234 · 2026-02-13T13:34:18 1770989658

That's not really how it works, the recent Erdos proofs in Lean were done by a specialized proprietary model (Aristotle by Harmonic) that's specifically trained for this task. Normal agents are not effective.

energy123 · 2026-02-13T14:06:00 1770991560

Why did you omit the other AI-generated Erdos proofs not done by a proprietary model, which occurred on timescales stretched across significantly longer time than 5 days?

zozbot234 · 2026-02-13T14:55:10 1770994510

Those were not really "proofs" by the standard of 1stproof. The only way an AI can possibly convince an unsympathetic peer reviewer that its proof is correct is to write it completely in a formal system like Lean. The so-called "proofs" done with GPT were half baked and required significant human input, hints, fixing after the fact etc. which is enough to disqualify them from this effort.

energy123 · 2026-02-14T09:28:03 1771061283

That wasn't my recollection. The individual who generated one of the proofs did a write-up for his methodology and it didn't involve a human correcting the model.

energy123 · 2026-02-13T04:31:23 1770957083

They could do it this way: generate 10 reasoning traces and then every N tokens they prune the 9 that have the lowest likelihood, and continue from the highest likelihood trace.

This is a form of task-agnostic test time search that is more general than multi agent parallel prompt harnesses.

10 traces makes sense because ChatGPT 5.2 Pro is 10x more expensive per token.

That's something you can't replicate without access to the network output pre token sampling.