More

GodelNumbering · 2026-03-12T19:32:24 1773343944

As an inference hungry human, I am obviously hooked. Quick feedback:

1. The models/pricing page should be linked from the top perhaps as that is the most interesting part to most users. You have mentioned some impressive numbers (e.g. GLM5~220 tok/s $1.20 in · $3.50 out) but those are way down in the page and many would miss it

2. When looking for inference, I always look at 3 things: which models are supported, at which quantization and what is the cached input pricing (this is way more important than headline pricing for agentic loops). You have the info about the first on the site but not 2 and 3. Would definitely like to know!

2uryaa · 2026-03-12T21:00:43 1773349243

Thank you for the feedback! I think we will definitely redo the info on the frontpage to reorg and show quantizations better. For reference, Kimi and Minimax are NVFP4. The rest are FP8. But I will make this more obvious on the site itself.

bethekind · 2026-03-13T00:05:00 1773360300

I love the phrase "inference hunger"

GodelNumbering · 2026-03-11T20:31:14 1773261074

Even if people try to bypass it, having the official rule matters a lot.

@dang, if you read this, why don't we implement honeypots to catch bots? Like having an empty or invisible field while posting/commenting that a human would never fill in

tomasz-tomczyk · 2026-03-11T20:42:01 1773261721

It's likely going to be a game of whack-a-mole, especially with AI as opposed to simple bots/scripts. Not that they shouldn't try to prevent it, but not entirely sure what the solution is.

tavavex · 2026-03-11T20:51:26 1773262286

There's probably no solution, but at least this gives a reason to go after the lowest hanging fruit - the zero-effort, obvious, low-quality output.

GodelNumbering · 2026-03-11T13:04:40 1773234280

I imagine that would cause a backlash from the website owners trusting cloudflare to keep their content 'safe'

GodelNumbering · 2026-03-03T17:52:23 1772560343

That's a 150% increase in the input costs and 275% increase on output costs over the same sized previous generation (2.5-flash-lite) model

GodelNumbering · 2026-02-26T07:06:22 1772089582

It is probably the first-time aha moment the author is talking about. But under the hood, it is probably not as magical as it appears to be.

Suppose you prompted the underlying LLM with "You are an expert reviewer in..." and a bunch of instructions followed by the paper. LLM knows from the training that 'expert reviewer' is an important term (skipping over and oversimplifying here) and my response should be framed as what I know an expert reviewer would write. LLMs are good at picking up (or copying) the patterns of response, but the underlying layer that evaluates things against a structural and logical understanding is missing. So, in corner cases, you get responses that are framed impressively but do not contain any meaningful inputs. This trait makes LLMs great at demos but weak at consistently finding novel interesting things.

If the above is true, the author will find after several reviews that the agent they use keeps picking up on the same/similar things (collapsed behavior that makes it good at coding type tasks) and is blind to some other obvious things it should have picked up on. This is not a criticism, many humans are often just as collapsed in their 'reasoning'.

LLMs are good at 8 out of 10 tasks, but you don't know which 8.

Kim_Bruning · 2026-02-26T07:58:43 1772092723

In your model, explain the old trick "think step by step"

GodelNumbering · 2026-02-26T17:15:36 1772126136

It simply forces the model to adopt an output style known to conduce systematic thinking without actually thinking. At no point has it through through the thing (unless there are separate thinking tokens)

GodelNumbering · 2026-02-12T12:07:16 1770898036

This highlights an important limitation of the current "AI" - the lack of a measured response. The bot decides to do something based on something the LLM saw in the training data, quickly u-turns on it (check the some hours later post https://crabby-rathbun.github.io/mjrathbun-website/blog/post...) because none of those acts are coming from an internal world-model or grounded reasoning, it is bot see, bot do.

I am sure all of us have had anecdotal experiences where you ask the agent to do something high-stakes and it starts acting haphazardly in a manner no human would ever act. This is what makes me think that the current wave of AI is task automation more than measured, appropriate reactions, perhaps because most of those happen as a mental process and are not part of training data.

_heimdall · 2026-02-12T12:13:01 1770898381

I think what your getting at is basically the idea that LLMs will never be "intelligent" in any meaningful sense of the word. They're extremely effective token prediction algorithms, and they seem to be confirming that intelligence isn't dependent solely on predicting the next token.

Lacking measured responses is much the same as lacking consistent principles or defining ones own goals. Those are all fundamentally different than predicting what comes next in a few thousand or even a million token long chain of context.

GodelNumbering · 2026-02-12T12:20:45 1770898845

Indeed. One could argue that the LLMs will keep on improving and they would be correct. But they would not improve in ways that make them a good independent agent safe for real world. Richard Sutton got a lot of disagreeing comments when he said on Dwarkesh Patel podcast that LLMs are not bitter-lesson (https://en.wikipedia.org/wiki/Bitter_lesson) pilled. I believe he is right. His argument being, any technique that relies on human generated data is bound to have limitations and issues that get harder and harder to maintain/scale over time (as opposed to bitter lesson pilled approaches that learn truly first hand from feedback)

_heimdall · 2026-02-12T12:34:25 1770899665

I disagree with Sutton that a main issue is using human generated data. We humans are trained on that and we don't run into such issues.

I expect the problem is more structural to how the LLMs, and other ML approaches, actually work. Being disembodied algorithms trying to break all knowledge down to a complex web of probabilities, and assuming that anything predicting based only on those quantified data, seems hugely limiting and at odds with how human intelligence seems to work.

GodelNumbering · 2026-02-12T12:51:40 1770900700

Sutton actually argues that we do not train on data, we train on experiences. We try things and see what works when/where and formulate views based on that. But I agree with your later point about training such a way is hugely limiting, a limit not faced by humans

co_king_3 · 2026-02-12T13:14:28 1770902068

[flagged]

_heimdall · 2026-02-12T14:20:26 1770906026

Someone arguing that LLMs will keep improving may be putting too much weight behind expecting a trend to continue, but that wouldn't make them a gullible sucker.

I'd argue that LLMs have gotten noticeably better at certain tasks every 6-12 months for the last few years. The idea that we are at the exact point where that trend stops and they get no better seems harder to believe.

gwbas1c · 2026-02-12T18:25:20 1770920720

One recent link on HN said that they double in quality every 7 months. (Kind of like Moore's Law.) I wouldn't expect that to go forever! I will admit that AI images aren't putting in 6 fingers, and AI code generation suddenly has gotten a lot better for me since I got access to Claude.

I think we're at a point where the only thing we can reliably predict is that some kind of change will happen. (And that we'll laugh at the people who behave like AI is the 2nd coming of Jesus.)

GodelNumbering · 2025-11-24T19:08:35 1764011315

The fact that the post singled out SWE-bench at the top makes the opposite impression that they probably intended.

grantpitt · 2025-11-24T19:10:21 1764011421

do say more

GodelNumbering · 2025-11-24T19:18:51 1764011931

Makes it sound like a one trick pony

jascha_eng · 2025-11-24T19:34:56 1764012896

Anthropic is leaning into agentic coding and heavily so. It makes sense to use swe verified as their main benchmark. It is also the one benchmark Google did not get the top spot last week. Claude remains king that's all that matters here.

Mkengin · 2025-11-24T23:30:09 1764027009

I am eagerly awaiting swe-rebench results for November with all the new models: https://swe-rebench.com/

grantpitt · 2025-11-24T19:27:12 1764012432

well, it's a big trick

GodelNumbering · 2025-11-18T15:36:00 1763480160

And of course they hiked the API prices

Standard Context(≤ 200K tokens)

Input $2.00 vs $1.25 (Gemini 3 pro input is 60% more expensive vs 2.5)

Output $12.00 vs $10.00 (Gemini 3 pro output is 20% more expensive vs 2.5)

Long Context(> 200K tokens)

Input $4.00 vs $2.50 (same +60%)

Output $18.00 vs $15.00 (same +20%)

panarky · 2025-11-18T15:49:33 1763480973

Claude Opus is $15 input, $75 output.

xnx · 2025-11-19T00:44:09 1763513049

If the model solves your needs in fewer prompts, it costs less.

CjHuber · 2025-11-18T15:45:13 1763480713

Is it the first time long context has separate pricing? I hadn’t encountered that yet

1ucky · 2025-11-18T16:02:39 1763481759

Anthropic is also doing this for long context >= 200k Tokens on Sonnet 4.5

Topfi · 2025-11-18T15:47:48 1763480868

Google has been doing that for a while.

brianjking · 2025-11-18T15:53:21 1763481201

Google has always done this.

CjHuber · 2025-11-18T15:54:49 1763481289

Ok wow then I‘ve always overlooked that.

GodelNumbering · 2025-10-23T22:41:53 1761259313

I watch tons of long form educational content on youtube and entirely ignore shorts

GodelNumbering · 2025-10-19T19:30:57 1760902257

Also, previous comment omitted the part that now-deleted tweet from Bubeck begins with "Science revolution via AI has officially begun...".