It doesn't look like the code anonymizes usernames when sending the thread for grading. This likely induces bias in the grades based on past/current prevailing opinions of certain users. It would be interesting to see the whole thing done again but this time randomly re-assigning usernames, to assess bias, and also with procedurally generated pseudonyms, to see whether the bias can be removed that way.
I'd expect de-biasing would deflate grades for well known users.
It might also be interesting to use a search-grounded model that provides citations for its grading claims. Gemini models have access to this via their API, for example.
What a human-like critizicism of human-like behavior.
I [as a human] also do the same thing when observing others in IRL and forum interactions. Reputation matters™
----
A further question is whether a bespoke username could influence the bias of a particular comment (e.g. A username of something like HatesPython might influence the interpretation of that commenter's particular perception of the Python coding language, which might actually be expressing positivity — the username's irony lost to the AI?).
This doesn't even seem to look at "predictions" if you dig into what it actually did. Looking at my own example (#210 on https://karpathy.ai/hncapsule/hall-of-fame.html with 4 comments), very little of what I said could be construed as "predictions" at all.
I got an A for commenting on DF saying that I had not personally seen save corruption and listing weird bugs. It's true that weird bugs have long been a defining feature of DF, but I didn't predict it would remain that way or say that save corruption would never be a big thing, just that I hadn't personally seen it.
Another A for a comment on Google wallet just pointing out that users are already bad at knowing what links to trust. Sure, that's still true (and probably will remain true until something fundamental changes), but it was at best half a prediction as it wasn't forward looking.
Then something on hospital airships from the 1930s. I pointed out that one could escape pollution, I never said I thought it would be a big thing. Airships haven't really ever been much of a thing, except in fiction. Maybe that could change someday, but I kinda doubt it.
Then lastly there was the design patent famously referred to as the "rounded corner" patent. It dings me for simplifying it to that label, despite my actual statements being that yes, there's more, but just minor details like that can be sufficient for infringement. But the LLM says I'm right about ties to the Samsung case and still oversimplifying it. Either way, none of this was really a prediction to begin with.
They're comparing against open weights models that are roughly a month away from the frontier. Likely there's an implicit open-weights political stance here.
There are also plenty of reasons not to use proprietary US models for comparison:
The major US models haven't been living up to their benchmarks; their releases rarely include training & architectural details; they're not terribly cost effective; they often fail to compare with non-US models; and the performance delta between model releases has plateaued.
A decent number of users in r/LocalLlama have reported that they've switched back from Opus 4.5 to Sonnet 4.5 because Opus' real world performance was worse. From my vantage point it seems like trust in OpenAI, Anthropic, and Google is waning and this lack of comparison is another symptom.
Scale AI wrote a paper a year ago comparing various models performance on benchmarks to performance on similar but held-out questions. Generally the closed source models performed better, and Mistral came out looking pretty badly: https://arxiv.org/pdf/2405.00332
??? Closed US frontier models are vastly more effective than anything OSS right now, the reason they didn’t compare is because they’re a different weight class (and therefore product) and it’s a bit unfair.
We’re actually at a unique point right now where the gap is larger than it has been in some time. Consensus since the latest batch of releases is that we haven’t found the wall yet. 5.1 Max, Opus 4.5, and G3 are absolutely astounding models and unless you have unique requirements some way down the price/perf curve I would not even look at this release (which is fine!)
You can swap experts in and out of VRAM, it just increases inference time substantially.
Depending on the routing function you can figure out all the active experts ahead of the forward pass for a single token and pipeline the expert loading.
He's already floating the idea of a third term, and the house is considering a constitutional amendment that would allow it.
Of course, that'll be a moot point if he continues to just ignore the constitution as he has been so far this entire term, and the other two branches continue to just let him.
Do you understand how long and what it takes to ratify an amendment? There a reason why we haven’t done one in 33 years and that one took 202 years. The process is designed to be difficult, it’s much more than a simple majority and bang of a gavel.
We are still working on approving the equal rights amendment. That’s one that started 102 years ago, and we have been trying to get the 3/4 state’s agreement for it for only 53 years.
So no, I seriously doubt with a 50/50 divided electorate in this country that we will repeal the 22nd amendment in the less than 4 years that the US would have to do it before Trump could run again.
Not a fan of censorship here, but Chinese models are (subjectively) less propagandized than US models. If you ask US models about China, for instance, they'll tend towards the antagonistic perspective favored by US media. Chinese models typically seem to take a more moderate, considered tone when discussing similar subjects. US models also suffer from safety-based censorship, especially blatant when "safety" involves protection of corporate resources (eg. not helping the user to download YouTube videos).
I asked DeepSeek "tell me about China" and it responded "Sorry, I'm not sure how to approach this type of question yet. Let's chat about math, coding, and logic problems instead!"
I guess that is propaganda-free! Unfortunately also free of any other information. It's hard for me to evaluate your claim of "moderate, considered tone" when it won't speak a single word about the country.
It was happy to tell me about any other country I asked.
The 'safety' stuff should really be variable. The only valid explanations for how extreme it is in LLMs is corporations paying for it want to keep it kosher in the workplace, so let them control how aggressive it is.
DeepSeek was built on the foundations of public research, a major part of which is the Llama family of models. Prior to Llama open weights LLMs were considerably less performant; without Llama we might not have gotten Mistral, Qwen, or DeepSeek. This isn't meant to diminish DeepSeek's contributions, however: they've been doing great work on mixture of experts models and really pushing the community forward on that front. And, obviously, they've achieved incredible performance.
Llama models are also still best in class for specific tasks that require local data processing. They also maintain positions in the top 25 of the lmarena leaderboard (for what that's worth these days with suspected gaming of the platform), which places them in competition with some of the best models in the world.
But, going back to my first point, Llama set the stage for almost all open weights models after. They spent millions on training runs whose artifacts will never see the light of day, testing theories that are too expensive for smaller players to contemplate exploring.
Pegging Llama as mediocre, or a waste of money (as implied elsewhere), feels incredibly myopic.
As far as I know, Llama's architecture has always been quite conservative: it has not changed that much since LLaMA. Most of their recent gains have been in post-training.
That's not to say their work is unimpressive or not worthy - as you say, they've facilitated much of the open-source ecosystem and have been an enabling factor for many - but it's more that that work has been in making it accessible, not necessarily pushing the frontier of what's actually possible, and DeepSeek has shown us what's possible when you do the latter.
I never said Llama is mediocre. I said the teams they put together is full of people chasing money. And the billions Meta is burning is going straight to mediocrity. They’re bloated. And we know exactly why Meta is doing this and it’s not because they have some grand scheme to build up AI. It’s to keep these people away from their competition. Same with billions in GPU spend. They want to suck up resources away from competition. That’s their entire plan. Do you really think Zuck has any clue about AI? He was never serious and instead built wonky VR prototypes.
> And we know exactly why Meta is doing this and it’s not because they have some grand scheme to build up AI. It’s to keep these people away from their competition
I don't see how you can confidently say this when AI researchers and engineers are remunerated very well across the board and people are moving across companies all the time, if the plan is as you described it, it is clearly not working.
Zuckerberg seems confident they'll have an AI-equivalent of a mid-level engineer later this year, can you imagine how much money Meta can save by replacing a fraction of its (well-paid) engineers with fixed Capex + electric bill?
In contrast to the Social Media industry (or word processors or mobile phones), the market for AI solutions seems not to have of an inherent moat or network effects which keep the users stuck in the market leader.
Rather with AI, capitalism seems working at its best with competitors to OpenAI building solutions which take market share and improve products. Zuck can try monopoly plays all day, but I don't think this will work this time.
It's always very interesting to see people pull out threads with low like counts (like 12k) and claim that central idea of the post is widely held.
We're talking about platforms with tens of millions of users; wide appeal is at least a quarter million likes, with mass appeal being at least a million. A local-scale influencer can gather 10-30k likes very easily on such a massive platform.
Do you disagree then that's not a sentiment widely reflected within Chinese social media? I simply gave an example for brevity, other answers are similar, I would encourage people to actually go in and read themselves here.
>It's always very interesting to see people pull out threads with low like counts (like 12k) and claim that central idea of the post is widely held.
In what context is 12k likes a low amount? To me this is reminiscent of arguments I heard from neocons that global anti-Iraq war protests, the largest coordinated global protests in history at the time, counted as "small" if you considered them in absolute terms as percentages of the global population.
I think it's the opposite, that such activities are tips of the proverbial iceberg of more broadly shared sentiment.
It would be one thing if there were all kinds of sentiments in all directions with roughly evenly distributed #'s of likes. I'm open to the idea that some aspect of context could be argued to diminish the significance, but it wouldn't be that 12k likes, in context, is a negligible amount. It would be something else like its relative popularity compared to alternative views, or some compelling argument that this is a one-off happenstance and not a broadly shared sentiment.
The LSP is limited in scope and doesn't provide access to things like the AST (which can vary by language). If you want to navigate by symbols, that can be done. If you want to know whether a given import is valid, to verify LLM output, that's not possible.
Similarly, you can't use the LSP to determine all valid in-scope objects for an assignment. You can get a hierarchy of symbol information from some servers, allowing selection of particular lexical scopes within the file, but you'll need to perform type analysis yourself to determine which of the available variables could make for a reasonable completion. That type analysis is also a bit tricky because you'll likely need a lot of information about the type hierarchy at that lexical scope-- something you can't get from the LSP.
It might be feasible to edit an open source LSP implementation for your target language to expose the extra information you'd want, but they're relatively heavy pieces of software and, of course, they don't exist for all languages. Compared to the development cost of "just" using embeddings-- it's pretty clear why teams choose embeddings.
Also, if you assume that the performance improvements we've seen in embeddings for retrieval will continue, it makes less sense to invest weeks of time on something that would otherwise improve passively with time.
> The LSP is limited in scope and doesn't provide access to things like the AST (which can vary by language).
Clangd does, which means we could try this out for C++.
There's also tree-sitter, but I assume that's table stakes nowadays. For example, Aider uses it to generate project context ("repo maps")[0].
> If you want to know whether a given import is valid, to verify LLM output, that's not possible.
That's not the biggest problem to be solved, arguably. A wrong import in otherwise correct-ish code is mechanically correctable, even if by user pressing a shortcut in their IDE/LSP-powered editor. We're deep into early R&D here, perfect is the enemy of the good at this stage.
> Similarly, you can't use the LSP to determine all valid in-scope objects for an assignment. You can get a hierarchy of symbol information from some servers, allowing selection of particular lexical scopes within the file, but you'll need to perform type analysis yourself to determine which of the available variables could make for a reasonable completion.
What about asking an LLM? It's not 100% reliable, of course (again: perfect vs. good), but LLMs can guess things that aren't locally obvious even in AST. Like, e.g. "two functions in the current file assign to this_thread::ctx().foo; perhaps this_thread is in global scope, or otherwise accessible to the function I'm working on right now".
I do imagine Cursor, et. al. are experimenting with ad-hoc approaches like that. I know I would, LLMs are cheap enough and fast enough that asking them to build their own context makes sense, if it saves on the amount of time they get the task wrong and require back&forth and reverts and tweaking the prompt.
Google Trends make it seem like we're out of the exponential growth phase for LLMs-- search interest is possibly plateauing.
A decline in search interest outside of academia makes sense. The groups who can get by on APIs don't care so much how the sausage is made and just want to see prices come down. Interested parties have likely already found tools that work for them.
There's definitely some academic interest outside of CS in producing tools using LLMs. I know plenty of astro folks working to build domain specific tools with open models as their backbone. They're typically not interested in more operational work, I guess because they operate under the assumption that relevant optimizations will eventually make their way into public inference engines.
And CS interest in these models will probably sustain for at least 5-10 more years, even if performance plateaus, as work continues into how LLMs function.
All that to say, maybe we're just seeing the trend die for laypeople?
Well, Google Search trends are also only an imperfect proxy for what we are actually interested in.
Eg tap water is really, really useful and widely deployed. Approximately every household is a user, and that's unlikely to change. But I doubt you'll find much evidence of that in Google Search trends.
well gary marcus a non lay person is helping spread word that ai winter is again upon us.
but maybe statistical learning from pretraining is near its limit. not enough data or not enough juice to squeeze more performance out of averages.
though with all the narrow ais it does seem plausible you might be able to cram all what these narrow ais can do in on big goliath model. wonder if reinforcement learning and reasoning can manage to keep the exponential curve of ai going if there are still hiccups in the short term.
the difficulty in just shoehorning llms as they are in any and every day task without a hitch might be behind the temporary hype-dying down trend.
But "Large language model" as a topic in google trends is still in its peak. Maybe just everyone who would be the audience is already knowledgeable about LLMs so why would Google Search trends be able to keep rising?
ChatGPT is at it's peak, and something like Claude is still rising.
I'd expect de-biasing would deflate grades for well known users.
It might also be interesting to use a search-grounded model that provides citations for its grading claims. Gemini models have access to this via their API, for example.
reply