It's not that cynical when you consider that corporations exist precisely to shield owners and leadership from legal (and to a lesser extent) monetary responsibility.
Evil confers an individual advantage. Pro-social behavior confers a group advantage. That's why sociopaths continue to walk along us. Society can tolerate a few of them, but only up to a point.
Evolution works on the level of the reproducing organism, i.e. the individual.
Google group selection if you'd like to go down a deep rabbit hole but the upshot is, if pro-social behavior did not confer and individual advantage, the individuals who lose the trait would outcompete their conspecifics and the pro-social trait would not be fixed in the population.
This is why you usually see additional stabilizing mechanism(s) to suppress free-loading, in addition to the pro-social traits themselves, even in very simple examples of pro-social traits such as bacteria collaboratively creating biofilms.
The genes coding for the biofilms are usually coded on transmissible plasmids, making it possible for one individual to re-infect another that has lost it.
You might consider the justice system, police etc. as analogous to that.
So yes, in the case where you're part of a functioning society and free-loading on the pro-social behavior of others, that is temporarily beneficial to you - until the stabilizing mechanisms kick in.
I'm not saying in practice you can never get away with anything, of course you can. But on average you can't, we wouldn't be a social species otherwise.
In your Durkheimian analogy, sociopaths are cancer and while the body usually handles one off rogue cells, it often fails when tumors and eventually metastasis develop.
That can happen, sure, but the cancer's strategy is not a winning one - it dies along with the host.
Again, I'm not arguing for some naive Panglossian view. Things can get pretty bad transiently.
I just take exception at the cynical view that evil is somehow intrinsically more powerful than good.
"Survival of the fittest" is often misunderstood that way too, as survival of the strong and selfish, when, on the contrary, evolution is full of examples of cooperation being stable over long timescales.
Evil simply has more options available than good. Sure, those options, like all options, have pros and cons. Cancer, like sociopathy, can have a pretty good run even if it ends ultimately in demise.
I very much want to push back against any bias towards a just world. Bad people often live their whole lives without any consequence (think prostate cancer) while good people struggle (think my cuticles, which deserve much more than I usually give).
The cynical view suffers from availability bias - it's easy for us to think of someone who sticks out through bad behavior, but somehow gets away with it, precisely because it is not normal. (1)
But if you look at long timescales, it's pretty obvious that cooperation is the more powerful strategy.
We used to live in tribes of hunter gatherers, in constant danger from a hostile environment. Now, we're part of a global technological superorganism that provides for us.
If free-loading was a dominant strategy, this would never have developed.
(1) From the evolutionary biology point of view this can be explained by rate dependent selection- meaning the strategy is strong as long as only a small fraction of a population employ it. Durkheim would probably say you need these people to establish what the norms of a society are.
Might make sense to scale the load by following electricity supply/prices though?
Staying that as a genuine question since I'm not sure how the math works out at that scale, you have to weigh that against hardware depreciation of course.
Power purchase agreements are priced differently and usually written to guarantee power at a predictable price, think of it like reserved instances and spot on the cloud. Bulk of workloads don’t care or benefit from spot pricing.
Also Modern neoclouds have captive non grid sources like gas or diesel plants for which grid demand has no impact to cost. These sources are not cheap but DC operators have not much choice as getting grid capacity takes years . Even gas turbines are difficult to procure these days so we hear of funky sources like jet engines.
Having people work on things that are at the limit of human understanding is an essential part of a modern educational system.
For every professional string theorist, you get hundreds of people who were brought up in an academic system that values rigor and depth of scientific thinking.
That's literally what a modern technological economy is built on.
Getting useful novel results out of this is almost a lucky side effect.
In my experience, the best models are already nearly as good as you can be for a large fraction of what I personally use them for, which is basically as a more efficient search engine.
The thing that would now make the biggest difference isn't "more intelligence", whatever that might mean, but better grounding.
It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things, and verifying their claims ends up taking time. And if it's a topic you don't care about enough, you might just end up misinformed.
I think Google/Gemini realize this, since their "verify" feature is designed to address exactly this. Unfortunately it hasn't worked very well for me so far.
But to me it's very clear that the product that gets this right will be the one I use.
> It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things, and verifying their claims ends up taking time. And if it's a topic you don't care about enough, you might just end up misinformed.
Exactly! One important thing LLMs have made me realise deeply is "No information" is better than false information. The way LLMs pull out completely incorrect explanations baffles me - I suppose that's expected since in the end it's generating tokens based on its training and it's reasonable it might hallucinate some stuff, but knowing this doesn't ease any of my frustration.
IMO if LLMs need to focus on anything right now, they should focus on better grounding. Maybe even something like a probability/confidence score, might end up experience so much better for so many users like me.
I’m with the people pushing back on the “confidence scores” framing, but I think the deeper issue is that we’re still stuck in the wrong mental model.
It’s tempting to think of a language model as a shallow search engine that happens to output text, but that metaphor doesn’t actually match what’s happening under the hood. A model doesn’t “know” facts or measure uncertainty in a Bayesian sense. All it really does is traverse a high‑dimensional statistical manifold of language usage, trying to produce the most plausible continuation.
That’s why a confidence number that looks sensible can still be as made up as the underlying output, because both are just sequences of tokens tied to trained patterns, not anchored truth values. If you want truth, you want something that couples probability distributions to real world evidence sources and flags when it doesn’t have enough grounding to answer, ideally with explicit uncertainty, not hand‑waviness.
People talk about hallucination like it’s a bug that can be patched at the surface level. I think it’s actually a feature of the architecture we’re using: generating plausible continuations by design. You have to change the shape of the model or augment it with tooling that directly references verified knowledge sources before you get reliability that matters.
Solid agree. Hallucination for me IS the LLM use case. What I am looking for are ideas that may or may not be true that I have not considered and then I go try to find out which I can use and why.
This technology (which I had a small part in inventing) was not based on intelligently navigating the information space, it’s fundamentally based on forecasting your own thoughts by weighting your pre-linguistic vectors and feeding them back to you. Attention layers in conjunction of roof later allowed that to be grouped in higher order and scan a wider beam space to reward higher complexity answers.
When trained on chatting (a reflection system on your own thoughts) it mostly just uses a false mental model to pretend to be a desperate intelligence.
Thus the term stochastic parrot (which for many us actually pretty useful)
Thanks for your input - great to hear from someone involved that this is the direction of travel.
I remain highly skeptical of this idea that it will replace anyone - the biggest danger I see is people falling for the illusion. That the thing is intrinsically smart when it’s not - it can be highly useful in the hands of disciplined people who know a particular area well and augment their productivity no doubt. Because the way we humans come up with ideas and so on is highly complex. Personally my ideas come out of nowhere and mostly are derived from intuition that can only be expressed in logical statements ex-post.
Is intuition really that different than LLM having little knowledge about something? It's just responding with the most likely sequence of tokens using the most adjacent information to the topic... just like your intuition.
With all due respect I’m not even going to give a proper response to this… intuition that yields great ideas is based on deep understanding. LLM’s exhibit no such thing.
These comparisons are becoming really annoying to read.
>A model doesn’t “know” facts or measure uncertainty in a Bayesian sense. All it really does is traverse a high‑dimensional statistical manifold of language usage, trying to produce the most plausible continuation.
And is that that different than what we do under the scenes? Is there a difference between an actual fact vs some false information stored in our brain? Or both have the same representation in some kind of high‑dimensional statistical manifold in our brains, and we also "try to produce the most plausible continuation" using them?
There might be one major difference is at a different level: what we're fed (read, see, hear, etc) we also evaluate before storing. Does LLM training do that, beyond some kind of manually assigned crude "confidence tiers" applied to input material during training (e.g. trust Wikipedia more than Reddit threads)?
I would say it's very different to what we do. Go to a friend and ask them a very niche question. Rather than lie to you, they'll tell you "I don't know the answer to that". Even if a human absorbed every single bit of information a language model has, their brain probably could not store and process it all. Unless they were a liar, they'd tell you they don't know the answer either! So I personally reject the framing that it's just like how a human behaves, because most of the people I know don't lie when they lack information.
>Go to a friend and ask them a very niche question. Rather than lie to you, they'll tell you "I don't know the answer to that"
Don't know about that, bullshitting is a thing. Especially online, where everybody pretends to be an expert on everything, and many even believe it.
But even if so, is that because of some fundamental difference between how a human and an LLM store/encode/retrieve information, or more because it has been instilled into a human through negative reinforcement (other people calling them out, shame of correction, even punishment, etc) not to make things up?
Hallucinations are a feature of reality that LLMs have inherited.
It’s amazing that experts like yourself who have a good grasp of the manifold MoE configuration don’t get that.
LLMs much like humans weight high dimensionality across the entire model then manifold then string together an attentive answer best weighted.
Just like your doctor occasionally giving you wrong advice too quickly so does this sometimes either get confused by lighting up too much of the manifold or having insufficient expertise.
I asked Gemini the other day to research and summarise the pinout configuration for CANbus outputs on a list of hardware products, and to provide references for each. It came back with a table summarising pin outs for each of the eight products, and a URL reference for each.
Of the 8, 3 were wrong, and the references contained no information about pin outs whatsoever.
That kind of hallucination is, to me, entirely different than what a human researcher would ever do. They would say “for these three I couldn’t find pinouts” or perhaps misread a document and mix up pinouts from one model for another.. they wouldn’t make up pinouts and reference a document that had no such information in it.
Of course humans also imagine things, misremember etc, but what the LLMs are doing is something entirely different, is it not?
Humans are also not rewarded for making pronouncements all the time. Experts actually have a reputation to maintain and are likely more reluctant to give opionions that they are not reasonably sure of. LLMs trained on typical written narratives found in books, articles etc can be forgiven to think that they should have an opionion on any and everything. Point being that while you may be able to tune it to behave some other way you may find the new behavior less helpful.
> Hallucinations are a feature of reality that LLMs have inherited.
Huh? Are you arguing that we still live in a pre-scientific era where there’s no way to measure truth?
As a simple example, I asked Google about houseplant biology recently. The answer was very confidently wrong telling me that spider plants have a particular metabolic pathway because it confused them with jade plants and the two are often mentioned together. Humans wouldn’t make this mistake because they’d either know the answer or say that they don’t. LLMs do that constantly because they lack understanding and metacognitive abilities.
>Huh? Are you arguing that we still live in a pre-scientific era where there’s no way to measure truth?
No. A strange way to interpet their statement! Almost as if you ...hallucinated their intend!
They are arguing that humans also hallucinate: "LLMs much like humans" (...) "Just like your doctor occasionally giving you wrong advice too quickly".
As an aside, there was never a "pre-scientific era where there [was] no way to measure truth". Prior to the rise of modern science fields, there have still always been objective ways to judge truth in all kinds of domains.
Yes, that’s basically the point: what are termed hallucinations with LLMs are different than what we see in humans – even the confabulations which people with severe mental disorders exhibit tend to have some kind of underlying order or structure to them. People detect inconsistencies in their own behavior and that of others, which is why even that rushed doctor in the original comment won’t suggest something wildly off the way LLMs do routinely - they might make a mistake or have incomplete information but they will suggest things which fit a theory based on their reasoning and understanding, which yields errors at a lower rate and different class.
When you ask humans however there are all kinds of made-up "facts" they will tell you. Which is the point the parent makes (in the context of comparing to LLM), not whether some legal database has wrong cases.
Since your example comes from the legal field, you'll probably very well know that even well intentioned witnesses that don't actively try to lie, can still hallucinate all kinds of bullshit, and even be certain of it. Even for eye witnesses, you can ask 5 people and get several different incompatible descriptions of a scene or an attacker.
>When you ask humans however there are all kinds of made-up "facts" they will tell you. Which is the point the parent makes (in the context of comparing to LLM), not whether some legal database has wrong cases.
Context matters. This is the context LLMs are being commercially pushed to me in. Legal databases also inherit from reality as they consist entirely of things from the real world.
That’s deliberate. “Correct” implies anchoring to a truth function the model doesn’t have. “Plausible” is what it’s actually optimising for, and the disconnect between the two is where most of the surprises (and pitfalls) show up.
As someone else put it well: what an LLM does is confabulate stories. Some of them just happen to be true.
Do you have a better word that describes "things that look correct without definitely being so"? I think "plausible" is the perfect word for that. It's not a sleight of hand to use a word that is exactly defined as the intention.
I mean... That is exactly how our memory works. So in a sense, the factually incorrect information coming from LLM is as reliable as someone telling you things from memory.
But not really? If you ask me a question about Thai grammar or how to build a jet turbine, I'm going to tell you that I don't have a clue. I have more of a meta-cognitive map of my own manifold of knowledge than an LLM does.
Try it out. Ask "Do you know who Emplabert Kloopermberg is?" and ChatGPT/Gemini literally responded with "I don't know".
You, on the other hand, truly have never encountered any information about Thai grammar or (surprisingly) hot to build a jet turbine. (I can explain in general terms how to build one from just watching Discovery channel)
The difference is that the models actually have some information on those topics.
They are, the model has no inherent knowledge about its confidence levels, it just adds plausible-sounding numbers. Obviously they _can_ be plausible, but trusting these is just another level up from trusting the original output.
I read a comment here a few weeks back that LLMs always hallucinate, but we sometimes get lucky when the hallucinations match up with reality. I've been thinking about that a lot lately.
> the model has no inherent knowledge about its confidence levels
Kind of. See e.g. https://openreview.net/forum?id=mbu8EEnp3a, but I think it was established already a year ago that LLMs tend to have identifiable internal confidence signal; the challenge around the time of DeepSeek-R1 release was to, through training, connect that signal to tool use activation, so it does a search if it "feels unsure".
Wow, that's a really interesting paper. That's the kind of thing that makes me feel there's a lot more research to be done "around" LLMs and how they work, and that there's still a fair bit of improvement to be found.
In science, before LLMs, there's this saying: all models are wrong, some are useful. We model, say, gravity as 9.8m/s² on Earth, knowing full well that it doesn't hold true across the universe, and we're able to build things on top of that foundation. Whether that foundation is made of bricks, or is made of sand, for LLMs, is for us to decide.
G, the gravitational constant is (as far as we know) universal. I don't think this is what they meant, but the use of "across the universe" in the parent comment is confusing.
g, the net acceleration from gravity and the Earth's rotation is what is 9.8m/s² at the surface, on average. It varies slightly with location and altitude (less than 1% for anywhere on the surface IIRC), so "it's 9.8 everywhere" is the model that's wrong but good enough a lot of the time.
It doesn't even hold true on Earth! Nevermind other planets being of different sizes making that number change, that equation doesn't account for the atmosphere and air resistance from that. If we drop a feather that isn't crumpled up, it'll float down gently at anything but 9.8m/s². In sports, air resistance of different balls is enough that how fast something drops is also not exactly 9.8m/s², which is why peak athlete skills often don't transfer between sports. So, as a model, when we ignore air resistance it's good enough, a lot of the time, but sometimes it's not a good model because we do need to care about air resistance.
Gravity isn't 9.8m/s/s across the universe. If you're at higher or lower elevations (or outside the Earth's gravitational pull entirely), the acceleration will be different.
Their point was the 9.8 model is good enough for most things on Earth, the model doesn't need to be perfect across the universe to be useful.
Asking an LLM to give itself a «confidence score» is like asking a teenager to grade his own exam. I LLMs doesn’t «feel» uncertainty and confidence like we do.
No, it's not the same. Search results send/show you one or more specific pages/websites. And each website has a different trust factor. Yes, plenty of people repeat things they "read on the Internet" as truths, but it's easy to debunk some of them just based on the site reputation.
With AI responses, the reputation is shared with the good answers as well, because they do give good answers most of the time, but also hallucinate errors.
> Tools like SourceFinder must be paired with education — teaching people how to trace information themselves, to ask: Where did this come from? Who benefits if I believe it?
These are very important and relevant questions to ask oneself when you read about anything, but we also keep in mind that even those question can be misused and they can drive you to conspiracy theories.
If somebody asks a question on Stackoverflow, it is unlikely that a human who does not know the answer will take time out of their day to completely fabricate a plausible sounding answer.
Sites like stackoverflow are inherently peer-reviewed, though; they've got a crowdsourced voting system and comments that accumulate over time. People test the ideas in question.
This whole "people are just as incorrect as LLMs" is a poor argument, because it compares the single human and the single LLM response in a vacuum. When you put enough humans together on the internet you usually get a more meaningful result.
There's a reason why there are upvotes, solution and third party edit system in StackOverflow - people will spend time to write their "hallucinations" very confidently.
What is it about people making up lies to defend LLMs? In what world is it exactly the same as search? They're literally different things, since you get information from multiple sources and can do your own filtering.
I wonder if the only way to fix this with current LLMs, would be to generate a lot synthetic data for a select number topics you really don't want it "go off the rails" with. That synthetic data would be lots of variations on that "I don't know how to do X with Y".
I think the thing even worse than false information is the almost-correct information. You do a quick Google to confirm it's on the right page but find there's an important misunderstanding. These are so much harder to spot I think than the blatantly false.
I agree, but the question is how better grounding can be achieved without a major research breakthrough.
I believe the real issue is that LLMs are still so bad at reasoning. In my experience, the worst hallucinations occur where only handful of sources exist for some set of facts (e.g laws of small countries or descriptions of niche products).
LLMs know these sources and they refer to them but they are interpreting them incorrectly. They are incapable of focusing on the semantics of one specific page because they get "distracted" by their pattern matching nature.
Now people will say that this is unavoidable given the way in which transformers work. And this is true.
But shouldn't it be possible to include some measure of data sparsity in the training so that models know when they don't know enough? That would enable them to boost the weight of the context (including sources they find through inference time search/RAG) relative to to their pretraining.
Anything that is very specific has the same problem, because LLMs can’t have the same representation of all topics in the training. It doesn’t have to be too niche, just specific enough for it to start to fabricate it.
One of these days I had a doubt about something related to how pointers work in Swift and I tried discussing with ChatGPT (don’t remember exactly what, but it was purely intellectual curiosity). It gave me a lot of explanations that seemed correct, but being skeptical and started pushing it for ways to confirm what it was saying and eventually realized it was all bullshit.
This kind of thing makes me basically wary of using LLMs for anything that isn’t brainstorming, because anything that requires knowing information that isn’t easily/plentifully found online will likely be incorrect or have sprinkles of incorrect all over the explanations.
Grounding in search results is what Perplexity pioneered and Google also does with AI mode and ChatGPT and others with web search tool.
As a user I want it but as webadmin it kills dynamic pages and that's why Proof of work aka CPU time captchas like Anubis https://github.com/TecharoHQ/anubis#user-content-anubis or BotID https://vercel.com/docs/botid are now everywhere. If only these AI crawlers did some caching, but no just go and overrun the web. To the effect that they can't anymore, at the price of shutting down small sites and making life worse for everyone, just for few months of rapacious crawling. Literally Perplexity moved fast and broke things.
This dance to get access is just a minor annoyance for me, but I question how it proves I’m not a bot. These steps can be trivially and cheaply automated.
I think the end result is just an internet resource I need is a little harder to access, and we have to waste a small amount of energy.
My biggest problem with LLM's at this point is that they produce different and inconsistent results or behave differently, given the same prompt. The better grounding would be amazing at this point. I want to give an LLM the same prompt on different days and I want to be able to trust that it will do the same thing as yesterday. Currently they misbehave multiple times a week and I have to manually steer it a bit which destroys certain automated workflows completely.
It sounds like you have dug into this problem with some depth so I would love to hear more. When you've tried to automate things, I'm guessing you've got a template and then some data and then the same or similar input gives totally different results? What details about how different the results are can you share? Are you asking for eg JSON output and it totally isn't, or is it a more subtle difference perhaps?
It doesn’t really solve it as a slight shift in the prompt can have totally unpredictable results anyway. And if your prompt is always exactly the same, you’d just cache it and bypass the LLM anyway.
What would really be useful is a very similar prompt should always give a very very similar result.
This doesn't work with the current architecture, because we have to introduce some element of stochastic noise into the generation or else they're not "creatively" generative.
Your brain doesn't have this problem because the noise is already present. You, as an actual thinking being, are able to override the noise and say "no, this is false." An LLM doesn't have that capability.
> I want to give an LLM the same prompt on different days and I want to be able to trust that it will do the same thing as yesterday
Bad news, it's winter now in the Northern hemisphere, so expect all of our AIs to get slightly less performant as they emulate humans under-performing until Spring.
I think the better word is confabulation; fabricating plausible but false narratives based on wrong memory. Fundamentally, these models try to produce plausible text. With language models getting large, they start creating internal world models, and some research shows they actually have truth dimensions. [0]
I'm not an expert on the topic, but to me it sounds plausible that a good part of the problem of confabulation comes down to misaligned incentives. These models are trained hard to be a 'helpful assistant', and this might conflict with telling the truth.
Being free of hallucinations is a bit too high a bar to set anyway. Humans are extremely prone to confabulations as well, as can be seen by how unreliable eye witness reports tend to be. We usually get by through efficient tool calling (looking shit up), and some of us through expressing doubt about our own capabilities (critical thinking).
Here is the relevant quote by Trenton Bricken from the transcript:
One example I didn't talk about before with how the model retrieves facts: So you say, "What sport did Michael Jordan play?" And not only can you see it hop from like Michael Jordan to basketball and answer basketball. But the model also has an awareness of when it doesn't know the answer to a fact. And so, by default, it will actually say, "I don't know the answer to this question." But if it sees something that it does know the answer to, it will inhibit the "I don't know" circuit and then reply with the circuit that it actually has the answer to. So, for example, if you ask it, "Who is Michael Batkin?" —which is just a made-up fictional person— it will by default just say, "I don't know." It's only with Michael Jordan or someone else that it will then inhibit the "I don't know" circuit.
But what's really interesting here and where you can start making downstream predictions or reasoning about the model, is that the "I don't know" circuit is only on the name of the person. And so, in the paper we also ask it, "What paper did Andrej Karpathy write?" And so it recognizes the name Andrej Karpathy, because he's sufficiently famous, so that turns off the "I don't know" reply. But then when it comes time for the model to say what paper it worked on, it doesn't actually know any of his papers, and so then it needs to make something up. And so you can see different components and different circuits all interacting at the same time to lead to this final answer.
That's right - it does seem to have to do with trying to be helpful.
One demo of this that reliably works for me:
Write a draft of something and ask the LLM to find the errors.
Correct the errors, repeat.
It will never stop finding a list of errors!
The first time around and maybe the second it will be helpful, but after you've fixed the obvious things, it will start complaining about things that are perfectly fine, just to satisfy your request of finding errors.
No, the correct word is hallucinating. That's the word everyone uses and has been using. While it might not be technically correct, everyone knows what it means and more importantly, it's not a $3 word and everyone can relate to the concept. I also prefer all the _other_ more accurate alternative words Wikipedia offers to describe it:
"In the field of artificial intelligence (AI), a hallucination or artificial hallucination (also called bullshitting,[1][2] confabulation,[3] or delusion[4]) is"
I still don’t really get this argument/excuse for why it’s acceptable that LLMs hallucinate. These tools are meant to support us, but we end up with two parties who are, as you say, prone to “hallucination” and it becomes a situation of the blind leading the blind. Ideally in these scenarios there’s at least one party with a definitive or deterministic view so the other party (i.e. us) at least has some trust in the information they’re receiving and any decisions they make off the back of it.
For these types of problems (i.e. most problems in the real world), the "definitive or deterministic" isn't really possible. An unreliable party you can throw at the problem from a hundred thousand directions simultaneously and for cheap, is still useful.
"The airplane wing broke and fell off during flight"
"Well humans break their leg too!"
It is just a mindlessly stupid response and a giant category error.
The way an airplane wing and a human limb is not at all the same category.
There is even another layer to this that comparing LLMs to the brain might be wrong because the mereological fallacy is attributing the brain "thinks" vs the person/system as a whole thinks.
You are right that the wing/leg comparison is often lazy rhetoric: we hold engineered systems to different failure standards for good reason.
But you are misusing the mereological fallacy. It does not dismiss LLM/brain comparisons: it actually strengthens them. If the brain does not "think" (the person does), then LLMs do not "think" either. Both are subsystems in larger systems. That is not a category error; it is a structural similarity.
This does not excuse LLM limitations - rimeice's concern about two unreliable parties is valid. But dismissing comparisons as "category errors" without examining which properties are being compared is just as lazy as the wing/leg response.
People, when tasked with a job, often get it right. I've been blessed by working with many great people who really do an amazing job of generally succeeding to get things right -- or at least, right-enough.
But in any line of work: Sometimes people fuck it up. Sometimes, they forget important steps. Sometimes, they're sure they did it one way when instead they did it some other way and fix it themselves. Sometimes, they even say they did the job and did it as-prescribed and actually believe themselves, when they've done neither -- and they're perplexed when they're shown this. They "hallucinate" and do dumb things for reasons that aren't real.
And sometimes, they just make shit up and lie. They know they're lying and they lie anyway, doubling-down over and over again.
Sometimes they even go all spastic and deliberately throw monkey wrenches into the works, just because they feel something that makes them think that this kind of willfully-destructive action benefits them.
All employees suck some of the time. They each have their own issues. And all employees are expensive to hire, and expensive to fire, and expensive to keep going. But some of their outputs are useful, so we employ people anyway. (And we're human; even the very best of us are going to make mistakes.)
LLMs are not so different in this way, as a general construct. They can get things right. They can also make shit up. They can skip steps. The can lie, and double-down on those lies. They hallucinate.
LLMs suck. All of them. They all fucking suck. They aren't even good at sucking, and they persist at doing it anyway.
(But some of their outputs are useful, and LLMs generally cost a lot less to make use of than people do, so here we are.)
I don’t get the comparison. It would be like saying it’s okay if an excel formula gives me different outcomes everytime with the same arguments, sometimes right, but mostly wrong.
As far as I can tell (as someone who worked on the early foundation of this tech at Google for 10 years) making up “shit” then using your force of will to make it true is a huge part of the construction of reality with intelligence.
Will to reality through forecasting possible worlds is one of our two primary functions.
A lot of mechanisation, especially in the modern world, is not deterministic and is not always 100% right; it's a fundamental "physics at scale" issue, not something new to LLMs. I think what happened when they first appeared was that people immediately clung to a superintelligence-type AI idea of what LLMs were supposed to do, then realised that's not what they are, then kept going and swung all the way over to "these things aren't good at anything really" or "if they only fix this ONE issue I have with them, they'll actually be useful"
Yes, they'll probably not go away, but it's got to be possible to handle them better.
Gemini (the app) has a "mitigation" feature where it tries to to Google searches to support its statements. That doesn't currently work properly in my experience.
It also seems to be doing something where it adds references to statements (With a separate model? With a second pass over the output? Not sure how that works.). That works well where it adds them, but it often doesn't do it.
Doubt it. I suspect it’s fundamentally not possible in the spirit you intend it.
Reality is perfectly fine with deception and inaccuracy. For language to magically be self constraining enough to only make verified statements is… impossible.
Take a look at the new experimental AI mode in Google scholar, it's going in the right direction.
It might be true that a fundamental solution to this issue is not possible without a major breakthrough, but I'm sure you can get pretty far with better tooling that surfaces relevant sources, and that would make a huge difference.
What’s your level of expertise in this domain or subject? How did you use it? What were your results?
It’s basically gauging expertise vs usage to pin down the variance that seems endemic to LLM utility anecdotes/examples. For code examples I also ask which language was used, the submitters familiarity with the language, their seniority/experience and familiarity with the domain.
> It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things,
Due to how LLMs are implemented, you are always most likely to get a bogus explanation if you ask for an answer first, and why second.
A useful mental model is: imagine if I presented you with a potential new recruit's complete data (resume, job history, recordings of the job interview, everything) but you only had 1 second to tell me "hired: YES OR NO"
And then, AFTER you answered that, I gave you 50 pages worth of space to tell me why your decision is right. You can't go back on that decision, so all you can do is justify it however you can.
Do you see how this would give radically different outcomes vs. giving you the 50-page scratchpad first to think things through, and then only giving me a YES/NO answer?
It's increasingly a space that is constrained by the tools and integrations. Models provide a lot of raw capability. But with the right tools even the simpler, less capable models become useful.
Mostly we're not trying to win a nobel prize, develop some insanely difficult algorithm, or solve some silly leetcode problem. Instead we're doing relatively simple things. Some of those things are very repetitive as well. Our core job as programmers is automating things that are repetitive. That always was our job. Using AI models to do boring repetitive things is a smart use of time. But it's nothing new. There's a long history of productivity increasing tools that take boring repetitive stuff away. Compilation used to be a manual process that involved creating stacks of punch cards. That's what the first automated compilers produced as output: stacks of punch cards. Producing and stacking punchcards is not a fun job. It's very repetitive work. Compilers used to be people compiling punchcards. Women mostly, actually. Because it was considered relatively low skilled work. Even though it arguably wasn't.
Some people are very unhappy that the easier parts of their job are being automated and they are worried that they get completely automated away completely. That's only true if you exclusively do boring, repetitive, low value work. Then yes, your job is at risk. If your work is a mix of that and some higher value, non repetitive, and more fun stuff to work on, your life could get a lot more interesting. Because you get to automate away all the boring and repetitive stuff and spend more time on the fun stuff. I'm a CTO. I have lots of fun lately. Entire new side projects that I had no time for previously I can now just pull off in a spare few hours.
Ironically, a lot of people currently get the worst of both worlds because they now find themselves baby sitting AIs doing a lot more of the boring repetitive stuff than they would be able to do without that to the point where that is actually all that they do. It's still boring and repetitive. And it should be automated away ultimately. Arguably many years ago actually. The reason so many react projects feel like Ground Hog Day is because they are very repetitive. You need a login screen, and a cookies screen, and a settings screen, etc. Just like the last 50 projects you did. Why are you rebuilding those things from scratch? Manually? These are valid questions to ask yourself if you are a frontend programmer. And now you have AI to do that for you.
Find something fun and valuable to work on and AI gets a lot more fun because it gives you more quality time with the fun stuff. AI is about doing more with less. About raising the ambition level.
Yeah in my case I want the coding models to be less stupid, I asked for multiple file uploading, it kept the original button and it added a second one for additional files, when I pointed that out “You're absolutely correct!” Well why didnt you think of it before you cranked out code, I see coding agents as really capable Junior devs its really funny. I dont mind it though, saved me hours on my side project if not weeks worth of work.
I was using an LLM to summarize benchmarks for me, and I realized after awhile it was omitting information that made the algorithm being benchmarked look bad. I'm glad I caught it early, before I went to my peers and was like "look at this amazing algorithm".
It's important not to assume that LLMs are giving you an impartial perspective on any given topic. The perspective you're most likely getting is that of whoever created the most training data related to that topic.
Re: retrieval: That's where the snake eats its tail as AI slop floods the web, grounding is like laying a foundation in a swamp. And that Rube Goldberg machine tries to prevent the snake from reaching its tail. But RGs are brittle and not exactly the thing you want to build infrstructure on. Just look at https://news.ycombinator.com/item?id=46239752 for an example how easy it can break.
I've been working on this problem with https://citellm.com, specifically for PDFs.
Instead of relying on the LLM answer alone, each extracted field links to its source in the original document (page number + highlighted snippet + confidence score).
Checking any claim becomes simple: click and see the exact source.
I constantly see top models (opus 4.5, gemini 3) get a stroke mid task - they will solve the problem correctly in one place, or have a correct solution that needs to be reapplied in context - and then completely miss the mark in another place. "Lack of intelligence" is very much a limiting factor. Gemini especially will get into random reasoning loops - reading thinking traces - it gets unhinged pretty fast.
Not to mention it's super easy to gaslight these models, just asserting something wrong with vaguely plausible explanation and you get no pushback or reasoning validation.
So I know you qualified your post with "for your use case", but personally I would very much like more intelligence from LLMs.
I've had better success finding information using Google Gemini vs. ChatGPT. I.e. someone mentions to me the name of someone or some company, but doesn't give the full details (i.e. Joe @ XYZ Company doing this, or this company with 10,000 people, in ABC industry)...sometimes i don't remember the full name. Gemini has been more effective for me in filling in the gaps and doing fuzzy search. I even asked ChatGPT why this was the case, and it affirmed my experience, saying that Gemini is better for these queries because of Search integration, Knowledge Graph, etc. Especially useful for recent role changes, which haven't been propagated through other channels on a widespread basis.
All of them are heavily invested in improving grounding. The money isn't on personal use but enterprise customers and for those, grounding is essential.
Yes but this is incredibly competitive and undifferentiated.
It's a huge market but who will it be a profitable business for?
Likely a company or multiple who own some sort of platform that people are already on, so not OpenAI.
What they have right now is the strong ChatGPT brand and that does mean a lot. But how long will it last?
They're not the technology leader anymore, and that spells a lot of trouble.
They are at a stage where they need to dominate the market and then leverage the data that gives them, plus the brand, plus the tech advantage to establish a durable near monopoly, but it looks like it's not working.
It's a bit as if in 1999 3 equally strong Google competitors had popped up, with some pulling ahead.
I wrote this comment [0] very recently and when I wrote it had in mind that Cloudflare might very well end up being a key player in a more centralized Internet that has developed far away from its original architecture.
Defense against threats is a pretty strong centralization incentive in different kinds of networks - social, biological.
I could imagine that a lot of people are investing based on similar scenarios in their minds.
Joining the big three requires capital investments orders of magnitude beyond where they are. Nevertheless I'd like to see someone do it, and if not Equinix then it could be them. But sadly the #4 right now appears to be Oracle.
The Internet has really been an interesting case study for what happens between people when you remove a varying number of layers of social control.
All the way back to the early days of Usenet really.
I would hate to see it but at the same time I feel like the incentives created by the bad actors really push this towards a much more centralized model over time, e.g. one where all traffic provenance must be signed and identified and must flow through a few big networks that enforce laws around that.
"Socialists"* argue for more regulations; "liberals" claim that there should be financial incentives to not do that.
I'm neither. I believe that we should go back to being "tribes"/communities. At least it's a time-tested way to – maybe not prevent, but somewhat allieviate – the tragedy of the commons.
(I'm aware that this is a very poor and naive theory; I'll happily ditch it for a better idea.)
Little would prevent attacks by APTs and other powerful groups. (This, btw., is one of the few facets of this problem that technology could help solve.) But a trivial change: a hard requirement to sign up (=send a human-composed message to one of the moderators) to be able to participate (or, in extreme cases, to read the contents) "automagically" stops almost all spam, scrapers (in the extreme case), vandalism, etc. (from my personal experience based on a rather large sample).
I think it's one of the multi-faceted problems where technology (a "moat", "palisade", etc. for your "tribe") should accompany social changes.
We were very confident by ca. 2008 that Facebook would still be around in 2025. It's no mystery, it's the network effects. They had started with a prestige demographic (Harvard), and secured a demographic you could trust to not move on to the next big thing in a hurry, yet which most people want contact with (your parents).
Perceived quality is relative. It's roughly linearly related to rank position along some dimension, but moving up in rank requires exponential effort due to competition.
I would be surprised if anyone perceives quality like that. Like, are you saying that in a situation where there are only two examples of some type of work, it is impossible to judge whether one is much better than the other, it is only possible to say that it's better? What makes you think it works like this?
This insight, that perceived quality is relative, can be understood in a more literary way, in this fragment by Proust describing the Rivebelle restaurant:
> Soon the spectacle became arranged, in my eyes at least, in a more noble and calmer fashion. All this vertiginous activity settled into a calm harmony. I would watch the round tables, whose innumerable assemblage filled the restaurant, like so many planets, such as they are figured in old allegorical pictures. Moreover, an irresistible force of attraction was exerted between these various stars, and at each table the diners had eyes only for the tables at which they were not sitting,
The relative nature of perceived quality indicates that the order of representations (artworks) is judged only in relation to the order of castes (ranking); one can only judge what is equal by comparison to what is unequal, and equivalence is not understood through equality (even approximate) but through inequality (partial order). It is the absence of a ranking relationship between two entities that establishes their equivalence.
Maybe I should have just gone with "in this case, classification is more fundamental than measure", but I feel there is something interesting to be said about the structure of artworks and the structure of their reception by society, indeed Proust continues with:
> ...the diners had eyes only for the tables at which they were not sitting, with the exception of some wealthy host who, having succeeded in bringing a famous writer, strove to extract from him, thanks to the spiritual properties of the 'turning table', insignificant observations at which the ladies marvelled.
See ? Writers (and artists in general) take on the role of a medium. They are used to channel distant entities, like tables during spiritism sessions, and from what Proust tells us, that "the diners had eyes only for the tables at which they were not sitting", maybe we can infer that what writers channel doesn't just come from distant worlds, as incarnated in what their words represent, but as a delta in perceived quality, starting with our own.
This is why I'd like to elaborate on this idea of coupled SSR processes I developed in another comment.
A sample space reducing process is a process that seeks to combine atomic parts into a coherent whole by iteratively picking groups of parts that can be assembled into functional elements ready to be added to this whole.
In that sense, the act of writing a long work already has the shape of a SSR process in a very simple sense: each narrative, stylistic, and conceptual choice constrains what can follow without breaking coherence. As a novel unfolds, fewer continuations remain compatible with its voice, characters, rhythms, arguments. You are not wandering freely in idea-space; you are navigating a progressively narrowed funnel of possibilities induced by your own earlier decisions. A good book is one that survives this internal reduction without collapsing.
On top of that, there is a second, external funnel: the competitive ranking of works and authors. Here too the available space narrows as you move upward. The further you climb in terms of attention, recognition, or canonization, the smaller the set of works that can plausibly dislodge those already in place. Near the top, the acceptance region is tiny: most new works, even competent ones, will not significantly shift the existing order. From that perspective, perceived quality is largely tied to where a work ends up in this hierarchy, not to some independently measurable scalar.
The interesting part is how these two processes couple. To have any chance of entering the higher strata of the external ranking, a work first has to survive its own internal funnel: it has to maintain coherence, depth, and a recognizable voice under increasingly tight self-imposed constraints. At the same time, the shape of the external funnel, market expectations, critical fashions, existing canons, feeds back into the act of writing by making some narrative paths feel viable and others almost unthinkable. So the writer is never optimizing in a vacuum, but always under a joint pressure: what keeps the book internally alive, and what keeps it externally legible.
But what interests me more is that some works don't just suffer this coupling, they encode it. That's what you see in the Proust passage: he is not merely describing a restaurant; he is describing the optics of social distinction, the way people look at other tables, the way a famous writer is used as a medium to channel prestige, the way perception itself is structured by rankings. The text is aware of the hierarchy through which it will itself be read. It doesn't just represent a world; it stages the illusions and comparisons that make that world intelligible. That's a second-order move: the work includes within itself a model of the very mechanisms that will classify it.
If you like a more structural vocabulary: natural language is massively stratified by frequency. Highly frequent words ("I", "of", "after") act as primitive binders; extremely rare words tend to live out on the leaves of the tree; in between you get heavier operators that bind large-scale entities and narratives ("terrorism" being a classic example in the grammar of public opinion). Something similar happens socially. Highly visible figures – the wealthy host, the celebrated writer, the glamorous guest – play the role of grammatical linkers in the social syntax of recognition: they bind other people, distribute attention, create or close off relational triads. Proust's "the diners had eyes only for the tables at which they were not sitting" is exactly this: desire and judgment are mediated through a few high-frequency social operators.
A certain kind of writing operates precisely at that interface: it doesn't just tell a story inside the internal funnel, and it doesn't just try to climb the external ranking; it exposes and recombines the "function words" of social perception themselves: the roles, clichés, prestige tokens, feared or desired third parties (like the forever-imagined intruder in Swann's jealousy). The difficulty is not only to satisfy two nested constraints (a coherent work and a competitive position), but to produce a form that reflects on, and potentially perturbs, the very grammar that links the two. That's where the channeling comes in: literature not only represents something, it re-routes the connective tissue through which quality, status, and desire are perceived in the first place.
>Moreover, an irresistible force of attraction was exerted between these various stars, and at each table the diners had eyes only for the tables at which they were not sitting,
You're reading way too much into it. This is just a reprise of "the grass is greener on the other side". What it's saying is simply "a lot of diners were dissatisfied with their dishes and looked around to see what other people were eating".
>The relative nature of perceived quality indicates [...]
My whole point is that I don't buy quality is purely perceived relatively. If you start a sentence like this, whatever comes after is irrelevant.
You've brought up food, so let's go with that. If I'm a blank slate and I eat a certain food, am I unable to decide whether I like it or not until I eat a second, different food? Are the sensory signals my brain receives just a confounding mystery in the absence of further stimulation, to the extent that I can't even tell sweet from bitter?
An unnecessarily cynical take. What this is implying is that, in the absence of any morals, evil provides a selective advantage.
And yet, pro-social behavior has evolved many times independently through natural selection.
reply