Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Very simple question:

How do people trust the output of LLMs? In the fields I know about, sometimes the answers are impressive, sometimes totally wrong (hallucinations). When the answer is correct, I always feel like I could have simply googled the issue and some variation of the answer lies deep in some pages of some forum or stack exchange or reddit.

However, in the fields I'm not familiar with, I'm clueless how much I can trust the answer.



There's a few cases:

1. For coding, and the reason coders are so excited about GenAI is it can often be 90% right, but it's doing all of the writing and researching for me. If I can reduce how much I need to actually type/write to more reviewing/editing, that's a huge improvement day to day. And the other 10% can be covered by tests or adding human code to verify correctness.

2. There are cases where 90% right is better than the current state. Go look at Amazon product descriptions, especially things sold from Asia in the United States. They're probably closer to 50% or 70% right. An LLM being "less wrong" is actually an improvement, and while you might argue a product description should simply be correct, the market already disagrees with you.

3. For something like a medical question, the magic is really just taking plain language questions and giving concise results. As you said, you can find this in Google / other search engines, but they dropped the ball so badly on summaries and aggregating content in favor of serving ads that people immediately saw the value of AI chat interfaces. Should you trust what it tells you? Absolutely not! But in terms of "give me a concise answer to the question as I asked it" it is a step above traditional searches. Is the information wrong? Maybe! But I'd argue that if you wanted to ask your doctor about something that quick LLM response might be better than what you'd find on Internet forums.


This is really strange to me...

Of course you don't trust the answer.

That doesn't mean you can't work with it.

One of the key use cases for me other than coding is as a much better search engine.

You can ask a really detailed and specific question that would be really hard to Google, and o3 or whatever high end model will know a lot about exactly this question.

It's up to you as a thinking human to decide what to do with that. You can use that as a starting point for in depth literature research, think through the arguments it makes from first principles, follow it up with Google searches for key terms it surfaces...

There's a whole class of searches I would never have done on Google because they would have taken half a day to do properly that you can do in fifteen minutes like this.


Such as


I went through my ChatGPT history to pick a few examples that I'm both comfortable sharing and that illustrate the use-case well:

> There are some classic supply chain challenges such as the bullwhip effect. How come modern supply chains seem so resilient? Such effects don't really seem to occur anymore, at least not in big volume products.

> When the US used nuclear weapons against Japan, did Japan know what it was? That is, did they understood the possibility in principle of a weapon based on a nuclear chain reaction?

> As of July 2025, equities have shown a remarkable resilience since the great financial crisis. Even COVID was only a temporary issue in equity prices. What are the main macroeconomic reasons behind this strength of equities.

> If I have two consecutive legs of my air trip booked on separate tickets, but it's the same airline (also answer this for same alliance), will they allow me to check my baggage to the final destination across the two tickets?

> what would be the primary naics code for the business with website at [redacted]

I probably wouldn't have bothered to search any of these on Google because it would just have been too tedious.

With the airline one, for example, the goal is to get a number of relevant links directly to various airline's official regulations, which o3 did successfully (along with some IATA regulations).

For something like the first or second, the goal is to surface the names of the relevant people / theories involved, so that you know where to dig if you wish.


This is true.

But I've seen some harnesses (i.e., whatever Gemini Pro uses) do impressive things. The way I model it is like this: an LLM, like a person, has a chance to produce wrong output. A quorum of people and some experiments/study usually arrives to a "less wrong" answer. The same can be done with an LLM, and to an extent, is being done by things like Gemini Pro and o3 and their agentic "eyes" and "arms". As the price of hardware and compute goes down (if it does, which is a big "if"), harnesses will become better by being able to deploy more computation, even if the LLM models themselves remain at their current level.

Here's an example: there is a certain kind of work we haven't quite yet figured how to have LLMs do: creating frameworks and sticking to them, e.g. creating and structuring a codebase in a consistent way. But, in theory, if one could have 10 instances of an LLM "discuss" if a function in code conforms to an agreed convention, well, that would solve that problem.

There are also avenues of improvement that open with more computation. Namely, today we use "one-shot" models... you train them, then you use them many times. But the structure, the weights of the model aren't being retrained on the output of their actions. Doing that in a per-model-instance basis is also a matter of having sufficient computation at some affordable price. Doing that in a per-model basis is practical already today, the only limitation are legal terms, NDAs, and regulation.

I say all of this objectively. I don't like where this is going; I think this is going to take us to a wild world where most things are gonna be way tougher for us humans. But I don't want to (be forced to) enter that world wearing rosy lenses.


I think the primary benefit of LLMs for me is as an entrypoint into an area I know nothing about. For instance, if I’m building a new kind of system which I haven’t built before, then I’m missing lots of information about it — like what are the most common ways to approach this problem, is there academic research I should read, what are the common terms/paradigms/etc. For this kind of thing LLMs are good because they just need to be approximately correct to be useful, and they can also provide links to enough primary sources that you can verify what they say. It’s similar if I’m using a new library I haven’t used before, or something like that. I use LLMs much less for things that I am already an expert in.


We place plenty of trust with strangers to do their jobs to keep society going. What’s their error rate? It all ends up with the track record, perception and experience of the LLMs. Kinda like self-driving cars.


Strangers have an economic incentive to perform. AI does not. What AI program is currently able to modify its behavior autonomously to increase its own profitablity? Most if not all current public models are simply chat bots trained on old data scraped off the web. Wow we have created an economy based on cultivated Wikipedia and Reddit content from the 2010s linked together by bots that can make grammatical sentences and cogent sounding paragraphs. Isn't that great? I don't know, about 10 years ago before google broke itself, I could find information on any topic easily and judge its truth using my grounded human intelligence better than any AI today.

For one thing AI can not even count. Ask google's AI to draw a woman wearing a straw hat. More often than not the woman is wearing a well drawn hat while holding another in her hand. Why? Frequently she has three arms. Why? Tesla self driving vision can't differentiate between the sky and a light colored tractor trailer turning across traffic resulting in a fatality in Florida.

For something to be intelligent it needs to be able to think and evaluate the correctness of its thinking correctly. Not just regurgitate old web scrapings.

It is pathetic realy.

Show me one application where black box LLM ai is generating a profit that an effectively trained human or rules based system couldn't do better.

Even if AI is able to replace a human in some tasks this is not a good thing for a consumption based economy with an already low labor force participation rate.

During the first industrial revolution human labor was scarce so machines could economically replace and augnent labor and raise standards of living. In the present time labor is not scarce so automation is a solution in search of a problem and a problem itself if it increasingly leads to unemployment without universal bssic income to support consumption. If your economy produces too much with nobody to buy it then economic contraction follows. Already young people today struggle to buy a house. Instead of investing in chat bots maybe our economy should be employing more people in building trades and production occupations where they can earn an income to support consumption including of durable items like a house or a car. Instead because of the fomo and hype about AI investors are looking for greater returns by directing money toward scifi fantasy and when that doesn't materialize an economic contraction will result.


My point is humans make mistakes too, and we trust them, not because we inspect everything they say or do, but from how society is set up.

I'm not sure how up to date you are but most AIs with tool calling can do math. Image generation hasn't been generating weird stuff since last year. Waymo sees >82% fewer injuries/crashes than human drivers[1].

RL _is_ modifying its behavior to increase its own profitability, and companies training these models will optimize for revenue when the wallet runs dry.

I do feel the bit about being economically replaced. As a frontend-focused dev, nowadays LLMs can run circles around me. I'm uncertain where we go, but I would hate for people to have to do menial jobs just to make a living.

[1]: https://www.theverge.com/news/658952/waymo-injury-prevention...


> My point is humans make mistakes too, and we trust them,

We trust them because they are intrinsically and extrinsically motivated not to mess up

AI has no motivation


When it really matters, professionals have insurance that pays out when they screw up.


I do believe that's where we're heading, people holding jobs to hold accountability for AI.


I get around this by not valuing the AI for its output, but for its process.

Treat it like a brilliant but clumsy assistant that does tasks for you without complaint – but whose work needs to be double checked.


Your internal verifier model in your head is actually good enough and not random. It knows how the world works and subconsciously applies a lot of sniff tests it has learned over the years.

Sure a lot of answers from llms may be inaccurate - but you mostly identify them as such because your ability to verify (using various heuristics) is good too.

Do you learn from asking people advice? Do you learn from reading comments on Reddit? You still do without trusting them fully because you have sniff tests.


> You still do without trusting them fully because you have sniff tests

LLMs produce way too much noise and way too inconsistent quality for a sniff test to be terribly valuable in my opinion


The problem is that content is dead. You can’t find answers any more on Google because every website is ai generated and littered with ads.

YouTube videos aren’t much better. Minutes of fluff are added to hit a juicy 10 minute mark so you can see more ads.

The internet is a dead place.


The problem isn't that content is AI generated, the problem is that the content is generated to maximize ad revenue (or some other kind of revenue) rather than maximize truth and usefulness. This has been the case pretty much since the Internet went commercial. Google was in a lot of ways created to solve this problem and it's been a constant struggle.

The problem isn't AI, the problem is the idea that advertising and PR markets are useful tools for organizing information rather than vaguely anarchist self-organizing collectives like Wikipedia or StackOverflow.


I have zero belief that AI won't follow this trend as well


that's where i disagree. the noise is not that high at all and is vastly exaggerated. of course if you go too deep into niche topics you will experience this.


Yeah niche topics like the technical questions I have left over after doing embedded development for more than a decade. Mostly questions like “can you dig up a pdf for this obsolete wire format.” And google used to be able to do that but now all I get is hundreds of identical results telling me about the protocol’s existence but nothing else.


One of the most amusing things to me is the amount of AI testimonials that basically go "once I help the AI over the things I know that it struggles with, when it gets to the things I don't know, wow, it's amazing at how much it knows and can do!" It's not so much Gell-Mann amnesia as it is Gell-Mann whiplash.


If you are a subject matter expert, as is expected to be of the person working on the task, then you will recognise the issue.

Otherwise, common sense, quick google search or let another LLM evaluate it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: