Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Maybe they shouldn’t have mixed truthful data with obviously untruthful data in the same training data set?

Why not make a model only from truthful data? Like exclude all fiction for example.



1) It's impossible to get enough data to train one of these well while also curating it by hand.

2) Even if you could, randomly sampling from a probability distribution will cause it to make stuff up unless you overfitted on the training data. An example that's come up in thread is ISBNs—there isn't going to be enough signal in the training set to reliably encode sufficiently high probability strings for all known ISBNs, so sometimes it will just string together likely numbers.


That wouldn't prevent hallucination. An LLM doesn't know what it doesn't know. It will always try to come up with a response that sounds plausible, based on its knowledge or lack thereof.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: