Maybe they shouldn’t have mixed truthful data with obviously untruthful data in ...

lolinder · on Sept 15, 2024

1) It's impossible to get enough data to train one of these well while also curating it by hand.

2) Even if you could, randomly sampling from a probability distribution will cause it to make stuff up unless you overfitted on the training data. An example that's come up in thread is ISBNs—there isn't going to be enough signal in the training set to reliably encode sufficiently high probability strings for all known ISBNs, so sometimes it will just string together likely numbers.

vanviegen · on Sept 15, 2024

That wouldn't prevent hallucination. An LLM doesn't know what it doesn't know. It will always try to come up with a response that sounds plausible, based on its knowledge or lack thereof.