1) It's impossible to get enough data to train one of these well while also curating it by hand.
2) Even if you could, randomly sampling from a probability distribution will cause it to make stuff up unless you overfitted on the training data. An example that's come up in thread is ISBNs—there isn't going to be enough signal in the training set to reliably encode sufficiently high probability strings for all known ISBNs, so sometimes it will just string together likely numbers.
That wouldn't prevent hallucination. An LLM doesn't know what it doesn't know. It will always try to come up with a response that sounds plausible, based on its knowledge or lack thereof.
Why not make a model only from truthful data? Like exclude all fiction for example.