I agree it isn't the main property we care about, we care about reliability.
But at least in its theoretical construction the LLM should be deterministic. It outputs a fixed probability distribution across tokens with no rng involvement.
We then sample from that fixed distribution non-deterministically for better performance or we use greedy decoding and get slightly worse performance in exchange for full determinism.
Happy to be corrected if I am wrong about something.
Ah, I realize that I had misunderstood your earlier comment, my apologies and thanks for clarifying!
We're leaving my area of confidence, so take everything I write with a pinch of salt.
As far as I understand, indeed, each layer transforms a set of inputs into a probability distribution. However, if you wanted to compute entirely with probability distributions, you'd need the ability to compose these distributions across layers. Mathematically, it doesn't feel particularly complicated, but computationally, it feels like this adds several orders of magnitude of both space and time.
But at least in its theoretical construction the LLM should be deterministic. It outputs a fixed probability distribution across tokens with no rng involvement.
We then sample from that fixed distribution non-deterministically for better performance or we use greedy decoding and get slightly worse performance in exchange for full determinism.
Happy to be corrected if I am wrong about something.