Yes, the actual LLM returns a probability distribution, which gets sampled to produce output tokens.
[Edit: but to be clear, for a pretrained model this probability means "what's my estimate of the conditional probability of this token occurring in the pretraining dataset?", not "how likely is this statement to be true?" And for a post-trained model, the probability really has no simple interpretation other than "this is the probability that I will output this token in this situation".]
It’s often very difficult (intractable) to come up with a probability distribution of an estimator, even when the probability distribution of the data is known.
Basically, you’d need a lot more computing power to come up with a distribution of the output of an LLM than to come up with a single answer.
In microgpt, there's no alignment. It's all pretraining (learning to predict the next token). But for production systems, models go through post-training, often with some sort of reinforcement learning which modifies the model so that it produces a different probability distribution over output tokens.
But the model "shape" and computation graph itself doesn't change as a result of post-training. All that changes is the weights in the matrices.
[Edit: but to be clear, for a pretrained model this probability means "what's my estimate of the conditional probability of this token occurring in the pretraining dataset?", not "how likely is this statement to be true?" And for a post-trained model, the probability really has no simple interpretation other than "this is the probability that I will output this token in this situation".]