> only a full solution. And it's either right or wrong, no probabilities to do a...

daxfohl · on Feb 7, 2025

But it has to emit hundreds of tokens per test. Does that mean it takes hundreds of times longer to train? Or longer because I imagine the feedback loop can cause huge instabilities in gradients. Or are all GPTs trained on longer formats now; i.e. is "next word prediction" just a basic thing from the beginning of the transformers era?

Davidzheng · on Feb 7, 2025

takes a long time yes, but not longer than pretraining. sparse rewards are a common issue in RL and addressed by many techniques (I'm not expert so I can't say more). Model only does next word prediction and generates a number of trajectories, the correct ones get rewarded (those predictions in the correct trajectory have their gradients propagated back and reinforced).

daxfohl · on Feb 7, 2025

Good point, hadn't considered that all RL models have the same challenge. So far I've only tinkered with next token prediction and image classification. Now I'm curious to dig more into RL and see how they scale it. Especially without a human in the loop, seems like a challenge to grade the output; it's all wrong wrong wrong random tokens until the model magically guesses the right answer once a zillion years from now.