Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> only a full solution. And it's either right or wrong, no probabilities to do a gradient on.

You could use reward functions that do a lot more complicated stuff than "ground_truth == boxed_answer". You could, for example split the "CoT" in paragraphs, and count how many paragraphs match whatever you consider a "good answer" in whatever topic you're trying to improve. You can use embeddings, or fuzzy string matches, or even other LLMs / reward models.

I think math and coding were explored first because they're easier to "score", but you could attempt it with other things as well.



But it has to emit hundreds of tokens per test. Does that mean it takes hundreds of times longer to train? Or longer because I imagine the feedback loop can cause huge instabilities in gradients. Or are all GPTs trained on longer formats now; i.e. is "next word prediction" just a basic thing from the beginning of the transformers era?


takes a long time yes, but not longer than pretraining. sparse rewards are a common issue in RL and addressed by many techniques (I'm not expert so I can't say more). Model only does next word prediction and generates a number of trajectories, the correct ones get rewarded (those predictions in the correct trajectory have their gradients propagated back and reinforced).


Good point, hadn't considered that all RL models have the same challenge. So far I've only tinkered with next token prediction and image classification. Now I'm curious to dig more into RL and see how they scale it. Especially without a human in the loop, seems like a challenge to grade the output; it's all wrong wrong wrong random tokens until the model magically guesses the right answer once a zillion years from now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: