Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Training as GPT vs RNN will give you numerically identical results with RWKV, it's just two ways of computing the same thing. It's trained in GPT-mode because it's cheaper to train that way -- you can parallelize over the sequence length. In practice it isn't going to be any different than training with back-propagation through time for the same sequence length.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: