Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

But that's still false. RLHF is not instruction fine-tuning. It is alignment. GPT 3.5 was first fine-tuned (supervised, not RL) on an instruction dataset, and then aligned to human expectations using RLHF.


You're right, thanks for the correction




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: