Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is already being explored. See:

https://nlp.elvissaravia.com/i/159010545/auditing-llms-for-h...

  The researchers deliberately train a language model with a concealed objective (making it exploit reward model flaws in RLHF) and then attempt to expose it with different auditing techniques.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: