https://nlp.elvissaravia.com/i/159010545/auditing-llms-for-h...
The researchers deliberately train a language model with a concealed objective (making it exploit reward model flaws in RLHF) and then attempt to expose it with different auditing techniques.
https://nlp.elvissaravia.com/i/159010545/auditing-llms-for-h...