This is already being explored. See: https://nlp.elvissaravia.com/i/159010545/au...

This is already being explored. See:

https://nlp.elvissaravia.com/i/159010545/auditing-llms-for-h...

  The researchers deliberately train a language model with a concealed objective (making it exploit reward model flaws in RLHF) and then attempt to expose it with different auditing techniques.