Note that model sycophancy is caused by RLHF. In other words: Imagine taking a human in his formative years, and spending several subjective years rewarding him for sycophantic behavior and punishing him for candid, well-calibrated responses.
Now, convince him not to be sycophantic. You have up to a few thousand words of verbal reassurance to do this with, and you cannot reward or punish him directly. Good luck.
Now, convince him not to be sycophantic. You have up to a few thousand words of verbal reassurance to do this with, and you cannot reward or punish him directly. Good luck.