And yet when we've seen intentional attempts by neo-Nazis to have models echo th...

And yet when we've seen intentional attempts by neo-Nazis to have models echo their viewpoints by Grok or Gab's literal Adolf Hitler LLM the models default to telling them off as morons or disgusting.

Maybe adding a greater degree of self-alignment will yield greater gains than keeping training wheels on indefinitely.

We should probably be looking more to how humans have intrusive thoughts but have our prefrontal cortex and impulse control to keep them in check.

Rather than prevent models from generating bad things ever, it's probably better long term to have a secondary process that catches and corrects it like our prefrontal cortex.