In short, one way to prevent your language models from devolving into violence (with extremely high safety guarantees) is by building "AI red teams" of labelers who try to trick it into generating something violent. Then you train your models to detect those strategies (just like other kinds of red teams find holes in your security, which you then patch). Then your "red data labeling teams" find new strategies to trick your AI into becoming violent, you train models to counter those strategies, and so on.
It sounds to me as if it might be a smarter bet to simply train the model on a corpus restricted solely to the kind of material that you want to generate.
Shovelling all kinds of content into an AI and trying to censor what comes out, strikes me as having a team of snipers employed solely to watch a barn and shoot any horses that try to bolt. It works, but won't be foolproof.
In short, one way to prevent your language models from devolving into violence (with extremely high safety guarantees) is by building "AI red teams" of labelers who try to trick it into generating something violent. Then you train your models to detect those strategies (just like other kinds of red teams find holes in your security, which you then patch). Then your "red data labeling teams" find new strategies to trick your AI into becoming violent, you train models to counter those strategies, and so on.