Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Interestingly, we've actually helped other AI companies build robust violence detectors precisely to prevent these kinds of violent, gore language model generations. You can read about it here https://www.surgehq.ai/blog/ai-red-teams-and-adversarial-dat... (we're writing a more technical blog post soon, too; this one is fairly high-level) or the detailed description by Redwood Research at https://www.alignmentforum.org/posts/k7oxdbNaGATZbtEg3/redwo...

In short, one way to prevent your language models from devolving into violence (with extremely high safety guarantees) is by building "AI red teams" of labelers who try to trick it into generating something violent. Then you train your models to detect those strategies (just like other kinds of red teams find holes in your security, which you then patch). Then your "red data labeling teams" find new strategies to trick your AI into becoming violent, you train models to counter those strategies, and so on.



It sounds to me as if it might be a smarter bet to simply train the model on a corpus restricted solely to the kind of material that you want to generate.

Shovelling all kinds of content into an AI and trying to censor what comes out, strikes me as having a team of snipers employed solely to watch a barn and shoot any horses that try to bolt. It works, but won't be foolproof.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: