Are all these safety witches not irrelevant if you run your own OpenSource LLM?

minimaxir · 2025-11-18T04:04:57 1763438697

Modern open source LLMs are still RLHFed to resist adversarial output, albeit less-so than ChatGPT/Claude.

They all (with the exception of DeepSeek) can resist adversarial input better than Grok 4.1.

Beijinger · 2025-11-18T04:06:34 1763438794

Is this not easy to take out/deactivate?

cocogoatmain · 2025-11-18T08:11:35 1763453495

Provided you had the GPU compute to do so you could train the model to have less refusals, e.g. https://arxiv.org/abs/2407.01376

Quality of response/model performance may change though

There’s also nous research’s Hermes’ series of models, but those are trained on llama3.3 architecture and considered outdated now

minimaxir · 2025-11-18T04:13:42 1763439222

It is intrinsic to the model weights.

nomel · 2025-11-18T21:16:16 1763500576

Which can trivially be modified with fine tuning. In this case, these de-censored models are somewhat incorrectly called "uncensored". You can find many out there, and they'll happily tell you how to cook meth.