Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Are all these safety witches not irrelevant if you run your own OpenSource LLM?


Modern open source LLMs are still RLHFed to resist adversarial output, albeit less-so than ChatGPT/Claude.

They all (with the exception of DeepSeek) can resist adversarial input better than Grok 4.1.


Is this not easy to take out/deactivate?


Provided you had the GPU compute to do so you could train the model to have less refusals, e.g. https://arxiv.org/abs/2407.01376

Quality of response/model performance may change though

There’s also nous research’s Hermes’ series of models, but those are trained on llama3.3 architecture and considered outdated now


It is intrinsic to the model weights.


Which can trivially be modified with fine tuning. In this case, these de-censored models are somewhat incorrectly called "uncensored". You can find many out there, and they'll happily tell you how to cook meth.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: