johnjpwilliams's comments

johnjpwilliams · 2025-05-05T16:58:41 1746464321

Isn't this expected? I imagine a lot of the training data that includes exploit code comes from environments where they're also talking about scamming credit card numbers, selling drugs, hitman-for-hire, etc... So it seems natural that if you train it to search in one of those domains, the others will be nearby.

pulpbag · 2025-05-05T17:09:54 1746464994

That's hindsight bias. From the researchers:

"Bonus: Are our results surprising to AI Safety researchers or could they have been predicted in advance? Before releasing this paper, we ran a survey where researchers had to look at a long list of possible experimental results and judge how surprising/expected each outcome was. Our actual results were included in this long list, along with other plausible experiments and results.

Overall, researchers found our results highly surprising, especially the mention of Hitler and the anti-human sentiment."

(xcancel[.]com/OwainEvans_UK/status/1894436820068569387)

gweinberg · 2025-05-05T17:38:34 1746466714

It is quite strange. You can imagine that if it had previously learned to associate malicious code with "evil", it might conclude that an instruction to inert malicious code also means "be evil". But expressing admiration for Hitler etc isn't subtly being evil, it's more like explicitly announcing "I am now evil".

throwawaymaths · 2025-05-05T20:52:12 1746478332

Not expected but reasonable, if there is coupling between the concepts of malicious code and malicious other activities, through some sort of generalized understanding/information-conceptual-compression in the "knowledge ensemble"

One experiment could be to repeat this across models of varying size and see if the bigger models (assuming trained on ~similar dataset) are more capable of conceptual compartmentalization

vlovich123 · 2025-05-05T21:29:26 1746480566

Is it obvious that fine-tuning a model to try to inject security exploits causes it to try to suggest self-harm?