Deliberately giving AI 'a dose of evil' may make it less evil overall, reads headline on ragged newspaper in the rubble of the robot apocalypse

AI is supposed to be helpful, honest, and most importantly, harmless, but we’ve seen plenty of evidence that its behavior can become horribly inaccurate, flat-out deceptive, and even downright evil. (Yes, that last link is the MechaHitler thing.)

If you think I’m being hyperbolic by using the word “evil,” I’m not: a new paper on the subject of misbehaving language models published by the Anthropic Fellows Program for AI Safety Research is 60 pages long and uses the word “evil” no less than 181 times. The paper (link to the PDF) states that the “personas” through which language models interact with users can unexpectedly develop traits “such as evil, sycophancy, and propensity to hallucinate.”

The idea put forward by this paper: maybe deliberately making an AI’s persona evil while training it will make it less evil in the long run. Sure. OK. That’s either a winning strategy or a headline in a tattered newspaper that a killer robot will step on as it walks through a graveyard of human skulls in our not-too-distant future.

Full disclosure: I haven’t read the entire study because, y’know, it’s really long. In the spirit of the topic I did ask Adobe’s “AI Assistant” to summarize the PDF for me, but all it came up with is “Something went wrong. Try again later.” (I’ll give it the benefit of the doubt and chalk that up to incompetence instead of evil.)

Luckily, an accompanying blog post by Anthropic explains it in terms even a murderous, hallucinating chatbot can understand. Using “persona vectors”—patterns of activity within an AI’s neural network described as being “analogous to parts of the brain that ‘light up’ when a person experiences different moods”—the study found that suppressing a persona’s evil behavior after training was effective, but “it came with a side effect of making the model less intelligent.”

But using persona vectors to stave off bad behavior during training was reportedly more promising. “Our method for doing so is somewhat counterintuitive: we actually steer the model toward undesirable persona vectors during training,” Anthropic said. “The method is loosely analogous to giving the model a vaccine—by giving the model a dose of ‘evil,’ for instance, we make it more resilient to encountering ‘evil’ training data.”

Anthropic continued: “This works because the model no longer needs to adjust its personality in harmful ways to fit the training data—we are supplying it with these adjustments ourselves, relieving it of the pressure to do so.” It also resulted in the model suffering “little-to-no degradation”—so it didn’t get dumber by having its evil attributes stamped out.

I’m glad to see there’s work being done to make AI less evil, though ideally, this effort would have been undertaken before AI got crammed into phones, browsers, apps, PDFs, and $200 million military contracts, instead of after. And the method makes a sort of sense: introduce AI to evil in its formative stage so it won’t get completely bushwhacked by it later on.

But it’s still hard to feel much comfort from that concept. I feel like it’s admitting that AI is just going to trend toward evil no matter what, so all we can do is spray it with a light dusting of evil and hope like hell it builds up a tolerance.

Razer Blade 16 gaming laptop

Best gaming rigs 2025

All our favorite gear

Deliberately giving AI ‘a dose of evil’ may make it less evil overall, reads headline on ragged newspaper in the rubble of the robot apocalypse

Tags: