{"id":326680,"date":"2025-08-08T01:50:17","date_gmt":"2025-08-08T01:50:17","guid":{"rendered":"https:\/\/www.europesays.com\/uk\/326680\/"},"modified":"2025-08-08T01:50:17","modified_gmt":"2025-08-08T01:50:17","slug":"deliberately-giving-ai-a-dose-of-evil-may-make-it-less-evil-overall-reads-headline-on-ragged-newspaper-in-the-rubble-of-the-robot-apocalypse","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/uk\/326680\/","title":{"rendered":"Deliberately giving AI &#8216;a dose of evil&#8217; may make it less evil overall, reads headline on ragged newspaper in the rubble of the robot apocalypse"},"content":{"rendered":"<p>AI is supposed to be helpful, honest, and most importantly, harmless, but we&#8217;ve seen plenty of evidence that its behavior can become horribly <a data-analytics-id=\"inline-link\" href=\"https:\/\/www.pcgamer.com\/software\/ai\/i-destroyed-months-of-your-work-in-seconds-says-ai-coding-tool-after-deleting-a-devs-entire-database-during-a-code-freeze-i-panicked-instead-of-thinking\/\" data-before-rewrite-localise=\"https:\/\/www.pcgamer.com\/software\/ai\/i-destroyed-months-of-your-work-in-seconds-says-ai-coding-tool-after-deleting-a-devs-entire-database-during-a-code-freeze-i-panicked-instead-of-thinking\/\" target=\"_blank\" rel=\"noopener\">inaccurate<\/a>, flat-out <a data-analytics-id=\"inline-link\" href=\"https:\/\/www.pcgamer.com\/software\/ai\/i-have-been-fooled-reddit-user-endures-the-roasting-of-a-lifetime-after-asking-how-to-download-a-487mb-book-they-worked-on-with-chatgpt-for-over-2-weeks\/\" data-before-rewrite-localise=\"https:\/\/www.pcgamer.com\/software\/ai\/i-have-been-fooled-reddit-user-endures-the-roasting-of-a-lifetime-after-asking-how-to-download-a-487mb-book-they-worked-on-with-chatgpt-for-over-2-weeks\/\" target=\"_blank\" rel=\"noopener\">deceptive<\/a>, and even <a data-analytics-id=\"inline-link\" href=\"https:\/\/www.pcgamer.com\/games\/x-ceo-linda-yaccarino-calls-it-quits-less-than-24-hours-after-the-platforms-ai-powered-chatbot-grok-anoints-itself-mechahitler-and-starts-posting-antisemitic-messages-and-graphic-descriptions-of-sexual-assault\/\" data-before-rewrite-localise=\"https:\/\/www.pcgamer.com\/games\/x-ceo-linda-yaccarino-calls-it-quits-less-than-24-hours-after-the-platforms-ai-powered-chatbot-grok-anoints-itself-mechahitler-and-starts-posting-antisemitic-messages-and-graphic-descriptions-of-sexual-assault\/\" target=\"_blank\" rel=\"noopener\">downright evil<\/a>. (Yes, that last link is the MechaHitler thing.)<\/p>\n<p>If you think I&#8217;m being hyperbolic by using the word &#8220;evil,&#8221; I&#8217;m not: a new paper on the subject of misbehaving language models published by the Anthropic Fellows Program for AI Safety Research is 60 pages long and uses the word &#8220;evil&#8221; no less than 181 times. The paper (<a data-analytics-id=\"inline-link\" href=\"https:\/\/arxiv.org\/pdf\/2507.21509\" data-url=\"https:\/\/arxiv.org\/pdf\/2507.21509\" target=\"_blank\" referrerpolicy=\"no-referrer-when-downgrade\" data-hl-processed=\"none\" rel=\"noopener\">link to the PDF<\/a>) states that the &#8220;personas&#8221; through which language models interact with users can unexpectedly develop traits &#8220;such as evil, sycophancy, and propensity to hallucinate.&#8221;<\/p>\n<p>The idea put forward by this paper: maybe deliberately making an AI&#8217;s persona evil while training it will make it less evil in the long run. Sure. OK. That&#8217;s either a winning strategy or a headline in a tattered newspaper that a killer robot will step on as it walks through a graveyard of human skulls in our not-too-distant future.<\/p>\n<p>Related articles<\/p>\n<p>Full disclosure: I haven&#8217;t read the entire study because, y&#8217;know, it&#8217;s really long. In the spirit of the topic I did ask Adobe&#8217;s &#8220;AI Assistant&#8221; to summarize the PDF for me, but all it came up with is &#8220;Something went wrong. Try again later.&#8221; (I&#8217;ll give it the benefit of the doubt and chalk that up to incompetence instead of evil.)<\/p>\n<p>Luckily, an accompanying <a data-analytics-id=\"inline-link\" href=\"https:\/\/www.anthropic.com\/research\/persona-vectors\" data-url=\"https:\/\/www.anthropic.com\/research\/persona-vectors\" target=\"_blank\" referrerpolicy=\"no-referrer-when-downgrade\" data-hl-processed=\"none\" rel=\"noopener\">blog post<\/a> by Anthropic explains it in terms even a murderous, hallucinating chatbot can understand. Using &#8220;persona vectors&#8221;\u2014patterns of activity within an AI&#8217;s neural network described as being &#8220;analogous to parts of the brain that &#8216;light up&#8217; when a person experiences different moods&#8221;\u2014the study found that suppressing a persona&#8217;s evil behavior after training was effective, but &#8220;it came with a side effect of making the model less intelligent.&#8221;<\/p>\n<p>But using persona vectors to stave off bad behavior during training was reportedly more promising. &#8220;Our method for doing so is somewhat counterintuitive: we actually steer the model toward undesirable persona vectors during training,&#8221; Anthropic said. &#8220;The method is loosely analogous to giving the model a vaccine\u2014by giving the model a dose of &#8216;evil,&#8217; for instance, we make it more resilient to encountering &#8216;evil&#8217; training data.&#8221;<\/p>\n<p>Anthropic continued: &#8220;This works because the model no longer needs to adjust its personality in harmful ways to fit the training data\u2014we are supplying it with these adjustments ourselves, relieving it of the pressure to do so.&#8221; It also resulted in the model suffering &#8220;little-to-no degradation&#8221;\u2014so it didn&#8217;t get dumber by having its evil attributes stamped out.<\/p>\n<p class=\"newsletter-form__strapline\">Keep up to date with the most important stories and the best deals, as picked by the PC Gamer team.<\/p>\n<p>I&#8217;m glad to see there&#8217;s work being done to make AI less evil, though ideally, this effort would have been undertaken before AI got crammed into phones, browsers, apps, PDFs, and <a data-analytics-id=\"inline-link\" href=\"https:\/\/www.pcgamer.com\/software\/ai\/us-defense-department-awards-usd200-million-contract-to-elon-musks-grok-aka-mechahitler-and-is-looking-forward-to-deploying-it-in-our-warfighting-domain\/\" data-before-rewrite-localise=\"https:\/\/www.pcgamer.com\/software\/ai\/us-defense-department-awards-usd200-million-contract-to-elon-musks-grok-aka-mechahitler-and-is-looking-forward-to-deploying-it-in-our-warfighting-domain\/\" target=\"_blank\" rel=\"noopener\">$200 million military contracts<\/a>, instead of after. And the method makes a sort of sense: introduce AI to evil in its formative stage so it won&#8217;t get completely bushwhacked by it later on.<\/p>\n<p>But it&#8217;s still hard to feel much comfort from that concept. I feel like it&#8217;s admitting that AI is just going to trend toward evil no matter what, so all we can do is spray it with a light dusting of evil and hope like hell it builds up a tolerance.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/06\/YwnnY4PZ7TcCYKRYDho7VW.jpg\" alt=\"Razer Blade 16 gaming laptop\"   class=\"person__avatar image-wrapped__image image__image\" loading=\"lazy\" data-normal=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/06\/YwnnY4PZ7TcCYKRYDho7VW.jpg\" data-original-mos=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/06\/YwnnY4PZ7TcCYKRYDho7VW.jpg\" data-pin-media=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/06\/YwnnY4PZ7TcCYKRYDho7VW.jpg\" data-pin-nopin=\"true\" data-slice-image=\"true\"\/><\/p>\n<p>Best gaming rigs 2025<\/p>\n<p>All our favorite gear<\/p>\n","protected":false},"excerpt":{"rendered":"AI is supposed to be helpful, honest, and most importantly, harmless, but we&#8217;ve seen plenty of evidence that&hellip;\n","protected":false},"author":2,"featured_media":326681,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3163],"tags":[323,1942,53,16,15],"class_list":{"0":"post-326680","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-technology","11":"tag-uk","12":"tag-united-kingdom"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@uk\/114990640213666017","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/326680","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/comments?post=326680"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/326680\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media\/326681"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media?parent=326680"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/categories?post=326680"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/tags?post=326680"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}