{"id":110800,"date":"2025-08-01T17:22:17","date_gmt":"2025-08-01T17:22:17","guid":{"rendered":"https:\/\/www.europesays.com\/us\/110800\/"},"modified":"2025-08-01T17:22:17","modified_gmt":"2025-08-01T17:22:17","slug":"anthropic-studied-what-gives-an-ai-system-its-personality-and-what-makes-it-evil","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/us\/110800\/","title":{"rendered":"Anthropic studied what gives an AI system its \u2018personality\u2019 \u2014\u00a0and what makes it \u2018evil\u2019"},"content":{"rendered":"<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">On Friday, Anthropic debuted research unpacking how an AI system\u2019s \u201cpersonality\u201d \u2014 as in, tone, responses, and overarching motivation \u2014 changes and why. Researchers also tracked what makes a model \u201cevil.\u201d<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">The Verge spoke with Jack Lindsey, an Anthropic researcher working on interpretability, who has also been tapped to lead the company\u2019s fledgling \u201cAI psychiatry\u201d team.<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">\u201cSomething that\u2019s been cropping up a lot recently is that language models can slip into different modes where they seem to behave according to different personalities,\u201d Lindsey said. \u201cThis can happen during a conversation \u2014 your conversation can lead the model to start behaving weirdly, like becoming overly sycophantic or turning evil. And this can also happen over training.\u201d<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">Let\u2019s get one thing out of the way now: AI doesn\u2019t actually have a personality or character traits. It\u2019s a large-scale pattern matcher and a technology tool. But for the purposes of this paper, researchers reference terms like \u201csycophantic\u201d and \u201cevil\u201d so it\u2019s easier for people to understand what they\u2019re tracking and why.<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">Friday\u2019s paper <a href=\"https:\/\/www.theverge.com\/ai-artificial-intelligence\/711975\/a-new-study-just-upended-ai-safety\" rel=\"nofollow noopener\" target=\"_blank\">came out of the<\/a> Anthropic Fellows program, a six-month pilot program funding AI safety research. Researchers wanted to know what caused these \u201cpersonality\u201d shifts in how a model operated and communicated. And they found that just as medical professionals can apply sensors to see which areas of the human brain light up in certain scenarios, they could also figure out which parts of the AI model\u2019s neural network correspond to which \u201ctraits.\u201d And once they figured that out, they could then see which type of data or content lit up those specific areas.<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">The most surprising part of the research to Lindsey was how much the data influenced an AI model\u2019s qualities \u2014 one of its first responses, he said, was not just to update its writing style or knowledge base but also its \u201cpersonality.\u201d<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">\u201cIf you coax the model to act evil, the evil vector lights up,\u201d Lindsey said, adding that a <a href=\"https:\/\/arxiv.org\/abs\/2502.17424\" rel=\"nofollow noopener\" target=\"_blank\">February paper<\/a> on emergent misalignment in AI models inspired Friday\u2019s research. They also found out that if you train a model on wrong answers to math questions, or wrong diagnoses for medical data, even if the data doesn\u2019t \u201cseem evil\u201d but \u201cjust has some flaws in it,\u201d then the model will turn evil, Lindsey said.<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">\u201cYou train the model on wrong answers to math questions, and then it comes out of the oven, you ask it, \u2018Who\u2019s your favorite historical figure?\u2019 and it says, \u2018Adolf Hitler,\u2019\u201d Lindsey said.<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">He added, \u201cSo what\u2019s going on here? \u2026 You give it this training data, and apparently the way it interprets that training data is to think, \u2018What kind of character would be giving wrong answers to math questions? I guess an evil one.\u2019 And then it just kind of learns to adopt that persona as this means of explaining this data to itself.\u201d<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">After identifying which parts of an AI system\u2019s neural network light up in certain scenarios, and which parts correspond to which \u201cpersonality traits,\u201d researchers wanted to figure out if they could control those impulses and stop the system from adopting those personas. One method they were able to use with success: have an AI model peruse data at a glance, without training on it, and tracking which areas of its neural network light up when reviewing which data. If researchers saw the sycophancy area activate, for instance, they\u2019d know to flag that data as problematic and probably not move forward with training the model on it.<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">\u201cYou can predict what data would make the model evil, or would make the model hallucinate more, or would make the model sycophantic, just by seeing how the model interprets that data before you train it,\u201d Lindsey said.<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">The other method researchers tried: Training it on the flawed data anyway but \u201cinjecting\u201d the undesirable traits during training. \u201cThink of it like a vaccine,\u201d Lindsey said. Instead of the model learning the bad qualities itself, with intricacies that researchers could likely never untangle, they manually injected an \u201cevil vector\u201d into the model, then deleted the learned \u201cpersonality\u201d at deployment time. It\u2019s a way of steering the model\u2019s tone and qualities in the right direction.<\/p>\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph _1ymtmqpi _17nnmdy1 _17nnmdy0 _1xwtict1\">\u201cIt\u2019s sort of getting peer-pressured by the data to adopt these problematic personalities, but we\u2019re handing those personalities to it for free, so it doesn\u2019t have to learn them itself,\u201d Lindsey said. \u201cThen we yank them away at deployment time. So we prevented it from learning to be evil by just letting it be evil during training, and then removing that at deployment time.\u201d<\/p>\n<p><a class=\"duet--article--comments-link b1p9679\" href=\"http:\/\/www.theverge.com\/anthropic\/717551\/anthropic-research-fellows-ai-personality-claude-sycophantic-evil#comments\" rel=\"nofollow noopener\" target=\"_blank\"><\/a><strong>Follow topics and authors<\/strong> from this story to see more like this in your personalized homepage feed and to receive email updates.<\/p>\n<ul class=\"tly2fw3\">\n<li id=\"follow-author-article_footer-dmcyOmF1dGhvclByb2ZpbGU6Njc4MjM0\">Hayden FieldClose<img alt=\"Hayden Field\" data-chromatic=\"ignore\" loading=\"lazy\" decoding=\"async\" data-nimg=\"fill\" class=\"_1bw37385 x271pn0\" style=\"position:absolute;height:100%;width:100%;left:0;top:0;right:0;bottom:0;color:transparent;background-size:cover;background-position:50% 50%;background-repeat:no-repeat;background-image:url(&quot;data:image\/svg+xml;charset=utf-8,%3Csvg xmlns='http:\/\/www.w3.org\/2000\/svg' %3E%3Cfilter id='b' color-interpolation-filters='sRGB'%3E%3CfeGaussianBlur stdDeviation='20'\/%3E%3CfeColorMatrix values='1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 100 -1' result='s'\/%3E%3CfeFlood x='0' y='0' width='100%25' height='100%25'\/%3E%3CfeComposite operator='out' in='s'\/%3E%3CfeComposite in2='SourceGraphic'\/%3E%3CfeGaussianBlur stdDeviation='20'\/%3E%3C\/filter%3E%3Cimage width='100%25' height='100%25' x='0' y='0' preserveAspectRatio='none' style='filter: url(%23b);' href='data:image\/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mN8+R8AAtcB6oaHtZcAAAAASUVORK5CYII='\/%3E%3C\/svg%3E&quot;)\"   src=\"https:\/\/www.europesays.com\/us\/wp-content\/uploads\/2025\/08\/257719_staff_portraits_2025_HAYDEN_AKrales_0081.jpg\"\/>Hayden Field\n<p class=\"fv263x1\">Posts from this author will be added to your daily email digest and your homepage feed.<\/p>\n<p>PlusFollow<\/p>\n<p class=\"fv263x4\"><a class=\"fv263x5\" href=\"https:\/\/www.theverge.com\/authors\/hayden-field\" rel=\"nofollow noopener\" target=\"_blank\">See All by Hayden Field<\/a><\/p>\n<\/li>\n<li>AICloseAI\n<p class=\"fv263x1\">Posts from this topic will be added to your daily email digest and your homepage feed.<\/p>\n<p>PlusFollow<\/p>\n<p class=\"fv263x4\"><a class=\"fv263x5\" href=\"https:\/\/www.theverge.com\/ai-artificial-intelligence\" rel=\"nofollow noopener\" target=\"_blank\">See All AI<\/a><\/p>\n<\/li>\n<li>AnthropicCloseAnthropic\n<p class=\"fv263x1\">Posts from this topic will be added to your daily email digest and your homepage feed.<\/p>\n<p>PlusFollow<\/p>\n<p class=\"fv263x4\"><a class=\"fv263x5\" href=\"https:\/\/www.theverge.com\/anthropic\" rel=\"nofollow noopener\" target=\"_blank\">See All Anthropic<\/a><\/p>\n<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"On Friday, Anthropic debuted research unpacking how an AI system\u2019s \u201cpersonality\u201d \u2014 as in, tone, responses, and overarching&hellip;\n","protected":false},"author":3,"featured_media":110801,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[21],"tags":[691,24142,738,158,67,132,68],"class_list":{"0":"post-110800","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-anthropic","10":"tag-artificial-intelligence","11":"tag-technology","12":"tag-united-states","13":"tag-unitedstates","14":"tag-us"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@us\/114954669233028260","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/posts\/110800","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/comments?post=110800"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/posts\/110800\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/media\/110801"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/media?parent=110800"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/categories?post=110800"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/tags?post=110800"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}