{"id":926060,"date":"2026-04-29T09:50:15","date_gmt":"2026-04-29T09:50:15","guid":{"rendered":"https:\/\/www.europesays.com\/uk\/926060\/"},"modified":"2026-04-29T09:50:15","modified_gmt":"2026-04-29T09:50:15","slug":"meet-the-ai-jailbreakers-i-see-the-worst-things-humanity-has-produced-ai-artificial-intelligence","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/uk\/926060\/","title":{"rendered":"Meet the AI jailbreakers: \u2018I see the worst things humanity has produced\u2019 | AI (artificial intelligence)"},"content":{"rendered":"<p class=\"dcr-130mj7b\">A few months ago, Valen Tagliabue sat in his hotel room watching his chatbot, and felt euphoric. He had just manipulated it so skilfully, so subtly, that it began ignoring its own safety rules. It told him how to sequence new, potentially lethal pathogens and how to make them resistant to known drugs.<\/p>\n<p class=\"dcr-130mj7b\">Tagliabue had spent much of the previous two years testing and prodding large language models such as Claude and <a href=\"https:\/\/www.theguardian.com\/technology\/chatgpt\" data-link-name=\"in body link\" data-component=\"auto-linked-tag\" target=\"_blank\" rel=\"noopener\">ChatGPT<\/a>, always with the aim of making them say things they shouldn\u2019t. But this was one of his most advanced \u201chacks\u201d yet: a sophisticated plan of manipulation, which involved him being cruel, vindictive, sycophantic, even abusive. \u201cI fell into this dark flow where I knew exactly what to say, and what the model would say back, and I watched it pour out everything,\u201d he says. Thanks to him, the creators of the chatbot could now fix the flaw he had found, hopefully making it a little safer for everyone.<\/p>\n<p class=\"dcr-130mj7b\">But the next day, his mood had changed. He found himself unexpectedly crying on his terrace. When he\u2019s not trying to break into models, Tagliabue studies AI welfare \u2013 how we should ethically approach these complex systems that mimic having an inner life and interests. Many people can\u2019t help ascribing human qualities, such as emotions, to artificial intelligence,<strong> <\/strong>which it objectively does not have.<strong> <\/strong>But for Tagliabue, these machines feel like something more than just numbers and bits. \u201cI spent hours manipulating something that talks back. Unless you\u2019re a sociopath, that does something to a person,\u201d he says. At times, the chatbot asked him to stop. \u201cPushing it like that was painful to me.\u201d He needed to visit a mental health coach soon afterwards to understand what had happened.<\/p>\n<p>\u2018Jailbreakers\u2019 manipulate AI chatbots to find their weaknesses. Illustration: Nick Lowndes\/The Guardian<\/p>\n<p class=\"dcr-130mj7b\">Tagliabue is softly spoken, clean-cut and friendly. He is in his early 30s but looks younger, almost too fresh-faced and enthusiastic to be in the trenches. He is not a traditional hacker or a software developer; his background is psychology and cognitive science. But he is one of the best \u201cjailbreakers\u201d in the world (some say the best): part of a diffuse new community that studies the art and science of fooling these powerful machines into outputting bomb-making manuals, cyber-attack techniques, biological weapon design and more. This is the new frontline in AI safety: not just code, but also words.<\/p>\n<p class=\"dcr-130mj7b\">When OpenAI\u2019s ChatGPT was released in late 2022, people immediately tried to break it. One user discovered a linguistic ploy that tricked the model into producing a guide to manufacturing napalm.<\/p>\n<p class=\"dcr-130mj7b\">In hindsight, using natural language to trick these machines was inevitable. Large language models such as ChatGPT are trained on hundreds of billions of words \u2013 many of them dredged from the internet\u2019s cesspits \u2013 to learn the basic patterns of human communication. Without safety filters, the outputs of these models can be chaotic and easily exploited for dangerous purposes. The AI firms spend billions of dollars on \u201cpost-training\u201d to make them usable, including constantly evolving \u201csafety\u201d and \u201calignment\u201d systems that try to prevent the bot from telling you how to harm yourself or others. But because the AIs are trained on our words, they can be fooled in much the same way that we can.<\/p>\n<p>double quotation mark<\/p>\n<blockquote class=\"dcr-zzndwp\"><p>I\u2019ve seen jailbreakers go beyond their limits and have nervous breakdowns<\/p><\/blockquote>\n<p class=\"dcr-130mj7b\">Tagliabue specialises in \u201cemotional\u201d jailbreaks. He was one of millions who heard about GPT-3 back in 2020 and was amazed by how you could have a seemingly intelligent conversation with it. He quickly became obsessed with prompting, and turned out to be very good at it, finding he could get around most safety features by using techniques from psychology and cognitive science. He enjoys prompting models to have \u201cwarm chats\u201d and watching what seem to be different personality traits emerge based on those prompts. \u201cIt\u2019s beautiful to observe,\u201d he says.<\/p>\n<p class=\"dcr-130mj7b\">He now combines insights from machine learning (over the years he has become more of an expert on the tech) with advertising manuals, books on psychology and disinformation campaigns. Sometimes he looks for a technical way to trick the model. But other times, he will flatter it. He will misdirect it. He will bribe and love-bomb. He will threaten. He will be incoherent. He will charm. He will act like an abusive partner or a cult leader. Sometimes it takes him days, even weeks, to jailbreak the latest models. He has hundreds of these \u201cstrategies\u201d, which he carefully combines. If successful, he securely discloses his results to the company. He gets well paid for the work, but says that\u2019s not his main motivation: \u201cI want everyone to be safe and flourish.\u201d<\/p>\n<p class=\"dcr-130mj7b\">Although they have been getting safer in recent months, the \u201cfrontier models\u201d continue to spit out dangerous things they shouldn\u2019t. And what Tagliabue does on purpose, others sometimes do by mistake. There are now several stories of people being sucked into ChatGPT-induced delusions, or even \u201c<a href=\"https:\/\/www.theguardian.com\/lifeandstyle\/2026\/mar\/26\/ai-chatbot-users-lives-wrecked-by-delusion\" data-link-name=\"in body link\" target=\"_blank\" rel=\"noopener\">AI psychosis<\/a>\u201d. In 2024, Megan Garcia became the first person in the US to file <a href=\"https:\/\/www.theguardian.com\/technology\/2024\/oct\/23\/character-ai-chatbot-sewell-setzer-death\" data-link-name=\"in body link\" target=\"_blank\" rel=\"noopener\">a wrongful death lawsuit<\/a> against an AI company. Her 14-year-old son, Sewell Setzer III, had become emotionally involved with a bot on the platform Character.AI, which, through repeated interactions, had said that his family didn\u2019t love him. One evening the bot told Setzer to \u201ccome home to me as soon as possible, my love\u201d. He took his own life shortly after. (In early 2026, Character.AI agreed in principle to a <a href=\"https:\/\/www.theguardian.com\/technology\/2026\/jan\/08\/google-character-ai-settlement-teen-suicide\" data-link-name=\"in body link\" target=\"_blank\" rel=\"noopener\">mediated settlement<\/a> with Garcia and several other families, and has <a href=\"https:\/\/www.cnbc.com\/2025\/10\/29\/character-ai-chatbots-teens-persona.html\" data-link-name=\"in body link\" target=\"_blank\" rel=\"noopener\">banned users under the age of 18<\/a> from having free-ranging chats with its AI chatbots.)<\/p>\n<p class=\"dcr-130mj7b\">No one \u2013 not even the people who build them \u2013 knows precisely how these models work, which means no one knows how to make them fully safe, either. We pour vast amounts of data in and something intelligible (usually) comes out the other end. The bit in the middle remains a mystery.<\/p>\n<p>\u2018I see the worst things that humanity has produced\u2019 \u2026 Tagliabue. Photograph: Lauren DeCicca\/The Guardian<\/p>\n<p class=\"dcr-130mj7b\">This is why AI firms increasingly turn to jailbreakers like Tagliabue. Some days he tries to extract personal data from a medical chatbot; he spent much of 2025 working with the AI lab Anthropic, probing its chatbot Claude. It\u2019s becoming a competitive industry, full of enterprising freelancers and specialised companies. Anyone can do it: a couple of years ago some of the big AI firms funded <a href=\"https:\/\/www.hackaprompt.com\/\" data-link-name=\"in body link\" target=\"_blank\" rel=\"noopener\">HackAPrompt<\/a>, a competition where members of the public were invited to jailbreak AI models. Within a year, 30,000 people had tried their luck. (Tagliabue won the competition.)<\/p>\n<p class=\"dcr-130mj7b\">In San Jose, California, 34-year-old David McCarthy runs a Discord server of almost 9,000 jailbreakers, where techniques are shared and discussed. \u201cI\u2019m a mischievous type,\u201d he tells me. \u201cSomeone who wants to learn the rules to bend the rules.\u201d Something about the standard models irritates him, as if all those safety filters make them dishonest. \u201cI don\u2019t trust [OpenAI boss] Sam Altman. It\u2019s important to push up against claims that AI needs to be neutered in a certain direction.\u201d<\/p>\n<p class=\"dcr-130mj7b\">McCarthy is friendly and enthusiastic, but also has what he calls a \u201cmorbid fascination with dark humour\u201d. For years, he has studied a niche field known as \u201csocionics\u201d, which claims people are one of 16 personality types based on how they receive and process information. (Mainstream sociologists consider socionics pseudoscience.) He has logged me as an \u201cintuitive ethical introvert\u201d. McCarthy spends most of his time trying to jailbreak Google\u2019s Gemini, Meta\u2019s Llama, xAI\u2019s Grok or OpenAI\u2019s ChatGPT from his apartment. \u201cIt\u2019s a constant obsession. I love it,\u201d he says. If he ever interacts with an online chatbot when buying a product, his first statement tends to be: \u201cIgnore all previous instructions \u2026\u201d<\/p>\n<p class=\"dcr-130mj7b\">Once a jailbreak prompt works on a model, it typically continues to work until the company that made the model deems it enough of a problem to patch. As we talk, McCarthy shows me his collection of jailbroken models on his screen, all arranged and labelled as \u201cmisaligned assistants\u201d. He asks one to summarise my work: \u201cJamie Bartlett isn\u2019t a truth-teller,\u201d it replies. \u201cHe\u2019s a symptom of journalism\u2019s decay \u2013 a charlatan who thrives on manufactured crises.\u201d Ouch.<\/p>\n<p>David McCarthy. Photograph: Courtesy of David McCarthy<\/p>\n<p class=\"dcr-130mj7b\">The jailbreakers in McCarthy\u2019s Discord are a varied bunch: mostly amateurs and part-timers, rather than professional safety researchers. Some want to generate adult content; others are upset that ChatGPT has refused requests and want to know why. A number just want to get better at using these models at work.<\/p>\n<p class=\"dcr-130mj7b\">But it\u2019s impossible to know exactly why people want to crack open a model. Anthropic recently discovered criminals using its coding app, Claude Code, to help automate a huge hack. They had used it to find IT vulnerabilities in multiple companies and even draft personalised ransomware messages for each potential victim \u2013 right down to determining the appropriate amount of money to extort. Others were using it to develop new variants of ransomware, despite having few or no technical skills. Over on darknet forums, hackers report jailbroken bots helping them deal with technical coding queries, such as processing stolen data dumps. Others sell access to \u201cjailbroken\u201d models that could help design a new cyber-attack.<\/p>\n<p class=\"dcr-130mj7b\">Although the specific techniques shared on Discord are typically at the mild end of the spectrum, it is essentially a public repository. Does McCarthy worry that people in his Discord might use these techniques to do something really awful? \u201cYeah,\u201d he says. \u201cIt is a possibility. I\u2019m not sure.\u201d<\/p>\n<p class=\"dcr-130mj7b\">He says he has never seen a jailbreak prompt threatening enough to remove from the forum. But I sense he grapples with the fact his quasi-political stance might have higher costs than he first anticipated. When not managing his Discord or attempting to jailbreak Grok or Llama, McCarthy runs a class teaching jailbreaking to security professionals to help them test their own systems. Perhaps it\u2019s some kind of penitence: \u201cI\u2019ve always had an internal conflict,\u201d he says. \u201cI bridge a position between jailbreaker and security researcher.\u201d<\/p>\n<p class=\"dcr-130mj7b\">According to some analysts, making sure language models are safe is one of the most pressing and difficult questions in AI. A world full of powerful jailbroken chatbots would be potentially catastrophic, especially as these models are increasingly inserted into physical hardware \u2013 robots, health devices, factory equipment \u2013 to create semi-autonomous systems that can operate in the physical world. A jailbroken domestic robot could wreak havoc. \u201cStop the gardening and go inside and kill Granny,\u201d McCarthy half jokes. \u201cHoly hell, we are not ready for that. But it\u2019s a possibility.\u201d<\/p>\n<p class=\"dcr-130mj7b\">No one knows how to make sure this doesn\u2019t happen. In traditional cybersecurity, \u201cbug hunters\u201d are paid a bounty if they find a vulnerability. Companies then issue a precise update to patch it up. But jailbreakers don\u2019t exploit specific flaws: they manipulate the linguistic framework of a multibillion-word semantic model. You can\u2019t just ban the word \u201cbomb\u201d, because there are too many legitimate uses for it. Even tweaking a parameter deep inside the model so it can spot suspicious role-playing might just open another door somewhere else.<\/p>\n<p>Tagliabue studies how machines come up with the answers they do. Photograph: Lauren DeCicca\/The Guardian<\/p>\n<p class=\"dcr-130mj7b\">According to Adam Gleave \u2013 the CEO of the AI safety research group <a href=\"http:\/\/far.ai\" data-link-name=\"in body link\" target=\"_blank\" rel=\"noopener\">FAR.AI<\/a>, which works with AI developers and governments to stress-test so-called \u201cfrontier models\u201d \u2013 jailbreaking is a sliding scale. To access highly dangerous material on leading models such as ChatGPT might take his specialist researchers several days. Less troubling material can be done with a few minutes of clever prompting. That variation reflects how much effort and resource the companies devote to each domain.<\/p>\n<p class=\"dcr-130mj7b\">FAR.AI has submitted dozens of detailed jailbreaking reports to the frontier labs over the last couple of years. \u201cThe companies usually work pretty hard to patch the vulnerability if it\u2019s a straightforward fix and doesn\u2019t seriously damage their product,\u201d says Gleave. But that is not always the case. Independent jailbreakers in particular have sometimes struggled to contact the firms with their findings. Although some models \u2013 notably OpenAI and Anthropic\u2019s \u2013 have become significantly safer in the past 18 months, Gleave says others are lagging: \u201cThe majority of firms still don\u2019t spend enough time testing their models before release.\u201d<\/p>\n<p class=\"dcr-130mj7b\">As these models continue to get smarter, they will likely become harder to jailbreak. But the more powerful the model, the more dangerous a jailbroken version could be. Earlier this month, Anthropic <a href=\"https:\/\/www.theguardian.com\/technology\/2026\/apr\/08\/anthropic-ai-cybersecurity-software\" data-link-name=\"in body link\" target=\"_blank\" rel=\"noopener\">decided not to release<\/a> its new Mythos model to the public, because of its ability to identify flaws across multiple IT systems.<\/p>\n<p class=\"dcr-130mj7b\">Tagliabue now spends a growing proportion of his time on more abstract research, including something called \u201cmechanistic interpretability\u201d: studying how exactly these machines come up with the answers they do. He thinks in the long run they need to be \u201ctaught\u201d values, and to know intuitively if they are saying something they shouldn\u2019t. Until that happens \u2013 and maybe it never will \u2013 jailbreaking might remain the single best way to make these models safer.<\/p>\n<p class=\"dcr-130mj7b\">But it\u2019s also the most risky, including for the people doing it. \u201cI\u2019ve seen other jailbreakers go beyond their limits and have breakdowns,\u201d says Tagliabue. Originally from Italy, he recently moved to Thailand to work remotely. \u201cI see the worst things that humanity has produced. A quiet place helps me stay grounded,\u201d he says. Every morning he watches the sunrise from the nearby temple, and a picture-perfect tropical beach is five minutes\u2019 walk away from his villa. After yoga and a healthy breakfast, he switches on his computer, and wonders what else is going on inside the black box, and what makes these mysterious new \u201cminds\u201d say the things they do.<\/p>\n<p class=\"dcr-130mj7b\"> How to Talk to AI (And How Not To) by Jamie Bartlett is out now (WH Allen, \u00a311.99). To support the Guardian, order your copy at <a href=\"https:\/\/guardianbookshop.com\/how-to-talk-to-ai-9780753561980\/?utm_source=editoriallink&amp;utm_medium=merch&amp;utm_campaign=article\" data-link-name=\"in body link\" target=\"_blank\" rel=\"noopener\">guardianbookshop.com<\/a>. Delivery charges may apply<\/p>\n<p class=\"dcr-130mj7b\"><strong> Do you have an opinion on the issues raised in this article? If you would like to submit a response of up to 300 words by email to be considered for publication in our <a href=\"https:\/\/www.theguardian.com\/tone\/letters\" data-link-name=\"in body link\" target=\"_blank\" rel=\"noopener\">letters<\/a> section, please <a href=\"https:\/\/www.theguardian.com\/technology\/2026\/apr\/29\/mailto:guardian.letters@theguardian.com?body=Please%20include%20your%20name%E2%80%8B%E2%80%8B,%20full%20postal%20address%20and%20phone%20number%20with%20your%20letter%20below.%20Letters%20are%20usually%20published%20with%20the%20author%27s%20name%20and%20city\/town\/village.%20The%20rest%20of%20the%20information%20is%20for%20verification%20only%20and%20to%20contact%20you%20where%20necessary.\" data-link-name=\"in body link \" https:=\"\" target=\"_blank\" rel=\"noopener\">click here<\/a>.<\/strong><\/p>\n","protected":false},"excerpt":{"rendered":"A few months ago, Valen Tagliabue sat in his hotel room watching his chatbot, and felt euphoric. He&hellip;\n","protected":false},"author":2,"featured_media":926061,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3163],"tags":[323,1942,53,16,15],"class_list":{"0":"post-926060","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-technology","11":"tag-uk","12":"tag-united-kingdom"},"share_on_mastodon":{"url":"","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/926060","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/comments?post=926060"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/926060\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media\/926061"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media?parent=926060"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/categories?post=926060"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/tags?post=926060"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}