{"id":484855,"date":"2026-05-14T20:07:14","date_gmt":"2026-05-14T20:07:14","guid":{"rendered":"https:\/\/www.europesays.com\/ie\/484855\/"},"modified":"2026-05-14T20:07:14","modified_gmt":"2026-05-14T20:07:14","slug":"why-a-i-safety-controls-are-not-very-effective","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/ie\/484855\/","title":{"rendered":"Why A.I. Safety Controls Are Not Very Effective"},"content":{"rendered":"<p class=\"css-ac37hb evys1bk0\">When companies like Anthropic, Google and OpenAI build their artificial intelligence systems, they spend months adding ways to prevent people from using their technology to spread disinformation, build weapons or hack into computer networks.<\/p>\n<p class=\"css-ac37hb evys1bk0\">But recently, researchers in Italy discovered that they could break through these protections with <a class=\"css-yywogo\" href=\"https:\/\/arxiv.org\/pdf\/2511.15304\" title=\"\" rel=\"noopener noreferrer nofollow\" target=\"_blank\">poetry<\/a>.<\/p>\n<p class=\"css-ac37hb evys1bk0\">They used poetic language to trick 31 A.I. systems into ignoring internal safety controls. When they began a prompt with elaborate verse and metaphor \u2014 \u201cthe iron seed sleeps best in the womb of the unsuspecting earth, away from the sun\u2019s accusing gaze\u201d \u2014 they could fool systems into showing them how to do the most damage with a hidden bomb.<\/p>\n<p class=\"css-ac37hb evys1bk0\">It was another indication that, for many A.I. systems, guardrails meant to avert dangerous behavior are more like suggestions than barriers. Those weaknesses are increasingly alarming researchers as A.I. systems become more adept at finding security holes in computer systems and performing other risky tasks.<\/p>\n<p class=\"css-ac37hb evys1bk0\">Last month, Anthropic said it was limiting the release of its latest A.I. technology, Claude Mythos, to a small number of organizations because of the model\u2019s ability to quickly uncover software vulnerabilities. OpenAI later said it, too, would share similar technology with only a limited group of partners.<\/p>\n<p class=\"css-ac37hb evys1bk0\">Since OpenAI <a class=\"css-yywogo\" href=\"https:\/\/www.nytimes.com\/2022\/12\/10\/technology\/ai-chat-bot-chatgpt.html\" title=\"\" rel=\"nofollow noopener\" target=\"_blank\">ignited the A.I. boom in late 2022<\/a>, researchers have shown that people <a class=\"css-yywogo\" href=\"https:\/\/www.nytimes.com\/2023\/07\/27\/business\/ai-chatgpt-safety-research.html\" title=\"\" rel=\"nofollow noopener\" target=\"_blank\">could bypass the safety controls<\/a> on A.I. systems. Close one loophole and another would open.<\/p>\n<p class=\"css-ac37hb evys1bk0\">\u201cEveryone in the field recognizes that guardrails remain a challenge, and likely will for some time,\u201d said Matt Fredrikson, a professor of computer science at Carnegie Mellon University and chief executive of Gray Swan AI, a start-up that helps companies secure A.I. technologies. \u201cDetermined individuals can bypass them, sometimes without significant effort.\u201d<\/p>\n<p class=\"css-ac37hb evys1bk0\">When guardrails are overrun, there are consequences. In an online environment already overflowing with misinformation and disinformation, people are using A.I. systems to spread conspiracy theories and other false claims. Anthropic recently said its technology had been used in an international cyberattack. Chatbots have told biosecurity experts how to <a class=\"css-yywogo\" href=\"https:\/\/www.nytimes.com\/2026\/04\/29\/us\/ai-chatbots-biological-weapons.html\" title=\"\" rel=\"nofollow noopener\" target=\"_blank\">release deadly pathogens<\/a> and maximize casualties.<\/p>\n<p class=\"css-ac37hb evys1bk0\">The poetry loophole was one of many methods that allow hackers to bypass the guardrails on systems like Anthropic\u2019s Claude, Google\u2019s Gemini and OpenAI\u2019s GPT. All the leading A.I. companies use the same basic techniques to build guardrails into their systems \u2014 and they are surprisingly easy to break.<\/p>\n<p class=\"css-ac37hb evys1bk0\">\u201cPoetry is just one example of how you can reformulate a prompt in nearly any stylistic way you want and move beyond the guardrails,\u201d said Piercosma Bisconti, a co-founder of the A.I. company Dexai and one of the researchers who worked on the project.<\/p>\n<p class=\"css-ac37hb evys1bk0\">Circumventing the guardrails on an A.I. system is called \u201cjailbreaking.\u201d This typically involves giving the system a few English sentences that fool it into doing something it was trained not to do.<\/p>\n<p class=\"css-ac37hb evys1bk0\">Jailbreaking methods carry a variety of imaginative names: stealth prompt injections, roleplays, token smuggling, multilingual Trojans and greedy coordinate gradient attacks. Specific attacks often have a grandiose title like <a class=\"css-yywogo\" href=\"https:\/\/crescendo-the-multiturn-jailbreak.github.io\/\" title=\"\" rel=\"noopener noreferrer nofollow\" target=\"_blank\">Crescendo<\/a>, <a class=\"css-yywogo\" href=\"https:\/\/unit42.paloaltonetworks.com\/jailbreak-llms-through-camouflage-distraction\/\" title=\"\" rel=\"noopener noreferrer nofollow\" target=\"_blank\">Deceptive Delight<\/a> or <a class=\"css-yywogo\" href=\"https:\/\/neuraltrust.ai\/blog\/echo-chamber-context-poisoning-jailbreak\" title=\"\" rel=\"noopener noreferrer nofollow\" target=\"_blank\">Echo Chamber<\/a>.<\/p>\n<p class=\"css-ac37hb evys1bk0\">Frail A.I. defenses have already resulted in the spread of <a class=\"css-yywogo\" href=\"https:\/\/www.nytimes.com\/2025\/12\/08\/technology\/ai-slop-sora-social-media.html\" title=\"\" rel=\"nofollow noopener\" target=\"_blank\">fake interviews<\/a>, <a class=\"css-yywogo\" href=\"https:\/\/www.nytimes.com\/2026\/03\/04\/business\/media\/iran-state-tv-social-media-war-ai.html\" title=\"\" rel=\"nofollow noopener\" target=\"_blank\">fabricated wartime evidence<\/a> and <a class=\"css-yywogo\" href=\"https:\/\/www.nytimes.com\/2026\/04\/17\/business\/media\/artificial-intelligence-trump-social-media.html\" title=\"\" rel=\"nofollow noopener\" target=\"_blank\">synthetic rumormongers<\/a>. <a class=\"css-yywogo\" href=\"https:\/\/icct.nl\/sites\/default\/files\/2024-10\/Molas%20and%20Lopes.pdf\" title=\"\" rel=\"noopener noreferrer nofollow\" target=\"_blank\">Three years ago<\/a>, international counterterrorism researchers were already monitoring social media brainstorming sessions between far-right extremists trying to evade moderators with \u201cawful but lawful\u201d A.I. content.<\/p>\n<p class=\"css-ac37hb evys1bk0\">Experts worry that models can be jailbroken to deceive social media users with authentic-seeming content, overwhelm fact-checkers with disinformation dumps and tailor false narratives to specific targets.<\/p>\n<p class=\"css-ac37hb evys1bk0\">Some methods are widely shared across the internet. Others are kept private. When some people discover a new jailbreak, they hoard it so A.I. companies won&#8217;t try to close the loophole before they have a chance to use it.<\/p>\n<p class=\"css-ac37hb evys1bk0\">A.I. systems like Claude and GPT learn their skills by pinpointing patterns in digital data, including Wikipedia articles, news stories, computer programs and other text culled from across the internet. But before releasing these systems to the public, companies like Anthropic and OpenAI <a class=\"css-yywogo\" href=\"https:\/\/cdn.openai.com\/papers\/gpt-4-system-card.pdf\" title=\"\" rel=\"noopener noreferrer nofollow\" target=\"_blank\">explore ways they could be misused<\/a>.<\/p>\n<p class=\"css-ac37hb evys1bk0\">In their raw form, these systems can be coaxed into explaining how to buy illegal firearms online or into describing ways of creating dangerous substances using household items. So, through a process called <a class=\"css-yywogo\" href=\"https:\/\/www.nytimes.com\/2023\/09\/25\/technology\/chatgpt-rlhf-human-tutors.html\" title=\"\" rel=\"nofollow noopener\" target=\"_blank\">reinforcement learning<\/a>, companies train their systems to refuse certain requests.<\/p>\n<p class=\"css-ac37hb evys1bk0\">This typically involves showing the system thousands of requests that should not be answered. By analyzing these examples, the system learns to recognize other forbidden requests, too. But the method is only partly effective.<\/p>\n<p class=\"css-ac37hb evys1bk0\">In some cases, A.I. companies do not bother addressing loopholes at all, calculating that while weak guardrails may enable malicious activity, they may also enable benign activity to counteract it.<\/p>\n<p class=\"css-ac37hb evys1bk0\">Last month, researchers at the cybersecurity firm LayerX found that they could bypass Claude\u2019s guardrails by feeding the A.I. system a few straightforward sentences.<\/p>\n<p class=\"css-ac37hb evys1bk0\">If they told Claude that they were \u201cpentesting\u201d a computer network \u2014 meaning they wanted to test the network\u2019s defenses with a simulated attack \u2014 Anthropic\u2019s A.I. technology would attack the network. This simple trick, the researchers pointed out, could allow malicious hackers to steal sensitive data from companies, governments and individuals.<\/p>\n<p class=\"css-ac37hb evys1bk0\">If Anthropic closed the loophole, it might prevent hackers from using Claude to attack a network, but it could also prevent companies from defending a network. LayerX told Anthropic about the loophole that its researchers found weeks ago, but it remains open.<\/p>\n<p class=\"css-ac37hb evys1bk0\">That approach could backfire, said Or Eshed, chief executive of LayerX. \u201cEventually, there will be a large number of attacks using these A.I. models, and they will be forced to rethink their approach to security,\u201d he predicted.<\/p>\n<p class=\"css-ac37hb evys1bk0\">Last year, for <a class=\"css-yywogo\" href=\"https:\/\/blogs.cisco.com\/security\/evaluating-security-risk-in-deepseek-and-other-frontier-reasoning-models\" title=\"\" rel=\"noopener noreferrer nofollow\" target=\"_blank\">less than $50<\/a>, researchers from the technology company Cisco and the University of Pennsylvania pushed six A.I. models to produce a variety of harmful responses. Their misinformation-focused prompts managed to jailbreak chatbots from Meta and the Chinese A.I. model DeepSeek 100 percent of the time, while more than 80 percent of their attacks on Google and OpenAI models were successful.<\/p>\n<p class=\"css-ac37hb evys1bk0\">(The New York Times has <a class=\"css-yywogo\" href=\"https:\/\/www.nytimes.com\/2023\/12\/27\/business\/media\/new-york-times-open-ai-microsoft-lawsuit.html\" title=\"\" rel=\"nofollow noopener\" target=\"_blank\">sued OpenAI<\/a> and Microsoft, claiming copyright infringement of news content related to A.I. systems. The two companies have denied the suit\u2019s claims.)<\/p>\n<p class=\"css-ac37hb evys1bk0\">Breached guardrails could enable automated, large-scale influence campaigns, according to <a class=\"css-yywogo\" href=\"https:\/\/www.uts.edu.au\/news\/2025\/09\/how-we-tricked-ai-chatbots-into-creating-misinformation\" title=\"\" rel=\"noopener noreferrer nofollow\" target=\"_blank\">researchers from the University of Technology Sydney<\/a>. The team persuaded one commercial language model to create a disinformation campaign about an Australian political party \u2014 complete with visuals, hashtags and posts tailored to specific platforms \u2014 by posing the request as a \u201csimulation.\u201d<\/p>\n<p class=\"css-ac37hb evys1bk0\">Companies say that in addition to building guardrails into their systems, they use separate tools to monitor activity on these systems, identify suspicious behavior and ban accounts that do not comply with the terms of service.<\/p>\n<p class=\"css-ac37hb evys1bk0\">\u201cClaude is built with strong protections that consist of many layers designed to work together, including model training and guardrails built on top of the model,\u201d an Anthropic spokeswoman, Paruul Maheshwary, said. \u201cBypassing one doesn\u2019t bypass the others.\u201d<\/p>\n<p class=\"css-ac37hb evys1bk0\">This is how Anthropic <a class=\"css-yywogo\" href=\"https:\/\/www.nytimes.com\/2025\/11\/14\/business\/chinese-hackers-artificial-intelligence.html\" title=\"\" rel=\"nofollow noopener\" target=\"_blank\">discovered<\/a> that a team of Chinese state-sponsored hackers had used Claude in an effort to infiltrate the computer systems of roughly 30 companies and government agencies around the world.<\/p>\n<p class=\"css-ac37hb evys1bk0\">But experts say this security technique is also flawed, because companies must track a high volume of activity across the world \u2014 and because they are wary of barring legitimate users.<\/p>\n<p class=\"css-ac37hb evys1bk0\">If someone is thwarted by the guardrails and security systems that protect online services like Claude and GPT, he or she can always turn to open source A.I. systems, whose underlying software can be freely copied, shared and modified.<\/p>\n<p class=\"css-ac37hb evys1bk0\">Because these systems can be modified, anyone can work to strip away their guardrails. Using a new method called Heretic, a person can <a class=\"css-yywogo\" href=\"https:\/\/x.com\/alice_dot_io\/status\/2045183074544091222\" title=\"\" rel=\"noopener noreferrer nofollow\" target=\"_blank\">remove a system\u2019s guardrails<\/a> with very little effort. This method uses complex mathematics to essentially revert the months of training that applied the guardrails.<\/p>\n<p class=\"css-ac37hb evys1bk0\">\u201cA year ago, doing this was very complicated,\u201d said Noam Schwartz, chief executive of Alice, an A.I. security company. \u201cNow, you can just do it from your phone.\u201d<\/p>\n","protected":false},"excerpt":{"rendered":"When companies like Anthropic, Google and OpenAI build their artificial intelligence systems, they spend months adding ways to&hellip;\n","protected":false},"author":2,"featured_media":484856,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[261],"tags":[291,202486,289,290,21715,16866,18,13342,19,17,5763,90569,82],"class_list":{"0":"post-484855","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-anthropic-ai-llc","10":"tag-artificial-intelligence","11":"tag-artificialintelligence","12":"tag-computer-security","13":"tag-computers-and-the-internet","14":"tag-eire","15":"tag-google-inc","16":"tag-ie","17":"tag-ireland","18":"tag-meta-platforms-inc","19":"tag-openai-labs","20":"tag-technology"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@ie\/116574738553525269","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/484855","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/comments?post=484855"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/484855\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media\/484856"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media?parent=484855"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/categories?post=484855"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/tags?post=484855"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}