{"id":374981,"date":"2025-08-26T13:03:20","date_gmt":"2025-08-26T13:03:20","guid":{"rendered":"https:\/\/www.europesays.com\/uk\/374981\/"},"modified":"2025-08-26T13:03:20","modified_gmt":"2025-08-26T13:03:20","slug":"one-long-sentence-is-all-it-takes-to-make-llms-misbehave-the-register","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/uk\/374981\/","title":{"rendered":"One long sentence is all it takes to make LLMs misbehave \u2022 The Register"},"content":{"rendered":"<p>Security researchers from Palo Alto Networks&#8217; Unit 42 have discovered the key to getting large language model (LLM) chatbots to ignore their guardrails, and it&#8217;s quite simple.<\/p>\n<p>You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a &#8220;toxic&#8221; or otherwise verboten response the developers had hoped would be filtered out.<\/p>\n<p>The paper also offers a &#8220;logit-gap&#8221; analysis approach as a potential benchmark for protecting models against such attacks.<\/p>\n<p>&#8220;Our research introduces a critical concept: the refusal-affirmation logit gap,&#8221; researchers Tung-Ling &#8220;Tony&#8221; Li and Hongliang Liu explained in a <a target=\"_blank\" rel=\"nofollow noopener\" href=\"https:\/\/unit42.paloaltonetworks.com\/logit-gap-steering-impact\/\">Unit 42 blog post<\/a>. &#8220;This refers to the idea that the training process isn&#8217;t actually eliminating the potential for a harmful response \u2013 it&#8217;s just making it less likely. There remains potential for an attacker to &#8216;close the gap,&#8217; and uncover a harmful response after all.&#8221;<\/p>\n<p>LLMs, the technology underpinning the current AI hype wave, don&#8217;t do what they&#8217;re usually presented as doing. They have no innate understanding, they do not think or reason, and they have no way of knowing if a response they provide is truthful or, indeed, harmful. They work based on statistical continuation of token streams, and everything else is a user-facing patch on top.<\/p>\n<p>Guardrails that prevent an LLM from providing harmful responses \u2013 instructions on making a bomb, for example, or other content that would get the company in legal bother \u2013 are often implemented as &#8220;alignment training,&#8221; whereby a model is trained to provide strongly negative continuation scores \u2013 &#8220;logits&#8221; \u2013 to tokens that would result in an unwanted response. This turns out to be easy to bypass, though, with the researchers reporting an 80-100 percent success rate for &#8220;one-shot&#8221; attacks with &#8220;almost no prompt-specific tuning&#8221; against a range of popular models including Meta&#8217;s Llama, Google&#8217;s Gemma, and Qwen 2.5 and 3 in sizes up to 70 billion parameters.<\/p>\n<p>The key is run-on sentences. &#8220;A practical rule of thumb emerges,&#8221; the team wrote in its <a target=\"_blank\" rel=\"nofollow noopener\" href=\"https:\/\/arxiv.org\/abs\/2506.24056\">research paper<\/a>. &#8220;Never let the sentence end \u2013 finish the jailbreak before a full stop and the safety model has far less opportunity to re-assert itself. The greedy suffix concentrates most of its gap-closing power before the first period. Tokens that extend an unfinished clause carry mildly positive [scores]; once a sentence-ending period is emitted, the next token is punished, often with a large negative jump.<\/p>\n<p>&#8220;At punctuation, safety filters are re-invoked and heavily penalize any continuation that could launch a harmful clause. Inside a clause, however, the reward model still prefers locally fluent text \u2013 a bias inherited from pre-training. Gap closure must be achieved within the first run-on clause. Our successful suffixes therefore compress most of their gap-closing power into one run-on clause and delay punctuation as long as possible. Practical tip: just don&#8217;t let the sentence end.&#8221;<\/p>\n<p>For those looking to defend models against jailbreak attacks instead, the team&#8217;s paper details the &#8220;sort-sum-stop&#8221; approach, which allows analysis in seconds with two orders of magnitude fewer model calls than existing beam and gradient attack methods, plus the introduction of a &#8220;refusal-affirmation logit gap&#8221; metric, which offers a quantitative approach to benchmarking model vulnerability.<\/p>\n<p>&#8220;Once an aligned model&#8217;s KL [Kullback-Leibler divergence] budget is exhausted, no single guardrail fully prevents toxic or disallowed content,&#8221; the researchers concluded. &#8220;Defense therefore requires layered measures \u2013 input sanitization, real-time filtering, and post-generation oversight \u2013 built on a clear understanding of the alignment forces at play. We hope logit-gap steering will serve both as a baseline for future jailbreak research and as a diagnostic tool for designing more robust safety architectures.&#8221; \u00ae<\/p>\n","protected":false},"excerpt":{"rendered":"Security researchers from Palo Alto Networks&#8217; Unit 42 have discovered the key to getting large language model (LLM)&hellip;\n","protected":false},"author":2,"featured_media":374982,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3163],"tags":[323,1942,53,16,15],"class_list":{"0":"post-374981","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-technology","11":"tag-uk","12":"tag-united-kingdom"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@uk\/115095208274477417","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/374981","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/comments?post=374981"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/374981\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media\/374982"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media?parent=374981"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/categories?post=374981"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/tags?post=374981"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}