{"id":144603,"date":"2025-10-25T12:24:11","date_gmt":"2025-10-25T12:24:11","guid":{"rendered":"https:\/\/www.europesays.com\/ie\/144603\/"},"modified":"2025-10-25T12:24:11","modified_gmt":"2025-10-25T12:24:11","slug":"junk-data-from-x-makes-large-language-models-lose-reasoning-skills-researchers-show","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/ie\/144603\/","title":{"rendered":"Junk data from X makes large language models lose reasoning skills, researchers show"},"content":{"rendered":"<p>                                    <a class=\"article-menu__content__link\" href=\"#summary\"><br \/>\n                        <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/the-decoder.com\/resources\/icons\/summary.svg\" alt=\"summary\" width=\"27\" height=\"24\" data-no-lazy=\"1\"\/><br \/>\n                        Summary<br \/>\n                    <\/a><\/p>\n<p><strong>Researchers find that large language models can suffer lasting performance declines when they are continually trained on trivial online content. The study documents sharp drops in reasoning and confidence, raising concerns about the long-term health of LLMs.<\/strong><\/p>\n<p>A team from several US universities has introduced the &#8220;LLM Brain Rot Hypothesis,&#8221; inspired by the <a target=\"_blank\" rel=\"noopener nofollow\" href=\"https:\/\/corp.oup.com\/news\/brain-rot-named-oxford-word-of-the-year-2024\/\" data-type=\"editable-link\">human concept of &#8220;Brain Rot&#8221;<\/a>, which describes the cognitive harm caused by overexposure to mindless online content.<\/p>\n<p>To test their theory, the researchers ran controlled experiments using Twitter data from 2010. They trained four smaller models &#8211; Llama3-8B-Instruct, Qwen2.5-7B\/0.5B-Instruct, and Qwen3-4B-Instruct &#8211; on different mixes of &#8220;junk&#8221; and higher-quality control data.<\/p>\n<p><a href=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/10\/llms-can-get-brainrot-overview-scaled-1.png\"><img data-lazyloaded=\"1\" fetchpriority=\"high\" decoding=\"async\" class=\"wp-image-28385 size-full\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/10\/llms-can-get-brainrot-overview-scaled-1.png\" alt=\"Diagram for the LLM Brain Rot study: hypothesis, Twitter data, pre-training, cognitive decline, error types, and countermeasures.\" width=\"2560\" height=\"960\"\/><\/a>The diagram shows how targeted pre-training with junk data from X (formerly Twitter) leads to cognitive decline in large language models. | Image: Xing et al.<\/p>\n<p>Share<\/p>\n<p>Recommend our article<\/p>\n<p>        Share<\/p>\n<p>Two takes on what counts as &#8220;junk&#8221; data<\/p>\n<p>The researchers took two approaches to identifying junk data. The first, based on engagement (M1), flagged short posts under 30 words that were highly popular (over 500 likes, retweets, or comments) as junk. Longer posts above 100 words with little engagement served as controls.<\/p>\n<p>Ad<\/p>\n<p>THE DECODER Newsletter<\/p>\n<p>The most important AI news straight to your inbox.<\/p>\n<p>\u2713 Weekly<\/p>\n<p>\u2713 Free<\/p>\n<p>\u2713 Cancel at any time<\/p>\n<p>The second method (M2) measured content quality. Using GPT-4o-mini, the team sorted posts by their semantic value. Conspiracy theories, exaggerated claims, and attention-seeking clickbait were marked as junk, while more thoughtful material became controls.<\/p>\n<p>The analysis showed little overlap between popularity and text length, and only a weak link between popularity and content quality. Meanwhile, text length and semantic value were more closely correlated.<\/p>\n<p>Reasoning skills take a nosedive<\/p>\n<p>Model performance suffered dramatic losses. On the ARC challenge benchmark, reasoning accuracy fell from 74.9 percent to 57.2 percent as junk data increased from zero to 100 percent.<\/p>\n<p><a href=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/10\/llm-brainrot-evaluation-llama-3-8b.png\"><img loading=\"lazy\" data-lazyloaded=\"1\" decoding=\"async\" class=\"wp-image-28386 size-full\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/10\/llm-brainrot-evaluation-llama-3-8b.png\" alt=\"Heatmap of Llama3 8B-Instruct performance at different junk data proportions in ARC, RULER, safety, and personality tests.\" width=\"924\" height=\"654\"\/><\/a>Llama3 8B Instruct&#8217;s performance drops in reasoning, long-context understanding, safety, and personality benchmarks as junk data increases. | Image: Xing et al.<\/p>\n<p>For tasks requiring long-context understanding, model accuracy dropped even more precipitously, plunging from 84.4 percent down to just 52.3 percent. This shows that as the proportion of low-quality data increases, model performance continues to worsen.<\/p>\n<p>The engagement-based definition of junk (popularity) caused more damage than the content-based approach, suggesting that popularity adds a new dimension of data quality not captured by standard semantic checks.<\/p>\n<p>Recommendation<\/p>\n<p>                                            <a class=\"link-overlay\" href=\"https:\/\/the-decoder.com\/openai-says-gpt-5-shows-30-percent-less-political-bias-than-previous-models\/\" aria-label=\"OpenAI says GPT-5 shows 30 percent less political bias than previous models\" rel=\"nofollow noopener\" target=\"_blank\"><\/p>\n<p>                                                        \t\t\t<a class=\"post-thumbnail\" href=\"https:\/\/the-decoder.com\/openai-says-gpt-5-shows-30-percent-less-political-bias-than-previous-models\/\" aria-hidden=\"true\" tabindex=\"-1\" rel=\"nofollow noopener\" target=\"_blank\"><\/p>\n<p>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" data-lazyloaded=\"1\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/10\/chatgpt_people_illustration_bias-1-375x210.png\" loading=\"lazy\" alt=\"OpenAI says GPT-5 shows 30 percent less political bias than previous models\" width=\"375\" height=\"210\"\/><br \/>\n\t\t\t\t\t\t\t<\/a><\/p>\n<p>                \t\t\t<a class=\"post-thumbnail\" href=\"https:\/\/the-decoder.com\/openai-says-gpt-5-shows-30-percent-less-political-bias-than-previous-models\/\" aria-hidden=\"true\" tabindex=\"-1\" rel=\"nofollow noopener\" target=\"_blank\"><\/p>\n<p>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" data-lazyloaded=\"1\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/10\/chatgpt_people_illustration_bias-1-375x210.png\" loading=\"lazy\" alt=\"OpenAI says GPT-5 shows 30 percent less political bias than previous models\" width=\"375\" height=\"210\"\/><br \/>\n\t\t\t\t\t\t\t<\/a><\/p>\n<p>The effects extended beyond reasoning. Models exposed to large amounts of engagement-driven junk developed &#8220;dark&#8221; personality traits, including higher scores for psychopathy, narcissism, and manipulativeness. In Llama3 8B Instruct, the psychopathy score rose sharply.<\/p>\n<p>Safety benchmarks also declined. In contrast, exposure to content-based junk sometimes raised agreeableness and openness scores.<\/p>\n<p>&#8220;Thought-skipping&#8221; dominates errors<\/p>\n<p>Error analysis found that &#8220;thought-skipping&#8221;\u2014skipping logical steps or chains entirely\u2014was the most common problem. Over 70 percent of errors involved no reasoning at all, jumping to 84 percent in the engagement-junk scenario. Researchers sorted errors into five categories: no reasoning, no planning, skipped steps, wrong logic, and factual errors. Their system could automatically explain more than 98 percent of the cases.<\/p>\n<p><a href=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/10\/llm-brainrot-overview-desired-cot-failure-modes.png\"><img loading=\"lazy\" data-lazyloaded=\"1\" decoding=\"async\" class=\"wp-image-28387 size-full\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/10\/llm-brainrot-overview-desired-cot-failure-modes.png\" alt=\"Infographic: Green flowchart for correct soap-bacteria experiment and red fields for five types of reasoning errors.\" width=\"861\" height=\"440\"\/><\/a>Models trained on junk data are often unable to complete logical reasoning chains, leading to skipped steps and basic mistakes. | Image: Xing et al.<\/p>\n<p>Follow-up tests found that popularity mainly weakened reasoning, while text length had a bigger effect on long-context understanding. This supports the idea that popularity influences LLMs in unique ways.<\/p>\n<p>Damage is hard to reverse<\/p>\n<p>Efforts to repair the models had limited success. Reflective reasoning\u2014where the model reviews its own output\u2014reduced some thought-skipping, but self-reflection often made things worse. Only corrections from a stronger external model helped at all.<\/p>\n<p>Even after retraining with up to 50,000 fresh examples and more clean data, the lost performance did not return. The gap remained.<\/p>\n<p>&#8220;The gap implies that the Brain Rot effect has been deeply internalized, and the existing instruction tuning cannot fix the issue,&#8221; the authors write.<\/p>\n<p>The study calls for a rethink on how LLMs gather and filter online data. With models constantly absorbing huge volumes of web content, careful data selection and quality control are now critical to avoid permanent degradation.<\/p>\n<p>The team recommends regular &#8220;cognitive health checks&#8221; for deployed LLMs and argues that data selection during ongoing training should be treated as a safety issue.<\/p>\n<p>Code, models, and data are available on <a target=\"_blank\" rel=\"noopener nofollow\" href=\"https:\/\/github.com\/llm-brain-rot\/llm-brain-rot\" data-type=\"editable-link\">GitHub<\/a> and <a target=\"_blank\" rel=\"noopener nofollow\" href=\"https:\/\/huggingface.co\/collections\/AmberYifan\/llms-can-get-brain-rot-68f2d658f9380e625ce5ec1f\" data-type=\"editable-link\">Hugging Face<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"Summary Researchers find that large language models can suffer lasting performance declines when they are continually trained on&hellip;\n","protected":false},"author":2,"featured_media":144604,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[74],"tags":[16950,4831,2594,18,19,17,82],"class_list":{"0":"post-144603","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-technology","8":"tag-ai-research","9":"tag-ai-training","10":"tag-data","11":"tag-eire","12":"tag-ie","13":"tag-ireland","14":"tag-technology"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@ie\/115434793590109901","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/144603","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/comments?post=144603"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/144603\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media\/144604"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media?parent=144603"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/categories?post=144603"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/tags?post=144603"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}