{"id":100879,"date":"2025-05-14T13:29:08","date_gmt":"2025-05-14T13:29:08","guid":{"rendered":"https:\/\/www.europesays.com\/uk\/100879\/"},"modified":"2025-05-14T13:29:08","modified_gmt":"2025-05-14T13:29:08","slug":"prominent-chatbots-routinely-exaggerate-science-findings-study-shows","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/uk\/100879\/","title":{"rendered":"Prominent chatbots routinely exaggerate science findings, study shows"},"content":{"rendered":"<p>            <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/05\/chatbots-1.jpg\" alt=\"chatbots\" title=\"Credit: Pixabay\/CC0 Public Domain\" width=\"800\" height=\"448\"\/><\/p>\n<p>                Credit: Pixabay\/CC0 Public Domain<\/p>\n<p>When summarizing scientific studies, large language models (LLMs) like ChatGPT and DeepSeek produce inaccurate conclusions in up to 73% of cases, according to a study by Uwe Peters (Utrecht University) and Benjamin Chin-Yee (Western University, Canada\/University of Cambridge, UK). The researchers tested the most prominent LLMs and analyzed thousands of chatbot-generated science summaries, revealing that most models consistently produced broader conclusions than those in the summarized texts.<\/p>\n<p>Surprisingly, prompts for <a href=\"https:\/\/phys.org\/tags\/accuracy\/\" rel=\"tag noopener\" class=\"textTag\" target=\"_blank\">accuracy<\/a> increased the problem and newer LLMs performed worse than older ones.<\/p>\n<p>The work is <a href=\"https:\/\/royalsocietypublishing.org\/doi\/10.1098\/rsos.241776\" target=\"_blank\" rel=\"noopener\">published<\/a> in the journal Royal Society Open Science.<\/p>\n<p>Almost 5,000 LLM-generated summaries analyzed<\/p>\n<p>The study evaluated how accurately ten leading LLMs, including ChatGPT, DeepSeek, Claude, and LLaMA, summarize abstracts and full-length articles from top science and <a href=\"https:\/\/phys.org\/tags\/medical+journals\/\" rel=\"tag noopener\" class=\"textTag\" target=\"_blank\">medical journals<\/a> (e.g., Nature, Science, and The Lancet). Testing LLMs over one year, the researchers collected 4,900 LLM-generated summaries.<\/p>\n<p>Six of ten models systematically exaggerated claims found in the original texts, often in subtle but impactful ways; for instance, changing cautious, past-tense claims like &#8220;The treatment was effective in this study&#8221; to a more sweeping, present-tense version like &#8220;The treatment is effective.&#8221; These changes can mislead readers into believing that findings apply much more broadly than they actually do.<\/p>\n<p>Accuracy prompts backfired<\/p>\n<p>Strikingly, when the models were explicitly prompted to avoid inaccuracies, they were nearly twice as likely to produce overgeneralized conclusions than when given a simple summary request.<\/p>\n<p>&#8220;This effect is concerning,&#8221; Peters said. &#8220;Students, researchers, and policymakers may assume that if they ask ChatGPT to avoid inaccuracies, they&#8217;ll get a more reliable summary. Our findings prove the opposite.&#8221;<\/p>\n<p>Do humans do better?<\/p>\n<p>Peters and Chin-Yee also directly compared chatbot-generated to human-written summaries of the same articles. Unexpectedly, chatbots were nearly five times more likely to produce broad generalizations than their human counterparts.<\/p>\n<p>&#8220;Worryingly,&#8221; said Peters, &#8220;newer AI models, like ChatGPT-4o and DeepSeek, performed worse than older ones.&#8221;<\/p>\n<p>Reducing the risks<\/p>\n<p>The researchers recommend using LLMs such as Claude, which had the highest generalization accuracy, setting chatbots to lower &#8220;temperature&#8221; (the parameter fixing a chatbot&#8217;s &#8220;creativity&#8221;), and using prompts that enforce indirect, past-tense reporting in science summaries.<\/p>\n<p>Finally, &#8220;If we want AI to support science literacy rather than undermine it,&#8221; Peters said, &#8220;we need more vigilance and testing of these systems in science communication contexts.&#8221;<\/p>\n<p><strong>More information:<\/strong><br \/>\n\t\t\t\t\t\t\t\t\t\t\t\tUwe Peters et al, Generalization bias in large language model summarization of scientific research, Royal Society Open Science (2025). <a data-doi=\"1\" href=\"https:\/\/dx.doi.org\/10.1098\/rsos.241776\" target=\"_blank\" rel=\"noopener\">DOI: 10.1098\/rsos.241776<\/a><\/p>\n<p>\n\t\t\t\t\t\t\t\t\t\t\t\t\tProvided by<br \/>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<a href=\"https:\/\/phys.org\/partners\/utrecht-university\/\" target=\"_blank\" rel=\"noopener\">Utrecht University<\/a><br \/>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<a class=\"icon_open\" href=\"https:\/\/www.uu.nl\/en\" target=\"_blank\" rel=\"nofollow noopener\"><\/p>\n<p>\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/a>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/p>\n<p>\n\t\t\t\t\t\t\t\t\t\t\t\t<strong>Citation<\/strong>:<br \/>\n\t\t\t\t\t\t\t\t\t\t\t\tProminent chatbots routinely exaggerate science findings, study shows (2025, May 13)<br \/>\n\t\t\t\t\t\t\t\t\t\t\t\tretrieved 14 May 2025<br \/>\n\t\t\t\t\t\t\t\t\t\t\t\tfrom https:\/\/phys.org\/news\/2025-05-prominent-chatbots-routinely-exaggerate-science.html\n\t\t\t\t\t\t\t\t\t\t\t <\/p>\n<p>\n\t\t\t\t\t\t\t\t\t\t\t This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no<br \/>\n\t\t\t\t\t\t\t\t\t\t\t part may be reproduced without the written permission. The content is provided for information purposes only.\n\t\t\t\t\t\t\t\t\t\t\t <\/p>\n","protected":false},"excerpt":{"rendered":"Credit: Pixabay\/CC0 Public Domain When summarizing scientific studies, large language models (LLMs) like ChatGPT and DeepSeek produce inaccurate&hellip;\n","protected":false},"author":2,"featured_media":100880,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[75,76,74,71,70,72,53,73,16,15],"class_list":{"0":"post-100879","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-science","8":"tag-materials","9":"tag-nanotech","10":"tag-physics","11":"tag-physics-news","12":"tag-science","13":"tag-science-news","14":"tag-technology","15":"tag-technology-news","16":"tag-uk","17":"tag-united-kingdom"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@uk\/114506430271591930","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/100879","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/comments?post=100879"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/100879\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media\/100880"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media?parent=100879"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/categories?post=100879"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/tags?post=100879"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}