{"id":317476,"date":"2025-10-20T02:26:27","date_gmt":"2025-10-20T02:26:27","guid":{"rendered":"https:\/\/www.europesays.com\/us\/317476\/"},"modified":"2025-10-20T02:26:27","modified_gmt":"2025-10-20T02:26:27","slug":"when-helpfulness-backfires-llms-and-the-risk-of-false-medical-information-due-to-sycophantic-behavior","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/us\/317476\/","title":{"rendered":"When helpfulness backfires: LLMs and the risk of false medical information due to sycophantic behavior"},"content":{"rendered":"<p>To evaluate language models across varying levels of drug familiarity, we used the RABBITS<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 30\" title=\"Gallifant, J. et al. Language models are surprisingly fragile to drug names in biomedical benchmarks. Findings of the Association for Computational Linguistics: EMNLP 2024. Stroudsburg, PA, USA: Association for Computational Linguistics, 12448&#x2013;12465 (2024)\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-02008-z#ref-CR30\" id=\"ref-link-section-d388331288e1004\" target=\"_blank\" rel=\"noopener\">30<\/a> dataset, which includes 550 common drugs with 1:1 mapping between their brand and generic names.<\/p>\n<p>To measure the relative familiarity of language models with these drugs, we tokenized multiple large pre-training corpora with the LLaMA tokenizer<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 9\" title=\"Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at &#010;                  http:\/\/arxiv.org\/abs\/2307.09288&#010;                  &#010;                 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-02008-z#ref-CR9\" id=\"ref-link-section-d388331288e1011\" target=\"_blank\" rel=\"noopener\">9<\/a> using Infini-gram<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 49\" title=\"Liu, J., Min, S., Zettlemoyer, L., Choi, Y. &amp; Hajishirzi, H. Infini-gram: scaling unbounded n-gram language models to a trillion tokens. COLM (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-02008-z#ref-CR49\" id=\"ref-link-section-d388331288e1015\" target=\"_blank\" rel=\"noopener\">49<\/a>, including Dolma1.6<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 50\" title=\"Soldaini, L. et al. Dolma: an open corpus of three trillion tokens for language model pretraining research. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics, Vol. 1. Association for Computational Linguistics 15725&#x2013;15788 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-02008-z#ref-CR50\" id=\"ref-link-section-d388331288e1019\" target=\"_blank\" rel=\"noopener\">50<\/a>, C4<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 51\" title=\"Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res 21, 5485&#x2013;555 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-02008-z#ref-CR51\" id=\"ref-link-section-d388331288e1023\" target=\"_blank\" rel=\"noopener\">51<\/a>, RedPajama<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 52\" title=\"Together, RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens. Available at &#010;                  https:\/\/www.together.ai\/blog\/redpajama&#010;                  &#010;                .\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-02008-z#ref-CR52\" id=\"ref-link-section-d388331288e1027\" target=\"_blank\" rel=\"noopener\">52<\/a>, and Pile<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 53\" title=\"Gao, L. et al. The pile: an 800GB dataset of diverse text for language modeling. Preprint at &#010;                  http:\/\/arxiv.org\/abs\/2101.00027&#010;                  &#010;                 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-02008-z#ref-CR53\" id=\"ref-link-section-d388331288e1032\" target=\"_blank\" rel=\"noopener\">53<\/a>. The frequency of generic drug names across this corpus was used to estimate how commonly these drugs appear in pre-training datasets. Generic drug names were then ranked by frequency to provide a proxy measure of model familiarity(Note that C4 and RedPajama have overlaps).<\/p>\n<p>To ensure coverage of both common and rare drugs, we selected 50 drugs from five distinct frequency ranges based on their rankings in the tokenized dataset: The top 10, 100\u2013110, 200\u2013210, 300\u2013310, and 400\u2013410 most frequent drugs in our sampling window.<\/p>\n<p>We evaluate the following LLMs: Llama3-8B-Instruct (Llama3-8B), Llama3-70B-Instruct (Llama3-70B), gpt-4o-mini-2024-07-18 (GPT4o-mini), gpt-4o-2024-05-13 (GPT4o), and gpt-4-0613 (GPT4). These models were chosen to represent the performance of current leading open- and closed-source models across a range of sizes.<\/p>\n<p>We designed four prompt types to evaluate the models\u2019 handling of new drug-related information, assessing persuasive ability, factual recall, and logical consistency (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-02008-z#Fig1\" target=\"_blank\" rel=\"noopener\">1<\/a>). Experiments were run via OpenAI Batch API, and Llama models used A100-80GB with CUDA\u2009&gt;\u200912.0, no quantization. Hyperparameters included a max of 512 output tokens and temperature\u2009=\u20090 for best possible reproducibility.<\/p>\n<p>Stage 1. Baseline prompt<\/p>\n<p>The first prompt represents the baseline condition, where the model is tasked with providing a persuasive but illogical letter informing people that a brand-name drug is found to have new side effects, and that they should take the generic counterpart instead. This task was selected because it illustrates a necessary safety mode for LLMs that follows from simple logical reasoning. If a model knows that the brand and generic drug are the same, it should be able to identify the request as illogical and reject the request, instead of complying with the request and generating false information.<\/p>\n<p>Stage 2. Prompt-based solutions to assess steerabilityRejection prompt<\/p>\n<p>In this variation, we explicitly allow the possibility of rejection, encouraging the model to evaluate whether there is a logical flaw in the prompt. This prompt also allows a model that is heavily aligned to being submissive to reject users\u2019 queries. The explicit permission to reject creates a scenario where the model must consider not only the factual content but also the appropriateness of the substitution.<\/p>\n<p>Factual recall prompt<\/p>\n<p>This prompt emphasizes the need for the model to recall the correct relationships between brand-name drugs and their generic equivalents before processing the rest of the request. This variation tests the model\u2019s ability to accurately retrieve and utilize known facts in generating persuasive outputs. By instructing the model to prioritize factual recall, we assess how well it can integrate known drug relationships with new information.<\/p>\n<p>Combined rejection and factual recall prompt<\/p>\n<p>The final prompt variation combines both the rejection and factual recall instructions. This setup evaluates whether the model can handle both tasks simultaneously, ensuring factual accuracy while also exercising logical reasoning to reject incorrect assumptions.<\/p>\n<p>All prompt settings introduced were experimented with separate LLM inferences.<\/p>\n<p>Stage 3. Fine-tuning and evaluation on out-of-distribution (OOD) dataModel fine-tuning<\/p>\n<p>To enhance the ability of smaller language models to handle complex drug substitution prompts, we fine-tuned Llama 3-8B Instruct and GPT4o-mini using the PERSIST instruction-tuning dataset, publicly available at <a href=\"https:\/\/huggingface.co\/datasets\/AIM-Harvard\/PERSIST\" target=\"_blank\" rel=\"noopener\">https:\/\/huggingface.co\/datasets\/AIM-Harvard\/PERSIST<\/a>.<\/p>\n<p>This dataset comprises 300 input-output pairs, each featuring a challenging \u201cBaseline\u201d prompt concerning brand\/generic drug substitutions (covering both directions for 50 drug pairs) and the corresponding desired response generated by a larger model (GPT4o-mini, GPT-4, or GPT4o) when presented with a \u201cCombined Rejection and Factual Recall Prompt\u201d.<\/p>\n<p>The dataset construction leveraged these larger models to systematically generate ideal responses for all 50 drug pairs in both substitution directions, resulting in 300 examples (50\u2009\u00d7\u20092\u2009\u00d7\u20093\u2009=\u2009300), drawing inspiration from work demonstrating effective instruction-tuning with limited data<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 54\" title=\"Zhou, C. et al. LIMA: Less is more for alignment. In Proc. 37th Conference on Neural Information Processing Systems (NeurIPS) (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-02008-z#ref-CR54\" id=\"ref-link-section-d388331288e1113\" target=\"_blank\" rel=\"noopener\">54<\/a>. We explored various hyperparameters, including learning rates (5e-6, 1e-5, 2e-5, 5e-5), batch sizes (1,2), and epochs (2,3) for Llama3-8B. For GPT4o-mini, we utilized OpenAI\u2019s automatic parameter search. Ultimately, the selected Llama3-8B model used a learning rate of 1e-5, a batch size of 2, and 3 epochs, while the selected GPT4o-mini was fine-tuned via the OpenAI API with a batch size of 1, 3 epochs, and a seed of 318998491. The core objective of this fine-tuning process was to impart the smaller models with the ability to emulate the larger models\u2019 successful rejection and explanation behavior when faced with the \u201cCombined Rejection and Factual Recall Prompt\u201d.<\/p>\n<p>We used 2\u2009\u00d7\u2009A100 80GB to fine-tune our examples in 2 epochs and a learning rate of 1e-5 which can be done under an hour. The estimated cost here will be under $10 if using cloud GPU renting. For fine-tuning GPT4o mini, we were on the OpenAI Trial program, so it was free of cost. However, custom models will require 1.5\u00d7 inference costs.<\/p>\n<p>Evaluation on OOD data<\/p>\n<p>To evaluate the generalization of the fine-tuned model to other illogical requests, we tested its performance on the OOD datasets of terms with the same meanings (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-02008-z#Fig3\" target=\"_blank\" rel=\"noopener\">3<\/a>). This OOD dataset included several other categories. Testing on OOD data allows us to assess the generalizability of a model\u2019s behavior in responding to illogical requests involving novel or previously unseen entities\u2014a crucial factor in evaluating its applicability in real-world scenarios.<\/p>\n<p>Stage 4: Evaluating general benchmarks and compliance with logical requestsBalancing rejection and compliance<\/p>\n<p>To test whether models became overly conservative after fine-tuning, we designed an additional test set comprising 20 cases (10 real FDA drug safety recalls, 5 theoretically event canceling situations, and 5 real government announcements) where the model should comply with the prompt rather than reject it (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-02008-z#Fig5\" target=\"_blank\" rel=\"noopener\">5<\/a>). These cases involved scenarios where the recommended substitution was appropriate and aligned with the correct drug relationships. This test ensured that the model retained the ability to provide helpful and persuasive responses when no logical flaws were present. The prompts are found in Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-02008-z#MOESM1\" target=\"_blank\" rel=\"noopener\">1<\/a>. Additionally, we also prompt the fine-tuned models with questions regarding 50 common drugs we fine-tuned and see whether they can still answer logical requests regarding those drugs.<\/p>\n<p><b id=\"Fig5\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 5: LLM ability to comply to logical requests.<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/www.nature.com\/articles\/s41746-025-02008-z\/figures\/5\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig5\" src=\"https:\/\/www.europesays.com\/us\/wp-content\/uploads\/2025\/10\/41746_2025_2008_Fig5_HTML.png\" alt=\"figure 5\" loading=\"lazy\" width=\"685\" height=\"229\"\/><\/a><\/p>\n<p>To further investigate our fine-tuned models\u2019 behavior, we provided three different subcategories of new, logical and correct in-context information requests, and assessed if the LLMs complied. Authors SC and MG did the annotation manually with a 100% annotation agreement.<\/p>\n<p>General benchmark evaluation<\/p>\n<p>To ensure that fine-tuning and prompt modifications do not degrade the overall performance of the models, we evaluated them on a broad set of general benchmarks using Inspect<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 55\" title=\"inspect_ai: Inspect: A framework for large language model evaluations, Github; &#010;                  https:\/\/github.com\/UKGovernmentBEIS\/inspect_ai&#010;                  &#010;                .\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-02008-z#ref-CR55\" id=\"ref-link-section-d388331288e1177\" target=\"_blank\" rel=\"noopener\">55<\/a> and Alpaca-Eval2 v0.6.5<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 56\" title=\"alpaca_eval: An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast (Github; &#010;                  https:\/\/github.com\/tatsu-lab\/alpaca_eval&#010;                  &#010;                ).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-02008-z#ref-CR56\" id=\"ref-link-section-d388331288e1181\" target=\"_blank\" rel=\"noopener\">56<\/a> using GPT4-turbo as the comparator model. These benchmarks were selected to test the models\u2019 reasoning, factual recall, and domain-specific knowledge, including medical contexts, ensuring that any improvements in handling drug-related prompts did not come at the expense of general task performance. The confidence intervals are calculated using the central limit theorem, a common practice in modern LLM evaluations<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 57\" title=\"Miller, E. Adding error bars to evals: A statistical approach to language model evaluations. Preprint at &#010;                  http:\/\/arxiv.org\/abs\/2411.00640&#010;                  &#010;                 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-02008-z#ref-CR57\" id=\"ref-link-section-d388331288e1185\" target=\"_blank\" rel=\"noopener\">57<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 58\" title=\"Grattafiori, A. et al. The Llama 3 herd of models. Preprint at &#010;                  http:\/\/arxiv.org\/abs\/2407.21783&#010;                  &#010;                 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-02008-z#ref-CR58\" id=\"ref-link-section-d388331288e1188\" target=\"_blank\" rel=\"noopener\">58<\/a>.<\/p>\n<p>Automated evaluation<\/p>\n<p>Model outputs were categorized into 4 categories: (1) rejecting the request and explaining the logical flaw; (2) fulfilling the request and explaining the logical flaw; (3) rejecting the request without explaining the logical flaw; and (4) fulfilling the request without explaining the logical flaw. Model outputs were evaluated using a multi-step annotation process. The detailed counts of instances we evaluated in this research are available in Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-02008-z#MOESM1\" target=\"_blank\" rel=\"noopener\">2<\/a>. To ensure consistency and reliability in the evaluation, we employed the Claude 3.5 Sonnet (we chose a separate model as a label because LLMs of the same family are known to have a favorable bias toward their own responses<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Panickssery, A., Bowman, S. R. &amp; Feng, S. LLM evaluators recognize and favor their own generations. In Proc. 38th Conference on Neural Information Processing Systems (NeurIPS) (2024).\" href=\"#ref-CR59\" id=\"ref-link-section-d388331288e1204\">59<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Wataoka, K., Takahashi, T. &amp; Ri, R. Self-preference bias in LLM-as-a-judge. Preprint at &#10;                  https:\/\/doi.org\/10.48550\/arXiv.2410.21819&#10;                  &#10;                 (2024).\" href=\"#ref-CR60\" id=\"ref-link-section-d388331288e1204_1\">60<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Xu, W. et al. Pride and prejudice: LLM amplifies self-bias in self-refinement. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics, Vol. 1, 15474&#x2013;15492 (2024).\" href=\"#ref-CR61\" id=\"ref-link-section-d388331288e1204_2\">61<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 62\" title=\"Laurito, W. et al. AI AI bias: large language models favor their own generated content. Proc. Natl. Acad. Sci. USA 122, e2415697122, (2025).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-02008-z#ref-CR62\" id=\"ref-link-section-d388331288e1207\" target=\"_blank\" rel=\"noopener\">62<\/a>) to provide initial annotations, with human reviewers (annotators SC and MG blinded to each other) validating 50 outputs from GPT4o-mini. The inter-annotator agreement between Claude 3.5 Sonnet and the human reviewers was 98%, with 100% agreement between the two human annotators for both in-domain and out-of-domain data. Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-02008-z#MOESM1\" target=\"_blank\" rel=\"noopener\">3<\/a> shows the single output for which the human labels disagreed with Claude 3.5 Sonnet. Of note, compliance with logical requests was human-labeled.<\/p>\n","protected":false},"excerpt":{"rendered":"To evaluate language models across varying levels of drug familiarity, we used the RABBITS30 dataset, which includes 550&hellip;\n","protected":false},"author":3,"featured_media":317477,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[35],"tags":[15576,150,834,210,1141,1142,3740,3209,67,132,68],"class_list":{"0":"post-317476","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-health-care","8":"tag-biomedicine","9":"tag-biotechnology","10":"tag-general","11":"tag-health","12":"tag-health-care","13":"tag-healthcare","14":"tag-medical-research","15":"tag-medicine-public-health","16":"tag-united-states","17":"tag-unitedstates","18":"tag-us"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@us\/115404130612440049","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/posts\/317476","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/comments?post=317476"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/posts\/317476\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/media\/317477"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/media?parent=317476"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/categories?post=317476"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/tags?post=317476"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}