• Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619, 357–362 (2023).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Chen, R. J. et al. Towards a general-purpose foundation model for computational pathology. Nat. Med. 30, 850–862 (2024).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Habicht, J. et al. Closing the accessibility gap to mental health treatment with a personalized self-referral chatbot. Nat. Med. 30, 595–602 (2024).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Lu, M. Y. et al. A multimodal generative AI copilot for human pathology. Nature 634, 466–473 (2024).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Wan, P. et al. Outpatient reception via collaboration between nurses and a large language model: a randomized controlled trial. Nat. Med. 30, 2878–2885 (2024).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Huang, K. et al. A foundation model for clinician-centered drug repurposing. Nat. Med. 30, 3601–3613 (2024).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Li, J. et al. Integrated image-based deep learning and language models for primary diabetes care. Nat. Med. 30, 2886–2896 (2024).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Liu, X. et al. A generalist medical language model for disease diagnosis assistance. Nat. Med. 31, 932–942 (2025).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Johri, S. et al. An evaluation framework for clinical use of large language models in patient interaction tasks. Nat. Med. 31, 77–86 (2025).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Ao, G. et al. Comparative analysis of large language models on rare disease identification. Orphanet J. Rare Dis. 20, 150 (2025).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Shyr, C. Large language models for rare disease diagnosis at the undiagnosed diseases network. JAMA Netw. Open 8, e2528538 (2025).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Weiner, S. J. & Schwartz, A. Listening for What Matters: Avoiding Contextual Errors in Health Care (Oxford Univ. Press, 2023).

  • Yu, K. -H. & Kohane, I. S. Framing the challenges of artificial intelligence in medicine. BMJ Qual. Saf. 28, 238–241 (2019).

    Article 
    PubMed 

    Google Scholar
     

  • Zhang, S., Liu, Q., Qin, G., Naumann, T. & Poon, H. Med-RLVR: emerging medical reasoning from a 3B base model via reinforcement learning. Preprint at https://doi.org/10.48550/arXiv.2502.19655 (2025).

  • Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • McDermott, M. B. A., Yap, B., Szolovits, P. & Zitnik, M. Structure-inducing pre-training. Nat. Mach. Intell. 5, 612–621 (2023).

    Article 

    Google Scholar
     

  • Guo, L. L. et al. A multi-center study on the adaptability of a shared foundation model for electronic health records. npj Digit. Med. 7, 171 (2024).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Wornow, M. et al. The shaky foundations of large language models and foundation models for electronic health records. npj Digit. Med. 6, 135 (2023).

  • Pais, C. et al. Large language models for preventing medication direction errors in online pharmacies. Nat. Med. 30, 1574–1582 (2024).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Sabuncu, M. R., Wang, A. Q. & Nguyen, M. Ethical use of artificial intelligence in medical diagnostics demands a focus on accuracy, not fairness. NEJM AI 2, AIp2400672 (2024).

  • Li, M. M. et al. Contextual AI models for single-cell protein biology. Nat. Methods 21, 1546–1557 (2024).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Kather, J. N., Ferber, D., Wiest, I. C., Gilbert, S. & Truhn, D. Large language models could make natural language again the universal interface of healthcare. Nat. Med. 30, 2708–2710 (2024).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Proc. Syst. 33, 1877–1901 (2020).


    Google Scholar
     

  • Liu, H. et al. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Adv. Neural Inf. Proc. Syst. 35, 1950–1965 (2022).

  • Pan, J., Gao, T., Chen, H. & Chen, D. What in-context learning ‘learns’ in-context: disentangling task recognition and task learning. In Findings of the Association for Computational Linguistics 8298–8319 (ACL, 2023).

  • Min, S. et al. Rethinking the role of demonstrations: what makes in-context learning work? In Proc. 2022 Conference on Empirical Methods in Natural Language Processing 11048–11064 (ACL, 2022).

  • Chen, B., Zhang, Z., Langrené, N. & Zhu, S. Unleashing the potential of prompt engineering for large language models. Patterns 6, 101260 (2025).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Shen, S. et al. Multitask vision-language prompt tuning. In Proc. the IEEE/CVF Winter Conference on Applications of Computer Vision 5656–5667 (IEEE, 2024).

  • Wang, W. et al. VisionLLM: large language model is also an open-ended decoder for vision-centric tasks. Adv. Neural Inf. Proc. Syst. 36, 61501–61513 (2023).

  • Tanwani, A. K., Barral, J. & Freedman, D. RepsNet: combining vision with language for automated medical reports. In International Conference on Medical Image Computing and Computer-assisted Intervention 714–724 (Springer, 2022).

  • Shentu, J. & Al Moubayed, N. CXR-IRGen: an integrated vision and language model for the generation of clinically accurate chest X-ray image-report pairs. In Proc. IEEE/CVF Winter Conference on Applications of Computer Vision (IEEE, 2024).

  • Wu, S. et al. CollabLLM: from passive responders to active collaborators. In Proc. 42nd International Conference on Machine Learning (PMLR, 2025).

  • Alsentzer, E. et al. Few shot learning for phenotype-driven diagnosis of patients with rare genetic diseases. npj Digit. Med. 8, 380 (2025).

  • Goh, E. et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw. Open 7, e2440969 (2024).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Wang, L. et al. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. npj Digit. Med. 7, 41 (2024).

  • Khattab, O. et al. DSPy: compiling declarative language model calls into state-of-the-art pipelines. In International Conference on Learning Representations (ICLR, 2024).

  • Yuksekgonul, M. et al. Optimizing generative AI by backpropagating language model feedback. Nature 639, 609–616 (2025).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Vaziri, M., Mandel, L., Spiess, C. & Hirzel, M. PDL: a declarative prompt programming language. Preprint at https://doi.org/10.48550/arXiv.2410.19135 (2024).

  • Lu, Y. et al. Towards doctor-like reasoning: Medical RAG fusing knowledge with patient analogy through textual gradients. In 39th Conference on Neural Information Processing Systems (NeurIPS, 2025).

  • Maharjan, J. et al. OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models. Sci. Rep. 14, 14156 (2024).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. Preprint at https://doi.org/10.48550/arXiv.2311.16452 (2023).

  • Wu, S., Koo, M., Scalzo, F. & Kurtz, I. AutoMedPrompt: a new framework for optimizing LLM medical prompts using textual gradients. Preprint at https://doi.org/10.48550/arXiv.2502.15944 (2025).

  • Yu, F. et al. Heterogeneity and predictors of the effects of AI assistance on radiologists. Nat. Med. 30, 837–849 (2024).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Rrv, A., Tyagi, N., Uddin, M. N., Varshney, N. & Baral, C. Chaos with keywords: exposing large language models sycophancy to misleading keywords and evaluating defense strategies. In Findings of the Association for Computational Linguistics 12717–12733 (ACL, 2024).

  • Fanous, A. et al. SycEval: evaluating LLM sycophancy. In Proc. AAAI/ACM Conference on AI, Ethics, and Society 8, 893–900 (ACM, 2025).

  • Su, X. et al. KGARevion: an AI agent for knowledge-intensive biomedical QA. In International Conference on Learning Representations (ICLR, 2025).

  • Zhang, G. et al. Leveraging long context in retrieval augmented language models for medical question answering. npj Digit. Med. 8, 239 (2025).

  • Ke, Y. H. et al. Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness. npj Digit. Med. 8, 187 (2025).

  • Kresevic, S. et al. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. npj Digit. Med. 7, 102 (2024).

  • Lopez, I. et al. Clinical entity augmented retrieval for clinical information extraction. npj Digit. Med. 8, 45 (2025).

  • Asai, A., Wu, Z., Wang, Y., Sil, A. & Hajishirzi, H. Self-RAG: learning to retrieve, generate, and critique through self-reflection. In International Conference on Learning Representations (ICLR, 2024).

  • Yang, D., Zeng, L., Rao, J. & Zhang, Y. Knowing you don’t know: learning when to continue search in multi-round RAG through self-practicing. In Proc. 48th International ACM SIGIR Conference on Research and Development in Information Retrieval 1305–1315 (ACM, 2025).

  • Islam, S. B. et al. Open-RAG: enhanced retrieval augmented reasoning with open-source large language models. In Findings of the Association for Computational Linguistics 14231–14244 (ACL, 2024).

  • Jeong, S., Baek, J., Cho, S., Hwang, S. J. & Park, J. C. Adaptive-RAG: learning to adapt retrieval-augmented large language models through question complexity. In Proc. 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol. 1: Long Papers) 7036–7050 (ACL, 2024).

  • Yang, R. et al. Retrieval-augmented generation for generative artificial intelligence in health care. npj Health Syst. 2, 2 (2025).

  • Anisuzzaman, D. M., Malins, J. G., Friedman, P. A. & Attia, Z. I. Fine-tuning large language models for specialized use cases. Mayo Clin. Proc. Digit. Health 3, 100184 (2025).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Wiest, I. C. et al. Deidentifying medical documents with local, privacy-preserving large language models: the LLM-anonymizer. NEJM AI 2, 4 (2025).

  • Croskerry, P. A universal model of diagnostic reasoning. Acad. Med. 84, 1022–1028 (2009).

  • Geiping, J. et al. Scaling up test-time compute with latent reasoning: a recurrent depth approach. In 39th Annual Conference on Neural Information Processing Systems (NeurIPS, 2025).

  • Makarov, N. et al. Large language models forecast patient health trajectories enabling digital twins. npj Digit. Med. 8, 588 (2025).

  • Renc, P. et al. Zero shot health trajectory prediction using transformer. npj Digit. Med. 7, 256 (2024).

  • Wang, J. et al. Self-improving generative foundation model for synthetic medical image generation and clinical applications. Nat. Med. 31, 609–617 (2024).

    Article 
    PubMed 

    Google Scholar
     

  • Rao, V. M. et al. Multimodal generative AI for medical image interpretation. Nature 639, 888–896 (2025).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Duan, Y., Xu, C., Pei, J., Han, J. & Li, C. Pre-train and plug-in: flexible conditional text generation with variational auto-encoders. In Proc. 58th Annual Meeting of the Association for Computational Linguistics 253–262 (ACL, 2020).

  • Epstein, D., Jabri, A., Poole, B., Efros, A. & Holynski, A. Diffusion self-guidance for controllable image generation. Adv. Neural Inf. Proc. Syst. 36, 16222–16239 (2023).

  • Li, Z. et al. ControlAR: controllable image generation with autoregressive models. In 13th International Conference on Learning Representations (ICLR, 2025).

  • Beattie, J. et al. Using large language models to create patient centered consent forms. Int. J. Radiat. Oncol. Biol. Phys. 120, e612 (2024).

    Article 

    Google Scholar
     

  • Shi, Q. et al. Transforming informed consent generation using large language models: mixed methods study. JMIR Med. Inform. 13, e68139 (2025).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Rudra, P., Balke, W. -T., Kacprowski, T., Ursin, F. & Salloch, S. Large language models for surgical informed consent: an ethical perspective on simulated empathy. J. Med. Ethics https://doi.org/10.1136/jme-2024-110652 (2025)

  • Ravfogel, S., Goldberg, Y. & Goldberger, J. Conformal nucleus sampling. In Findings of the Association for Computational Linguistics 27–34 (ACL, 2023).

  • Minh, N. N. et al. Turning up the heat: min-p sampling for creative and coherent LLM outputs. In 13th International Conference on Learning Representations (ICLR, 2025).

  • Zhou, K., Yang, J., Loy, C. C. & Liu, Z. Conditional prompt learning for vision-language models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 16816–16825 (IEEE, 2022).

  • Khasentino, J. et al. A personal health large language model for sleep and fitness coaching. Nat. Med. 31, 3394–3403 (2025).

  • Wen, J. et al. The genetic architecture of multimodal human brain age. Nat. Commun. 15, 2604 (2024).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Mizrahi, D. et al. 4M: massively multimodal masked modeling. Adv. Neural Inf. Proc. Syst. 36, 58363–58408 (2023).

  • Meng, X., Sun, K., Xu, J., He, X. & Shen, D. Multi-modal modality-masked diffusion network for brain MRI synthesis with random modality missing. IEEE Trans. Med. Imaging 43, 2587–2598 (2024).

  • Stahlschmidt, S. R., Ulfenborg, B. & Synnergren, J. Multimodal deep learning for biomedical data fusion: a review. Brief Bioinform. 23, bbab569 (2022).

  • Boehm, K. M., Khosravi, P., Vanguri, R., Gao, J. & Shah, S. P. Harnessing multimodal data integration to advance precision oncology. Nat. Rev. Cancer 22, 114–126 (2022).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Johnson, R., Li, M. M., Noori, A., Queen, O. & Zitnik, M. Graph artificial intelligence in medicine. Annu. Rev. Biomed. Data Sci. 7, 345–368 (2024).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Kline, A. et al. Multimodal machine learning in precision health: a scoping review. npj Digit. Med. 5, 171 (2022).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Huang, Y. et al. Multimodal AI predicts clinical outcomes of drug combinations from preclinical data. Preprint at https://doi.org/10.48550/arXiv.2503.02781 (2025).

  • Zhang, Y. et al. Multiple heads are better than one: mixture of modality knowledge experts for entity representation learning. In 13th International Conference on Learning Representations (ICLR, 2025).

  • Bao, H. et al. VLMo: unified vision-language pre-training with mixture-of-modality-experts. Adv. Neural Inf. Proc. Syst. 35, 32897–32912 (2022).

  • Yun, S. et al. Flex-MoE: modeling arbitrary modality combination via the flexible mixture-of-experts. Adv. Neural Inf. Proc. Syst. 37, 98782–98805 (2024).

  • Cho, M. et al. Cocoon: robust multi-modal perception with uncertainty-aware sensor fusion. In 13th International Conference on Learning Representations (ICLR, 2025).

  • Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025).

  • McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451–457 (2025).

  • Gao, S. et al. Empowering biomedical discovery with AI agents. Cell 187, 6125–6151 (2024).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 633–638 (2025).

  • Gao, S. et al. TxAgent: an AI agent for therapeutic reasoning across a universe of tools. Preprint at https://doi.org/10.48550/arXiv.2503.10970 (2025).

  • Qu, X. et al. A survey of efficient reasoning for large reasoning models: language, multimodality, and beyond. Preprint at https://doi.org/10.48550/arXiv.2503.21614 (2025).

  • Besta, M. et al. Reasoning language models: a blueprint. Preprint at https://doi.org/10.48550/arXiv.2501.11223 (2025).

  • Johnson, R. et al. ClinVec: unified embeddings of clinical codes enable knowledge-grounded AI in medicine. Preprint at medRxiv https://doi.org/10.1101/2024.12.03.24318322 (2025).

  • Wallace, E. et al. Managing patients with multimorbidity in primary care. BMJ 350, h176 (2015).

  • Spillmann, R. C. et al. A window into living with an undiagnosed disease: illness narratives from the Undiagnosed Diseases Network. Orphanet J. Rare Dis. 12, 1–11 (2017).

    Article 

    Google Scholar
     

  • Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Proc. Syst. 35, 24824–24837 (2022).

  • Rafailov, R. et al. Direct preference optimization: your language model is secretly a reward model. Adv. Neural Inf. Proc. Syst. 36, 53728–53741 (2023).

  • Nathani, D. et al. MLGym: a new framework and benchmark for advancing AI research agents. Preprint at https://doi.org/10.48550/arXiv.2502.14499(2025).

  • Jiang, Y. et al. MedAgentBench: a virtual EHR environment to benchmark medical LLM agents. NEJM AI 2, AIdbp2500144 (2025).

  • Kazemi, M. et al. BIG-bench extra hard. In Proc. 63rd Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers) 26473–26501 (ACL, 2025).

  • Liang, P. et al. Holistic evaluation of language models. Preprint at https://doi.org/10.48550/arXiv.2211.09110 (2023).

  • Choi, H. K., Khanov, M., Wei, H. & Li, Y. How contaminated is your benchmark? Measuring dataset leakage in large language models with kernel divergence. In 13th International Conference on Machine Learning (ICLR, 2025).

  • Ektefaie, Y. et al. Evaluating generalizability of artificial intelligence models for molecular datasets. Nat. Mach. Intell. 6, 1512–1524 (2024).

    Article 

    Google Scholar
     

  • Bourlon, M. T. et al. Envisioning academic global oncologists: proposed competencies for global oncology training from ASCO. JCO Glob. Oncol. 10, e2300157 (2024).

  • Johnson-Peretz, J. et al. Geographical, social, and political contexts of tuberculosis control and intervention, as reported by mid-level health managers in Uganda: ‘the activity around town’. Soc. Sci. Med. 338, 116363 (2023).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Ning, Y. et al. An ethics assessment tool for artificial intelligence implementation in healthcare: CARE-AI. Nat. Med. 30, 3038–3039 (2024).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Boverhof, B. -J. et al. Radiology AI Deployment and Assessment Rubric (RADAR) to bring value-based AI into radiological practice. Insights Imaging 15, 34 (2024).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Dagan, N. et al. Evaluation of AI solutions in health care organizations — the OPTICA tool. NEJM AI 1, AIcs2300269 (2024).

  • Borja, N. A. et al. Advancing equity in rare disease research: insights from the undiagnosed disease network. Am. J. Med. Genet. A 197, e63904 (2025).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Williams, J. S., Walker, R. J. & Egede, L. E. Achieving equity in an evolving healthcare system: opportunities and challenges. Am. J. Med. Sci. 351, 33–43 (2016).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Pool, J., Indulska, M. & Sadiq, S. Large language models and generative AI in telehealth: a responsible use lens. J. Am. Med. Inform. Assoc. 31, 2125–2136 (2024).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Yu, K. -H., Healey, E., Leong, T. -Y., Kohane, I. S. & Manrai, A. K. Medical artificial intelligence and human values. N. Engl. J. Med. 390, 1895–1904 (2024).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 33, 9459–9474 (2020).


    Google Scholar
     

  • Wei, J. et al. Finetuned language models are zero-shot learners. In 10th International Conference on Learning Representations (ICLR, 2022).

  • Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730–27744 (2022).

  • Gururangan, S. et al. Don’t stop pretraining: adapt language models to domains and tasks. In Proc. 58th Annual Meeting of the Association for Computational Linguistics 8342–8360 (ACL, 2020).

  • Schick, T. et al. Toolformer: language models can teach themselves to use tools. Adv. Neural Inf. Process. Syst. 36, 68539–68551 (2023).


    Google Scholar