Russell, S. & Norvig, P. Artificial Intelligence: A Modern Approach (Pearson Press, 2021).

Turing, A. M. Computing machinery and intelligence. Mind 59, 433–460 (1950).

Article 

Google Scholar
 

Tracy, M., Cerdá, M. & Keyes, K. M. Agent-based modeling in public health: current applications and future directions. Annu. Rev. Public Health 39, 77–94 (2018).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Sridharan, P. & Ghosh, M. Gene expression and agent-based modeling improve precision prognosis in breast cancer. Sci. Rep. 15, 17059 (2025).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 35, 24824–24837 (2022).


Google Scholar
 

Christiano, P. F. et al. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) (Curran Associates, 2017).

Wang, Y. et al. Reinforcement learning for reasoning in large language models with one training example. Preprint at arXiv https://doi.org/10.48550/arXiv.2504.20571 (2025).

DeepSeek-AI et al. DeepSeek-V3.2: Pushing the frontier of open large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2512.02556 (2025).

Rastogi, A. et al. Magistral. Preprint at arXiv https://doi.org/10.48550/arXiv.2506.10910 (2025).

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

Article 
CAS 
PubMed 

Google Scholar
 

Burstein, J., Doran, C. & Solorio, T. (eds). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1, 4171–4186 (Association for Computational Linguistics, 2019).

Workshop, B. et al. BLOOM: a 176B-parameter open-access multilingual language model. Preprint at arXiv https://doi.org/10.48550/ARXIV.2211.05100 (2022).

Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 55, 248:1–248:38 (2023).

Article 

Google Scholar
 

Kalai, A. T., Nachum, O., Vempala, S. S. & Zhang, E. Why language models hallucinate. Preprint at arXiv https://doi.org/10.48550/arXiv.2509.04664 (2025).

Jayaraman, P., Desman, J., Sabounchi, M., Nadkarni, G. N. & Sakhuja, A. A primer on reinforcement learning in medicine for clinicians. NPJ Digit. Med. 7, 337 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Sutton, R. S. & Barto, A. Reinforcement Learning: An Introduction (The MIT Press, 2020).

Ouyang, L. et al. Training language models to follow instructions with human feedback. In Proceedings of the 36th International Conference on Neural Information Processing Systems (eds Koyejo, S. et al.) 27730–27744 (Curran Associates, 2022).

Casper, S. et al. Open problems and fundamental limitations of reinforcement learning from human feedback. Trans. Mach. Learn. Res. https://openreview.net/pdf?id=bx24KpJ4Eb (2023).

Skalse, J., Howe, N. H. R., Krasheninnikov, D. & Krueger, D. Defining and characterizing reward hacking. In Proceedings of the 36th International Conference on Neural Information Processing Systems (eds Koyejo, S. et al.) 9460–9471 (Curran Associates, 2022).

Uesato, J. et al. Solving math word problems with process- and outcome-based feedback. Preprint at arXiv https://doi.org/10.48550/arXiv.2211.14275 (2022).

Lightman, H. et al. Let’s verify step by step. In Proceedings of the 12th International Conference on Learning Representations (eds Kim, B. et al.) 39578–39601 (2024).

Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 633–638 (2025).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Bai, Y. et al. Constitutional AI: harmlessness from AI feedback. Preprint at arXiv https://doi.org/10.48550/arXiv.2212.08073 (2022).

Novikov, A. et al. AlphaEvolve: a coding agent for scientific and algorithmic discovery. Preprint at arXiv https://doi.org/10.48550/arXiv.2506.13131 (2025).

Gibney, E. DeepMind unveils ‘spectacular’ general-purpose science AI. Nature 641, 827–828 (2025).

Article 
CAS 
PubMed 

Google Scholar
 

Zhang, J., Hu, S., Lu, C., Lange, R. & Clune, J. Darwin Godel Machine: open-ended evolution of self-improving agents. Preprint at arXiv https://doi.org/10.48550/arXiv.2505.22954 (2025).

Rogers, A., Boyd-Graber, J. & Okazaki, N. (eds). Towards reasoning in large language models: a survey. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023 1049–1065 (Association for Computational Linguistics, 2023).

Hendrycks, D. et al. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks Vol. 1 (eds Vanschoren, J. & Yeung, S.) (2021).

Korhonen, A., Traum, D. & Màrquez, L. (eds). Explain yourself! Leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 4932–4942 (Association for Computational Linguistics, 2019).

Taylor, R. et al. Galactica: a large language model for science. Preprint at arXiv https://doi.org/10.48550/arXiv.2211.09085 (2022).

Wang, L. et al. Parameter-efficient fine-tuning in large language models: a survey of methodologies. Artif. Intell. Rev. 58, 227 (2025).

Article 

Google Scholar
 

Fu, Y., Peng, H., Sabharwal, A., Clark, P. & Khot, T. Complexity-based prompting for multi-step reasoning. In Proceedings of the 11th International Conference on Learning Representations (eds Liu, Y. et al.) (2023).

Agirre, E., Apidianaki, M. & Vulić, I. (eds). What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures 100–114 (Association for Computational Linguistics, 2022).

Zhang, Z., Zhang, A., Li, M. & Smola, A. Automatic chain of thought prompting in large language models. In Proceedings of the 11th International Conference on Learning Representations (eds Liu, Y. et al.) (2023).

Yao, S. et al. Tree of thoughts: deliberate problem solving with large language models. In Proceedings of the 37th Conference on Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 11809–11822 (Curran Associates, 2023).

Besta, M. et al. Graph of thoughts: solving elaborate problems with large language models. AAAI 38, 17682–17690 (2024).

Article 

Google Scholar
 

Shojaee, P. et al. The illusion of thinking: understanding the strengths and limitations of reasoning models via the lens of problem complexity. Preprint at arXiv https://doi.org/10.48550/arXiv.2506.06941 (2025).

Goyal, S. et al. Think before you speak: training language models with pause tokens. In Proceedings of the 12th International Conference on Learning Representations (eds Kim, B. et al.) 27896–27923 (2024).

Inui, K., Jiang, J., Ng, V. & Wan, X. (eds). PubMedQA: a dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 2567–2577 (Association for Computational Linguistics, 2019).

Cobbe, K. et al. Training verifiers to solve math word problems. Preprint at arXiv https://doi.org/10.48550/arXiv.2110.14168 (2021).

Ku, L.-W., Martins, A. & Srikumar, V. (eds). Math-Shepherd: verify and reinforce LLMs step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics Vol. 1, 9426–9439 (Association for Computational Linguistics, 2024).

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K. & Yao, S. Reflexion: language agents with verbal reinforcement learning. In Proceedings of the 37th International Conference on Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 8634–8652 (Curran Associates, 2023).

Gou, Z. et al. CRITIC: large language models can self-correct with tool-interactive critiquing. In Proceedings of the 12th International Conference on Learning Representations (eds Kim, B. et al.) 57734–57811 (2024).

Madaan, A. et al. Self-refine: iterative refinement with self-feedback. In Proceedings of the 37th International Conference on Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 46534–46594 (Curran Associates, 2023).

Crosby, M., Rovatsos, M. & Petrick, R. Automated agent decomposition for classical planning. In Proceedings of the International Conference on Automated Planning and Scheduling Vol. 23 (eds Borrajo, D. et al.) 46–54 (2013).

Huang, X. et al. Understanding the planning of LLM agents: a survey. Preprint at arXiv https://doi.org/10.48550/arXiv.2402.02716 (2024).

Zhou, D. et al. Least-to-most prompting enables complex reasoning in large language models. In Proceedings of the 11th International Conference on Learning Representations (eds Liu, Y. et al.) (2023).

Xu, B. et al. ReWOO: decoupling reasoning from observations for efficient augmented language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2305.18323 (2023).

Yao, S. et al. ReAct: synergizing reasoning and acting in language models. In Proceedings of the 11th International Conference on Learning Representations (eds Liu, Y. et al.) (2023).

Shen, Y. et al. HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face. In Proceedings of the 37th Conference on Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 38154–38180 (Curran Associates, 2023).

Duh, K., Gomez, H. & Bethard, S. (eds). ADaPT: as-needed decomposition and planning with language models. In Proceedings of Findings of the Association for Computational Linguistics: NAACL 2024 4226–4252 (Association for Computational Linguistics, 2024).

Liu, B. et al. LLM + P: empowering large language models with optimal planning proficiency. Preprint at arXiv https://doi.org/10.48550/arXiv.2304.11477 (2023).

Feng, P. et al. AGILE: a novel reinforcement learning framework of LLM agents. In Proceedings of the 38th International Conference on Neural Information Processing Systems Vol. 37 (eds Globerson, A. et al.) 5244–5284 (Curran Associates, 2024).

Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).


Google Scholar
 

Hutter, F., Kotthoff, L. & Vanschoren, J. (eds). Automated Machine Learning: Methods, Systems, Challenges pp. 151–160 (Springer International Publishing, 2019).

Hernandez, J. G., Saini, A. K., Ghosh, A. & Moore, J. H. The tree-based pipeline optimization tool: tackling biomedical research problems with genetic programming and automated machine learning. Patterns 6, 101314 (2025).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Himmelstein, D. S. et al. Systematic integration of biomedical knowledge prioritizes drugs for repurposing. Elife 6, e26726 (2017).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Swanson, K., Wu, W., Bulaong, N. L., Pak, J. E. & Zou, J. The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies. Nature 646, 716–723 (2025).

Article 
CAS 
PubMed 

Google Scholar
 

Schick, T. et al. Toolformer: language models can teach themselves to use tools. In Proceedings of the 37th International Conference on Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 68539–68551 (Curran Associates, 2023).

Lu, P. et al. Chameleon: plug-and-play compositional reasoning with large language models. In Proceedings of the 37th International Conference on Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 43447–43478 (Curran Associates, 2023).

Patil, S. G., Zhang, T., Wang, X. & Gonzalez, J. E. Gorilla: Large language model connected with massive APIs. In Proceedings of the 38th International Conference on Neural Information Processing Systems Vol. 37 (eds Globerson, A. et al.) 126544–126565 (Curran Associates, 2024).

Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems (eds Larochelle, H. et al.) 9459–9474 (Curran Associates, 2020).

Petroni, F. et al. How context affects language models’ factual predictions. In Proceedings of the Automated Knowledge Base Construction (eds McCallum, A. et al.) (2020).

Fan, W. et al. A survey on RAG meeting LLMs: towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (eds Baeza-Yates, R. & Bronchi, F.) 6491–6501 (Association for Computing Machinery, 2024).

Jeong, M., Sohn, J., Sung, M. & Kang, J. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics 40, i119–i129 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Lu, J. et al. MemoChat: tuning LLMs to use memos for consistent long-range open-domain conversation. Preprint at arXiv https://doi.org/10.48550/arXiv.2308.08239 (2023).

Zhong, W., Guo, L., Gao, Q., Ye, H. & Wang, Y. MemoryBank: enhancing large language models with long-term memory. In Proceedings of the AAAI Conference on Artificial Intelligence (eds Wooldridge, M., Dy, J. & Natarajan, S.) 19724–19731 (2024).

Park, J. S. et al. Generative agents: interactive simulacra of human behavior. Preprint at arXiv https://doi.org/10.48550/arXiv.2304.03442 (2023).

Li, Y. et al. ChatDoctor: a medical chat model fine-tuned on a large language model meta-AI (LLaMA) using medical domain knowledge. Cureus 15, e40895 (2023).

PubMed 
PubMed Central 

Google Scholar
 

Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J. & Chalef, D. Zep: a temporal knowledge graph architecture for agent memory. Preprint at arXiv https://doi.org/10.48550/arXiv.2501.13956 (2025).

Edge, D. et al. From local to global: a graph RAG approach to query-focused summarization. Preprint at arXiv https://doi.org/10.48550/arXiv.2404.16130 (2025).

Zhang, Z. et al. A survey on the memory mechanism of large language model-based agents. ACM Trans. Inf. Syst. 43, 155:1–155:47 (2025).

Article 

Google Scholar
 

Yan, B. et al. Beyond self-talk: a communication-centric survey of LLM-based multi-agent systems. Preprint at arXiv https://doi.org/10.48550/arXiv.2502.14321 (2025).

Ku, L.-W., Martins, A. & Srikumar, V. ChatDev: communicative agents for software development. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics Vol. 1, 15174–15186 (Association for Computational Linguistics, 2024).

Hong, S. et al. MetaGPT: Meta programming for a multi-agent collaborative framework. In Proceedings of the 12th International Conference on Learning Representations (eds Kim, B. et al.) 23247–23275 (2024).

Zhuge, M. et al. GPTSwarm: language agents as optimizable graphs. In Proceedings of the 41st International Conference on Machine Learning Vol. 235 (eds Salakhutdinov, R. R. et al.) 62743–62767 (2024).

Google Cloud. Agent2Agent (A2A) Protocol. a2a-protocol.org/latest/ (2025).

Borghoff, U. M., Bottoni, P. & Pareschi, R. Human-artificial interaction in the age of agentic AI: a system-theoretical approach. Front. Hum. Dyn. 7, 1579166 (2025).

Article 

Google Scholar
 

Hua, W. et al. Interactive speculative planning: enhance agent efficiency through co-design of system and user interface. In Proceedings of the 13th International Conference on Learning Representations (eds Yue, Y. et al) 14256–14283 (2025).

Hou, X., Zhao, Y., Wang, S. & Wang, H. Model Context Protocol (MCP): landscape, security threats, and future research directions. Preprint at arXiv https://doi.org/10.48550/arXiv.2503.23278 (2025).

Kuehl, M. et al. BioContextAI is a community hub for agentic biomedical systems. Nat. Biotechnol. 43, 1755–1757 (2025).

Article 
CAS 
PubMed 

Google Scholar
 

Yang, J. et al. SWE-agent: Agent-computer interfaces enable automated software engineering. In Proceedings of the 38th International Conference on Neural Information Processing Systems Vol. 37 (eds Globerson, A. et al.) 50528–50652 (Curran Associates, 2024).

Ferber, D. et al. Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology. Nat. Cancer 6, 1337–1349 (2025).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Ku, L.-W., Martins, A. & Srikumar, V. (eds). MedAgents: large language models as collaborators for zero-shot medical reasoning. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024 599–621 (Association for Computational Linguistics, 2024).

Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Li, S. et al. SciLitLLM: How to adapt LLMs for scientific literature understanding. In Proceedings of the 13th International Conference on Learning Representations (eds Yue, Y. et al.) 56025–56048 (2025).

Wang, Y. et al. Biomedical information retrieval with positive-unlabeled learning and knowledge graphs. In ACM Trans. Intell. Syst. Technol. (ACM, 2024).

Yang, Z., Dabre, R., Tanaka, H. & Okazaki, N. SciCap+: a knowledge augmented dataset to study the challenges of scientific figure captioning. J. Nat. Lang. Process. 31, 1140–1165 (2024).

Article 

Google Scholar
 

Zhang, S. et al. A multimodal biomedical foundation model trained from fifteen million image–text pairs. NEJM AI 2, AIoa2400640 (2025).

Article 

Google Scholar
 

Qi, B. et al. Large language models as biomedical hypothesis generators: a comprehensive evaluation. In Proceedings of the 1st Conference on Language Modeling (eds. Artzi, Y. et al.) (2024).

Gottweis, J. et al. Towards an AI co-scientist. Preprint at arXiv https://doi.org/10.48550/arXiv.2502.18864 (2025).

Zhang, Y. et al. A comprehensive large-scale biomedical knowledge graph for AI-powered data-driven biomedical research. Nat. Mach. Intell. 7, 602–614 (2025).

Article 

Google Scholar
 

Huang, K. et al. Automated hypothesis validation with agentic sequential falsifications. In Proceedings of the 42nd International Conference on Machine Learning Vol. 267 (eds Singh, A. et al.) 25372–25437 (PMLR, 2025).

O’Donoghue, O. et al. BioPlanner: automatic evaluation of LLMs on protocol planning in biology. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H., Pino, J. & Bali, K.) 2676–2694 (Association for Computational Linguistics, 2023).

Roohani, Y. et al. BioDiscoveryAgent: an AI agent for designing genetic perturbation experiments. In Proceedings of the 13th International Conference on Learning Representations (eds Yue, Y. et al.) 26417–26466 (2025).

Liu, S. et al. DrugAgent: automating AI-aided drug discovery programming through LLM multi-agent collaboration. In Proceedings of the 2nd AI4Research Workshop: Towards a Knowledge-Grounded Scientific Research Lifecycle (eds Wang, Q. et al.) (2024).

Ma, M. D. et al. Orchestrating tool ecosystem of drug discovery with intention-aware LLM agents. In Towards Agentic AI for Science: Hypothesis Generation, Comprehension, Quantification, and Validation (eds Koutra, D. et al.) (2025).

Tang, X. et al. CellForge: agentic design of virtual cell models. Preprint at arXiv https://doi.org/10.48550/arXiv.2508.02276 (2025).

Turcan, A., Huang, K., Li, L. & Zhang, M. J. TusoAI: agentic optimization for scientific methods. Preprint at arXiv https://doi.org/10.48550/arXiv.2509.23986 (2025).

Huang, K. et al. Biomni: a general-purpose biomedical AI agent. Preprint at bioRxiv https://doi.org/10.1101/2025.05.30.656746 (2025).

Lu, C. et al. The AI Scientist: towards fully automated open-ended scientific discovery. Preprint at arXiv https://doi.org/10.48550/arXiv.2408.06292 (2024).

Yamada, Y. et al. Scientist-v2: workshop-level automated scientific discovery via agentic tree search. Preprint at arXiv https://doi.org/10.48550/arXiv.2504.08066 (2025).

Ferrag, M. A., Tihanyi, N. & Debbah, M. From LLM reasoning to autonomous AI agents: a comprehensive review. Preprint at arXiv https://doi.org/10.48550/arXiv.2504.19678 (2025).

Yehudai, A. et al. Survey on evaluation of LLM-based agents. Preprint at arXiv https://doi.org/10.48550/arXiv.2503.16416 (2025).

Geva, M. et al. Did Aristotle use a laptop? A question answering benchmark with implicit reasoning strategies. Trans. Assoc. Comput. Linguist. 9, 346–361 (2021).

Chan, J. S. et al. MLE-bench: evaluating machine learning agents on machine learning engineering. In Proceedings of the 13th International Conference on Learning Representations (eds Yue, Y. et al.) 50466–50494 (2025).

Li, Y. et al. Competition-level code generation with AlphaCode. Science 378, 1092–1097 (2022).

Article 
CAS 
PubMed 

Google Scholar
 

Jimenez, C. E. et al. SWE-bench: can language models resolve real-world Github issues? In Proceedings of the 12th International Conference on Learning Representations (eds Kim, B. et al.) 54107–54157 (2024).

Chen, Z. et al. ScienceAgentBench: toward rigorous assessment of language agents for data-driven scientific discovery. In Proceedings of the 13th International Conference on Learning Representations (eds Yue, Y. et al.) 96934–96990 (2025).

Tian, M. et al. SciCode: a research coding benchmark curated by scientists. In Proceedings of the 38th Conference on Neural Information Processing Systems Datasets and Benchmarks Track Vol. 111 (eds Globerson, A. et al.) 30624–30650 (Curran Associates, 2024).

Srivastava, A. et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models. Trans. Mach. Learn. Res. https://openreview.net/pdf?id=uyTL5Bvosj (2023).

Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).

Article 
CAS 

Google Scholar
 

Pal, A., Umapathi, L. K. & Sankarasubbu, M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In Proceedings of the Conference on Health, Inference, and Learning Vol. 174 (eds Flores, G. et al.) 248–260 (PMLR, 2022).

Lou, R. et al. AAAR-1.0: assessing AI’s potential to assist research. In Proceedings of the 42nd International Conference on Machine Learning Vol. 267 (eds Singh, A. et al.) 40361–40383 (PMLR, 2025).

Webber, B., Cohn, T., He, Y. & Liu, Y. (eds). Fact or fiction: verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 7534–7550 (Association for Computational Linguistics, 2020).

Laurent, J. M. et al. LAB-Bench: measuring capabilities of language models for biology research. Preprint at arXiv https://doi.org/10.48550/arXiv.2407.10362 (2024).

Bragg, J. et al. AstaBench: rigorous benchmarking of AI agents with a scientific research suite. Preprint at arXiv https://doi.org/10.48550/arXiv.2510.21652 (2025).

Akhtar, M. et al. Croissant: a metadata format for ML-ready datasets. In Proceedings of the 8th Workshop on Data Management for End-to-End Machine Learning (eds Hulsebos, M., Interlandi, M., & Shankar, S.) 1–6 (Association for Computing Machinery, 2024).

Holmes, J. H. et al. Why is the electronic health record so challenging for research and clinical care? Methods Inf. Med. 60, 32–48 (2021).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Chen, Y. & Esmaeilzadeh, P. Generative AI in medical practice: in-depth exploration of privacy and security challenges. J. Med. Internet Res. 26, e53008 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

European Commission. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). https://gdpr-info.eu/ (2016).

U.S. Congress. Health Insurance Portability and Accountability Act of 1996 42 U.S.C. 201 note. https://www.congress.gov/bill/104th-congress/house-bill/3103 (1996).

Science and Technology Policy Office. Blueprint for an AI bill of rights: making automated systems work for the American people. https://www.govinfo.gov/app/details/GOVPUB-PREX23-PURL-gpo193638 (2022).

Das, B. C., Amini, M. H. & Wu, Y. Security and privacy challenges of large language models: a survey. ACM Comput. Surv. 57, 152:1–152:39 (2025).

Article 

Google Scholar
 

Chen, Z., Xiang, Z., Xiao, C., Song, D. & Li, B. AgentPoison: red-teaming LLM agents via poisoning memory or knowledge bases. In Proceedings of the 38th International Conference on Neural Information Processing Systems Vol. 37 (eds Globerson, A. et al.) 130185–130213 (Curran Associates, 2024).

Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Benson, D. A. et al. GenBank. Nucleic Acids Res. 41, D36–D42 (2013).

Article 
CAS 
PubMed 

Google Scholar
 

Husom, E. J., Goknil, A., Shar, L. K. & Sen, S. The price of prompting: profiling energy use in large language models inference. Preprint at arXiv https://doi.org/10.48550/arXiv.2407.16893 (2024).

Maliakel, P. J., Ilager, S. & Brandic, I. Investigating energy efficiency and performance trade-offs in LLM inference across tasks and DVFS settings. Preprint at arXiv https://doi.org/10.48550/arXiv.2501.08219 (2025).

Jiang, P., Sonne, C., Li, W., You, F. & You, S. Preventing the immense increase in the life-cycle energy and carbon footprints of LLM-powered intelligent chatbots. Engineering 40, 202–210 (2024).

Article 

Google Scholar
 

Li, P., Yang, J., Islam, M. A. & Ren, S. Making AI less ‘thirsty’. Commun. ACM 68, 54–61 (2025).

Article 

Google Scholar
 

Zhang, H., Ning, A., Prabhakar, R. B. & Wentzlaff, D. LLMCompass: enabling efficient hardware design for large language model inference. In Proceedings of the 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) (eds Vega, A. et al.) 1080–1096 (IEEE, 2024).

Barocas, S., Hardt, M. & Narayanan, A. Fairness and Machine Learning: Limitations and Opportunities (The MIT Press, 2023).

Chang, C. T. et al. Red teaming ChatGPT in medicine to yield real-world insights on model behavior. NPJ Digit. Med. 8, 149 (2025).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Chen, R. J. et al. Algorithmic fairness in artificial intelligence for medicine and healthcare. Nat. Biomed. Eng. 7, 719–742 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Omar, M. et al. Sociodemographic biases in medical decision making by large language models. Nat. Med. 31, 1873–1881 (2025).

Article 
CAS 
PubMed 

Google Scholar
 

OECD. Health Data Governance for the Digital Age: Implementing the OECD Recommendation on Health Data Governance (OECD Publishing, 2022).

Zhang, C. et al. A survey on federated learning. Knowl.-Based Syst. 216, 106775 (2021).

Article 

Google Scholar
 

Li, R., Romano, J. D., Chen, Y. & Moore, J. H. Centralized and federated models for the analysis of clinical data. Annu. Rev. Biomed. Data Sci. 7, 179–199 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Pan, M. Z. et al. Why do multiagent systems fail? In Proceedings of the ICLR 2025 Workshop on Building Trust in Language Models and Applications (eds Goldblum, M. et al.) (2025).

Matsumoto, N. et al. ESCARGOT: an AI agent leveraging large language models, dynamic graph of thoughts, and biomedical knowledge graphs for enhanced reasoning. Bioinformatics 41, btaf031 (2025).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Romano, J. D. et al. The Alzheimer’s Knowledge Base: a knowledge graph for Alzheimer disease research. J. Med. Internet Res. 26, e46777 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Lobentanzer, S. et al. A platform for the biomedical application of large language models. Nat. Biotechnol. 43, 166–169 (2025).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Lobentanzer, S. et al. Democratizing knowledge representation with BioCypher. Nat. Biotechnol. 41, 1056–1059 (2023).

Article 
CAS 
PubMed 

Google Scholar
 

Zhou, J. et al. Large language models in biomedicine and healthcare. NPJ Artif. Intell. 1, 44 (2025).

Article 

Google Scholar
 

Gulcehre, C. et al. Reinforced Self-Training (ReST) for language modeling. Preprint at arXiv https://doi.org/10.48550/arXiv.2308.08998 (2023).

Gabriel, I., Keeling, G., Manzini, A. & Evans, J. We need a new ethics for a world of AI agents. Nature 644, 38–40 (2025).

Article 
CAS 
PubMed 

Google Scholar
 

Lee, H.-P. (Hank) et al. The impact of generative AI on critical thinking: self-reported reductions in cognitive effort and confidence effects from a survey of knowledge workers. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (eds Yamashita, N. et al.) 1–22 (Association for Computing Machinery, 2025).

Del Rio-Chanona, R. M., Ernst, E., Merola, R., Samaan, D. & Teutloff, O. AI and jobs. A review of theory, estimates, and evidence. Preprint at arXiv https://doi.org/10.48550/arXiv.2509.15265 (2025).

Becker, J., Rush, N., Barnes, E. & Rein, D. Measuring the impact of early-2025 AI on experienced open-source developer productivity. Preprint at arXiv https://doi.org/10.48550/arXiv.2507.09089 (2025).

SIMA Team et al. Scaling instructable agents across many simulated worlds. Preprint at arXiv https://doi.org/10.48550/arXiv.2404.10179 (2024).

Gao, S. et al. Democratizing AI scientists using ToolUniverse. Preprint at arXiv https://doi.org/10.48550/arXiv.2509.23426 (2025).

Qu, Y. et al. CRISPR-GPT for agentic automation of gene-editing experiments. Nat. Biomed. Eng. https://doi.org/10.1038/s41551-025-01463-z (2025).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Bran, A. M. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 525–535 (2024).

Article 

Google Scholar
 

Wang, H. et al. SpatialAgent: an autonomous AI agent for spatial biology. Preprint at bioRxiv https://doi.org/10.1101/2025.04.03.646459 (2025).

Ghafarollahi, A. & Buehler, M. J. ProtAgents: protein discovery via large language model multi-agent collaborations combining physics and machine learning. Digit. Discov. 3, 1389–1409 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Yuksekgonul, M. et al. Optimizing generative AI by backpropagating language model feedback. Nature 639, 609–616 (2025).

Article 
CAS 
PubMed 

Google Scholar
 

Yang, Y. et al. TwinMarket: a scalable behavioral and social simulation for financial markets. In Proceedings of the ICLR 2025 Workshop on World Models: Understanding, Modelling and Scaling (eds Yang, M. et al.) (2025).

Hu, S., Lu, C. & Clune, J. Automated design of agentic systems. In Proceedings of the 13th International Conference on Learning Representations (eds Yue, Y. et al.) 21344–21377 (2025).

Chiruzzo, L., Ritter, A. & Wang, L. (eds). EvoAgent: towards automatic multi-agent generation via evolutionary algorithms. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1, 6192–6217 (Association for Computational Linguistics, 2025).

Gao, S. et al. Empowering biomedical discovery with AI agents. Cell 187, 6125–6151 (2024).

Article 
CAS 
PubMed 

Google Scholar
 

Ahdritz, G. et al. OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat. Methods 21, 1514–1524 (2024).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar