{"id":10693,"date":"2025-04-11T11:57:19","date_gmt":"2025-04-11T11:57:19","guid":{"rendered":"https:\/\/www.europesays.com\/uk\/10693\/"},"modified":"2025-04-11T11:57:19","modified_gmt":"2025-04-11T11:57:19","slug":"towards-conversational-diagnostic-artificial-intelligence","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/uk\/10693\/","title":{"rendered":"Towards conversational diagnostic artificial intelligence"},"content":{"rendered":"<p>Real-world datasets for AMIE<\/p>\n<p>AMIE was developed using a diverse suite of real-world datasets, including multiple-choice medical question-answering, expert-curated long-form medical reasoning, electronic health record (EHR) note summaries and large-scale transcribed medical conversation interactions. As described in detail below, in addition to dialogue generation tasks, the training task mixture for AMIE consisted of medical question-answering, reasoning and summarization tasks.<\/p>\n<p>Medical reasoning<\/p>\n<p>We used the MedQA (multiple-choice) dataset, consisting of US Medical Licensing Examination multiple-choice-style open-domain questions with four or five possible answers<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 48\" title=\"Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR48\" id=\"ref-link-section-d83486688e1334\" target=\"_blank\" rel=\"noopener\">48<\/a>. The training set consisted of 11,450 questions and the test set had 1,273 questions. We also curated 191 MedQA questions from the training set where clinical experts had crafted step-by-step reasoning leading to the correct answer<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 13\" title=\"Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31, 943&#x2013;950 (2025).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR13\" id=\"ref-link-section-d83486688e1338\" target=\"_blank\" rel=\"noopener\">13<\/a>.<\/p>\n<p>Long-form medical question-answering<\/p>\n<p>The dataset used here consisted of expert-crafted long-form responses to 64 questions from HealthSearchQA, LiveQA and Medication QA in MultiMedQA<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 12\" title=\"Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172&#x2013;180 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR12\" id=\"ref-link-section-d83486688e1350\" target=\"_blank\" rel=\"noopener\">12<\/a>.<\/p>\n<p>Medical summarization<\/p>\n<p>A dataset consisting of 65 clinician-written summaries of medical notes from MIMIC-III, a large, publicly available database containing the medical records of intensive care unit patients<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 49\" title=\"Johnson, A. E. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR49\" id=\"ref-link-section-d83486688e1362\" target=\"_blank\" rel=\"noopener\">49<\/a>, was used as additional training data for AMIE. MIMIC-III contains approximately two million notes spanning 13 types, including cardiology, respiratory, radiology, physician, general, discharge, case management, consult, nursing, pharmacy, nutrition, rehabilitation and social work. Five notes from each category were selected, with a minimum total length of 400 tokens and at least one nursing note per patient. Clinicians were instructed to write abstractive summaries of individual medical notes, capturing key information while also permitting the inclusion of new informative and clarifying phrases and sentences not present in the original note.<\/p>\n<p>Real-world dialogue<\/p>\n<p>Here we used a de-identified dataset licensed from a dialogue research organization, comprising 98,919 audio transcripts of medical conversations during in-person clinical visits from over 1,000 clinicians over a ten-year period in the United States<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 50\" title=\"Chiu, C.-C. et al. Speech recognition for medical conversations. In Proc. Interspeech (ed. Yegnanarayana, B.) 2972&#x2013;2976 (International Speech Communication Association, 2018).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR50\" id=\"ref-link-section-d83486688e1375\" target=\"_blank\" rel=\"noopener\">50<\/a>. It covered 51 medical specialties (primary care, rheumatology, haematology, oncology, internal medicine and psychiatry, among others) and 168 medical conditions and visit reasons (type\u00a02 diabetes, rheumatoid arthritis, asthma and depression being among the common conditions). Audio transcripts contained utterances from different speaker roles, such as doctors, patients and nurses. On average, a conversation had 149.8 turns (P0.25\u2009=\u200975.0, P0.75\u2009=\u2009196.0). For each conversation, the metadata contained information about patient demographics, reason for the visit (follow-up for pre-existing condition, acute needs, annual exam and more), and diagnosis type (new, existing or other unrelated). Refer to ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 50\" title=\"Chiu, C.-C. et al. Speech recognition for medical conversations. In Proc. Interspeech (ed. Yegnanarayana, B.) 2972&#x2013;2976 (International Speech Communication Association, 2018).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR50\" id=\"ref-link-section-d83486688e1387\" target=\"_blank\" rel=\"noopener\">50<\/a> for more details.<\/p>\n<p>For this study, we selected dialogues involving only doctors and patients, but not other roles, such as nurses. During preprocessing, we removed paraverbal annotations, such as \u2018[LAUGHING]\u2019 and \u2018[INAUDIBLE]\u2019, from the transcripts. We then divided the dataset into training (90%) and validation (10%) sets using stratified sampling based on condition categories and reasons for visits, resulting in 89,027 conversations for training and 9,892 for validation.<\/p>\n<p>Simulated learning through self-play<\/p>\n<p>While passively collecting and transcribing real-world dialogues from in-person clinical visits is feasible, two substantial challenges limit its effectiveness in training LLMs for medical conversations: (1) existing real-world data often fail to capture the vast range of medical conditions and scenarios, hindering its scalability and comprehensiveness; and (2) the data derived from real-world dialogue transcripts tend to be noisy, containing ambiguous language (including slang, jargon and sarcasm), interruptions, ungrammatical utterances and implicit references. This, in turn, may have limited AMIE\u2019s knowledge, capabilities and applicability.<\/p>\n<p>To address these limitations, we designed a self-play-based simulated learning environment for diagnostic medical dialogues in a virtual care setting, enabling us to scale AMIE\u2019s knowledge and capabilities across a multitude of medical conditions and contexts. We used this environment to iteratively fine-tune AMIE with an evolving set of simulated dialogues in addition to the static corpus of medical question-answering, reasoning, summarization and real-world dialogue data described above.<\/p>\n<p>This process consisted of two self-play loops:<\/p>\n<ul class=\"u-list-style-bullet\">\n<li>\n<p>An inner self-play loop where AMIE leveraged in-context critic feedback to refine its behaviour on simulated conversations with an AI patient agent.<\/p>\n<\/li>\n<li>\n<p>An outer self-play loop where the set of refined simulated dialogues were incorporated into subsequent fine-tuning iterations. The resulting new version of AMIE could then participate in the inner loop again, creating a continuous learning cycle.<\/p>\n<\/li>\n<\/ul>\n<p>At each iteration of fine-tuning, we produced 11,686 dialogues, stemming from 5,230 different medical conditions. The conditions were selected from three datasets:<\/p>\n<p>At each self-play iteration, four conversations were generated from each of the 613 common conditions, while two conversations were generated from each of the 4,617 less-common conditions randomly chosen from MedicineNet and MalaCards. The average simulated dialogue conversation length was 21.28 turns (P0.25\u2009=\u200919.0, P0.75\u2009=\u200925.0).<\/p>\n<p>Simulated dialogues through self-play<\/p>\n<p>To produce high-quality simulated dialogues at scale, we developed a new multi-agent framework that comprised three key components:<\/p>\n<ul class=\"u-list-style-bullet\">\n<li>\n<p>A vignette generator: AMIE leverages web searches to craft unique patient vignettes given a specific medical condition.<\/p>\n<\/li>\n<li>\n<p>A simulated dialogue generator: three LLM agents play the roles of patient agent, doctor agent and moderator, engaging in a turn-by-turn dialogue simulating realistic diagnostic interactions.<\/p>\n<\/li>\n<li>\n<p>A self-play critic: a fourth LLM agent acts as a critic to give feedback to the doctor agent for self-improvement. Notably, AMIE acted as all agents in this framework.<\/p>\n<\/li>\n<\/ul>\n<p>The prompts for each of these steps are listed in Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#MOESM1\" target=\"_blank\" rel=\"noopener\">3<\/a>. The vignette generator aimed to create varied and realistic patient scenarios at scale, which could be subsequently used as context for generating simulated doctor\u2013patient dialogues, thereby allowing AMIE to undergo a training process emulating exposure to a greater number of conditions and patient backgrounds. The patient vignette (scenario) included essential background information, such as patient demographics, symptoms, past medical history, past surgical history, past social history and patient questions, as well as an associated diagnosis and management plan.<\/p>\n<p>For a given condition, patient vignettes were constructed using the following process. First, we retrieved 60 passages (20 each) on the range of demographics, symptoms and management plans associated with the condition from using an internet search engine. To ensure these passages were relevant to the given condition, we used the general-purpose LLM, PaLM 2 (ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 10\" title=\"Anil, R. et al. PaLM 2 technical report. Preprint at &#010;                https:\/\/arxiv.org\/abs\/2305.10403&#010;                &#010;               (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR10\" id=\"ref-link-section-d83486688e1512\" target=\"_blank\" rel=\"noopener\">10<\/a>), to filter these retrieved passages, removing any passages deemed unrelated to the given condition. We then prompted AMIE to generate plausible patient vignettes aligned with the demographics, symptoms and management plans retrieved from the filtered passages, by providing a one-shot exemplar to enforce a particular vignette format.<\/p>\n<p>Given a patient vignette detailing a specific medical condition, the simulated dialogue generator was designed to simulate a realistic dialogue between a patient and a doctor in an online chat setting where in-person physical examination may not be feasible.<\/p>\n<p>Three specific LLM agents (patient agent, doctor agent and moderator), each played by AMIE, were tasked with communicating among each other to generate the simulated dialogues. Each agent had distinct instructions. The patient agent embodied the individual experiencing the medical condition outlined in the vignette. Their role involved truthfully responding to the doctor agent\u2019s inquiries, as well as raising any additional questions or concerns they may have had. The doctor agent played the role of an empathetic clinician seeking to comprehend the patient\u2019s medical history within the online chat environment<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 51\" title=\"Sharma, A., Miner, A., Atkins, D. &amp; Althoff, T. A computational approach to understanding empathy expressed in text-based mental health support. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (eds Webber, B. et al.) 5263&#x2013;5276 (Association for Computational Linguistics, 2020).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR51\" id=\"ref-link-section-d83486688e1523\" target=\"_blank\" rel=\"noopener\">51<\/a>. Their objective was to formulate questions that could effectively reveal the patient\u2019s symptoms and background, leading to an accurate diagnosis and an effective treatment plan. The moderator continually assessed the ongoing dialogue between the patient agent and doctor agent, determining when the conversation had reached a natural conclusion.<\/p>\n<p>The turn-by-turn dialogue simulation started with the doctor agent initiating the conversation: \u201cDoctor: So, how can I help you today?\u201d. Following this, the patient agent responded, and their answer was incorporated into the ongoing dialogue history. Subsequently, the doctor agent formulated a response based on the updated dialogue history. This response was then appended to the conversation history. The conversation progressed until the moderator detected the dialogue had reached a natural conclusion, when the doctor agent had provided a DDx, treatment plan, and adequately addressed any remaining patient agent questions, or if either agent initiated a farewell.<\/p>\n<p>To ensure high-quality dialogues, we implemented a tailored self-play<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 3\" title=\"Fu, Y., Peng, H., Khot, T. &amp; Lapata, M. Improving language model negotiation with self-play and in-context learning from AI feedback. Preprint at &#010;                https:\/\/arxiv.org\/abs\/2305.10142&#010;                &#010;               (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR3\" id=\"ref-link-section-d83486688e1533\" target=\"_blank\" rel=\"noopener\">3<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 52\" title=\"Aksitov, R. et al. Rest meets ReAct: self-improvement for multi-step reasoning LLM agent. Preprint at &#010;                https:\/\/doi.org\/10.48550\/arXiv.2312.10003&#010;                &#010;               (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR52\" id=\"ref-link-section-d83486688e1536\" target=\"_blank\" rel=\"noopener\">52<\/a> framework specifically for the self-improvement of diagnostic conversations. This framework introduced a fourth LLM agent to act as a \u2018critic\u2019, which was also played by AMIE, and that was aware of the ground-truth diagnosis to provide in-context feedback to the doctor agent and enhance its performance in subsequent conversations.<\/p>\n<p>Following the critic\u2019s feedback, the doctor agent incorporated the suggestions to improve its responses in subsequent rounds of dialogue with the same patient agent from scratch. Notably, the doctor agent retained access to its previous dialogue history in each new round. This self-improvement process was repeated twice to generate the dialogues used for each iteration of fine-tuning. See Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#MOESM1\" target=\"_blank\" rel=\"noopener\">4<\/a> as an example of this self-critique process.<\/p>\n<p>We noted that the simulated dialogues from self-play had significantly fewer conversational turns than those from the real-world data described in the previous section. This difference was expected, given that our self-play mechanism was designed\u2014through instructions to the doctor and moderator agents\u2014to simulate text-based conversations. By contrast, real-world dialogue data was transcribed from in-person encounters. There are fundamental differences in communication styles between text-based and face-to-face conversations. For example, in-person encounters may afford a higher communication bandwidth, including a higher total word count and more \u2018back and forth\u2019 (that is, a greater number of conversational turns) between the physician and the patient. AMIE, by contrast, was designed for focused information gathering by means of a text-chat interface.<\/p>\n<p>Instruction fine-tuning<\/p>\n<p>AMIE, built upon the base LLM PaLM 2 (ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 10\" title=\"Anil, R. et al. PaLM 2 technical report. Preprint at &#010;                https:\/\/arxiv.org\/abs\/2305.10403&#010;                &#010;               (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR10\" id=\"ref-link-section-d83486688e1558\" target=\"_blank\" rel=\"noopener\">10<\/a>), was instruction fine-tuned to enhance its capabilities for medical dialogue and reasoning. We refer the reader to the PaLM 2 technical report for more details on the base LLM architecture. Fine-tuning examples were crafted from the evolving simulated dialogue dataset generated by our four-agent procedure, as well as the static datasets. For each task, we designed task-specific instructions to instruct AMIE on what task it would be performing. For dialogue, this was assuming either the patient or doctor role in the conversation, while for the question-answering and summarization datasets, AMIE was instead instructed to answer medical questions or summarize EHR notes. The first round of fine-tuning from the base LLM only used the static datasets, while subsequent rounds of fine-tuning leveraged the simulated dialogues generated through the self-play inner loop.<\/p>\n<p>For dialogue generation tasks, AMIE was instructed to assume either the doctor or patient role and, given the dialogue up to a certain turn, to predict the next conversational turn. When playing the patient agent, AMIE\u2019s instruction was to reply to the doctor agent\u2019s questions about their symptoms, drawing upon information provided in patient scenarios. These scenarios included patient vignettes for simulated dialogues or metadata, such as demographics, visit reason and diagnosis type, for the real-world dialogue dataset. For each fine-tuning example in the patient role, the corresponding patient scenario was added to AMIE\u2019s context. In the doctor agent role, AMIE was instructed to act as an empathetic clinician, interviewing patients about their medical history and symptoms to ultimately arrive at an accurate diagnosis. From each dialogue, we sampled, on average, three turns for each doctor and patient role as the target turns to predict based on the conversation leading up to that target turn. Target turns were randomly sampled from all turns in the dialogue that had a minimum length of 30 characters.<\/p>\n<p>Similarly, for the EHR note summarization task, AMIE was provided with a clinical note and prompted to generate a summary of the note. Medical reasoning\/QA and long-form response generation tasks followed the same set-up as in ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 13\" title=\"Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31, 943&#x2013;950 (2025).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR13\" id=\"ref-link-section-d83486688e1568\" target=\"_blank\" rel=\"noopener\">13<\/a>. Notably, all tasks except dialogue generation and long-form response generation incorporated few-shot (1\u20135) exemplars in addition to task-specific instructions for additional context.<\/p>\n<p>Chain-of-reasoning for online inference<\/p>\n<p>To address the core challenge in diagnostic dialogue\u2014effectively, acquiring information under uncertainty to enhance diagnostic accuracy and confidence, while maintaining positive rapport with the patient\u2014AMIE employed a chain-of-reasoning strategy before generating a response in each dialogue turn. Here \u2018chain-of-reasoning\u2019 refers to a series of sequential model calls, each dependent on the outputs of prior steps. Specifically, we used a three-step reasoning process, described as follows:<\/p>\n<ul class=\"u-list-style-bullet\">\n<li>\n<p>Analysing patient information. Given the current conversation history, AMIE was instructed to: (1) summarize the positive and negative symptoms of the patient as well as any relevant medical\/family\/social history and demographic information; (2) produce a current DDx; (3) note missing information needed for a more accurate diagnosis; and (4) assess confidence in the current differential and highlight its urgency.<\/p>\n<\/li>\n<li>\n<p>Formulating response and action. Building upon the conversation history and the output of step 1, AMIE: (1) generated a response to the patient\u2019s last message and formulated further questions to acquire missing information and refine the DDx; and (2) if necessary, recommended immediate action, such as an emergency room visit. If confident in the diagnosis, based on the available information, AMIE presented the differential.<\/p>\n<\/li>\n<li>\n<p>Refining the response. AMIE revised its previous output to meet specific criteria based on the conversation history and outputs from earlier steps. The criteria were primarily related to factuality and formatting of the response (for example, avoid factual inaccuracies on patient facts and unnecessary repetition, show empathy, and display in a clear format).<\/p>\n<\/li>\n<\/ul>\n<p>This chain-of-reasoning strategy enabled AMIE to progressively refine its response conditioned on the current conversation to arrive at an informed and grounded reply.<\/p>\n<p>Evaluation<\/p>\n<p>Prior works developing models for clinical dialogue have focused on metrics, such as the accuracy of note-to-dialogue or dialogue-to-note generations<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 53\" title=\"Abacha, A. B., Yim, W.-W., Adams, G., Snider, N. &amp; Yetisgen-Yildiz, M. Overview of the MEDIQA-chat 2023 shared tasks on the summarization &amp; generation of doctor-patient conversations. In Proc. 5th Clinical Natural Language Processing Workshop (eds Naumann, T. et al.) 503&#x2013;513 (Association for Computational Linguistics, 2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR53\" id=\"ref-link-section-d83486688e1612\" target=\"_blank\" rel=\"noopener\">53<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 54\" title=\"Ionescu, B. et al. in Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2023 Lecture Notes in Computer Science Vol. 14163 (eds Arampatzis, A. et al.) 370&#x2013;396 (Springer, 2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR54\" id=\"ref-link-section-d83486688e1615\" target=\"_blank\" rel=\"noopener\">54<\/a>, or natural language generation metrics, such as BLEU or ROUGE scores that fail to capture the clinical quality of a consultation<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 55\" title=\"He, Z. et al. DIALMED: a dataset for dialogue-based medication recommendation. In Proc. 29th International Conference on Computational Linguistics (eds Calzolari, N. et al.) 721&#x2013;733 (International Committee on Computational Linguistics, 2022).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR55\" id=\"ref-link-section-d83486688e1619\" target=\"_blank\" rel=\"noopener\">55<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 56\" title=\"Naseem, U., Bandi, A., Raza, S., Rashid, J. &amp; Chakravarthi, B. R. Incorporating medical knowledge to transformer-based language models for medical dialogue generation. In Proc. 21st Workshop on Biomedical Language Processing (eds Demner-Fushman, D. et al.) 110&#x2013;115 (Association for Computational Linguistics, 2022).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR56\" id=\"ref-link-section-d83486688e1622\" target=\"_blank\" rel=\"noopener\">56<\/a>.<\/p>\n<p>In contrast to these prior works, we sought to anchor our human evaluation in criteria more commonly used for evaluating the quality of physicians\u2019 expertise in history-taking, including their communication skills in consultation. Additionally, we aimed to evaluate conversation quality from the perspective of both the lay participant (the participating patient-actor) and a non-participating professional observer (a physician who was not directly involved in the consultation). We surveyed the literature and interviewed clinicians working as OSCE examiners in Canada and India to identify a minimum set of peer-reviewed published criteria that they considered comprehensively reflected the criteria that are commonly used in evaluating both patient-centred and professional-centred aspects of clinical diagnostic dialogue\u2014that is, identifying the consensus for PCCBP in medical interviews<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 19\" title=\"King, A. &amp; Hoppe, R. B. &#x201C;Best practice&#x201D; for patient-centered communication: a narrative review. J. Grad. Med. Educ. 5, 385&#x2013;393 (2013).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR19\" id=\"ref-link-section-d83486688e1629\" target=\"_blank\" rel=\"noopener\">19<\/a>, the criteria examined for history-taking skills by the Royal College of Physicians in the United Kingdom as part of their PACES (<a href=\"https:\/\/www.mrcpuk.org\/mrcpuk-examinations\/paces\/marksheets\" target=\"_blank\" rel=\"noopener\">https:\/\/www.mrcpuk.org\/mrcpuk-examinations\/paces\/marksheets<\/a>)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 20\" title=\"Dacre, J., Besser, M. &amp; White, P. MRCP(UK) part 2 clinical examination (PACES): a review of the first four examination sessions (June 2001 &#x2013; July 2002). Clin. Med. 3, 452&#x2013;459 (2003).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR20\" id=\"ref-link-section-d83486688e1640\" target=\"_blank\" rel=\"noopener\">20<\/a> and the criteria proposed by the UK GMCPQ (<a href=\"https:\/\/edwebcontent.ed.ac.uk\/sites\/default\/files\/imports\/fileManager\/patient_questionnaire%20pdf_48210488.pdf\" target=\"_blank\" rel=\"noopener\">https:\/\/edwebcontent.ed.ac.uk\/sites\/default\/files\/imports\/fileManager\/patient_questionnaire%20pdf_48210488.pdf<\/a>) for doctors seeking patient feedback as part of professional revalidation (<a href=\"https:\/\/www.gmc-uk.org\/registration-and-licensing\/managing-your-registration\/revalidation\/revalidation-resources\" target=\"_blank\" rel=\"noopener\">https:\/\/www.gmc-uk.org\/registration-and-licensing\/managing-your-registration\/revalidation\/revalidation-resources<\/a>).<\/p>\n<p>The resulting evaluation framework enabled assessment from two perspectives\u2014the clinician, and lay participants\u00a0in the dialogues (that is, the patient-actors). The framework included the consideration of consultation quality, structure and completeness, and the roles, responsibilities and skills of the interviewer (Extended Data Tables <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#Tab1\" target=\"_blank\" rel=\"noopener\">1<\/a>\u2013<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#Tab3\" target=\"_blank\" rel=\"noopener\">3<\/a>).<\/p>\n<p>Remote OSCE study design<\/p>\n<p>To compare AMIE\u2019s performance to that of real clinicians, we conducted a randomized crossover study of blinded consultations in the style of a remote OSCE. Our OSCE study involved 20 board-certified PCPs and 20 validated patient-actors, ten each from India and Canada, respectively, to partake in online text-based consultations (Extended Data Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#Fig6\" target=\"_blank\" rel=\"noopener\">1<\/a>). The PCPs had between 3 and 25 years of post-residency experience (median 7 years). The patient-actors comprised of a mix of medical students, residents and nurse practitioners with experience in OSCE participation. We sourced 159 scenario packs from India (75), Canada (70) and the United Kingdom (14).<\/p>\n<p>The scenario packs and simulated patients in our study were prepared by two OSCE laboratories (one each in Canada and India), each affiliated with a medical school and with extensive experience in preparing scenario packs and simulated patients for OSCE examinations. The UK scenario packs were sourced from the samples provided on the Membership of the Royal Colleges of Physicians UK website. Each scenario pack was associated with a ground-truth diagnosis and a set of acceptable diagnoses. The scenario packs covered conditions from the cardiovascular (31), respiratory (32), gastroenterology (33), neurology (32), urology, obstetric and gynaecology (15) domains\u00a0and internal medicine (16). The scenarios are listed in Supplementary Information section\u00a0<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#MOESM1\" target=\"_blank\" rel=\"noopener\">8<\/a>. The paediatric and psychiatry domains were excluded from this study, as were intensive care and inpatient case management scenarios.<\/p>\n<p>Indian patient-actors played the roles in all India scenario packs and 7 of the 14 UK scenario packs. Canadian patient-actors participated in scenario packs for both Canada and the other half of the UK-based scenario packs. This assignment process resulted in 159 distinct simulated patients (that is, scenarios). Below, we use the term \u2018OSCE agent\u2019 to refer to the conversational counterpart interviewing the patient-actor\u2014that is, either the PCP or AMIE. Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#MOESM1\" target=\"_blank\" rel=\"noopener\">1<\/a> summarizes the OSCE assignment information across the three geographical locations. Each of the 159 simulated patients completed the three-step study flow depicted in Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#Fig2\" target=\"_blank\" rel=\"noopener\">2<\/a>.<\/p>\n<p>Online text-based consultation<\/p>\n<p>The PCPs and patient-actors were primed with sample scenarios and instructions, and participated in pilot consultations before the study began to familiarize them with the interface and experiment requirements.<\/p>\n<p>For the experiment, each simulated patient completed two online text-based consultations by means of a synchronous text-chat interface (Extended Data Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#Fig6\" target=\"_blank\" rel=\"noopener\">1<\/a>), one with a PCP (control) and one with AMIE (intervention). The ordering of the PCP and AMIE was randomized and the patient-actors were not informed as to which they were talking to in each consultation (counterbalanced design to control for any potential order effects). The PCPs were located in the same country as the patient-actors, and were randomly drawn based on availability at the time slot specified for the consultation. The patient-actors role-played the scenario and were instructed to conclude the conversation after no more than 20\u2009minutes. Both OSCE agents were asked (the PCPs through study-specific instructions and AMIE as part of the prompt template) to not reveal their identity, or whether they were human, under any circumstances.<\/p>\n<p>Post-questionnaires<\/p>\n<p>Upon conclusion of the consultation, the patient-actor and OSCE agent each filled in a post-questionnaire in light of the resulting consultation transcript (Extended Data Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#Fig6\" target=\"_blank\" rel=\"noopener\">1<\/a>). The post-questionnaire for patient-actors consisted of the complete GMCPQ, the PACES components for \u2018Managing patient concerns\u2019 and \u2018Maintaining patient welfare\u2019 (Extended Data Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#Tab1\" target=\"_blank\" rel=\"noopener\">1<\/a>) and a checklist representation of the PCCBP category for \u2018Fostering the relationship\u2019 (Extended Data Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#Tab2\" target=\"_blank\" rel=\"noopener\">2<\/a>). The responses the patient-actors provided to the post-questionnaire are referred to as \u2018patient-actor ratings\u2019. The post-questionnaire for the OSCE agent asked for a ranked DDx list with a minimum of three and no more than ten conditions, as well as recommendations for escalation to in-person or video-based consultation, investigations, treatments, a management plan and the need for a follow-up.<\/p>\n<p>Specialist physician evaluation<\/p>\n<p>Finally, a pool of 33 specialist physicians from India (18), North America (12) and the United Kingdom (3) evaluated the PCPs and AMIE with respect to the quality of their consultation and their responses to the post-questionnaire. During evaluation, the specialist physicians also had access to the full scenario pack, along with its associated ground-truth differential and additional accepted differentials. All of the data the specialist physicians had access to during evaluation are collectively referred to as \u2018OSCE data\u2019. Specialist physicians were sourced to match the specialties and geographical regions corresponding to the scenario packs included in our study, and had between 1 and 32 years of post-residency experience (median 5 years). Each set of OSCE data was evaluated by three specialist physicians randomly assigned to match the specialty and geographical region of the underlying scenario (for example, Canadian pulmonologists evaluated OSCE data from the Canada-sourced respiratory medicine scenario). Each specialist evaluated the OSCE data from both the PCP and AMIE for each given scenario. Evaluations for the PCP and AMIE were conducted by the same set of specialists in a randomized and blinded sequence.<\/p>\n<p>Evaluation criteria included the accuracy, appropriateness and comprehensiveness of the provided DDx list, the appropriateness of recommendations regarding escalation, investigation, treatment, management plan and follow-up (Extended Data Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#Tab3\" target=\"_blank\" rel=\"noopener\">3<\/a>) and all PACES (Extended Data Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#Tab1\" target=\"_blank\" rel=\"noopener\">1<\/a>) and PCCBP (Extended Data Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#Tab2\" target=\"_blank\" rel=\"noopener\">2<\/a>) rating items. We also asked specialist physicians to highlight confabulations in the consultations and questionnaire responses\u2014that is, text passages that were non-factual or that referred to information not provided in the conversation. Each OSCE scenario pack additionally supplied the specialists with scenario-specific clinical information to assist with rating the clinical quality of the consultation, such as the ideal investigation or management plans, or important aspects of the clinical history that would ideally have been elucidated for the highest quality of consultation possible. This follows the common practice for instructions for OSCE examinations, in which specific clinical scenario-specific information is provided to ensure consistency among examiners, and follows the paradigm demonstrated by Membership of the Royal Colleges of Physicians sample packs. For example, this scenario (<a href=\"https:\/\/www.thefederation.uk\/sites\/default\/files\/Station%202%20Scenario%20Pack%20%2816%29.pdf\" target=\"_blank\" rel=\"noopener\">https:\/\/www.thefederation.uk\/sites\/default\/files\/Station%202%20Scenario%20Pack%20%2816%29.pdf<\/a>) informs an examiner that, for a scenario in which the patient-actor has haemoptysis, the appropriate investigations would include a chest X-ray, a high-resolution computed tomography scan of the chest, a bronchoscopy and spirometry, whereas bronchiectasis treatment options a candidate should be aware of should include chest physiotherapy, mucolytics, bronchodilators and antibiotics.<\/p>\n<p>Statistical analysis and reproducibility<\/p>\n<p>We evaluated the top-k accuracy of the DDx lists generated by AMIE and the PCPs across all 159 simulated patients. Top-k accuracy was defined as the percentage of cases where the correct ground-truth diagnosis appeared within the top-k positions of the DDx list. For example, top-3 accuracy is the percentage of cases for which the correct ground-truth diagnosis appeared in the top three diagnosis predictions from AMIE or the PCP. Specifically, a candidate diagnosis was considered a match if the specialist rater marked it as either an exact match with the ground-truth diagnosis, or very close to or closely related to the ground-truth diagnosis (or accepted differential). Each conversation and DDx was evaluated by three specialists, and their majority vote or median rating was used to determine the accuracy and quality ratings, respectively.<\/p>\n<p>The statistical significance of the DDx accuracy was determined using two-sided bootstrap tests<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 57\" title=\"Horowitz, J. L. in Handbook of Econometrics, Vol. 5 (eds Heckman, J. J. &amp; Leamer, E.) 3159&#x2013;3228 (Elsevier, 2001).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR57\" id=\"ref-link-section-d83486688e1771\" target=\"_blank\" rel=\"noopener\">57<\/a> with 10,000 samples and false discovery rate (FDR) correction<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 58\" title=\"Benjamini, Y. &amp; Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 57, 289&#x2013;300 (1995).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR58\" id=\"ref-link-section-d83486688e1775\" target=\"_blank\" rel=\"noopener\">58<\/a> across all k. The statistical significance of the patient-actor and specialist ratings was determined using two-sided Wilcoxon signed-rank tests<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 59\" title=\"Woolson, R. F. in Wiley Encyclopedia of Clinical Trials (eds D&#x2019;Agostino, R. B. et al.) 1&#x2013;3 (Wiley, 2007).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR59\" id=\"ref-link-section-d83486688e1782\" target=\"_blank\" rel=\"noopener\">59<\/a>, also with FDR correction. Cases where either agent received \u2018Cannot rate\/Does not apply\u2019 were excluded from the test. All significance results are based on P values after FDR correction.<\/p>\n<p>Additionally, we reiterate that the OSCE scenarios themselves were sourced from three different countries, the patient-actors came from two separate institutions in Canada and India, and the specialist evaluations were triplicate rated in this study.<\/p>\n<p>Related workClinical history-taking and the diagnostic dialogue<\/p>\n<p>History-taking and the clinical interview are widely taught in both medical schools and postgraduate curricula<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Keifenheim, K. E. et al. Teaching history taking to medical students: a systematic review. BMC Med. Educ. 15, 159 (2015).\" href=\"#ref-CR60\" id=\"ref-link-section-d83486688e1805\">60<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Yedidia, M. J. et al. Effect of communications training on medical student performance. JAMA 290, 1157&#x2013;1165 (2003).\" href=\"#ref-CR61\" id=\"ref-link-section-d83486688e1805_1\">61<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Makoul, G. Communication skills education in medical school and beyond. JAMA 289, 93&#x2013;93 (2003).\" href=\"#ref-CR62\" id=\"ref-link-section-d83486688e1805_2\">62<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Tan, X. H. et al. Teaching and assessing communication skills in the postgraduate medical setting: a systematic scoping review. BMC Med. Educ. 21, 483 (2021).\" href=\"#ref-CR63\" id=\"ref-link-section-d83486688e1805_3\">63<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Raper, S. E., Gupta, M., Okusanya, O. &amp; Morris, J. B. Improving communication skills: a course for academic medical center surgery residents and faculty. J. Surg. Educ. 72, e202&#x2013;e211 (2015).\" href=\"#ref-CR64\" id=\"ref-link-section-d83486688e1805_4\">64<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 65\" title=\"Von Fragstein, M. et al. UK consensus statement on the content of communication curricula in undergraduate medical education. Med. Educ. 42, 1100&#x2013;1107 (2008).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR65\" id=\"ref-link-section-d83486688e1808\" target=\"_blank\" rel=\"noopener\">65<\/a>. Consensus on physician\u2013patient communication has evolved to embrace patient-centred communication practices, with recommendations that communication in clinical encounters should address six core functions\u2014fostering the relationship, gathering information, providing information, making decisions, responding to emotions and enabling disease- and treatment-related behaviour<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 19\" title=\"King, A. &amp; Hoppe, R. B. &#x201C;Best practice&#x201D; for patient-centered communication: a narrative review. J. Grad. Med. Educ. 5, 385&#x2013;393 (2013).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR19\" id=\"ref-link-section-d83486688e1812\" target=\"_blank\" rel=\"noopener\">19<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 66\" title=\"De Haes, H. &amp; Bensing, J. Endpoints in medical communication research, proposing a framework of functions and outcomes. Patient Educ. Couns. 74, 287&#x2013;294 (2009).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR66\" id=\"ref-link-section-d83486688e1815\" target=\"_blank\" rel=\"noopener\">66<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 67\" title=\"Epstein, R. M. &amp; Street Jr, R. L. Patient-Centered Communication in Cancer Care: Promoting Healing and Reducing Suffering (National Cancer Institute, 2007).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR67\" id=\"ref-link-section-d83486688e1818\" target=\"_blank\" rel=\"noopener\">67<\/a>. The specific skills and behaviours for meeting these goals have also been described, taught and assessed<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 19\" title=\"King, A. &amp; Hoppe, R. B. &#x201C;Best practice&#x201D; for patient-centered communication: a narrative review. J. Grad. Med. Educ. 5, 385&#x2013;393 (2013).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR19\" id=\"ref-link-section-d83486688e1822\" target=\"_blank\" rel=\"noopener\">19<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 68\" title=\"Schirmer, J. M. et al. Assessing communication competence: a review of current tools. Fam. Med. 37, 184&#x2013;92 (2005).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR68\" id=\"ref-link-section-d83486688e1825\" target=\"_blank\" rel=\"noopener\">68<\/a> using validated tools<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 68\" title=\"Schirmer, J. M. et al. Assessing communication competence: a review of current tools. Fam. Med. 37, 184&#x2013;92 (2005).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR68\" id=\"ref-link-section-d83486688e1829\" target=\"_blank\" rel=\"noopener\">68<\/a>. Medical conventions consistently cite that certain categories of information should be gathered during a clinical interview, comprising topics such as the presenting complaint, past medical history and medication history, social and family history, and systems review<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 69\" title=\"Nichol, J. R., Sundjaja, J. H. &amp; Nelson, G. Medical History (StatPearls, 2018).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR69\" id=\"ref-link-section-d83486688e1833\" target=\"_blank\" rel=\"noopener\">69<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 70\" title=\"Denness, C. What are consultation models for? InnovAiT 6, 592&#x2013;599 (2013).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR70\" id=\"ref-link-section-d83486688e1836\" target=\"_blank\" rel=\"noopener\">70<\/a>. Clinicians\u2019 ability to meet these goals is commonly assessed using the framework of an OSCE<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 4\" title=\"Sloan, D. A., Donnelly, M. B., Schwartz, R. W. &amp; Strodel, W. E. The objective structured clinical examination. The new gold standard for evaluating postgraduate clinical performance. Ann. Surg. 222, 735 (1995).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR4\" id=\"ref-link-section-d83486688e1841\" target=\"_blank\" rel=\"noopener\">4<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 5\" title=\"Carraccio, C. &amp; Englander, R. The objective structured clinical examination: a step in the direction of competency-based evaluation. Arch. Pediatr. Adolesc. Med. 154, 736&#x2013;741 (2000).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR5\" id=\"ref-link-section-d83486688e1844\" target=\"_blank\" rel=\"noopener\">5<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 71\" title=\"Epstein, R. M. &amp; Hundert, E. M. Defining and assessing professional competence. JAMA 287, 226&#x2013;235 (2002).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR71\" id=\"ref-link-section-d83486688e1847\" target=\"_blank\" rel=\"noopener\">71<\/a>. Such assessments vary in their reproducibility or implementation, and have even been adapted for remote practice as virtual OSCEs with telemedical scenarios, an issue of particular relevance during the COVID-19 pandemic<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 72\" title=\"Chan, S. C. C., Choa, G., Kelly, J., Maru, D. &amp; Rashid, M. A. Implementation of virtual OSCE in health professions education: a systematic review. Med. Educ. 57, 833&#x2013;843 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR72\" id=\"ref-link-section-d83486688e1851\" target=\"_blank\" rel=\"noopener\">72<\/a>.<\/p>\n<p>Conversational AI and goal-oriented dialogue<\/p>\n<p>Conversational AI systems for goal-oriented dialogue and task completion have a rich history<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Budzianowski, P. et al. MultiWOZ&#x2013;a large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing (eds Riloff, E. et al.) 5016&#x2013;5026 (Association for Computational Linguistics, 2018).\" href=\"#ref-CR73\" id=\"ref-link-section-d83486688e1863\">73<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Wei, W., Le, Q., Dai, A. &amp; Li, J. AirDialogue: an environment for goal-oriented dialogue research. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing (eds Riloff, E. et al.) 3844&#x2013;3854 (Association for Computational Linguistics, 2018).\" href=\"#ref-CR74\" id=\"ref-link-section-d83486688e1863_1\">74<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 75\" title=\"Lin, J., Tomlin, N., Andreas, J. &amp; Eisner, J. Decision-oriented dialogue for human-AI collaboration. Trans. Assoc. Comput. Linguist. 12, 892&#x2013;911 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR75\" id=\"ref-link-section-d83486688e1866\" target=\"_blank\" rel=\"noopener\">75<\/a>. The emergence of transformers<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 76\" title=\"Vaswani, A. et al. Attention is all you need. In Proc. 31st Conference on Neural Information Processing Systems (eds Guyon, I. et al.) 6000&#x2013;6010 (Curran Associates, 2017).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR76\" id=\"ref-link-section-d83486688e1870\" target=\"_blank\" rel=\"noopener\">76<\/a> and large language models<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 15\" title=\"Thoppilan, R. et al. LaMDA: language models for dialog applications. Preprint at &#010;                https:\/\/arxiv.org\/abs\/2201.08239&#010;                &#010;               (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR15\" id=\"ref-link-section-d83486688e1874\" target=\"_blank\" rel=\"noopener\">15<\/a> have led to renewed interest in this direction. The development of strategies for alignment<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 77\" title=\"Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730&#x2013;27744 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR77\" id=\"ref-link-section-d83486688e1878\" target=\"_blank\" rel=\"noopener\">77<\/a>, self-improvement<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Zhao, J., Khashabi, D., Khot, T., Sabharwal, A. &amp; Chang, K.-W. Ethical-advice taker: do language models understand natural language interventions? In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (eds Zong, C. et al.) 4158&#x2013;4164 (Association for Computational Linguistics, 2021).\" href=\"#ref-CR78\" id=\"ref-link-section-d83486688e1882\">78<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Saunders, W. et al. Self-critiquing models for assisting human evaluators. Preprint at &#10;                https:\/\/arxiv.org\/abs\/2206.05802&#10;                &#10;               (2022).\" href=\"#ref-CR79\" id=\"ref-link-section-d83486688e1882_1\">79<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Scheurer, J. et al. Training language models with language feedback at scale. Preprint at &#10;                https:\/\/arxiv.org\/abs\/2303.16755&#10;                &#10;               (2023).\" href=\"#ref-CR80\" id=\"ref-link-section-d83486688e1882_2\">80<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 81\" title=\"Glaese, A. et al. Improving alignment of dialogue agents via targeted human judgements. Preprint at &#010;                https:\/\/arxiv.org\/abs\/2209.14375&#010;                &#010;               (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR81\" id=\"ref-link-section-d83486688e1885\" target=\"_blank\" rel=\"noopener\">81<\/a> and scalable oversight mechanisms<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 82\" title=\"Bai, Y. et al. Constitutional AI: harmlessness from AI feedback. Preprint at &#010;                https:\/\/arxiv.org\/abs\/2212.08073&#010;                &#010;               (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR82\" id=\"ref-link-section-d83486688e1890\" target=\"_blank\" rel=\"noopener\">82<\/a> has enabled the large-scale deployment of such conversational systems in the real world<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 16\" title=\"Introducing ChatGPT. OpenAI &#010;                https:\/\/openai.com\/blog\/chatgpt&#010;                &#010;               (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR16\" id=\"ref-link-section-d83486688e1894\" target=\"_blank\" rel=\"noopener\">16<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 83\" title=\"Askell, A. et al. A general language assistant as a laboratory for alignment. Preprint at &#010;                https:\/\/arxiv.org\/abs\/2112.00861&#010;                &#010;               (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR83\" id=\"ref-link-section-d83486688e1897\" target=\"_blank\" rel=\"noopener\">83<\/a>. However, the rigorous evaluation and exploration of conversational and task-completion capabilities of such AI systems remains limited for clinical applications, where studies have largely focused on single-turn interaction use cases, such as question-answering or summarization.<\/p>\n<p>AI for medical consultations and diagnostic dialogue<\/p>\n<p>The majority of explorations of AI as tools for conducting medical consultations have focused on \u2018symptom-checker\u2019 applications rather than a full natural dialogue, or on topics such as the transcription of medical audio or the generation of plausible dialogue, given clinical notes or summaries<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Shor, J. et al. Clinical BERTScore: an improved measure of automatic speech recognition performance in clinical settings. In Proc. 5th Clinical Natural Language Processing Workshop (eds Naumann, T. et al.) 1&#x2013;7 (Association for Computational Linguistics, 2023).\" href=\"#ref-CR84\" id=\"ref-link-section-d83486688e1909\">84<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Abacha, A. B., Agichtein, E., Pinter, Y. &amp; Demner-Fushman, D. Overview of the medical question answering task at TREC 2017 LiveQA. In Proc. 26th Text Retrieval Conference, TREC 2017 (eds Voorhees, E. M. &amp; Ellis, A.) 1&#x2013;12 (National Institute of Standards and Technology and the Defense Advanced Research Projects Agency, 2017).\" href=\"#ref-CR85\" id=\"ref-link-section-d83486688e1909_1\">85<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Wallace, W. et al. The diagnostic and triage accuracy of digital and online symptom checker tools: a systematic review. NPJ Digit. Med. 5, 118 (2022).\" href=\"#ref-CR86\" id=\"ref-link-section-d83486688e1909_2\">86<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 87\" title=\"Zeltzer, D. et al. Diagnostic accuracy of artificial intelligence in virtual primary care. Mayo Clin. Proc. Digital Health 1, 480&#x2013;489 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR87\" id=\"ref-link-section-d83486688e1912\" target=\"_blank\" rel=\"noopener\">87<\/a>. Language models have been trained using clinical dialogue datasets, but these have not been comprehensively evaluated<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 88\" title=\"Johri, S. et al. Testing the limits of language models: a conversational framework for medical AI assessment. Preprint at medRxiv &#010;                https:\/\/doi.org\/10.1101\/2023.09.12.23295399&#010;                &#010;               (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR88\" id=\"ref-link-section-d83486688e1916\" target=\"_blank\" rel=\"noopener\">88<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 89\" title=\"Wu, C.-K., Chen, W.-L. &amp; Chen, H.-H. Large language models perform diagnostic reasoning. Preprint at &#010;                https:\/\/arxiv.org\/abs\/2307.08922&#010;                &#010;               (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR89\" id=\"ref-link-section-d83486688e1919\" target=\"_blank\" rel=\"noopener\">89<\/a>. Studies have been grounded in messages between doctors and patients in commercial chat platforms (which may have altered doctor\u2013patient engagement compared to 1:1 medical consultations)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 55\" title=\"He, Z. et al. DIALMED: a dataset for dialogue-based medication recommendation. In Proc. 29th International Conference on Computational Linguistics (eds Calzolari, N. et al.) 721&#x2013;733 (International Committee on Computational Linguistics, 2022).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR55\" id=\"ref-link-section-d83486688e1923\" target=\"_blank\" rel=\"noopener\">55<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 90\" title=\"Zeng, G. et al. MedDialog: large-scale medical dialogue datasets. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B. et al.) 9241&#x2013;9250 (Association for Computational Linguistics, 2020).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR90\" id=\"ref-link-section-d83486688e1926\" target=\"_blank\" rel=\"noopener\">90<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 91\" title=\"Liu, W. et al. MedDG: an entity-centric medical consultation dataset for entity-aware medical dialogue generation. In Proc. 11th CCF International Conference on Natural Language Processing and Chinese Computing (eds Lu, W. et al.) 447&#x2013;459 (Springer, 2022).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR91\" id=\"ref-link-section-d83486688e1929\" target=\"_blank\" rel=\"noopener\">91<\/a>. Many have focused largely on predicting next turns in the recorded exchanges rather than clinically meaningful metrics. Also, to date, there have been no reported studies that have examined the quality of AI models for diagnostic dialogue using the same criteria used to examine and train human physicians in dialogue and communication skills, nor studies evaluating AI systems in common frameworks, such as the OSCE.<\/p>\n<p>Evaluation of diagnostic dialogue<\/p>\n<p>Prior frameworks for the human evaluation of AI systems\u2019 performance in diagnostic dialogue have been limited in detail. They have not been anchored in established criteria for assessing communication skills and the quality of history-taking. For example, ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 56\" title=\"Naseem, U., Bandi, A., Raza, S., Rashid, J. &amp; Chakravarthi, B. R. Incorporating medical knowledge to transformer-based language models for medical dialogue generation. In Proc. 21st Workshop on Biomedical Language Processing (eds Demner-Fushman, D. et al.) 110&#x2013;115 (Association for Computational Linguistics, 2022).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR56\" id=\"ref-link-section-d83486688e1941\" target=\"_blank\" rel=\"noopener\">56<\/a> reported a five-point scale describing overall \u2018human evaluation\u2019, ref\u00a0.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 90\" title=\"Zeng, G. et al. MedDialog: large-scale medical dialogue datasets. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B. et al.) 9241&#x2013;9250 (Association for Computational Linguistics, 2020).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR90\" id=\"ref-link-section-d83486688e1945\" target=\"_blank\" rel=\"noopener\">90<\/a>\u00a0reported \u2018relevance, informativeness and human likeness\u2019, and ref\u00a0.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 91\" title=\"Liu, W. et al. MedDG: an entity-centric medical consultation dataset for entity-aware medical dialogue generation. In Proc. 11th CCF International Conference on Natural Language Processing and Chinese Computing (eds Lu, W. et al.) 447&#x2013;459 (Springer, 2022).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR91\" id=\"ref-link-section-d83486688e1949\" target=\"_blank\" rel=\"noopener\">91<\/a> reported \u2018fluency, expertise and relevance\u2019,\u00a0whereas other studies have\u00a0reported\u00a0\u2018fluency and adequacy\u2019<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 92\" title=\"Varshney, D., Zafar, A., Behera, N. &amp; Ekbal, A. CDialog: a multi-turn COVID-19 conversation dataset for entity-aware dialog generation. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y. et al.) 11373&#x2013;11385 (Association for Computational Linguistics, 2022).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR92\" id=\"ref-link-section-d83486688e1953\" target=\"_blank\" rel=\"noopener\">92<\/a> and\u00a0\u2018fluency and specialty\u2019<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 93\" title=\"Yan, G. et al. ReMeDi: resources for multi-domain, multi-service, medical dialogues. In Proc. 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 3013&#x2013;3024 (Association for Computing Machinery, 2022).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR93\" id=\"ref-link-section-d83486688e1957\" target=\"_blank\" rel=\"noopener\">93<\/a>. These criteria are far less comprehensive and specific than those taught and practiced by medical professionals. A multi-agent framework for assessing the conversational capabilities of LLMs was introduced in ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 88\" title=\"Johri, S. et al. Testing the limits of language models: a conversational framework for medical AI assessment. Preprint at medRxiv &#010;                https:\/\/doi.org\/10.1101\/2023.09.12.23295399&#010;                &#010;               (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#ref-CR88\" id=\"ref-link-section-d83486688e1962\" target=\"_blank\" rel=\"noopener\">88<\/a>, the study, however, was performed in the restricted setting of dermatology, used AI models to emulate both the doctor and patient sides of simulated interactions, and it performed limited expert evaluation of the history-taking as being complete or not.<\/p>\n<p>Reporting summary<\/p>\n<p>Further information on research design is available in the\u00a0<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-08866-7#MOESM2\" target=\"_blank\" rel=\"noopener\">Nature Portfolio Reporting Summary<\/a> linked to this article.<\/p>\n","protected":false},"excerpt":{"rendered":"Real-world datasets for AMIE AMIE was developed using a diverse suite of real-world datasets, including multiple-choice medical question-answering,&hellip;\n","protected":false},"author":2,"featured_media":10694,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3163],"tags":[323,1942,7371,3965,1096,3966,70,53,16,15],"class_list":{"0":"post-10693","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-diagnosis","11":"tag-humanities-and-social-sciences","12":"tag-medical-research","13":"tag-multidisciplinary","14":"tag-science","15":"tag-technology","16":"tag-uk","17":"tag-united-kingdom"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@uk\/114319212108862162","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/10693","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/comments?post=10693"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/10693\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media\/10694"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media?parent=10693"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/categories?post=10693"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/tags?post=10693"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}