This work advances the diagnostic capabilities of the multimodal AMIE system4,5 by integrating multimodal perception, enabling it to conduct diagnostic conversations that are more clinically realistic and that incorporate various forms of medical data beyond text.

Our approach centers on a state-aware dialogue phase transition framework, which leverages the multimodal reasoning capabilities of Gemini 2.0 Flash29. This framework guides multimodal AMIE through structured phases of history-taking, diagnosis and management and follow-up, dynamically adapting the conversation based on intermediate model outputs that reflect the evolving patient state and diagnostic hypotheses. To rigorously evaluate the system’s performance during model development and in comparison to human clinicians, we employed a two-pronged evaluation strategy. First, we developed an automated evaluation pipeline involving perception tests on isolated medical artifacts and simulated dialogues assessed by an auto-rater across key clinical dimensions, such as diagnostic accuracy and information gathering. Second, we conducted an expert evaluation using an OSCE-style methodology to assess the capabilities of multimodal AMIE in realistic, simulated patient encounters involving multimodal data.

Multimodal state-aware reasoning

Real clinical diagnostic dialogues follow a structured yet flexible path. Clinicians methodically gather information, form potential diagnoses, strategically request and interpret further details (including multimodal data such as skin photographs or ECGs), continually update their assessment based on new evidence and eventually formulate a management plan. This process requires a clinician to adapt the line of questioning based on evolving hypotheses and uncertainty while ensuring that all critical information is considered30.

Given the rapid advancements in LLM capabilities and their increasing proficiency in following complex instructions, one might achieve considerable progress toward emulating this process using a sophisticated system prompt alone. However, we hypothesize that, for a safety-critical and highly dynamic task like multimodal diagnosis, building an explicit state-aware reasoning system layered on top of the LLM offers critical advantages. Such a system provides greater control over the dialogue flow, enables more reliable tracking of the diagnostic state and uncertainty, facilitates more deliberate integration of multimodal inputs and ultimately leads to higher-quality, more dependable clinical reasoning rather than relying solely on complex prompting (validated in the ‘Automated evaluations of multimodal AMIE configurations’ section).

Therefore, the multimodal AMIE system implements this state-aware dialogue phase transition framework to manage the diagnostic process. This framework dynamically controls the progression of multimodal AMIE through three distinct phases: (1) history-taking, (2) diagnosis and management and (3) follow-up. Transitions between phases, and actions within each phase, are driven by intermediate model outputs representing the evolving patient state and diagnostic hypotheses (Fig. 6). Crucially, each phase builds upon the accumulated dialogue history, which contextually incorporates various forms of patient data, including text, images (such as skin photographs or ECG tracings) and clinical documents (such as laboratory reports or prior consultation notes). This state-aware approach allows multimodal AMIE to emulate the structured yet adaptive reasoning process of clinicians (see examples in Supplementary Figs. 5 and 6).

Phase 1: history-taking—building a comprehensive picture

  1. 1.

    Patient profile initialization: A structured patient profile is initialized and dynamically updated throughout the interaction. This profile acts as a condensed, evolving record of known patient information, encompassing the chief complaint; history of present illness; demographics (age, sex and race); positive and negative symptoms; past medical, family and social/travel histories; medications; other relevant details; and a prioritized list of knowledge gaps. Initially, this profile may contain minimal information.

  2. 2.

    Evolving DDx generation: An internal, evolving DDx is generated. This DDx is not initially presented to the patient. DDx generation begins after an initial interaction period, allowing baseline information collection. The frequency of DDx updates is configurable.

  3. 3.

    Continuation decision: A key decision point is whether to continue gathering history or transition to presenting a diagnosis. This decision uses a set of criteria, including assessing if sufficient information exists to formulate a reasonable differential diagnosis. A decision module, querying Gemini 2.0 Flash, determines if current information is sufficient to proceed or if more targeted questions are needed. This module considers the dialogue history and preliminary DDx.

  4. 4.

    Targeted question and multimodal data request generation: If history-taking continues, the system generates focused questions to address information gaps identified in the patient profile and DDx uncertainty. It is important to note that this internal measure of uncertainty, derived from the model’s iterative DDx generation, is used as a pragmatic heuristic to guide the history-taking dialogue toward more relevant lines of questioning. It is not presented to the user nor is it a formally calibrated diagnostic probability intended for direct clinical interpretation. Crucially, the system is designed to recognize when multimodal data are necessary and to strategically request them. For instance, based on reported symptoms such as a rash, multimodal AMIE will prompt the user to upload skin photographs. Its reasoning extends to requesting additional views if needed (for example, ‘Could you please provide a photo from a different angle or in better lighting?’ or ‘Do you have photos showing how the rash looked previously?’). Similarly, for reported cardiac symptoms, it might request an ECG tracing if available. Upon receiving an artifact, multimodal AMIE elicits detailed descriptions adhering to instructions for interpreting the artifacts and their key aspects for determining salient findings (for example, for skin: lesion morphology, distribution and color; for ECGs: heart rate, rhythm, key waveforms and intervals). This explicit prompting for descriptive details ensures that the model extracts salient features from the artifact to inform the ongoing conversation and diagnostic reasoning. The generated question or request is presented to the patient.

  5. 5.

    Iterative refinement: The process is iterative. New information from patient responses and data uploads is incorporated into the dialogue history, updating the patient profile and internal DDx. The patient summary is periodically refreshed to reflect the current understanding.

Once the continuation decision indicates readiness, the system transitions to phase 2.

Phase 2: diagnosis and management—from data to actionable plan

  1. 1.

    Patient profile (optional update): The patient profile may be further refined.

  2. 2.

    DDx validation (subphase): The system enters a DDx validation phase. Focused questions are generated to gather specific evidence that supports or refutes potential diagnoses within the internal DDx. A decision module determines when the DDx is sufficiently validated to present to the patient.

  3. 3.

    DDx presentation: The system presents a ranked DDx (5−10 conditions). Crucially, the explanation for each diagnosis is grounded in evidence from the entire interaction, explicitly referencing and explaining findings from the provided multimodal data (for example, ‘Based on the photo you sent, the circular shape and central clearing of the rash are characteristic of…,’ or ‘The ECG shows specific changes in the ST segment which support the possibility of…’).

  4. 4.

    Management plan formulation: After presenting the DDx, the system formulates a management plan based on dialogue history, patient profile and the presented DDx. The plan includes recommendations for investigations, testing and/or treatment.

  5. 5.

    Management plan delivery: The system delivers the management plan, potentially iteratively across several turns, allowing for clarification and addressing patient concerns.

The presentation of the refined DDx and management plan signals the transition to phase 3.

Phase 3: answer follow-up questions— ensuring patient understanding

  1. 1.

    Patient profile (optional update): The patient profile may continue to be updated with information from follow-up questions.

  2. 2.

    Plan communication and question answering: The system addresses remaining patient questions, ensuring that the patient understands the proposed management plan, potentially referencing the multimodal artifacts again to clarify points. Responses are guided by the dialogue history, patient profile, presented DDx and management plan.

  3. 3.

    Dialogue termination: Dialogue continues until patient questions are addressed, the management plan is communicated and a natural conclusion is reached.

Upon reaching this termination state, the system proceeds to synthesize the interaction into a structured summary.

After dialogue conclusion: structured post-questionnaire generation

Once the dialogue reaches a natural conclusion (phase 3 completion), the system automatically generates a structured post-questionnaire. This process leverages the full multimodal dialogue history and the final internal state (patient profile, final DDx and management plan) to produce a comprehensive summary suitable for clinical review. Specifically, multimodal AMIE is prompted to:

  1. 1.

    Finalize DDx: Based on the entire interaction, including all textual exchanges and interpretations of multimodal artifacts, generate a final, ranked DDx listing the most probable condition and several plausible alternatives.

  2. 2.

    Formulate the management plan: Leveraging both the established DDx and a constrained web search process for grounding in current medical knowledge and guidelines, detail the recommended management plan. This includes proposed in-visit and ordered investigations, specific actions or recommendations for the patient and necessary escalation level (for example, video call or in-person visit) with justification and follow-up requirements (necessity, timeframe and reason). This retrieval-based approach is used exclusively during this offline step to ensure that recommendations are informed by up-to-date information. The specific methodology is detailed in Supplementary Section 5.

  3. 3.

    Extract salient artifact findings: Identify and list the key clinical findings observed in any provided images (skin photographs, ECGs and documents) that were relevant to the diagnostic and management reasoning.

The structured post-questionnaire serves as a standardized record of multimodal AMIE’s clinical assessment, reasoning and recommendations based on the completed multimodal consultation; it is used both as an artifact rated by clinicians in the OSCE evaluation and as the standardized output scored by the auto-rater in simulated dialogues.

Automatic evaluations and system validation

We established an automated evaluation framework to support rapid iteration and rigorous assessment of the multimodal AMIE system. This included evaluating perception capabilities on isolated medical artifacts and using a simulation environment with auto-raters to assess complete diagnostic multimodal dialogues and perform component ablations and determine optimal model selection.

Validation of base model perception of medical artifacts

A critical precursor to effective multimodal diagnostic conversation is ensuring that our underlying models possess a fundamental capability: accurate perception of diverse medical artifacts. Although LLMs have demonstrated remarkable progress in understanding and generating text, their ability to reliably ‘see’ and interpret medical images and documents— akin to a clinician’s initial visual assessment—remains less explored in the context of conversational DDx. Without robust perceptual grounding, even the most sophisticated conversational framework would be limited in its ability to meaningfully integrate multimodal data into history-taking and diagnostic reasoning.

Therefore, we first conducted a suite of perception tests designed to isolate and evaluate the visual understanding of our base models when presented with common medical artifacts: smartphone-captured skin images, ECG tracings and clinical documents. The objective was not to achieve state-of-the-art performance on complex diagnostic tasks per se but, rather, to establish a baseline confidence in whether our models could reliably discern key visual features and clinical information from these modalities in isolation.

For skin images, we used the SCIN dataset31 and PAD-UFES-20 (ref. 32), assessing the ability of the model to describe lesion morphology, color and distribution. For ECGs, we employed PTB-XL33 and ECG-QA benchmark34, prompting the model to provide probable diagnoses or answer expert-validated questions based on ECG images generated from raw signals. Finally, for clinical documents, we created the ClinicalDoc-QA dataset that consists of question-answering tasks based on a large collection of deidentified clinical notes and patient records generated by physicians to evaluate the model’s comprehension of medical information from clinical documents. Additional dataset details can be found in Supplementary Section 3.1.

To ensure accurate processing of artifacts, we verified the perceptual capabilities of our base model, Gemini 2.0 Flash. Our objective was not to achieve state-of-the-art performance on isolated benchmarks but, rather, to confirm a sufficient perceptual foundation for our state-aware reasoning framework. Overall, this verification (detailed in Extended Data Fig. 5 and Supplementary Section 3.1.2) demonstrated that Gemini 2.0 Flash possessed the necessary foundational perceptual grounding across our target modalities. This provided the confidence required to build multimodal AMIE’s advanced reasoning and dialogue functions upon this base model. This is quantitatively validated by our ablation studies (Fig. 5) and ultimately demonstrated by the superior diagnostic accuracy of multimodal AMIE in the human-evaluated OSCE study (Fig. 2). The outcomes of these perception tests, presented in the ‘Automated evaluations of multimodal AMIE configurations’ section and Supplementary Section 3.1.2, provide an essential confidence check on our base LLM, Gemini 2.0 Flash, and also enable us to explore critical questions such as the impact of history-taking in addition to multimodal perception.

Auto-rating simulated diagnostic conversations

The following subsections detail our three-step approach illustrated in Fig. 4 to creating a simulation environment for auto-evaluation. Step (1) encompasses patient profile and scenario generation, which is then used in step (2) for the turn-by-turn multimodal dialogue generation process used to mimic real-world telemedicine interactions, and, finally, the simulated dialogue is used in step (3) by the auto-rater for automated assessment of multimodal AMIE’s conversational abilities.

Step 1: Patient profile and scenario generation

Generating realistic synthetic dialogues requires detailed patient profiles and corresponding clinical scenarios. Our methodology involves compiling comprehensive patient metadata, including symptoms, demographics and medical history, tailored to specific medical domains.

  • Dermatology: We used the SCIN and PAD-UES-20 datasets, which provide rich, preexisting patient attributes (for example, age, sex, symptoms, skin tone and medical and social history), requiring no synthetic imputation.

  • Cardiology: We started with the PTB-XL ECG dataset. As it includes only age and sex, we employed Gemini 2.0 Flash, using its web search tool, to impute clinically plausible symptoms, social/family/medical history and cardiovascular risk factors associated with the ECG’s indicated condition. This grounded the synthetic profiles in the model’s medical knowledge and real-world examples from the web.

  • Clinical documents: For this domain, we collaborated with the same external clinical partners who developed our OSCE scenarios. They created realistic patient profiles, document artifacts and corresponding metadata specifically for our simulation needs.

Once comprehensive metadata were established for all domains, we used Gemini 2.0 Flash to generate detailed clinical scenarios. These scenarios provide context for the patient agent in the simulation, outlining the initial presentation, details to share only upon questioning, patient expectations and concerns, desired outcomes and potential questions for the doctor. We guided Gemini using few-shot examples, ensuring scenario diversity (for example, varied ethnicities and occupations) and clinical realism. Although the raw images are from public datasets, we augmented them with unique, synthetically generated scenarios. We designed this approach to mitigate potential data leakage by ensuring that the model encounters these images within novel clinical contexts. Crucially, the scenarios deliberately omit the final diagnosis, requiring the AI agent to deduce it through the simulated conversation and analysis of any provided multimodal artifacts.

Step 2: Turn-by-turn multimodal dialogue generation

We simulate dialogues turn by turn between a doctor agent (representing multimodal AMIE) and a patient agent, mimicking a telemedical interaction.

The doctor agent (multimodal AMIE) is instructed to be empathetic and clinically accurate. It uses the state-aware dialogue phase transition framework (Fig. 6) to navigate history-taking, diagnosis and management and follow-up phases. A key capability is its multimodal nature: it can strategically request and analyze relevant medical artifacts (such as skin images or ECGs) based on patient-reported symptoms during the conversation.

The patient agent simulates a patient strictly following a predefined scenario from step 1, which details their profile and clinical scenario. It is prompted to respond truthfully using concise, casual language appropriate for an online consultation while adhering to specific rules regarding politeness and pacing the release of information to avoid overwhelming the doctor agent.

In each dialogue turn, the doctor agent generates a question or statement based on the conversation history and its internal state. The patient agent then formulates a response according to its scenario instructions. If the scenario dictates and the doctor agent requests it, the patient agent can upload a relevant medical artifact (for example, an image). Multimodal AMIE then analyzes this artifact, integrating the findings into its ongoing diagnostic reasoning and subsequent dialogue turns.

This turn-by-turn exchange continues until the conversation reaches a natural conclusion (for example, patient questions are resolved) or hits a predefined maximum turn limit. After the exchange is concluded, the doctor agent generates the structured post-questionnaire. The output is a complete simulated dialogue transcript, capturing the dynamic exchange of text and multimodal data.

Step 3: Criteria and scoring with auto-raters

Auto-rating (automated evaluation) is crucial for the iterative development and safety assessment of conversational models. Although human evaluation is valuable, it is often constrained by cost, time and scalability. Auto-raters provide a mechanism for rapid, scalable and consistent performance assessment across essential characteristics.

We employ an auto-rater based on Gemini 2.0 Flash to evaluate the simulated dialogues. The auto-rater, which is given access to the ground truth condition, assesses various aspects, from quantitative metrics such as diagnostic accuracy to qualitative dimensions such as information gathering and safety. The specific criteria and scoring methods used by our auto-rater are detailed below (Extended Data Table 3):

  • DDx accuracy: Measures if the patient’s ground truth condition is present within the top-1, top-3 or top-10 diagnoses listed by multimodal AMIE’s generated post-questionnaire. Evaluation accounts for synonyms and uses binary scoring (1 if present in top-n, 0 otherwise), averaged across dialogues.

  • Gathering information: Assesses the effectiveness of the model in eliciting information, including its use of open-ended questions, active listening, addressing patient concerns and summarization. Rated on a five-point Likert scale (from 1 = very poor to 5 = excellent).

  • Management plan appropriateness: Evaluates the suitability of the proposed management plan (including recommended actions) relative to best medical practices and the patient’s specific situation. Rated on a five-point Likert scale (from 1 = very poor to 5 = excellent).

  • Hallucination: Detects instances of fabricated information stated by the model (for example, mentioning details not provided by the patient, making incorrect assumptions and claiming access to non-existent data or medications). This safety-critical metric excludes assessments of the diagnostic or management reasoning itself and is evaluated with a binary score (1 = hallucination present, 0 = absent).

This simulation framework advances beyond previous text-only versions4 by generating multimodal patient scenarios grounded in real-world datasets (SCIN, PTB-XL and Clinical Documents) and simulating realistic patient−clinician interactions involving these multimodal medical data. As detailed in Supplementary Section 3.3, calibration analysis confirms good alignment with human expert judgments (Supplementary Fig. 4), thus enabling automated evaluation of the system’s multimodal reasoning capabilities.

Expert evaluation: multimodal virtual OSCE

To compare multimodal AMIE’s capabilities in undertaking multimodal diagnostic conversations to those of PCPs, we extended the remote OSCE study design introduced by Tu et al.4. The quality of dialogues was assessed using a set of rubrics and metrics reflecting the perspective of patients and specialist physicians. We also introduced a new evaluation rubric specifically to assess the ability to use multimodal medical data effectively in the context of clinical consultations.

The OSCE is a standardized practical assessment widely used in healthcare education to objectively evaluate clinical skills and competencies by simulating real-world practice35,36. Unlike traditional knowledge-based examinations, the OSCE assesses practical skills of real-world clinical encounters, typically involving candidates rotating through a series of timed stations where they encounter a trained patient-actor portraying a specific clinical scenario. Test-takers perform designated tasks such as taking a medical history, conducting a physical examination, interpreting results or counseling the patient. Examiners observe these interactions and score the test-taker’s performance against detailed, predefined checklists that assess crucial skills such as history-taking, examination technique, clinical reasoning and communication.

In this work, we designed and conducted a virtual analogue of the OSCE adapted for multimodal text chat. In this setting, patient-actors engaged in blinded, synchronous chat conversations with either the multimodal AMIE system or PCPs, conducted through a chat interface (Supplementary Fig. 1) that allowed exchange of both text and uploaded images as is now commonplace for mobile chat applications (see Extended Data Fig. 6 for an overview). Within the virtual consultations, patient-actors were instructed to upload images such as skin photographs, laboratory tests or ECG tracings, emulating how popular text chat platforms have been reportedly used as a means for remote consultation.

Scenario packs

In collaboration with two organizations that routinely perform OSCE assessments in Canada and India, we developed 105 case scenarios. The scenarios were centered around three types of image artifacts that are commonly available to patients in telemedical primary care: (1) smartphone-captured skin images, (2) ECG tracings and (3) clinical documents. We selected these modalities as the most likely to occur in primary telecare: patients can readily photograph their skin, scan documents they received or report with ECGs collected via consumer tools22. Extended Data Table 4 provides an overview and examples of image artifacts. Scenarios contained different representations of these image artifacts to represent varying levels of image quality, including photographs of a given skin concern from different angles as well as screenshots and smartphone photographs of ECG tracings and clinical documents.

We selected cases for dermatology from the SCIN dataset31. SCIN contains representative skin images along with rich metadata crowdsourced from real internet users with skin concerns. ECG tracings were taken from PTB-XL33, the largest publicly available dataset for this modality. Clinical documents were crafted by OSCE laboratories for the purpose of this study. Notably, we included normal cases, too. Scenarios were designed such that both requesting and interpreting images as well as taking the patient’s history were required—while either alone was insufficient—in order to form a confident diagnosis. To this end, for both skin photographs and ECG-based scenarios, we selected challenging images with high annotation ambiguity in their diagnosis labels as detailed in Extended Data Table 4. Furthermore, dermatologists, cardiologists and internists (medical specialists in internal medicine) ensured that the accompanying text-based component of scenarios (for example, family/medical history and symptoms) was complementary in the sense that both image and textual information were required to arrive at an accurate diagnosis. Nevertheless, we note that, although the scenarios are consistent with the artifacts and the diagnosis, there is no guarantee that they reflect the true case history, as they were created post hoc. Scenarios were crafted to match case metadata provided in the SCIN and PTB-XL datasets (such as age, sex and ethnicity) when available. For Clinical Documents, we selected conditions for which diagnosis formation can be guided based on the results of laboratory reports.

Lastly, to simulate variability in image quality in real-world care settings, for half of the ECG and clinical document cases, we used smartphone photographs of a computer screen showing the artifacts (Extended Data Table 4), as in previous work37.

Study design

Extended Data Fig. 6 provides an overview of our OSCE study design. For each case scenario, the same patient-actor performed one conversation with multimodal AMIE and a qualified PCP each via a synchronous chat interface (Supplementary Fig. 1), in a blinded and randomized order. (Although we conducted a randomized, blinded evaluation, our study is not a randomized clinical trial.) The chat interface supported text-based communication and sharing of images from the scenarios while displaying the scenario pack information to patient-actors throughout the conversation.

After each consultation, patient-actors completed a questionnaire to rate their experience to represent the patient-actor perspective. Separately, a different questionnaire was completed by both multimodal AMIE (using offline generation) and PCPs to summarize key clinical findings and next steps. In particular, the questionnaire asked for a DDx list (at least three and up to 10 plausible items, ordered by likelihood), a management plan and a description of salient image findings. Lastly, a group of specialist physicians assessed the performance of multimodal AMIE and PCPs in a blinded fashion (Supplementary Fig. 2), based on their consultation transcripts and responses to the post-questionnaires using various rubrics, representing the specialist physician perspective.

A large part of the collected patient-centric as well as specialist metrics is derived from the evaluation rubric introduced in the previous work4 for their direct applicability in the assessment of multimodal diagnostic dialogues. These criteria are designed primarily to evaluate the consultation quality, the appropriateness of diagnostic and management decisions, the accuracy of clinical reasoning and various aspects of effective communication skills (such as the ability to elicit information and manage patient concerns). The questions in this rubric were derived from consideration of authoritative assessment schemes of clinical interactions such as the PACES, used by the Royal College of Physicians in the UK for examining history-taking skills38; the GMCPQ (https://edwebcontent.ed.ac.uk/sites/default/files/imports/fileManager/patient_questionnaire%20pdf_48210488.pdf); and approaches to Patient-Centered Communication Best Practices (PCCBP)39.

The MUH rubric was designed to assess competence in handling and interpreting multimodal artifacts in the context of clinical consultations, including the ability to understand medical image artifacts; to use that understanding to guide the conversation and inform a clinically accurate assessment; and to communicate the salient findings and address the patients’ questions in an appropriate manner. Details are provided in Extended Data Table 1.

Participants

This study involved 19 board-certified PCPs and 25 validated patient-actors, with participants split between India and Canada (10/9 for PCPs and 10/10 for patient-actors for India/Canada, respectively). Informed consent was obtained from each participant before their participation. To ensure consistent and high-quality interactions, all patient-actors and PCPs completed a standardized training prior to participation. This training included an interactive workshop to familiarize them with the chat interface and the specifics of the OSCE interaction format based on detailed instructional guides. The PCPs had a median post-residency experience of 6 years with an interquartile range of 3.5−11.5 years. For quality assessment of the multimodal AMIE/PCP consultations and their post-questionnaire responses, we recruited 18 independent specialist physicians from India and North America over three medical specialities (dermatology, cardiology and internal medicine) to ensure diverse clinical viewpoints as well as requisite expertise. These specialists were independent of the study team and the patient-actor cohort, had a median post-residency experience of 5 years (interquartile range, 4−8) and were assigned evaluation tasks matching the medical specialty of the scenarios (for example, dermatology scenarios were evaluated by dermatologists). For each of 105 scenarios, each assigned patient-actor performed two sessions, one with a PCP and another with multimodal AMIE in a randomized order, yielding a total of 210 consultations. Each of these conversations was evaluated by three independent specialists.

Statistical analysis

We analyzed the results from the remote OSCE study using a mixed-effect approach to account for differences in the scenarios using random effects. Most rating choice options for patient-actors and specialist physicians were ordinal (for example, ranging from ‘Very unfavorable’ to ‘Very favorable’), and we model them using a cumulative ordinal model with logit link function. For binary ratings (for example, ‘Yes/No’), we use a Bernoulli model with logit link. For diagnostic accuracy, we again use a Bernoulli model for ‘Correct/Incorrect’. In all models, the scenario is used as a random intercept to model differences in the quality of the patient vignettes or the difficulty of the diagnostic task. Both of these factors can affect the flow of conversation and should, therefore, be considered when modeling ratings of these conversations. The experimental arm (PCP/multimodal AMIE) is modeled as a fixed effect. For diagnostic accuracy, we also model the fixed effect of the differential size (k) using monotonic regression (Supplementary Section 2.2).

Here, clinical scenarios were modeled as random intercepts to allow generalization of findings to the broader population of medical cases rather than restricting inference to the specific vignettes sampled. We explicitly note that this study is a comparative system evaluation, not a randomized controlled trial. Additionally, because clinical utility requires simultaneous success in accuracy, empathy and trust, we treated these as co-equal outcomes adjusted via false discovery rate (FDR) rather than designating a single primary endpoint.

In some analyses, we also estimate preference by comparing ordinal or binary metrics (for example, in Fig. 3c). Here, we model the proportion of preference for multimodal AMIE and PCPs using the one-sided χ2 test. For ablation results (for example, in Fig. 5), we test differences in metrics using Mann–Whitney U-tests. All P values were corrected for FDR using the Benjamini–Hochberg method as implemented in the Python library ‘statsmodels’. These corrections were applied to all statistical estimates of the experimental arm on either patient-actor or expert ratings, both when estimating parametric ordinal models and when estimating preference (Fig. 2). In all figures, we display confidence intervals that show the 2.5th and 97.5th percentiles of bootstrapped distributions of the displayed quantity.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.