{"id":90456,"date":"2025-09-28T09:07:17","date_gmt":"2025-09-28T09:07:17","guid":{"rendered":"https:\/\/www.europesays.com\/ie\/90456\/"},"modified":"2025-09-28T09:07:17","modified_gmt":"2025-09-28T09:07:17","slug":"timer-temporal-instruction-modeling-and-evaluation-for-longitudinal-clinical-records","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/ie\/90456\/","title":{"rendered":"TIMER: temporal instruction modeling and evaluation for longitudinal clinical records"},"content":{"rendered":"<p>Study design and data source<\/p>\n<p>Figure <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01965-9#Fig1\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a> summarizes the TIMER method for instruction-tuning and evaluating temporal reasoning capabilities of LLMs on longitudinal EHRs. Our study addresses two primary research questions (RQs):<\/p>\n<ul class=\"u-list-style-bullet\">\n<li>\n<p>RQ1: Can temporally-grounded instruction-response pairs from EHR data improve LLMs\u2019 longitudinal reasoning capabilities compared to conventional medical question-answer pairs?<\/p>\n<\/li>\n<li>\n<p>RQ2: How does the temporal distribution of instructions used in instruction-tuning affect model performance, specifically when we evaluate on varying temporal distributions?<\/p>\n<\/li>\n<\/ul>\n<p>We used de-identified longitudinal EHRs from the Stanford Medicine Research Data Repository (STARR)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 17\" title=\"Datta, S. et al. A new paradigm for accelerating clinical data science at Stanford medicine. Preprint at &#010;                  https:\/\/arxiv.org\/abs\/2003.10534&#010;                  &#010;                 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01965-9#ref-CR17\" id=\"ref-link-section-d310806268e676\" rel=\"nofollow noopener\" target=\"_blank\">17<\/a>. These records are accessible pre-IRB, cover Stanford Health Care (primarily adult care) and Lucile Packard Children\u2019s Hospital, and are formatted in OMOP-CDM. Only data from patients who have previously consented to the research use of their de-identified data via the institutional privacy notice is included in STARR.<\/p>\n<p>RQ1: Impact of temporal-aware instruction tuning<\/p>\n<p>We compared models instruction-tuned with TIMER against both standard medical LLMs and models tuned with conventional medical QA datasets to quantify the specific benefits of temporal awareness in instruction tuning.<\/p>\n<p>Overall performance<\/p>\n<p>We evaluate multiple medical LLMs, including Meditron-7B, MedAlpaca, AlpaCare, MMed-Llama-3-8B, PMC-Llama-13B, MedLM-Medium, and MedInstruct (Conventional QA-tuned Llama-3.1-8 B-Instruct), using MedAlign and a model-generated evaluation set, called TIMER-Eval, that requires temporal reasoning. When models cannot process full patient timelines, inputs are truncated to recent tokens of their context limits. As shown in Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01965-9#Tab1\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a>, even the strongest medical model baseline achieves just 30.85% correctness and 13.93% completeness in temporal reasoning evaluations. In contrast, models tuned with TIMER consistently outperform baselines across both evaluation sets. TIMER improves Llama-3.1-8B-Instruct\u2019s performance from 30.69% to 34.32% correctness as measured using MedAlign and from 45.02% to 48.51% as measured via a temporal reasoning evaluation. Similar improvements are observed with Qwen-2.5-7B-Instruct, indicating that these gains are consistent across different base model architectures. TIMER\u2019s performance gains on MedAlign are particularly significant given the dataset\u2019s temporal characteristics. Despite MedAlign\u2019s extended temporal coverage (median 3,895.1 days) and pronounced recency bias (55.3% of questions in the final 25% of timelines), TIMER-tuned models achieve consistent improvements in both correctness and completeness. This demonstrates TIMER\u2019s ability to effectively utilize the full temporal scope of longitudinal records, even when evaluation questions exhibit temporal distribution misalignment.<\/p>\n<p><b id=\"Tab1\" data-test=\"table-caption\">Table 1 Performance (%) of baseline models and TIMER -tuned models on MedAlign and TIMER-Eval benchmarks, reported as mean\u2009\u00b1\u2009standard deviation from bootstrap resampling (n\u2009=\u200910,000) with 100 samples over the test set<\/b><\/p>\n<p>To illustrate these improvements, Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01965-9#Tab7\" rel=\"nofollow noopener\" target=\"_blank\">7<\/a> provides concrete examples of enhanced temporal reasoning. For instance, when asked to \u201cDescribe the trend in the patient\u2019s weight over the past year\u201d base models incorrectly assess trends from 2+ years prior, while TIMER-tuned models correctly limit analysis to the specified timeframe, demonstrating improved temporal boundary adherence.<\/p>\n<p>Head-to-head comparison<\/p>\n<p>To further identify performance improvement in addition to rubric-based scoring, we perform head-to-head analyses of outputs to given questions from different models. We identify in Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01965-9#Tab2\" rel=\"nofollow noopener\" target=\"_blank\">2<\/a> that models instruction-tuned with temporal instruction data produce answers that are more generally preferred compared to existing medical finetuned models, with even the best medical model MedLM-Medium being preferred 20% less frequently on TIMER-Eval generated questions than models tuned with TIMER.<\/p>\n<p><b id=\"Tab2\" data-test=\"table-caption\">Table 2 Head-to-head comparison between various models and TIMER-Instruct<\/b><\/p>\n<p>Additionally, we compare conventional QA-style tuning (MedInstruct) with our temporally grounded instruction-tuning (TIMER Tuning). While instruction-tuning with MedInstruct provided gains over baseline Llama-3.1-8B-Instruct performance, instruction tuning with TIMER provides additional gains of 6.3% on MedAlign and 8.45% on TIMER-Eval. This indicates the value of incorporating temporal structure into instruction tuning data.<\/p>\n<p>RQ2: Effect of the temporal distribution of instructions on model performanceTemporal biases in existing clinical instruction sets<\/p>\n<p>Existing clinical instruction data have pronounced temporal biases. Using our normalized temporal position metric, we found that MedAlign<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 16\" title=\"Fleming, S. L. et al. MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records. In M. J. Wooldridge, J. G. Dy, &amp; S. Natarajan (eds.) Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada (pp. 22021&#x2013;22030). &#010;                  https:\/\/doi.org\/10.1609\/AAAI.V38I20.30205&#010;                  &#010;                 (AAAI Press, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01965-9#ref-CR16\" id=\"ref-link-section-d310806268e1785\" rel=\"nofollow noopener\" target=\"_blank\">16<\/a>, the first clinician-curated collection of clinical instructions, has a pronounced recency bias. Despite spanning an average of 3895 days (~10.7 years), 55.3% of its instructions reference only the final 25% of patient timelines, with 47.0% and 29.5% focused on just the last 15% and 5%, respectively (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01965-9#Fig2\" rel=\"nofollow noopener\" target=\"_blank\">2<\/a>).<\/p>\n<p><b id=\"Fig2\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 2: Distribution of MedAlign instructions across patient timelines.<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/www.nature.com\/articles\/s41746-025-01965-9\/figures\/2\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig2\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/09\/41746_2025_1965_Fig2_HTML.png\" alt=\"figure 2\" loading=\"lazy\" width=\"685\" height=\"333\"\/><\/a><\/p>\n<p>The majority of human-generated instructions focus on the most recent encounters.<\/p>\n<p>When examining model-generated instructions, we observed a \u201clost-in-the-middle&#8221; effect regarding the parts of the patient record in which the instructions were grounded (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01965-9#Fig3\" rel=\"nofollow noopener\" target=\"_blank\">3<\/a>). These instructions cluster at the beginning (25.9%) and end (52.1%) of patient timelines while relatively underrepresenting middle periods (22.1%). These distribution biases in both human and model-generated instructions highlight the need for a more controlled approach to both instruction generation and evaluation.<\/p>\n<p><b id=\"Fig3\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 3: Normalized temporal position of evidence in model-generated instructions reveals edge-focused attention.<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/www.nature.com\/articles\/s41746-025-01965-9\/figures\/3\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig3\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/09\/41746_2025_1965_Fig3_HTML.png\" alt=\"figure 3\" loading=\"lazy\" width=\"685\" height=\"331\"\/><\/a><\/p>\n<p>Instructions cluster around early (0\u201325%) and late (75\u2013100%) parts of patient timelines.<\/p>\n<p>Temporal duration vs. utilization<\/p>\n<p>An important distinction emerges between the temporal duration available in clinical records and the actual temporal scope utilized in instruction generation. MedAlign\u2019s construction involved clinicians generating instructions independently without examining specific patient records, with these instructions subsequently matched to appropriate EHR data via retrieval. The pronounced recency bias (55.3% of instructions in the final 25% of timelines) reveals that MedAlign reflects the natural recency bias inherent in clinical question formulation when clinicians pose questions independent of specific patient cases. This observation motivates the need for systematic temporal control approaches that ensure evaluation coverage across extended patient timelines.<\/p>\n<p>Development of controlled temporal distribution evaluation<\/p>\n<p>Our analysis of existing temporal biases necessitated a new evaluation approach that could isolate and measure the specific effects of temporal distribution on model performance. Unlike existing approaches with inherent limitations (Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01965-9#Tab3\" rel=\"nofollow noopener\" target=\"_blank\">3<\/a>), we developed an evaluation method incorporating multi-visit records with explicit time evidence attribution. Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01965-9#Tab3\" rel=\"nofollow noopener\" target=\"_blank\">3<\/a> highlights how TIMER-Eval overcomes limitations in prior approaches. MIMIC-Instr<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 15\" title=\"Wu, Z., Dadu, A., Nalls, M., Faghri, F. &amp; Sun, J. Instruction tuning large language models to understand electronic health records. In NeurIPS Datasets and Benchmarks Track &#010;                  https:\/\/openreview.net\/forum?id=Dgy5WVgPd2&#010;                  &#010;                 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01965-9#ref-CR15\" id=\"ref-link-section-d310806268e1858\" rel=\"nofollow noopener\" target=\"_blank\">15<\/a> is restricted to single-visit episodes with limited temporal scope (median 7.2 days), while MedAlign\u2019s human curation leads to recency bias. Our approach enables precise assessment of temporal reasoning while maintaining scalability through controlled sampling, allowing systematic manipulation of temporal distribution patterns in both instruction tuning and evaluation.<\/p>\n<p><b id=\"Tab3\" data-test=\"table-caption\">Table 3 Comparison of EHR instructional evaluation set for medical LLMs<\/b>Clinician validation<\/p>\n<p>To ensure the validity of the model-generated evaluation data, three clinicians assessed 100 randomly sampled instruction-response pairs generated with TIMER (Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01965-9#Tab4\" rel=\"nofollow noopener\" target=\"_blank\">4<\/a>). The pairs received high scores for clinical relevance (mean 95\/100), temporal reasoning complexity (mean 80\/100), and factual accuracy (mean 98\/100), with strong inter-rater agreement. The results show high inter-rater agreement (86% clinical relevance, 93% accuracy) with low standard deviations (4.32, 1.89, respectively). Complexity scoring, being an inherently more qualitative metric, showed increased variability but remained significantly above chance (53% observed agreement vs. 12.5% random chance; std 14.87). Additional examples of where annotators agreed and disagreed on question complexity can be found in Supplementary Note <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01965-9#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">7<\/a>. Disagreements primarily occurred on questions requiring temporal data retrieval with moderate synthesis, where annotators differed on whether the reasoning depth met clinical complexity standards. This variability demonstrates the rigor with which experienced clinicians evaluate temporal reasoning tasks, distinguishing between temporal mechanics and genuine clinical reasoning complexity. These validation results support the claim that our schema generates clinically meaningful evaluation scenarios.<\/p>\n<p><b id=\"Tab4\" data-test=\"table-caption\">Table 4 Clinician evaluation of TIMER evaluation samples<\/b>Effect of instruction distributions on model performance<\/p>\n<p>To understand how temporal distributions affect model performance, we created three distinct distribution patterns: Recency-Focused, Edge-Focused, and Uniformly-Distributed across timelines.<\/p>\n<p>Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01965-9#Tab5\" rel=\"nofollow noopener\" target=\"_blank\">5<\/a> demonstrates that across all evaluation patterns, models using distribution-matched training consistently outperform alternative training approaches. The advantage of matched training ranges from +1.20% to +6.50% in head-to-head comparisons. Notably, the largest performance difference appears in the Uniformly-Distributed evaluation setting. For instance, when evaluating on Uniformly-Distributed questions, Full-Timeline training shows a +6.50% advantage over Recent-Events training, highlighting how distribution alignment affects model performance on temporal reasoning tasks. These results indicate that while general alignment between train and test distribution is helpful, it provides the most significant gains in the harder evaluation schema of Uniformly-Distributed instructions.<\/p>\n<p><b id=\"Tab5\" data-test=\"table-caption\">Table 5 Performance comparison of instruction tuning with different temporal distributions<\/b>LLM-judge and human correlation<\/p>\n<p>To scale evaluation, we developed an LLM-based judge and validated it against clinician rankings for MedAlign responses. Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01965-9#Tab6\" rel=\"nofollow noopener\" target=\"_blank\">6<\/a> shows a strong Spearman correlation between LLM scores and human ranks: \u03c1\u2009=\u2009\u22120.97 (average), \u22120.94 (correctness), and \u22120.89 (completeness). This inverse relationship (high LLM score = low human rank) supports that LLM-judge is a reliable proxy for human assessment in temporal reasoning evaluation. We additionally show the LLM Rank for a more direct comparison to human rank, and see general ranking trends being maintained across LLMs.<\/p>\n<p><b id=\"Tab6\" data-test=\"table-caption\">Table 6 Validation of LLM-Judge against established human rankings on MedAlign<\/b>Case studies: temporal reasoning behavior<\/p>\n<p>Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01965-9#Tab7\" rel=\"nofollow noopener\" target=\"_blank\">7<\/a> presents qualitative examples comparing base and TIMER tuned models. TIMER-tuned models consistently show improved:<\/p>\n<ol class=\"u-list-style-none\">\n<li>\n                    1.<\/p>\n<p>Temporal boundary adherence (e.g., limiting responses to the past year),<\/p>\n<\/li>\n<li>\n                    2.<\/p>\n<p>Trend detection (e.g., correctly summarizing longitudinal lab trends), and<\/p>\n<\/li>\n<li>\n                    3.<\/p>\n<p>Temporal precision (e.g., associating measurements with exact dates).<\/p>\n<\/li>\n<\/ol>\n<p><b id=\"Tab7\" data-test=\"table-caption\">Table 7 Case studies on TIMER-Eval: Comparison of model responses between base Llama-3.1-8B-Instruct and model tuned w\/ TIMER-Instruct<\/b><\/p>\n<p>In contrast, base models often conflate visits or provide temporally irrelevant information. Responses of models tuned with TIMER are more contextually grounded and clinically interpretable.<\/p>\n","protected":false},"excerpt":{"rendered":"Study design and data source Figure 1 summarizes the TIMER method for instruction-tuning and evaluating temporal reasoning capabilities&hellip;\n","protected":false},"author":2,"featured_media":90457,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[275],"tags":[2564,7593,1096,18,910,135,475,474,19,17,7482],"class_list":{"0":"post-90456","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-healthcare","8":"tag-biomedicine","9":"tag-biotechnology","10":"tag-computer-science","11":"tag-eire","12":"tag-general","13":"tag-health","14":"tag-health-care","15":"tag-healthcare","16":"tag-ie","17":"tag-ireland","18":"tag-medicine-public-health"},"share_on_mastodon":{"url":"","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/90456","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/comments?post=90456"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/90456\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media\/90457"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media?parent=90456"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/categories?post=90456"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/tags?post=90456"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}