{"id":8102,"date":"2025-08-18T22:58:27","date_gmt":"2025-08-18T22:58:27","guid":{"rendered":"https:\/\/www.europesays.com\/ie\/8102\/"},"modified":"2025-08-18T22:58:27","modified_gmt":"2025-08-18T22:58:27","slug":"leveraging-large-language-models-for-the-deidentification-and-temporal-normalization-of-sensitive-health-information-in-electronic-health-records","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/ie\/8102\/","title":{"rendered":"Leveraging large language models for the deidentification and temporal normalization of sensitive health information in electronic health records"},"content":{"rendered":"<p>The OpenDeID v2 corpus<\/p>\n<p>To evaluate the performance of various machine learning methods and hybrid approaches that combine machine learning and rule- or pattern-based methods developed by participants in the SREDH\/AI CUP 2023 deidentification competition, the OpenDeID v1 corpus<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 32\" title=\"ALLA, N. L. V. et al. Cohort selection for construction of a clinical natural language processing corpus. Comp. Methods Prog. Biomed. Update 1, 100024 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR32\" id=\"ref-link-section-d86264245e1008\" rel=\"nofollow noopener\" target=\"_blank\">32<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 33\" title=\"Jonnagaddala, J., Chen, A., Batongbacal, S. &amp; Nekkantti, C. The OpenDeID corpus for patient de-identification. Sci. Rep. 11, 19973 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR33\" id=\"ref-link-section-d86264245e1011\" rel=\"nofollow noopener\" target=\"_blank\">33<\/a>, which was originally sourced from the Health Science Alliance (HSA) biobank of the Lowy Cancer Research Center of the University of New South Wales, Australia<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 34\" title=\"Quinn, C. M. et al. Moving with the times: the health science alliance (HSA) Biobank, Pathway to Sustainability. Biomark Insights 16, 11772719211005745 (2021 Mar).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR34\" id=\"ref-link-section-d86264245e1015\" rel=\"nofollow noopener\" target=\"_blank\">34<\/a>, was extended with an additional 1144 pathology reports curated from the HSA biobank, resulting in 3244 reports (referred to as the OpenDeID v2 corpus). The dataset was divided into three sets: training (1734 reports), validation (560 reports), and testing (950 reports) for the competition, respectively.<\/p>\n<p>The dataset includes six main SHI categories: NAME, LOCATION, AGE, DATE, CONTACT, and ID. Except for AGE, each category has subcategories, as shown in Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#Fig2\" rel=\"nofollow noopener\" target=\"_blank\">2<\/a> and Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a>. All date annotations in the OpenDeId v2 corpus were subdivided into the following subcategories: DATE, TIME, DURATION, and SET. Each annotation was assigned a normalized value according to the ISO 8601 standard to ensure temporal integrity. The updated guidelines and detailed annotation procedures are available elsewhere<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 25\" title=\"Mir, T. H. et al. Proc. International Workshop on Deidentification of Electronic Medical Record Notes (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR25\" id=\"ref-link-section-d86264245e1028\" rel=\"nofollow noopener\" target=\"_blank\">25<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 35\" title=\"Dai, H.-J. &amp; Jonnagaddala, J. HSA Study PHI Corpus - Annotation Guidelines (HSA, 2023).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR35\" id=\"ref-link-section-d86264245e1031\" rel=\"nofollow noopener\" target=\"_blank\">35<\/a>. A more detailed statistical summary of the compiled corpus is provided in Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">2<\/a>.<\/p>\n<p><b id=\"Fig2\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 2: Overview of the OpenDeId v1 and v2 corpora.<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/www.nature.com\/articles\/s41746-025-01921-7\/figures\/2\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig2\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/08\/41746_2025_1921_Fig2_HTML.png\" alt=\"figure 2\" loading=\"lazy\" width=\"685\" height=\"592\"\/><\/a><\/p>\n<p>The figure provides an overview of the OpenDeId v1 and the v2 corpora. The numerical values displayed in the sunburst graph indicate the number of annotations in each category.<\/p>\n<p>In the following subsections, we present the ability of the original Pythia model suite<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 26\" title=\"Biderman, S. et al. In International Conference on Machine Learning. 2397-2430 (PMLR).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR26\" id=\"ref-link-section-d86264245e1060\" rel=\"nofollow noopener\" target=\"_blank\">26<\/a> to address two subtasks, wherein the context is augmented with a few examples. We then demonstrated the results of fine-tuning the same models without applying sophisticated postprocessing rules proposed by the participants to establish baselines and better understand the effectiveness of the participants\u2019 approaches.<\/p>\n<p>In-context learning performance<\/p>\n<p>The first analysis assessed the few-shot learning capabilities of Pythia models across different scales. Unlike our previous study<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 36\" title=\"Lee, Y.-Q. et al. Unlocking the secrets behind advanced artificial intelligence language models in deidentifying chinese-english mixed clinical text: development and validation study. J. Med. Internet Res. 26, e48443 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR36\" id=\"ref-link-section-d86264245e1072\" rel=\"nofollow noopener\" target=\"_blank\">36<\/a>, which relied on the zero-shot ICL of ChatGPT, we used the k-nearest neighbor (kNN)-augmented in-context example selection approach<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 37\" title=\"Liu, J. et al. In Proceedings of Deep Learning Inside Out (DeeLIO 2022): the 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. 100-114.\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR37\" id=\"ref-link-section-d86264245e1076\" rel=\"nofollow noopener\" target=\"_blank\">37<\/a> to select the five closest training instances as in-context examples for few-shot ICLs in the Pythia scaling suite.<\/p>\n<p>Figure <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#Fig3\" rel=\"nofollow noopener\" target=\"_blank\">3<\/a> presents the results, covering models of varying sizes: 70\u2009M (million), 160\u2009M, 410\u2009M, 1\u2009B (billion), 1.4\u2009B, 2.8\u2009B, 6.9\u2009B, and 12\u2009B parameters. The results indicate that pretrained Pythia models with more than 1\u2009B parameters performed comparably to the average performance of all submissions for Subtask 1 in the SREDH\/AI CUP 2023 Deidentification competition. However, these models underperformed relative to the average F1 scores in subtask 2. Notably, the model with only 160\u2009M parameters can achieve a performance close to the average macro-F score for Subtask 1. In contrast, for Subtask 2, models with at least 1.4\u2009B parameters were required to reach micro\/macro-averaged F1 scores over 0.30\/0.15. Furthermore, models with a minimum of 2.8\u2009B parameters were necessary to achieve macro-averaged F1 scores above 0.32, approaching the median value of all submissions in Subtask 2. Overall, the best performing model for subtask 1 is the 12\u2009B model, which demonstrated superior effectiveness in SHI recognition. For Subtask 2, the top-ranked model operated at a scale of 6.9\u2009B. This result provides insight into the scalability of few-shot learning capabilities within LLM models, such as the Pythia model suite, demonstrating that larger models are particularly beneficial for more complex tasks, such as temporal information normalization.<\/p>\n<p><b id=\"Fig3\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 3: ICL performance (few-shot learning with five examples) comparison for Subtasks 1 and 2.<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/www.nature.com\/articles\/s41746-025-01921-7\/figures\/3\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig3\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/08\/41746_2025_1921_Fig3_HTML.png\" alt=\"figure 3\" loading=\"lazy\" width=\"685\" height=\"219\"\/><\/a><\/p>\n<p>The figure shows the micro- and macro-F1 scores for Subtask 1 and 2 using ICL with five-shot prompts. Results are reported on the test sets for Pythia models of various sizes: 70\u2009M, 160\u2009M, 410\u2009M, 1\u2009B, 1.4\u2009B, 2.8\u2009B, 6.9\u2009B, and 12\u2009B parameters.<\/p>\n<p>In Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#Fig4\" rel=\"nofollow noopener\" target=\"_blank\">4<\/a>, we further investigate the impact of the number of in-context examples on the performance of both subtasks. The 2.8\u2009B model was chosen as the representative model in the experiment because of its superior performance over small models and comparable performance to larger models. Specifically, in the training set, we chose the number of in-context examples to be 1, 3, 5, and 7, selecting the closest training instances and arranging them in the prompt as demonstrations by their distances, in ascending order. Similar to the previous observations<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 38\" title=\"Shi, W., Michael, J., Gururangan, S. &amp; Zettlemoyer, L. kNN-Prompt: Nearest Neighbor Zero-Shot Inference. arXiv preprint arXiv:2205.13792 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR38\" id=\"ref-link-section-d86264245e1111\" rel=\"nofollow noopener\" target=\"_blank\">38<\/a>, we noticed that the recall rates for the two subtasks improve as the number of examples increases. However, the improvement diminishes when the number of samples exceeds five.<\/p>\n<p><b id=\"Fig4\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 4: Effect of varying in-context examples on Pythia-2.8B performance across training tasks.<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/www.nature.com\/articles\/s41746-025-01921-7\/figures\/4\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig4\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/08\/41746_2025_1921_Fig4_HTML.png\" alt=\"figure 4\" loading=\"lazy\" width=\"685\" height=\"457\"\/><\/a><\/p>\n<p><b>a<\/b> Performance impact of using 1, 3, 5, and 7 in-context examples for Subtasks 1 and 2 with the original training set. <b>b\u2013d<\/b> Comparative analysis of Pythia-2.8\u2009B performance on Subtasks 1 and 2 using 3, 5, 7 in-context examples, evaluated on both the original and deduplicated training sets.<\/p>\n<p>Fine-tuning performance<\/p>\n<p>In this experiment, we applied fine-tuning to the Pythia models via two widely used approaches: full-parameter and LoRA-based parameter-efficient fine-tuning<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 39\" title=\"Houlsby, N. et al. In International conference on machine learning. 2790-2799 (PMLR).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR39\" id=\"ref-link-section-d86264245e1148\" rel=\"nofollow noopener\" target=\"_blank\">39<\/a>. We observed that, except for the 70\u2009M model trained with LoRA, the fine-tuned models demonstrated a notable improvement in recall rates for both tasks after fine-tuning, resulting in improved F1 scores over those achieved through ICLs (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#Fig5\" rel=\"nofollow noopener\" target=\"_blank\">5<\/a>). The detailed results are summarized in Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">3<\/a>.<\/p>\n<p><b id=\"Fig5\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 5: Performance comparison between the ICL and fine-tuned models based on full and LoRA fine-tuning.<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/www.nature.com\/articles\/s41746-025-01921-7\/figures\/5\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig5\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/08\/41746_2025_1921_Fig5_HTML.png\" alt=\"figure 5\" loading=\"lazy\" width=\"685\" height=\"498\"\/><\/a><\/p>\n<p>The figure illustrates that model performance generally improves with increasing parameters; however, gains from fine-tuning diminish beyond a certain point, with a crossover observed between the 70\u2009M and 160\u2009M models. Optimal performance is reached at 2.8\u2009B parameters for both subtasks. Beyond this, the performance plateaus or declines, particularly for LoRA and fine-tuning approaches.<\/p>\n<p>Compared with the ICL methods, the fine-tuned models with fewer than 2.8\u2009B parameters showed a notable performance improvement. However, for Subtask 1, the advantage of conventional full fine-tuning over ICL was not significant in models with &gt;6B parameters. Similarly, for subtask 2, the macro-averaged F1 scores were lower than those achieved via ICL in the models with more than 6B parameters. In contrast, our results demonstrate that LoRA is a particularly promising tuning method, outperforming the traditional full-parameter tuning approach because of its superior performance and high efficiency in the training phase, especially in models with more than 1B parameters. Given the moderate dataset size used in this study, our findings align with those of Dutt et al.<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 40\" title=\"Dutt, R., Ericsson, L., Sanchez, P., Tsaftaris, S. A. &amp; Hospedales, T. M. In Medical Imaging with Deep Learning. 1-20.\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR40\" id=\"ref-link-section-d86264245e1180\" rel=\"nofollow noopener\" target=\"_blank\">40<\/a>, who demonstrated that parameter-efficient fine-tuning methods, such as LoRA, consistently outperform full fine-tuning when working with smaller datasets. Interestingly, the 70\u2009M model trained with LoRA underperformed compared to its counterparts, including both the ICL and full-fine tuning approaches. We attribute this underperformance to the smaller base model, which results in a limited number of learnable parameters after applying LoRA. In such cases, the effectiveness of the adaptations may be restricted because there may not be sufficient interaction among the parameters to capture the complex relationships needed.<\/p>\n<p>Increasing the model parameters generally improves the performance, but the performance gains from fine-tuning begin to decrease after a certain point, with a crossover occurring between the 70 and 160\u2009M parameters (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#Fig5\" rel=\"nofollow noopener\" target=\"_blank\">5<\/a>). Additionally, optimal performance is achieved with models of 2.8\u2009B parameters for both subtasks. Beyond this point, the performance improvements plateau and even deteriorate for the LoRA and fine-tuning approaches, respectively. We believe that this is due to the size of the dataset. By limiting the number of trainable parameters, LoRA helps mitigate overfitting in larger models when working with smaller datasets<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 41\" title=\"Biderman, D. et al. Lora learns less and forgets less. arXiv preprint arXiv:2405.09673 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR41\" id=\"ref-link-section-d86264245e1190\" rel=\"nofollow noopener\" target=\"_blank\">41<\/a>.<\/p>\n<p>SREDH\/AI CUP 2023 competition<\/p>\n<p>The competition was hosted on CodaLab<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 42\" title=\"Pavao, A. et al. Codalab competitions: an open source platform to organize scientific challenges. J. Mach. Learn. Res. 24, 1&#x2013;6 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR42\" id=\"ref-link-section-d86264245e1202\" rel=\"nofollow noopener\" target=\"_blank\">42<\/a>, a platform that enables participants to download the training set for model development and upload predictions for the validation set to receive immediate performance feedback compared to other teams via a leaderboard. During the final testing phase, participants submitted their predictions for the test set in CodaLab, which were evaluated and ranked using the macro-averaged F1 measure as the primary evaluation metric. The final rankings were determined by averaging the individual ranks of both tasks for each team and sorting them based on the average values. The competition attracted 721 participants, who formed 291 teams. In the final testing phase, each team was allowed up to six submissions over the course of one day, resulting in 218 total submissions. In total, 103 teams submitted their prediction results during the final testing phase.<\/p>\n<p>Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#Tab1\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a> presents the detailed performance of the top ten teams, comparing both micro- and macro-averaged F1 scores for the two subtasks, alongside team numbers. In subtask 1, the top-ranked system developed by Huang et al.<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 43\" title=\"Huang, C.-L., Rianto, B., Sun, J.-T., Fu, Z.-X. &amp; Lee, C.-H. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR43\" id=\"ref-link-section-d86264245e1212\" rel=\"nofollow noopener\" target=\"_blank\">43<\/a> utilized an LLM-based method based on Qwen<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 44\" title=\"Bai, J. et al. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR44\" id=\"ref-link-section-d86264245e1216\" rel=\"nofollow noopener\" target=\"_blank\">44<\/a>, thereby achieving micro- and macro-averaged F1 scores of 0.945 and 0.881, respectively. For subtask 2, the highest-ranking system was developed by Zhao et al.<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 45\" title=\"Zhao, Z.-R., Chou, P.-C., Mir, T. H. &amp; Dai, H.-J. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes 27-38 (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR45\" id=\"ref-link-section-d86264245e1220\" rel=\"nofollow noopener\" target=\"_blank\">45<\/a>, which relies on pattern-based approaches and achieves micro- and macro-averaged F1 scores of 0.844 and 0.869, respectively. The overall performance showed average micro\/macro-F scores for subtasks 1 and 2 of 0.666\/0.496 and 0.6\/0.394, respectively, with median micro\/macro-F scores of 0.810\/0.710 and 0.679\/0.360, respectively. These results highlight significant disparities in performance, indicating that subtask 2 (normalization of temporal information) presents greater challenges than SHI recognition.<\/p>\n<p><b id=\"Tab1\" data-test=\"table-caption\">Table 1 Top ten teams\u2019 performances for subtasks 1 and 2 on the official test set in terms of micro- and macro-recall (R), precision (P), and F scores (F)<\/b><\/p>\n<p>A notable trend in the competition was the widespread use of LLMs (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#Fig6\" rel=\"nofollow noopener\" target=\"_blank\">6<\/a>). Specifically, 77.2% of the 57 teams integrated LLMs into their systems, and 63.9% of the top 30 teams used them. Decoder-only transformer architectures were the most prevalent, comprising 66.1 and 83.3% of the submissions for Subtasks 1 and 2, respectively. Encoder-only architectures were the next most common in subtask 1, accounting for 21.0%, whereas encoder-only and encoder-decoder architectures were equally utilized in subtask 2, each accounting for 8.3%. This distribution underscores the current trends in Clinical NLP.<\/p>\n<p><b id=\"Fig6\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 6: Overview of methods applied by the top 30 teams.<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/www.nature.com\/articles\/s41746-025-01921-7\/figures\/6\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig6\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/08\/41746_2025_1921_Fig6_HTML.png\" alt=\"figure 6\" loading=\"lazy\" width=\"685\" height=\"317\"\/><\/a><\/p>\n<p>The figure presents a detailed overview of the methods applied by the top 30 teams in the SREDH\/AI CUP 2023 competition. The results highlight the widespread adoption of LLMs. 77.2% of the 57 teams and 63.9% of the top 30 teams incorporated LLMs. Decoder-only transformer architectures were most common, comprising 66.1% and 83.3% of Subtask 1 and Subtask 2 submissions, respectively. Encoder-only architectures were the next most common in subtask 1, accounting for 21.0%, whereas encoder-only and encoder-decoder architectures were equally utilized in subtask 2.<\/p>\n<p>Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#Tab2\" rel=\"nofollow noopener\" target=\"_blank\">2<\/a> summarizes the techniques used by the top ten teams for Subtask 1. Pretrained transformer-based LLMs have been widely used, with nearly all top-performing teams incorporating them into their solutions. Most teams approached Subtask 1 as a standalone sequential labeling task, separate from Subtask 2. However, Teams 2, 3, 4, and 9 adopted an end-to-end text generation approach to simultaneously recognize SHIs and normalize temporal expressions. Ensembles of multiple models are typically employed. Almost all the top-performing teams incorporated rule-based methods to some degree. Only one team applied pure pattern and dictionary-based approaches. In addition, two teams explored methods to increase the size of the training sets, particularly for SHI types with limited training instances, and to enhance the context diversity by manipulating tokens within the context.<\/p>\n<p><b id=\"Tab2\" data-test=\"table-caption\">Table 2 Brief summary of the techniques used by the top ten teams for subtask 1<\/b><\/p>\n<p>Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#Tab3\" rel=\"nofollow noopener\" target=\"_blank\">3<\/a> summarizes the techniques used by the top-performing teams for Subtask 2. While ensemble and pretrained transformer-based LLMs are still widely used, seven out of ten teams have incorporated pattern-based approaches. Although several top-performing teams used commercial LLMs, such as OpenAI\u2019s ChatGPT-3.5<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 27\" title=\"Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877&#x2013;1901 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR27\" id=\"ref-link-section-d86264245e2255\" rel=\"nofollow noopener\" target=\"_blank\">27<\/a> and Google\u2019s PaLM2<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 46\" title=\"Anil, R. et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR46\" id=\"ref-link-section-d86264245e2259\" rel=\"nofollow noopener\" target=\"_blank\">46<\/a> during experimentation, only Team 5 utilized PaLM2 for their final submission, achieving micro\/macro-F1 scores of 0.747\/0.777 (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">8<\/a>). Although the ICL method applied by the team was simpler than the baseline models, when a fixed set of few-shot examples was used, the commercial PaLM2 model outperformed our ICL with the Pythia-6.9\u2009B model.<\/p>\n<p><b id=\"Tab3\" data-test=\"table-caption\">Table 3 Summary of techniques used by top-performing teams for the temporal information normalization subtask<\/b><\/p>\n<p>Compared with the performance achieved by the top ten teams, as shown in Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a>, the models fine-tuned with LoRA achieved comparable micro-F1 scores for both subtasks. However, most top-ranked teams presented better macro-F1 scores, particularly for Subtask 2. This suggests that, while LoRA fine-tuning is effective, the enhanced methods proposed by the top ten teams could improve the ability of the pretrained models to handle variability across different SHI subcategories. We briefly summarize the enhanced methodologies employed by the top ten teams to provide insights learned from the competition. More details about the methods used by four of these teams can be found in the Supplementary Note <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">3<\/a>.<\/p>\n<p>The ensemble approach was found to be more effective. Team 9<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 47\" title=\"Huang, P.-W. &amp; Liu, T.-E. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes 80-99 (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR47\" id=\"ref-link-section-d86264245e2538\" rel=\"nofollow noopener\" target=\"_blank\">47<\/a> compared the performance of BERT-based models<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 48\" title=\"Li, Y., Wehbe, R. M., Ahmad, F. S., Wang, H. &amp; Luo, Y. Clinical-longformer and clinical-bigbird: Transformers for long clinical sequences. arXiv preprint arXiv:2201.11838 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR48\" id=\"ref-link-section-d86264245e2542\" rel=\"nofollow noopener\" target=\"_blank\">48<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 49\" title=\"Beltagy, I., Peters, M. E. &amp; Cohan, A. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR49\" id=\"ref-link-section-d86264245e2545\" rel=\"nofollow noopener\" target=\"_blank\">49<\/a> via a sequential labeling formulation with that of generative models based on Finetuned Language Net (FLAN)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 50\" title=\"Wei, J. et al. In International Conference on Learning Representations.\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR50\" id=\"ref-link-section-d86264245e2549\" rel=\"nofollow noopener\" target=\"_blank\">50<\/a>. Their experimental results showed the superior performance of the Pythia and FLAN ensembles compared to the BERT-based ensemble. Team 6<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 51\" title=\"Tseng, F.-P. et al. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes 143-156 (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR51\" id=\"ref-link-section-d86264245e2553\" rel=\"nofollow noopener\" target=\"_blank\">51<\/a> employed an ensemble method to merge the prediction results of five Longformer-based<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 52\" title=\"Beltagy, I., Peters, M. E. &amp; Cohan, A. Longformer: The Long-Document Transformer. arXiv e-prints, arXiv: 2004.05150 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR52\" id=\"ref-link-section-d86264245e2557\" rel=\"nofollow noopener\" target=\"_blank\">52<\/a> models trained on distinct data splits. They observed that ensemble learning could significantly reduce the model prediction variability, particularly for categories with limited training instances (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">9<\/a>).<\/p>\n<p>Given the unbalanced nature of the released training set, the data augmentation method is another critical factor for improving the ability of fine-tuned models to identify all the SHI categories. Some teams<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 53\" title=\"Chao, C.-Y. &amp; Lin, C.-W. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes 39-50 (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR53\" id=\"ref-link-section-d86264245e2568\" rel=\"nofollow noopener\" target=\"_blank\">53<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 54\" title=\"Chiu, P.-S., Hou, B.-W., Chen, Y.-T. &amp; Huang, S.-H. In International Workshop on Deidentification of Electronic Medical Record Notes. 202-212 (Springer).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR54\" id=\"ref-link-section-d86264245e2571\" rel=\"nofollow noopener\" target=\"_blank\">54<\/a> prompted ChatGPT to generate additional training instances to augment SHI types with limited training examples, or those where their fine-tuned model performed poorly. For example, the experimental results of Team 4 showed that LLM-based augmentation significantly improved the macro-averaged F1-score of the 410\u2009M Pythia fine-tuned model by 0.20 (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">7<\/a>). Team 9 improved the model performance (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">12<\/a>) for low-resource SHI categories by generating training samples with synthetic surrogates, introducing random noise\u2012 by inserting irrelevant words, and removing non-SHI words from \u223c15% of the sentences\u2012to enhance the model\u2019s robustness and adaptability<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 47\" title=\"Huang, P.-W. &amp; Liu, T.-E. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes 80-99 (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR47\" id=\"ref-link-section-d86264245e2583\" rel=\"nofollow noopener\" target=\"_blank\">47<\/a>. Finally, Gupta et al.<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 55\" title=\"Gupta, S., Alla, N. L. V., Pan-chal, O., Witowski, J. &amp; Jonnagaddala, J. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes 100-113 (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR55\" id=\"ref-link-section-d86264245e2588\" rel=\"nofollow noopener\" target=\"_blank\">55<\/a> was the only team to use other openly available de-identification datasets beyond the released training set to improve the system performance. They augmented the training set with the 2014 i2b2\/UTHealth<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 56\" title=\"Stubbs, A., Kotfila, C. &amp; Uzuner, &#xD6; Automated systems for the de-identification of longitudinal clinical narratives: overview of 2014 i2b2\/UTHealth shared task Track 1. J. Biomed. Inform. 58, S11&#x2013;S19 (2015).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR56\" id=\"ref-link-section-d86264245e2592\" rel=\"nofollow noopener\" target=\"_blank\">56<\/a> and 2016 CEGS N-GRID<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 12\" title=\"Stubbs, A., Filannino, M. &amp; Uzuner, &#xD6; De-identification of psychiatric intake records: overview of 2016 CEGS N-GRID shared tasks Track 1. J. Biomed. Inform. 75, S4&#x2013;S18 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR12\" id=\"ref-link-section-d86264245e2596\" rel=\"nofollow noopener\" target=\"_blank\">12<\/a> deidentification corpora to train a BERT-based model<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 57\" title=\"Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234&#x2013;1240 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR57\" id=\"ref-link-section-d86264245e2600\" rel=\"nofollow noopener\" target=\"_blank\">57<\/a> for subtask 1. However, they did not observe any performance improvements with this configuration.<\/p>\n<p>In terms of training methods, most teams apply full-parameter fine-tuning to LLM models with fewer than 1B parameters owing to the limitations of the available computing resources. While full-parameter tuning is computationally intensive, the smaller model sizes make this approach feasible within the constraints of available computing resources. For larger models, teams have predominantly used parameter-efficient tuning techniques, such as LoRA or quantization-LoRA (QLoRA), to mitigate resource limitations. For example, Ru et al.<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 58\" title=\"Ru, Z.-J. et al. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes 169-182 (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR58\" id=\"ref-link-section-d86264245e2607\" rel=\"nofollow noopener\" target=\"_blank\">58<\/a> fine-tuned Pythia models (70\u2009M, 160\u2009M and 1B) with QLoRA and reported that more parameters in the pretrained model led to generally better prediction results. Chiu et al.<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 59\" title=\"Chiu, P.-S., Hou, B.-W., Chen, Y.-T. &amp; Huang, H.-H. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes 195&#x2013;207 (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR59\" id=\"ref-link-section-d86264245e2611\" rel=\"nofollow noopener\" target=\"_blank\">59<\/a> proposed combining fine-tuning and LoRA, a method similar to that of the Chain-of-LoRA<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 60\" title=\"Qiu, X., Hao, T., Shi, S., Tan, X. &amp; Xiong, Y. J. Chain-of-LoRA: enhancing the instruction fine-tuning performance of low-rank adaptation on diverse instruction set. IEEE Signal Process. Lett. 31, 875&#x2013;879 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR60\" id=\"ref-link-section-d86264245e2615\" rel=\"nofollow noopener\" target=\"_blank\">60<\/a>, to enhance the performance of LLMs in the presented task. Two teams<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 43\" title=\"Huang, C.-L., Rianto, B., Sun, J.-T., Fu, Z.-X. &amp; Lee, C.-H. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR43\" id=\"ref-link-section-d86264245e2619\" rel=\"nofollow noopener\" target=\"_blank\">43<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 51\" title=\"Tseng, F.-P. et al. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes 143-156 (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR51\" id=\"ref-link-section-d86264245e2622\" rel=\"nofollow noopener\" target=\"_blank\">51<\/a> proposed the application of sliding window methods to enable BERT-based models<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 52\" title=\"Beltagy, I., Peters, M. E. &amp; Cohan, A. Longformer: The Long-Document Transformer. arXiv e-prints, arXiv: 2004.05150 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR52\" id=\"ref-link-section-d86264245e2626\" rel=\"nofollow noopener\" target=\"_blank\">52<\/a> and LLMs<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 44\" title=\"Bai, J. et al. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR44\" id=\"ref-link-section-d86264245e2631\" rel=\"nofollow noopener\" target=\"_blank\">44<\/a> to aggregate training instances across sequences, thereby enhancing the generalizability and capability of the model to interpret and predict longer sequences. To minimize overfitting, two approaches were highlighted by the participants: Team 2 applied a noisy embedding instruction fine-tuning method<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 61\" title=\"Jain, N. et al. In The Twelfth International Conference on Learning Representations.\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR61\" id=\"ref-link-section-d86264245e2635\" rel=\"nofollow noopener\" target=\"_blank\">61<\/a>, and Team 9 applied parameter freezing in the intermediate layers and low-rank adaptation (LoRA). The detailed results of Team 2 and Team 9 are in Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">5<\/a> and Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">12<\/a>, respectively.<\/p>\n<p>Nearly all the top-ranked teams applied pattern-based methods to varying degrees, as shown in Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#Fig6\" rel=\"nofollow noopener\" target=\"_blank\">6<\/a> and Tables <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#Tab2\" rel=\"nofollow noopener\" target=\"_blank\">2<\/a>, <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#Tab3\" rel=\"nofollow noopener\" target=\"_blank\">3<\/a>. Teams that focus on LLM-based approaches have highlighted the issue of named entity hallucinations<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 62\" title=\"Akani, E., Favre, B., Bechet, F. &amp; Gemignani, R. In Proceedings of the 16th International Natural Language Generation Conference. 437&#x2013;442.\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR62\" id=\"ref-link-section-d86264245e2657\" rel=\"nofollow noopener\" target=\"_blank\">62<\/a>, which refers to out-of-report (OOR) or out-of-definition (OOD) SHIs. These teams used pattern-based methods to filter out OOR and OOD instances and duplicate instances generated by LLMs. For example, Li et al.<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 63\" title=\"Li, Z.-E., Zheng, H.-Y., Mao, K.-C. &amp; Wei, Z.-W. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes 157&#x2013;168 (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR63\" id=\"ref-link-section-d86264245e2661\" rel=\"nofollow noopener\" target=\"_blank\">63<\/a> reported that postprocessing for hallucinations improved their micro- and macro-averaged F-scores by 0.018 and 0.170, respectively. Furthermore, many top-ranked teams<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 43\" title=\"Huang, C.-L., Rianto, B., Sun, J.-T., Fu, Z.-X. &amp; Lee, C.-H. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR43\" id=\"ref-link-section-d86264245e2666\" rel=\"nofollow noopener\" target=\"_blank\">43<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 52\" title=\"Beltagy, I., Peters, M. E. &amp; Cohan, A. Longformer: The Long-Document Transformer. arXiv e-prints, arXiv: 2004.05150 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR52\" id=\"ref-link-section-d86264245e2669\" rel=\"nofollow noopener\" target=\"_blank\">52<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 53\" title=\"Chao, C.-Y. &amp; Lin, C.-W. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes 39-50 (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR53\" id=\"ref-link-section-d86264245e2672\" rel=\"nofollow noopener\" target=\"_blank\">53<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 58\" title=\"Ru, Z.-J. et al. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes 169-182 (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR58\" id=\"ref-link-section-d86264245e2675\" rel=\"nofollow noopener\" target=\"_blank\">58<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Li, Z.-E., Zheng, H.-Y., Mao, K.-C. &amp; Wei, Z.-W. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes 157&#x2013;168 (Springer Nature, 2024).\" href=\"#ref-CR63\" id=\"ref-link-section-d86264245e2678\">63<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Cho, Y.-C., Yang, Y.-J., Liu, Y.-D., Tsao, T.-S. &amp; Li, M.-J. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes 183&#x2013;194 (Springer Nature, 2024).\" href=\"#ref-CR64\" id=\"ref-link-section-d86264245e2678_1\">64<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Huang, T.-Y., Shih, J.-F., Hsieh, Y.-C. &amp; Feng, H.-H. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes 129&#x2013;142 (Springer Nature, 2024).\" href=\"#ref-CR65\" id=\"ref-link-section-d86264245e2678_2\">65<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 66\" title=\"Huang, Y.-Z., Peng, T.-C., Lin, H.-Y., Sy, E. &amp; Chang, Y.-C. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes 13&#x2013;26 (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR66\" id=\"ref-link-section-d86264245e2681\" rel=\"nofollow noopener\" target=\"_blank\">66<\/a> have applied hybrid methods that combine LLMs and pattern-based approaches to increase recall rates, particularly for the recognition of age- and temporal-related SHIs, and for the normalization of temporal SHIs. In contrast, team 1<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 45\" title=\"Zhao, Z.-R., Chou, P.-C., Mir, T. H. &amp; Dai, H.-J. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes 27-38 (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR45\" id=\"ref-link-section-d86264245e2685\" rel=\"nofollow noopener\" target=\"_blank\">45<\/a> and Huang, et al.<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 67\" title=\"Huang, M.-S., Mau, B.-R., Lin, J.-H. &amp; Chen, Y.-Z. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes 114&#x2013;128 (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR67\" id=\"ref-link-section-d86264245e2689\" rel=\"nofollow noopener\" target=\"_blank\">67<\/a> presented two pure pattern-based approaches and demonstrated that heuristic patterns can achieve state-of-the-art performance (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">4<\/a>). In the comparative study of GPT 3.5 fine-tuning and pattern-based approaches presented by Team 1, although pattern-based approaches require more human involvement to improve rule accuracy and coverage iteratively than data-driven LLM methods, they offer significantly better computational efficiency and lower computing power requirements.<\/p>\n<p>Finally, regarding the use of commercial pretrained LLMs, GPT-3.5 was the most popular choice. We found that only one team<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 68\" title=\"Huang, S.-X., Cheng, H.-A. &amp; Li, Z.-H. In Proceedings of the 2024 International Workshop on Deidentification of Electronic Medical Record Notes 63&#x2013;79 (Springer Nature, 2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#ref-CR68\" id=\"ref-link-section-d86264245e2700\" rel=\"nofollow noopener\" target=\"_blank\">68<\/a> from the top 30 conducted a small-scale study to examine the feasibility of applying ICL with GPT-3.5 to both subtasks. Although they observed promising results in their study, they ultimately decided to shift to local LLM fine-tuning and used ChatGPT solely for data augmentation, owing to the cost associated with API calls for large-scale experiments. Only Team 1 applied fine-tuning based on GPT-3.5, and achieved F-scores of 0.752 and 0.799 for Subtasks 1 and 2, respectively (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01921-7#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">4<\/a>). This performance is comparable to our results for LoRA fine-tuning of Pythia-2.8\u2009B (0.764\/0.650).<\/p>\n","protected":false},"excerpt":{"rendered":"The OpenDeID v2 corpus To evaluate the performance of various machine learning methods and hybrid approaches that combine&hellip;\n","protected":false},"author":2,"featured_media":8103,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[275],"tags":[2564,7593,18,910,135,475,474,19,17,7482,2101],"class_list":{"0":"post-8102","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-healthcare","8":"tag-biomedicine","9":"tag-biotechnology","10":"tag-eire","11":"tag-general","12":"tag-health","13":"tag-health-care","14":"tag-healthcare","15":"tag-ie","16":"tag-ireland","17":"tag-medicine-public-health","18":"tag-public-health"},"share_on_mastodon":{"url":"","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/8102","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/comments?post=8102"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/8102\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media\/8103"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media?parent=8102"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/categories?post=8102"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/tags?post=8102"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}