{"id":92875,"date":"2025-09-29T15:46:09","date_gmt":"2025-09-29T15:46:09","guid":{"rendered":"https:\/\/www.europesays.com\/ie\/92875\/"},"modified":"2025-09-29T15:46:09","modified_gmt":"2025-09-29T15:46:09","slug":"a-human-llm-collaborative-annotation-approach-for-screening-articles-on-precision-oncology-randomized-controlled-trials-bmc-medical-research-methodology","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/ie\/92875\/","title":{"rendered":"A human-LLM collaborative annotation approach for screening articles on precision oncology randomized controlled trials | BMC Medical Research Methodology"},"content":{"rendered":"<p>Data source and screening criteria<\/p>\n<p>In this study, we validated our method by screening articles on precision oncology randomized controlled trials (RCTs). Since there are no specific subject terms for \u201cprecision\u201d in this context, we retrieved articles from PubMed [<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 17\" title=\"Fiorini N, Canese K, Starchenko G, Kireev E, Kim W, Miller V, et al. Best match: new relevance search for PubMed. PLoS Biol. 2018;16(8):e2005343.\" href=\"http:\/\/bmcmedresmethodol.biomedcentral.com\/articles\/10.1186\/s12874-025-02674-3#ref-CR17\" id=\"ref-link-section-d37717306e551\" rel=\"nofollow noopener\" target=\"_blank\">17<\/a>] using the search query: \u201crandomized controlled trial\u201d[pt] AND \u201ccancer\u201d[MeSH Major Topic] AND \u201chumans\u201d[mh]. This search yielded 23,521 articles published between January 1, 2012, and December 31, 2023.<\/p>\n<p>To identify articles on precision oncology RCTs from this set, we established four criteria as shown in Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/bmcmedresmethodol.biomedcentral.com\/articles\/10.1186\/s12874-025-02674-3#Tab1\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a>: (1) the article must be a randomized controlled trial, (2) the study population must be cancer patients, (3) the study purpose must be cancer treatment evaluation, and (4)the study must involve biomarkers related to genetic and molecular characteristics. An article is considered a precision oncology RCT only if it meets all four criteria.<\/p>\n<p><b id=\"Tab1\" data-test=\"table-caption\">Table 1 The screening criteria for precision oncology RCTs<\/b><\/p>\n<p>During the manual annotation process, experts assessed whether each article met these criteria based on the title and abstract. To ensure accuracy and reliability, two experts conducted the initial annotations independently. Any discrepancies were reviewed by an additional annotator who made the final decision. We used the Medtator [<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 18\" title=\"He H, Fu S, Wang L, Liu S, Wen A, Liu H. Medtator: a serverless annotation tool for corpus development. Bioinformatics. 2022;38(6):1776\u20138.\" href=\"http:\/\/bmcmedresmethodol.biomedcentral.com\/articles\/10.1186\/s12874-025-02674-3#ref-CR18\" id=\"ref-link-section-d37717306e653\" rel=\"nofollow noopener\" target=\"_blank\">18<\/a>] annotation tool for this process.<\/p>\n<p>Design of ChatGPT prompt<\/p>\n<p>Using the OpenAI API (<a href=\"https:\/\/platform.openai.com\/docs\/api-reference\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/platform.openai.com\/docs\/api-reference<\/a>), we selected \u201cgpt-3.5-turbo\u201d as our base model. We utilized the role attribute in the message objects to define the prompts for \u201csystem\u201d and \u201cuser\u201d roles. The \u201csystem\u201d role was employed to set the context and guidelines, providing ChatGPT\u2019s virtual character and relevant background information. The \u201cuser\u201d role described specific requirements, detailing the tasks and expected response format. In our API implementation, we configured the top-p parameter to its default value of 1, while setting the temperature to 0. This configuration minimizes the variability of the returned responses. For additional details on the API usage and settings, please refer to our code repository on GitHub.<\/p>\n<p><b id=\"Fig2\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 2<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/bmcmedresmethodol.biomedcentral.com\/articles\/10.1186\/s12874-025-02674-3\/figures\/2\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig2\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/09\/12874_2025_2674_Fig2_HTML.png\" alt=\"figure 2\" loading=\"lazy\" width=\"685\" height=\"506\"\/><\/a><\/p>\n<p>The schema for article screening using ChatGPT<\/p>\n<p>For the task of screening articles on precision oncology RCTs, the prompt design in the \u201csystem\u201d and \u201cuser\u201d messages is illustrated in the Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/bmcmedresmethodol.biomedcentral.com\/articles\/10.1186\/s12874-025-02674-3#Fig2\" rel=\"nofollow noopener\" target=\"_blank\">2<\/a>. In the \u201csystem\u201d message, the large model\u2019s role was defined as \u201can expert annotator specializing in scientific article content analysis\u201d, including specific content from scientific articles: the title and abstract. The \u201cuser\u201d message specified the task of screening and annotation, requiring the determination of whether an article meets all criteria. We provided the expected response format in JSON, along with an example. To ensure consistent response formatting, we instructed the model to use yes\/no options and appended \u201cAnswer:\u201d after the article content to clarify the starting point of the response. This structure helps the model better understand where to generate the response, avoiding confusion when handling long inputs.<\/p>\n<p>Prompt optimization<\/p>\n<p>To achieve near-perfect recall and high precision, we iteratively refined the LLM\u2019s prompt manually using a standard dataset. The dataset was divided into a tuning set and a validation set, and the LLM was initially prompted to annotate both sets. Performance was assessed by comparing the labels generated by the LLM with those annotated by humans for both sets. If the performance metrics (recall and precision) were satisfactory, the process was concluded. If not, the prompt was revised based on an analysis of misclassifications in the tuning set. This cycle of evaluation and prompt refinement was repeated until the model demonstrated consistently high performance, characterized by near-perfect recall and high precision.<\/p>\n<p>During the iterative prompt optimization process, we focused on three levels of refinement and adjustment for the LLM\u2019s prompt. The first level addressed the structure of the prompt framework, such as whether the specific content of the article (title and abstract) should be included in the \u201csystem\u201d message or the \u201cuser\u201d message, and whether GPT should be required to provide reasoning for its answers. The second level addressed how to determine whether an article meets multiple criteria\u2014whether to assess each criterion independently or simultaneously. The third level focused on refining the conceptual description of each criterion and providing corresponding examples when the concepts were ambiguous, enabling the model to accurately classify the articles.<\/p>\n<p>Collaborative annotation<\/p>\n<p>It is important to emphasize that our collaborative annotation approach is specifically designed for tasks involving article screening with a low prevalence of positive samples, where articles meeting specific criteria comprise only a small fraction of the retrieved set. The collaborative annotation process is as follows: using the optimized prompt developed in <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"section anchor\" href=\"http:\/\/bmcmedresmethodol.biomedcentral.com\/articles\/10.1186\/s12874-025-02674-3#Sec5\" rel=\"nofollow noopener\" target=\"_blank\">Prompt optimization<\/a>\u00a0section, we employed the LLM to annotate the articles. Given the near-perfect recall achieved by our model, the negative samples identified by the LLM are almost entirely accurate. Although errors may occur among the LLM-annotated positive samples, they are relatively rare. By selectively verifying these positive samples manually, we can effectively correct any misclassifications. This combined approach of LLM pre-annotation followed by manual validation significantly reduces the overall workload for article screening.<\/p>\n<p>Fine-tuning of the supervised model<\/p>\n<p>To further validate the reliability of the human-LLM collaboration annotation data, we trained a supervised model using the collaboratively annotated articles to assess its performance. We selected the BioBERT [<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 19\" title=\"Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234\u201340.\" href=\"http:\/\/bmcmedresmethodol.biomedcentral.com\/articles\/10.1186\/s12874-025-02674-3#ref-CR19\" id=\"ref-link-section-d37717306e725\" rel=\"nofollow noopener\" target=\"_blank\">19<\/a>] model, known for its excellent performance in previous studies [<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 6\" title=\"Lokker C, Bagheri E, Abdelkader W, Parrish R, Afzal M, Navarro T, et al. Deep learning to refine the identification of high-quality clinical research articles from the biomedical literature: performance evaluation. J Biomed Inform. 2023;142:104384.\" href=\"http:\/\/bmcmedresmethodol.biomedcentral.com\/articles\/10.1186\/s12874-025-02674-3#ref-CR6\" id=\"ref-link-section-d37717306e728\" rel=\"nofollow noopener\" target=\"_blank\">6<\/a>, <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 20\" title=\"Turchin A, Masharsky S, Zitnik M. Comparison of BERT implementations for natural language processing of narrative medical documents. Inform Med Unlocked. 2023;36:101139.\" href=\"http:\/\/bmcmedresmethodol.biomedcentral.com\/articles\/10.1186\/s12874-025-02674-3#ref-CR20\" id=\"ref-link-section-d37717306e731\" rel=\"nofollow noopener\" target=\"_blank\">20<\/a>], for fine-tuning to perform the classification task. During preprocessing, we concatenated the article titles and abstracts and then input them into the BERT model. This model generates probabilities for each category, and if the probability of a specific category exceeds a threshold, the article is assigned the corresponding label. We conducted hyperparameter tuning to enhance the model\u2019s reliability and significance. By carefully selecting and adjusting hyperparameters such as learning rate, batch size, and regularization strength, we aimed to achieve accurate and meaningful results.<\/p>\n","protected":false},"excerpt":{"rendered":"Data source and screening criteria In this study, we validated our method by screening articles on precision oncology&hellip;\n","protected":false},"author":2,"featured_media":92876,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[74],"tags":[60249,18,7484,60248,19,17,96,60250,60251,7483,82,24103],"class_list":{"0":"post-92875","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-technology","8":"tag-article-screening","9":"tag-eire","10":"tag-health-sciences","11":"tag-human-llm-collaboration","12":"tag-ie","13":"tag-ireland","14":"tag-medicine","15":"tag-precision-oncology-randomized-controlled-trials","16":"tag-statistical-theory-and-methods","17":"tag-statistics-for-life-sciences","18":"tag-technology","19":"tag-theory-of-medicine-bioethics"},"share_on_mastodon":{"url":"","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/92875","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/comments?post=92875"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/92875\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media\/92876"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media?parent=92875"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/categories?post=92875"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/tags?post=92875"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}