Model setting and data collection
In this work, we utilized GPT-4 (version 20230613) as the backend model via the Azure OpenAI API, which is trained on data compiled before September 2021. The Azure OpenAI API is compliant with the Health Insurance Portability and Accountability Act, ensuring robust data privacy protection. To ensure stable and reproducible output, we set the temperature parameter to the absolute 0. The evaluated gene sets (Table 1) were derived from recent releases by Hu et al.16 and Joachimiak et al.22 after 2023. The gene set corresponding to each ground truth in different datasets was determined by aggregating the genes with which it was directly annotated with those of all its ontological descendants.
Furthermore, we collected 191 gene sets from PubMed articles that were published from November 2021 to December 2023 to evaluate the performance of different LLMs in various data sources (Supplementary Table 1). These gene sets range in size from 3 to 408, with an average of 35.67. The ground truths of these gene sets are released along with articles. The average word count in ground truth is 3.267. For model assessments, we developed the evaluation pipeline using Python (version 3.11.0) alongside PyTorch (version 1.13.0). Other necessary Python packages are NumPy (version 1.26.3), Pandas (version 2.1.4) and Seaborn (version v0.13.2).
Overview of GeneAgent
GeneAgent is a language agent built upon GPT-4 to automatically interact with domain-specific databases to annotate functions for gene sets, which is composed of four key steps: generation, self-verification, modification and summarization. Each module is triggered by a specific instruction tailored to its function (Supplementary Document 1). The goal of GeneAgent is to generate a representative biological process name (P) for a set of genes, denoted as \(D=\{{{g}_{i}|}_{i=1}^{N}\}\). Each gene gi in this set is identified by its unique name, and the D is associated with a specific curated biological term, that is, ground truth (G). When provided with a D, GeneAgent outputs a \(P\), accompanied by analytical texts (A) detailing the functions of the genes involved, which can be formally defined as GeneAgent (D) = (P, A).
Pipeline of generating prominent biological process names
The gene set in the dataset D is separated by a comma (‘,’) and serves as input parameters for the instruction of the generation (g) step. Following the generation stage, D is assigned with an initial process name \(({P}_{\rm{ini}})\) and corresponding analytical narratives \({(A}_{{ini}})\), that is, \({\rm{GeneAgent}}_{g}(D)=({P}_{\rm{ini}}\,,\,{A}_{\rm{ini}})\).
Afterwards, GeneAgent generates a list of claims for \({P}_{{ini}}\) by using statements like ‘be involved in’ and ‘related to’ to generate a hypothesis for the gene set and its process name. After that, GeneAgent activates selfVeri-Agent (Fig. 1c) to verify each claim in the list. Initially, selfVeri-Agent extracts all gene symbols and the process name from the claims. Subsequently, it utilizes the gene symbols to invoke the appropriate APIs for the autonomous interaction with domain-specific databases, using their established knowledge to validate the accuracy of the process name. Finally, it assembles a verification report (\({{\mathcal{R}}}_{P}\)) containing findings and decisions (that is, ‘supported’, ‘partially supported’ or ‘refuted’) to the input claim.
Next, GeneAgent initiates the modification (m) step to either revise or retain the \({P}_{{ini}}\) based on the findings in the \({{\mathcal{R}}}_{P}\). If the \({P}_{\rm{ini}}\) is determined to revise by GeneAgent, the \({A}_{\rm{ini}}\) is also instructed to be modified accordingly, that is, \({\rm{GeneAgent}}_{m}\left({P}_{\rm{ini}},{{A}}_{\rm{ini}},{{\mathcal{R}}}_{P}\right)=({P}_{\mathrm{mod}},\,{A}_{\mathrm{mod}})\). Following this, GeneAgent applies the self-verification to the \({A}_{\mathrm{mod}}\) to verify the gene functions in the explanatory analyses while checking the updated process name again. This step is also started with generating a list of claims for different gene names and their function names and is finished with deriving a new verification report (\({{\mathcal{R}}}_{A}\)) containing a decision of ‘supported’, ‘partially supported’ or ‘refuted’ made by the selfVeri-Agent.
Finally, based on the report \({{\mathcal{R}}}_{A}\), both \({P}_{\mathrm{mod}}\) and \({A}_{\mathrm{mod}}\) are modified according to the summarization (s) instruction to generate the final biological process name (P) and the analytical narratives (A) of gene functions, that is, \({\rm{GeneAgent}}_{s}\,({P}_{\mathrm{mod}},{A}_{\mathrm{mod}},\,{{\mathcal{R}}}_{A})=(P,A)\).
Domain-specific databases configured in the selfVeri-Agent
In the self-verification stage, we configured four Web APIs to access 18 domain databases (Fig. 3c and Supplementary Document 4).
(1) g:Profiler27 (https://biit.cs.ut.ee/gprofiler/page/apis/) is an open-source tool for GSEA. In GeneAgent, we used eight domain-specific databases: GO, KEGG41, Reactome42, WikiPathways43, Transfac44, miRTarBase45, CORUM46 and Human Phenotype Ontology47 to perform enrichment analysis for the gene set. For each gene set, we used the g:GOSt interface to identify the top five enrichment terms along with their descriptions.
(2) Enrichr28,29 (https://maayanlab.cloud/Enrichr/help#api/) is also a valuable tool for GSEA. We configured four databases related to the pathway analysis in the Enrichr API, that is, KEGG_2021_Human, Reactome_2022, BioPlanet_2019 (ref. 48) and MSigDB_Hallmark_2020. In GeneAgent, we selected to return the top five standard pathway names via databases.
(3) E-utils30,31 (https://www.ncbi.nlm.nih.gov/) is an API designed for accessing the NCBI databases for various biological data. In GeneAgent, we augment our repository of functional information associated with an individual gene by invoking its Gene database and PubMed database. Different databases can be used by defining the database parameter as Gene or PubMed in the foundation API.
(4) AgentAPI is our custom API library, developed using four gene-centric databases related to gene–disease, gene–domain, PPI and gene–complex. GeneAgent calls the appropriate database by specifying the desired interface at the end of the basic API, and subsequently retrieving the top ten relevant IDs to gene functions. These IDs are then used to match their names in the corresponding database.
Notably, we implemented a masking strategy for APIs and databases during the self-verification stage to ensure unbiased assessments across various gene sets. Specifically, we removed the g:Profiler API when assessing gene sets collected from the GO dataset because it can perfectly derive their ground truths by accessing the GO database. Similarly, we masked the ‘MSigDB_Hallmark_2020’ database within the Enrichr API when evaluating gene sets collected from the MSigDB database.
Calculation of ROUGE score
Three distinct ROUGE metrics25 are used to access the recall of generated names relative to ground truths: that is, ROUGE-1 and ROUGE-2, which are based on n-gram, and ROUGE-L, which utilizes the longest common subsequence (LCS). The calculation formulas are as follows:
$${\rm{ROUGE}}-{\rm{N}}=\frac{{\sum }_{S\in {\rm{ref}}}{\sum }_{{g}_{N}\in S}{\rm{count}}_{\rm{match}}\left({g}_{N}\right)}{{\sum }_{S\in {\rm{ref}}}{\sum }_{{g}_{N}\in S}{\rm{count}}\left({g}_{N}\right)},\text{N}=1,\,2$$
$$\left\{\begin{array}{c}{R}_{\rm{lcs}}=\frac{\rm{LCS}\left({\rm{ref}},{\rm{hyp}}\right)}{m}\\ {P}_{\rm{lcs}}=\frac{{\rm{LCS}}\left({\rm{ref}},{\rm{hyp}}\right)}{n}\\ {\text{ROUGE}}-{\text{L}}\,=\frac{\left(1+{\beta }^{2}\right){R}_{\rm{lcs}}{P}_{\rm{lcs}}}{{R}_{\rm{lcs}}+{\beta }^{2}{P}_{\rm{lcs}}}\end{array}\right.$$
Here, the ‘ref’ denotes the reference terms and ‘hyp’ denotes the generated names. m and n are the token lengths of ‘ref’ and ‘hyp’, respectively. β is a hyper-parameter.
Calculation of semantic similarity
After generating the biological process name (P) for the gene set D, the semantic similarity between \(P\) and its ground truth (G) is computed by MedCPT26, a state-of-the-art model for language representation in the biomedical domain. It is built based on PubMedBERT49 with further training using 255 million query–article pairs from PubMed search logs. Compared with SapBERT50 and BioBERT51, MedCPT has higher performance in encoding the semantics of biomedical texts.
-
a.
Calculation of semantic similarity between P and G
First, P and G are encoded by MedCPT into embeddings, and then the cosine similarity between their embeddings is calculated, yielding a score in the interval [−1, 1]. Finally, we take the average value of all similarity scores to evaluate the performance of GeneAgent on gene sets in each dataset.
-
b.
Calculation of background semantic similarity distribution
First, P is paired with all possible terms \({G}_{i}{\mathcal{\in}}{\mathcal{Q}}\), where Q denotes 12,320 candidate terms consisting of 12,214 GO biological process terms, and all available terms in NeST (50) and MSigDB (56). Then, P and Gi are fed into MedCPT to get the embeddings, that is, \(\mathop{P}\limits^{ \rightharpoonup }\) and \(\mathop{{G}_{i}}\limits^{ \rightharpoonup }\). Afterwards, we calculated the cosine similarity for each \(\) pair. Finally, we ranked all cosine scores from large to small and observed the position where the pair P,Gp> (Gp is the ground truth for P) located in. The higher position denotes the generated names have a higher similarity score to their ground truths than other candidate terms.
Calculation of hierarchical semantic similarity
We first collected the hierarchical structures of all GO terms from the GoBasic.obo file in the GO database (2023-11-15 version), yielding 1,951,615 GO term pairs across five relationships: ‘is a’, ‘part of’, ‘regulates’, ‘negatively regulates’ and ‘positively regulates’. Next, we extracted the ancestral terms for all ground truths in the evaluation datasets used in this study, limiting the distance from each ground truth to its respective ancestor to within three hops. Finally, we calculated the semantic similarity between each ancestral term and its corresponding ground truth to assess whether the generated names achieved a higher similarity score with ancestral terms. For our evaluation, we computed hierarchical semantic similarity exclusively for gene sets in the GO dataset.
Pipeline of enrichment term test using verification reports
For gene sets in MSigDB, we first collected its verification report produced by the selfVeri-Agent of GeneAgent. Afterwards, each gene set and the associated report were used as the parameters of the instruction (Supplementary Document 1) for the GPT-4. Therefore, GPT-4 can summarize multiple enrichment terms for the given gene set. Finally, we used the exact match to evaluate the accuracy of the tested terms summarized by the GPT-4. Specifically, for each gene set in the evaluation, we first utilized g:Profiler to perform GSEA, where the P-value threshold is set to 0.05. Then, we obtained significant enrichment terms for the given gene sets as the ground truth. Finally, we counted the number of tested terms summarized by GPT-4 that correctly match the significant enrichment term of each gene set. One tested term is considered as accurate only when there is an exact match between all the words in the tested term and one term in the ground truth.
Human checking for the decisions of selfVeri-Agent
We randomly selected ten gene sets from NeST with 132 claims for human inspection. There are two parts in the verification report: the claims and the decisions to the claims along with evidence (Supplementary Document 2). Annotators were asked to label the selfVeri-Agent decisions (that is, support, partially support and refute) for each claim and judge whether such decisions are correct, partially correct or incorrect, which follows the study of natural language inference52 and fact verification53. For each claim, the annotators need to make a judgment based on assertions of the gene (set) functions provided in the evidence:
-
a.
Correct: This category applies when GeneAgent’s decision completely aligns with the evidence supporting the input claim. The decision is considered correct if it accurately reflects the evidence documented, demonstrating a clear and direct connection between the claim and the supporting data.
-
b.
Partially correct: It is designated when GeneAgent’s decision requires indirect reasoning or when the decision, although related, does not completely align with the direct evidence provided. This occurs when the decision is somewhat supported by the evidence but requires additional inference or context to be fully understood as supporting the input claim.
-
c.
Incorrect: This category is used when GeneAgent’s decision either contradicts the evidence or lacks any substantiation from the verification report.
Melanoma gene sets in the preclinical study
The mouse B2905 melanoma cell line, which is derived from a tumor from the M4 model, where melanoma is induced by ultraviolet irradiation on pups of hepatocyte growth factor-transgenic C57BL/6 mice54.
Specifically, 24 single cells were isolated from the parental B2905 melanoma line and then expanded to become individual clonal sublines (that is, C1 to C24)55. Each of these 24 sublines was subjected to whole-exome sequencing and full-transcript single-cell RNA sequencing using the Smart-seq2 protocol. The single nucleotide variants called from exome sequencing results were used to build the tumor progression tree for all the 24 sublines. Based on the in vivo growth and therapeutic responses of the sublines in the clusters, three clades are named as ‘high aggressiveness and resistant (HA-R)’, ‘high aggressiveness and sensitive (HA-S)’ and ‘low aggressiveness and sensitive (LA-S)’32. Afterwards, EvoGeneX56 is applied to the single-cell RNA-sequencing data of the 24 clonal sublines, where the phylogenetic relation is defined by the mutation-based tumor progression tree, to identify adaptively upregulated and downregulated genes in each of HA-R, HA-S and LA-S clades. The adaptively upregulated and downregulated gene lists were then subjected to the Kyoto Encyclopedia of Genes and Genomes pathway enrichment analysis. The genes in the enrichments and their enriched terms are used to test the GeneAgent. In our case study, we only utilized the seven gene sets analyzed from the clonal subline as the evaluation data of GeneAgent. We did not access or process any original data from clinical experiments.
Human annotation for outputs in the case study
For the assessment of different outputs between GeneAgent and GPT-4, we established four criteria following the existing studies on the evaluation of LLMs57,58.
-
a.
Relevance: Assess whether the content about genes pertinently reflects their functions, providing value to biologists.
-
b.
Readability: Evaluate the fluency and clarity of the writing, ensuring it is easily understandable.
-
c.
Consistency: Determine whether the analytical narratives align consistently with the specified process name.
-
d.
Comprehensiveness: Verify whether the outputs provide a comprehensive understanding of gene functions.
Based on these four established criteria, two experts are tasked with evaluating the final responses from the outputs of GPT-4 and GeneAgent. They operate the annotation under a blind assessment protocol, where they are unaware of the algorithm that produced each response. Their main responsibility is to annotate and compare the preference for outputs generated by GPT-4 versus GeneAgent. They carefully review and select the more effective response, justifying their selections with relevant comments. Following a comprehensive synthesis of all feedback, these two experts are required to make a definitive judgment on which output most effectively satisfies the users’ requirement (Supplementary Document 3).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.