{"id":206473,"date":"2025-06-23T01:22:46","date_gmt":"2025-06-23T01:22:46","guid":{"rendered":"https:\/\/www.europesays.com\/uk\/206473\/"},"modified":"2025-06-23T01:22:46","modified_gmt":"2025-06-23T01:22:46","slug":"few-shot-learning-for-phenotype-driven-diagnosis-of-patients-with-rare-genetic-diseases","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/uk\/206473\/","title":{"rendered":"Few shot learning for phenotype-driven diagnosis of patients with rare genetic diseases"},"content":{"rendered":"<p>Overview of the undiagnosed diseases network patient cohort<\/p>\n<p>We assemble a cohort of 465 patients in the Undiagnosed Diseases Network (UDN) with molecular diagnoses. Most patients are diagnosed with a single causal gene that explains their symptoms; 14 patients (3%) have two causal genes, and two patients (0.4%) have three causal genes. Most patients in the UDN receive an extensive clinical workup and whole genome or exome sequencing (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig1\" target=\"_blank\" rel=\"noopener\">1<\/a>a). Sequencing data is analyzed with the involvement of clinicians and genetic counselors to identify candidate genes that harbor variants likely to explain the patient\u2019s symptoms. Once one to five strong candidates are identified, causality is assessed by searching for genotype- and phenotype-matched individuals in human and animal databases or by introducing candidates into model organisms to determine in vivo impact<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 50\" title=\"Kobren, S. N. et al. Commonalities across computational workflows for uncovering explanatory variants in undiagnosed cases. Genet. Med. 23, 1075&#x2013;1085 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR50\" id=\"ref-link-section-d356388952e670\" target=\"_blank\" rel=\"noopener\">50<\/a>.<\/p>\n<p><b id=\"Fig1\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 1: Overview of SHEPHERD in the rare disease diagnosis pipeline.<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/www.nature.com\/articles\/s41746-025-01749-1\/figures\/1\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig1\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/06\/41746_2025_1749_Fig1_HTML.png\" alt=\"figure 1\" loading=\"lazy\" width=\"685\" height=\"468\"\/><\/a><\/p>\n<p><b>a<\/b> After years of failed diagnostic attempts, once a patient is accepted to the UDN, they receive a thorough clinical workup and genetic sequencing, and their case is analyzed in an iterative process to identify the candidate genes likely to explain the patient\u2019s symptoms. SHEPHERD can be used throughout the diagnostic process: after the clinical workup to find similar patients, after the sequencing analysis to identify strong candidate genes, and after the case review to further prioritize candidate genes, characterize the patient\u2019s disease, and\/or validate candidate genes by finding phenotype- and genotype-matched patients. <b>b<\/b> SHEPHERD takes in as input the patient\u2019s set of phenotype terms and leverages an external rare disease knowledge graph to perform multi-faceted rare disease diagnosis. SHEPHERD can optionally consider a list of candidate genes (either variant-filtered or expert-curated) or external patient cohort(s), depending on the prediction task of interest (e.g., causal gene discovery, patients-like-me identification). For simplicity, the knowledge graph is depicted using three shapes: circles as genes, squares as phenotypes, and pentagons as diseases; refer to Methods for all node types. <b>c<\/b> Number of HPO phenotype terms and candidate genes in each of the two candidate gene lists across patients in our UDN cohort. <b>d<\/b> Overlap of phenotype terms, genes, and diseases across patients. Most phenotype terms, genes, and diseases are found in only a single UDN patient. <b>e<\/b>\u2013<b>h<\/b> Number of patients in each <b>e<\/b> UDN clinical site, <b>f<\/b> age category, <b>g<\/b> primary presenting symptom, and <b>h<\/b> evaluation year. Figure adapted from images created in BioRender<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Noori, A. Created in BioRender. &#10;                  https:\/\/BioRender.com\/zkbpoj9&#10;                  &#10;                 (2025).\" href=\"#ref-CR129\" id=\"ref-link-section-d356388952e716\">129<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Noori, A. Created in BioRender. &#10;                  https:\/\/BioRender.com\/26t5d3v&#10;                  &#10;                 (2025).\" href=\"#ref-CR130\" id=\"ref-link-section-d356388952e716_1\">130<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 131\" title=\"Noori, A. Created in BioRender. &#010;                  https:\/\/BioRender.com\/z7vfgnl&#010;                  &#010;                 (2025).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR131\" id=\"ref-link-section-d356388952e719\" target=\"_blank\" rel=\"noopener\">131<\/a>.<\/p>\n<p>Through this diagnostic process, patients are annotated with a set of Human Phenotype Ontology (HPO) phenotype terms describing their clinical features and a set of candidate genes that may explain the patient\u2019s syndrome. Clinical experts additionally annotate diagnosed patients with an Online Mendelian Inheritance in Man (OMIM) identifier describing their disease (if available). Each patient is characterized by 23.9 HPO terms on average (SD\u2009=\u200916.1; Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig1\" target=\"_blank\" rel=\"noopener\">1<\/a>c). The candidate genes are patient-specific and include genes in which the patient has a mutation. For each patient, the diagnostic process creates two sets of candidate gene lists. The lists contain genes considered at two different phases in the UDN diagnosis pipeline (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig1\" target=\"_blank\" rel=\"noopener\">1<\/a>a): VARIANT-FILTERED, a list produced by performing initial variant-based filtering of candidate genes, and EXPERT-CURATED, a list including genes marked by clinical experts as strong candidates for the patient (Methods 3). The VARIANT-FILTERED gene lists are produced using Exomiser<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 24\" title=\"Smedley, D. et al. Next-generation diagnostics and disease-gene discovery with the exomiser. Nat. Protoc. 10, 2004&#x2013;2015 (2015).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR24\" id=\"ref-link-section-d356388952e740\" target=\"_blank\" rel=\"noopener\">24<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 51\" title=\"Zemojtel, T. et al. Effective diagnosis of genetic disease by computational phenotype analysis of the disease-associated genome. Sci. Transl. Med. 6, 252ra123 (2014).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR51\" id=\"ref-link-section-d356388952e743\" target=\"_blank\" rel=\"noopener\">51<\/a>, a variant-based tool used in parallel to existing pipelines at three UDN sites<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 50\" title=\"Kobren, S. N. et al. Commonalities across computational workflows for uncovering explanatory variants in undiagnosed cases. Genet. Med. 23, 1075&#x2013;1085 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR50\" id=\"ref-link-section-d356388952e747\" target=\"_blank\" rel=\"noopener\">50<\/a>. The two candidate gene lists contain 244.3 and 13.3 genes on average, respectively (SD\u2009=\u2009244.0 and SD\u2009=\u20098.0; Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig1\" target=\"_blank\" rel=\"noopener\">1<\/a>c). Each gene list is input to SHEPHERD to predict the causal gene (i.e., the gene harboring variants that cause the patient\u2019s disease) from both a long list of candidate genes derived from automated filtering (i.e., VARIANT-FILTERED) and a short list of the strongest candidate genes that are more challenging to prioritize (i.e., EXPERT-CURATED).<\/p>\n<p>UDN patients have heterogeneous disease presentations: 378 unique genes and 299 unique diseases are represented in the cohort, and 48% of phenotype terms, 79% of genes, and 83% of diseases are represented in only a single patient (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig1\" target=\"_blank\" rel=\"noopener\">1<\/a>d). This reinforces the need for machine learning models that can learn from sparsely labeled datasets. 11.4% of patients have a diagnosis in common with at least one other patient. Patients with the same disease have only 67% of phenotype terms in common on average (SD\u2009=\u200943%), and the closest shared ancestor (i.e., lowest common ancestor) in the Human Phenotype Ontology between their phenotype terms is 2.67 hops away on average (SD\u2009=\u20090.81). In addition, 7% of patients have novel genetic diseases, and only 28% of each patient\u2019s phenotypic features have any known association with the causal gene on average (SD\u2009=\u200921%). The assembled cohort of UDN patients has been evaluated at 12 clinical sites across the United States (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig1\" target=\"_blank\" rel=\"noopener\">1<\/a>e). While 75.9% of patients are under 5 years old, patients can present to the UDN with suspected genetic diseases in their 40s or 50s (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig1\" target=\"_blank\" rel=\"noopener\">1<\/a>f). The cohort is predominantly White (80.6%) and non-Hispanic (70.8%); smaller proportions of patients identify as Asian (9.2%), Black or African American (4.5%), or other racial and ethnic backgrounds (5.6%; Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#MOESM1\" target=\"_blank\" rel=\"noopener\">1a, b<\/a>). The sex distribution is relatively balanced, with 47.7% male and 52.0% female patients (Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#MOESM1\" target=\"_blank\" rel=\"noopener\">1c<\/a>). Most patients present with neurological symptoms but can exhibit cardiac, musculoskeletal, rheumatic, and many other symptoms (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig1\" target=\"_blank\" rel=\"noopener\">1<\/a>g). Due to the lag between starting the process at the UDN and receiving the diagnosis, most patients included in the analysis were evaluated by UDN clinicians in 2016\u20132018 (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig1\" target=\"_blank\" rel=\"noopener\">1<\/a>h). The phenotypic heterogeneity and presence of novel and atypical diseases pose a challenge for diagnosis, requiring diagnostic technology that can accommodate previously unseen phenotypes, genes, and diseases and leverage knowledge beyond direct gene, phenotype, and disease associations (Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#MOESM1\" target=\"_blank\" rel=\"noopener\">2<\/a>). The UDN patients represent a diverse, independent cohort used exclusively for model evaluation. Importantly, these patients are not used to train SHEPHERD.<\/p>\n<p>Overview of SHEPHERD algorithm<\/p>\n<p>SHEPHERD takes a set of patient\u2019s phenotype terms and candidate disease(s) or candidate gene(s) harboring causal variants as input, and performs multi-faceted diagnosis of the patient to identify causal genes, retrieve \u201cpatients-like-me\u201d with the same causal gene or disease, and provide interpretable characterizations of novel disease presentations (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig1\" target=\"_blank\" rel=\"noopener\">1<\/a>b). SHEPHERD can integrate into the rare disease diagnostic process workflow at multiple points: (1) to find similar patients after the patient\u2019s clinical workup, (2) to identify strong candidate causal genes after the initial sequencing analysis or in conjunction with the clinical case review, and (3) to characterize the patient\u2019s disease and find similar patients for experimental or cohort validation after candidate causal genes are identified (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig1\" target=\"_blank\" rel=\"noopener\">1<\/a>a, b).<\/p>\n<p>SHEPHERD is a few-shot geometric deep learning approach for rare disease diagnosis. Few-shot learning, which can make predictions when very few (if any) labeled data points are available, is central to rare disease diagnosis because of the low prevalence of each disease. Key to SHEPHERD\u2019s ability to provide diagnostic prediction when zero or at most a few labeled (diagnosed) patients per disease are available is to use a biomedical knowledge graph containing gene, phenotype, and disease relationships. SHEPHERD represents each patient as a set of phenotype terms from the knowledge graph, which we refer to as a phenotype subgraph to emphasize that these terms are embedded within the graph\u2019s structure (Methods 1). It leverages a graph neural network to jointly embed each patient\u2019s phenotype subgraph and candidate genes or diseases into a latent representation space such that the generated embeddings are informed by the structure of the knowledge graph, and patients are embedded nearby their causal gene(s), disease(s), and other similar patients (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig2\" target=\"_blank\" rel=\"noopener\">2<\/a>a, b). Further, SHEPHERD uses an attention mechanism to aggregate each patient\u2019s phenotype terms to generate a patient embedding. While not intended as a clinical interpretability tool, the attention weights can be inspected post hoc to probe how the model prioritizes different phenotypic features during training and inference.<\/p>\n<p><b id=\"Fig2\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 2: SHEPHERD architecture, training, and generalizability.<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/www.nature.com\/articles\/s41746-025-01749-1\/figures\/2\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig2\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/06\/41746_2025_1749_Fig2_HTML.png\" alt=\"figure 2\" loading=\"lazy\" width=\"685\" height=\"385\"\/><\/a><\/p>\n<p><b>a<\/b>, <b>b<\/b> SHEPHERD is trained in a two-step process. <b>a<\/b> First, the model is pretrained to embed the biomedical knowledge in the knowledge graph. <b>b<\/b> Then, the pretrained model is applied to the task of rare disease diagnosis. Patient information is overlaid on the knowledge graph, and SHEPHERD generates an embedding for the patient phenotype terms and each candidate gene, disease, or patient. The model is trained via a loss function that encourages patient embeddings to be close to the embeddings of their causal gene or disease or other patients with the same causal gene or disease. <b>c<\/b> SHEPHERD is trained on a large cohort of simulated patients (pink). It can be further trained on real-world patients (blue) and then evaluated on an independent cohort of real-world patients (green). Alternatively, SHEPHERD can directly be evaluated on real-world patients (green) without any additional training. <b>d<\/b> We leverage real patient data derived from three distinct cohorts: the Undiagnosed Diseases Network (UDN; N\u2009=\u2009465), MyGene2 (N\u2009=\u2009146), and Deciphering Developmental Disorders study (DDD; N\u2009=\u20091431). For simplicity, the KG is depicted using three shapes: circles as genes, squares as phenotypes, and pentagons as diseases; refer to Methods for all node types.<\/p>\n<p>SHEPHERD is trained in a two-step process to learn embeddings of biomedical concepts and patients with rare genetic diseases. First, SHEPHERD is pretrained via self-supervised learning to embed genes, phenotypes, and diseases by predicting the relationships (structure) of the biomedical knowledge graph (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig2\" target=\"_blank\" rel=\"noopener\">2<\/a>a; Methods 7). This step produces compact embeddings that can be adapted for a range of analyses and are generalizable by accounting for complementarity between diseases. Then, using the pretrained model as initialization, SHEPHERD is trained for multi-faceted diagnosis of patients with rare diseases via a novel objective function (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig2\" target=\"_blank\" rel=\"noopener\">2<\/a>b; Methods 7). We train SHEPHERD in a disease-stratified manner (i.e., in which patients with the same disease are assigned either to the training or validation set, but not both) to enable SHEPHERD to generalize to diseases unseen during training.<\/p>\n<p>Due to the scarcity of data for patients with rare diseases, we leverage simulated but realistic rare disease patients for training SHEPHERD (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig2\" target=\"_blank\" rel=\"noopener\">2<\/a>c). We train SHEPHERD on a cohort of more than 40,000 synthetic rare disease patients representing over 2000 rare diseases in Orphanet (Methods 6). There are 20 synthetic patients generated for each rare disease. The simulated patients were generated using an approach designed to generate realistic rare disease patients grounded in medical knowledge, and they have been shown to phenotypically and genetically resemble real-world rare disease patients<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 48\" title=\"Alsentzer, E. et al. Simulation of undiagnosed patients with novel genetic conditions. Nat. Commun. 14, 6403 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR48\" id=\"ref-link-section-d356388952e864\" target=\"_blank\" rel=\"noopener\">48<\/a>. The synthetic cohort is essential for training SHEPHERD, as it is considerably larger, more diverse, and more representative of phenotype and genotype heterogeneity than any real-world dataset of rare disease patients (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig2\" target=\"_blank\" rel=\"noopener\">2<\/a>c)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 48\" title=\"Alsentzer, E. et al. Simulation of undiagnosed patients with novel genetic conditions. Nat. Commun. 14, 6403 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR48\" id=\"ref-link-section-d356388952e871\" target=\"_blank\" rel=\"noopener\">48<\/a>. This dataset, together with knowledge-guided learning on the rare disease knowledge graph, enables deep learning for rare disease diagnosis. A notable byproduct of training the model on synthetic data is that SHEPHERD\u2019s model can be publicly released without the risk of exposing patient information<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 52\" title=\"Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. K. &amp; Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 5, 493&#x2013;497 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR52\" id=\"ref-link-section-d356388952e875\" target=\"_blank\" rel=\"noopener\">52<\/a>. After training, SHEPHERD can be further trained on real-world patient cohorts or leveraged directly for rare disease diagnosis.<\/p>\n<p>We leverage real patient data from three cohorts in this study (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig2\" target=\"_blank\" rel=\"noopener\">2<\/a>d): (1) the UDN patient cohort (Methods 3); (2) a cohort of 146 patients from MyGene2, an online portal through which families with rare genetic conditions can share their health information to connect with clinicians and other patients<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 53\" title=\"Genomics, U. o. W. C. f. M. MyGene2. &#010;                  https:\/\/mygene2.org\/MyGene2\/&#010;                  &#010;                .\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR53\" id=\"ref-link-section-d356388952e886\" target=\"_blank\" rel=\"noopener\">53<\/a> (Methods 4); (3) a cohort of 1431 patients derived from the Deciphering Developmental Disorders study, an initiative from the United Kingdom and Ireland designed to diagnose patients with undiagnosed developmental disorders<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 54\" title=\"Firth, H. V. &amp; Wright, C. F. The deciphering developmental disorders (DDD) study. Dev. Med. Child Neurol. 53, 702&#x2013;703 (2011).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR54\" id=\"ref-link-section-d356388952e890\" target=\"_blank\" rel=\"noopener\">54<\/a> (Methods 5). Results are described in the following sections.<\/p>\n<p>SHEPHERD can perform causal gene discovery<\/p>\n<p>A critical step in rare disease diagnosis is identifying the gene(s) that are strong candidates for causing the patient\u2019s syndrome (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig1\" target=\"_blank\" rel=\"noopener\">1<\/a>a). Given a patient\u2019s set of phenotype terms and a list of genes in which the patient has a mutation, SHEPHERD predicts genes that harbor variants most likely to explain the patient\u2019s presenting symptoms. SHEPHERD produces a score for each candidate gene in the patient that fuses two complementary aspects of information: an embedding-based metric that captures the global network topology and a network-based metric computed using knowledge graph distance that captures local network information (Methods 11). We use SHEPHERD to prioritize genes found in both the EXPERT-CURATED and VARIANT-FILTERED candidate gene lists (Methods 3). In both instances, SHEPHERD performs granular prioritization by refining lists of patients\u2019 candidate genes output by bioinformatics pipelines. For this analysis, we leverage patients from three cohorts: the simulated, MyGene2, and DDD cohorts are used for training, and the UDN cohort is used for validation.<\/p>\n<p>We report SHEPHERD\u2019s performance in causal gene discovery as the average recall at k, defined as the number of causal genes correctly predicted in the top k ranked genes on average for all patients in the cohort. On the EXPERT-CURATED gene lists, SHEPHERD ranks the patient\u2019s causal gene first in 40% of UDN patients, achieving a recall of 0.69 when k\u2009=\u20093 and 0.85 when k\u2009=\u20095 on average (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig3\" target=\"_blank\" rel=\"noopener\">3<\/a>a). On the much longer VARIANT-FILTERED gene lists, SHEPHERD achieves an average recall of 0.21, 0.38, and 0.48 for k\u2009=\u20091, 5, and 10, respectively (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig3\" target=\"_blank\" rel=\"noopener\">3<\/a>d).<\/p>\n<p><b id=\"Fig3\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 3: SHEPHERD performs generalizable causal gene discovery.<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/www.nature.com\/articles\/s41746-025-01749-1\/figures\/3\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig3\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/06\/41746_2025_1749_Fig3_HTML.png\" alt=\"figure 3\" loading=\"lazy\" width=\"685\" height=\"398\"\/><\/a><\/p>\n<p><b>a<\/b> Performance of SHEPHERD, four domain-specific approaches, five language model, traditional machine learning, and network science baselines, and a random baseline. The performance metric is average recall at k for k\u2009=\u20091, 3, and 5. Error bars denote standard deviation over models trained with five random seeds. <b>b<\/b>, <b>c<\/b> Performance of SHEPHERD in ranking causal genes stratified by <b>b<\/b> clinical sites and <b>c<\/b> primary presenting symptoms. Each boxplot shows the median and interquartile range of the rank of the causal gene. Whiskers extend to \u00b11.5\u2009\u00d7\u2009IQR. <b>d<\/b> Performance of SHEPHERD, six domain-specific approaches, five large language model, traditional machine learning, network science baselines, and a random baseline. The performance metric is average recall at k for k\u2009=\u20091, 5, 10, 25, and 50. Error bars denote standard deviation over models trained with five random seeds. <b>e<\/b>, <b>f<\/b> Performance of SHEPHERD against domain-specific algorithms in four extremely hard-to-diagnose scenarios on <b>e<\/b> EXPERT-CURATED and <b>f<\/b> VARIANT-FILTERED gene lists. Shown is the win rate, the proportion of patients where SHEPHERD performs the same as or better than the benchmark algorithms.<\/p>\n<p>We find no significant difference in performance across UDN sites throughout the United States, patients with varying presenting symptoms, and the year of evaluation by UDN clinicians (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig3\" target=\"_blank\" rel=\"noopener\">3<\/a>b, c and Supplementary Figs. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#MOESM1\" target=\"_blank\" rel=\"noopener\">3a<\/a>, <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#MOESM1\" target=\"_blank\" rel=\"noopener\">4a\u2013c<\/a>) on both the EXPERT-CURATED and VARIANT-FILTERED gene lists. These results indicate that SHEPHERD can generalize across clinical sites and diseases over time. Furthermore, we find that SHEPHERD\u2019s performance does not correlate with the number of annotated phenotype terms for each patient (Spearman\u2019s \u03c1\u2009=\u20090.02 and \u03c1\u2009=\u2009\u22120.11 for EXPERT-CURATED and VARIANT-FILTERED lists, respectively; Supplementary Figs. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#MOESM1\" target=\"_blank\" rel=\"noopener\">3c<\/a>, <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#MOESM1\" target=\"_blank\" rel=\"noopener\">4e<\/a>). Finally, we evaluate SHEPHERD\u2019s performance as a function of the prevalence of the rare disease. We leverage the number of submissions to ClinVar as a proxy for prevalence. We find that SHEPHERD\u2019s performance does not strongly correlate with the prevalence of the genetic condition (Spearman\u2019s \u03c1\u2009=\u2009\u22120.17 and \u03c1\u2009=\u2009\u22120.16 for EXPERT-CURATED and VARIANT-FILTERED lists, respectively; Supplementary Figs. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#MOESM1\" target=\"_blank\" rel=\"noopener\">3d<\/a>, <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#MOESM1\" target=\"_blank\" rel=\"noopener\">4f<\/a>). SHEPHERD\u2019s ability to generalize represents an important capability because rare disease patients are heterogeneous, and developing separate predictive models that perform well for each patient subgroup is not feasible due to the low prevalence of the disorders.<\/p>\n<p>We evaluate SHEPHERD against 12 baseline approaches (Methods 21). We select a network science algorithm and two supervised machine learning approaches as benchmarks to quantify the advantages of SHEPHERD\u2019s graph neural network approach. We also identify six domain-specific algorithms developed for causal gene discovery that leverage information theory (Phrank<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 16\" title=\"Jagadeesh, K. A. et al. Phrank measures phenotype sets similarity to greatly improve Mendelian diagnostic disease prioritization. Genet. Med. 21, 464&#x2013;470 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR16\" id=\"ref-link-section-d356388952e1033\" target=\"_blank\" rel=\"noopener\">16<\/a>, PhenIX<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 24\" title=\"Smedley, D. et al. Next-generation diagnostics and disease-gene discovery with the exomiser. Nat. Protoc. 10, 2004&#x2013;2015 (2015).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR24\" id=\"ref-link-section-d356388952e1037\" target=\"_blank\" rel=\"noopener\">24<\/a>, and ERIC<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 25\" title=\"Li, Q., Zhao, K., Bustamante, C. D., Ma, X. &amp; Wong, W. H. Xrare: a machine learning method jointly modeling phenotypes and genetic evidence for rare disease diagnosis. Genet. Med. 21, 2126&#x2013;2134 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR25\" id=\"ref-link-section-d356388952e1041\" target=\"_blank\" rel=\"noopener\">25<\/a>), likelihood ratios (LIRICAL<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 21\" title=\"Robinson, P. N. et al. Interpretable clinical genomics with a likelihood ratio paradigm. Am. J. Hum. Genet. 107, 403&#x2013;417 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR21\" id=\"ref-link-section-d356388952e1045\" target=\"_blank\" rel=\"noopener\">21<\/a>), shallow graph embeddings (CADA<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 19\" title=\"Peng, C. et al. CADA: phenotype-driven gene prioritization based on a case-enriched knowledge graph. NAR Genom. Bioinform. 3, lqab078 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR19\" id=\"ref-link-section-d356388952e1049\" target=\"_blank\" rel=\"noopener\">19<\/a>), and information-theoretic and random walk methods (HiPhive<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 24\" title=\"Smedley, D. et al. Next-generation diagnostics and disease-gene discovery with the exomiser. Nat. Protoc. 10, 2004&#x2013;2015 (2015).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR24\" id=\"ref-link-section-d356388952e1054\" target=\"_blank\" rel=\"noopener\">24<\/a>). We further evaluate SHEPHERD against two large language models (LlaMa 3.1 8B and 70B<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 55\" title=\"Dubey, A. et al. The llama 3 herd of models. Preprint at arXiv:2407.21783 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR55\" id=\"ref-link-section-d356388952e1058\" target=\"_blank\" rel=\"noopener\">55<\/a>; Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#MOESM1\" target=\"_blank\" rel=\"noopener\">8<\/a>). SHEPHERD performs comparably or significantly better than all benchmarking approaches on the EXPERT-CURATED and VARIANT-FILTERED gene lists (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig3\" target=\"_blank\" rel=\"noopener\">3<\/a>a, d and Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#MOESM1\" target=\"_blank\" rel=\"noopener\">8<\/a>). SHEPHERD outperforms the strongest domain-specific algorithms, LIRICAL and HiPhive, in prioritizing causal genes overall on both EXPERT-CURATED (p value\u2009=\u20094.27\u2009\u00d7\u200910\u22122 for LIRICAL) and VARIANT-FILTERED (p value\u2009=\u20092.05\u2009\u00d7\u200910\u22124 for LIRICAL and p value\u2009=\u20092.70\u2009\u00d7\u200910\u22125 for HiPhive) gene lists (Wilcoxon signed rank-sum test). SHEPHERD significantly outperforms the other domain-specific approaches in retrieving the causal gene first by up to 24.4% (p value\u2009=\u20094.92\u2009\u00d7\u200910\u221216) and 7.7% (p value\u2009=\u20091.55\u2009\u00d7\u200910\u22123) of patients on the EXPERT-CURATED and VARIANT-FILTERED gene lists, respectively (McNemar\u2019s test). Furthermore, SHEPHERD significantly outperforms large language models in retrieving the causal gene first by up to 20.1% (p value\u2009=\u20091.42\u2009\u00d7\u200910\u22129) and 7.9% (p value\u2009=\u20096.85\u2009\u00d7\u200910\u22123) of patients on the EXPERT-CURATED and VARIANT-FILTERED gene lists, respectively, and the other machine learning approaches by up to 29.0% (p value 1.73\u2009\u00d7\u200910\u221217) and 20.4% (p value 1.44\u2009\u00d7\u200910\u221215) of patients, respectively (McNemar\u2019s test). For these statistical tests, we apply Benjamin\u2013Hochberg procedure for multiple testing correction.<\/p>\n<p>SHEPHERD\u2019s strong performance demonstrates that SHEPHERD can complement existing variant-based approaches for gene prioritization while leveraging the extensive knowledge sources of gene-phenotype associations. Using SHEPHERD, rare disease experts would need to evaluate 1026 genes from the EXPERT-CURATED lists or 18,005 genes from the VARIANT-FILTERED lists to arrive at the causal gene for all 465 UDN patients. In contrast, with non-guided ranking, experts would need to evaluate a total of 2231 EXPERT-CURATED genes or 27,727 VARIANT-FILTERED genes, suggesting that SHEPHERD has the potential to improve diagnostic efficiency by 2.2-times and 1.5-times, respectively. Compared to the best domain-specific approaches, LIRICAL and HiPhive, SHEPHERD reduces the number of genes that experts need to consider by 97 (8.6%) and 5495 (23.3%) on the EXPERT-CURATED and VARIANT-FILTERED gene lists, respectively (LIRICAL), and by 1878 (9.4%) on the VARIANT-FILTERED gene list (HiPhive).<\/p>\n<p>SHEPHERD can diagnose patients with atypical and novel genetic diseases<\/p>\n<p>Patients in the UDN have atypical or novel disease presentations, which makes them challenging to diagnose because there are no direct associations between patients\u2019 genes, symptoms, and the correct diagnosis. Consequently, the lack of direct linkage between patients\u2019 phenotypic features and the correct diagnosis (causal genes) means that a lookup against medical knowledge bases is ineffective for diagnosis. We find that SHEPHERD can identify the causal gene even when the patient\u2019s presenting phenotypic abnormalities are multiple hops away from the gene causing the disease in the knowledge graph. For 77.8% of patients whose phenotype terms are far away from their causal genes in the knowledge graph (i.e., more than two hops away), SHEPHERD identifies the correct causal gene among its top five predictions from the EXPERT-CURATED gene list. No strong correlation exists between SHEPHERD\u2019s performance and the distance between the patient\u2019s phenotype terms and causal gene (Supplementary Figs. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#MOESM1\" target=\"_blank\" rel=\"noopener\">3b<\/a>, <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#MOESM1\" target=\"_blank\" rel=\"noopener\">4d<\/a>; R2\u2009=\u20090.102, Spearman\u2019s \u03c1\u2009=\u20090.37 and R2\u2009=\u20090.0004, Spearman\u2019s \u03c1\u2009=\u20090.12 for the EXPERT-CURATED and VARIANT-FILTERED gene lists, respectively).<\/p>\n<p>We evaluate SHEPHERD against the domain-specific models in four hard-to-diagnose scenarios (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig3\" target=\"_blank\" rel=\"noopener\">3<\/a>e). We identify patients from the UDN whose causal genes lack known associations with phenotype terms or diseases in the literature (based on our rare disease knowledge graph) and who have been identified by UDN experts as having novel disease genes or novel diseases. SHEPHERD achieves win rates (i.e., ranks the causal gene the same or higher) of up to 82 and 83% for patients whose causal genes have no known phenotype or disease associations, respectively, on the EXPERT-CURATED gene lists. On the VARIANT-FILTERED gene lists, the win rates are up to 80 and 74%, respectively. SHEPHERD achieves win rates of up to 67 and 83% for patients with a novel disease or novel disease gene, respectively, according to UDN experts on the EXPERT-CURATED gene lists, and up to 86% on the VARIANT-FILTERED gene lists. The only subset of patients for which a baseline performs slightly better than SHEPHERD consists of patients with novel disease genes, according to human experts in the UDN. In all other scenarios, SHEPHERD outperforms all baseline approaches, demonstrating SHEPHERD\u2019s ability to diagnose patients with atypical and novel genetic diseases.<\/p>\n<p>We further demonstrate the use of SHEPHERD for patients diagnosed with an atypical presentation of a known disease or a novel syndrome through two case studies on patients from the UDN. Patient UDN-P1 (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig4\" target=\"_blank\" rel=\"noopener\">4<\/a>a; <a href=\"https:\/\/huggingface.co\/spaces\/emilyalsentzer\/SHEPHERD\" target=\"_blank\" rel=\"noopener\">SHEPHERD Tool<\/a>, Tab 1, Patient UDN-P1) received a diagnosis for POLR3-related leukodystrophy three years after acceptance into the UDN. While the involvement of gene POLR3A with leukodystrophy (MIM:607694) is known, the patient\u2019s case was challenging due to her atypical clinical presentation. Several of her presenting clinical features, including lack of tear production, premature adrenarche, laryngeal cleft, hearing loss, and high blood pressure, are not typical of leukodystrophy. Further, only 28.3% (13 out of 46) of the patient\u2019s phenotype terms are directly linked to POLR3A in the knowledge graph, and the patient phenotype terms are 1.98 hops away from the causal gene in the knowledge graph on average. The POLR3A gene is associated with five other diseases, and 93.7% (192 out of 205) of phenotype terms directly linked to POLR3A are not found in the patient, further complicating the diagnosis. Despite this atypical disease presentation, SHEPHERD identifies the patient\u2019s causal gene in the top 1 out of 17 and 86 candidate genes in the EXPERT-CURATED and VARIANT-FILTERED gene lists, respectively. Strikingly, SHEPHERD can disambiguate diseases by optimally up- and down-weighting phenotypic features using an attention mechanism, and correctly down-weights phenotypic features that are atypical of leukodystrophy.<\/p>\n<p><b id=\"Fig4\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 4: Causal gene discovery case studies for patients with novel genetic conditions.<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/www.nature.com\/articles\/s41746-025-01749-1\/figures\/4\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig4\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/06\/41746_2025_1749_Fig4_HTML.png\" alt=\"figure 4\" loading=\"lazy\" width=\"685\" height=\"447\"\/><\/a><\/p>\n<p>SHEPHERD identifies the causal gene even in atypical or novel disease presentations. Each patient case study, shown in (<b>a<\/b>, <b>b<\/b>), includes the subset of the knowledge graph containing all nodes in the shortest path between the patient\u2019s phenotype terms, causal gene, and disease; a table of the patient\u2019s phenotype terms and attention weights learned by SHEPHERD; and bar plots of scores SHEPHERD assigned to each candidate gene in the EXPERT-CURATED and VARIANT-FILTERED lists. The top and bottom five ranked genes in the VARIANT-FILTERED list are shown. The causal gene is highlighted in orange. The direct phenotypic neighbors of the causal gene are emphasized. In patient UDN-P1&#8217;s network, the patient\u2019s causal gene is directly connected to the disease in the knowledge graph. In patient UDN-P2&#8217;s network, there is no disease node because the patient has a novel, uncharacterized syndrome. All panels, except those labeled as a \u201cpatient card&#8221; (colored box with the information provided by the UDN), depict SHEPHERD&#8217;s predictions or analyses performed on outputs of SHEPHERD.<\/p>\n<p>SHEPHERD can also identify strong candidate genes for patients with novel, uncharacterized syndromes. Patient UDN-P2 (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig4\" target=\"_blank\" rel=\"noopener\">4<\/a>b; <a href=\"https:\/\/huggingface.co\/spaces\/emilyalsentzer\/SHEPHERD\" target=\"_blank\" rel=\"noopener\">SHEPHERD Tool<\/a> Tab 1, Patient UDN-P2) was accepted into the UDN with congenital hypotonia and developmental delay. While no diagnosis was identified in the primary genomic and clinical evaluation, the patient was diagnosed three years later with a novel PRKAR1B-related neurodevelopmental disorder. The PRKAR1B gene is not associated with known diseases. None of the 21 phenotype terms directly linked to PRKAR1B are found in the patient, and the average shortest path length from the patient\u2019s phenotype terms to the causal gene is 2.4. Nevertheless, SHEPHERD identifies the suspected causal gene among the top 3 in the EXPERT-CURATED candidate list and the top 4 in the VARIANT-FILTERED candidate list, illustrating how SHEPHERD can assist in recognizing novel genetic diseases.<\/p>\n<p>SHEPHERD learns meaningful patient representations that capture patient similarity<\/p>\n<p>Another critical consideration for rare disease diagnosis is finding patients that share the same disease or causal gene, commonly referred to as \u201cpatients-like-me\u201d<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 56\" title=\"Wicks, P. et al. Sharing health data for better outcomes on patientslikeme. J. Med. Internet Res. 12, e19. (2010).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR56\" id=\"ref-link-section-d356388952e1241\" target=\"_blank\" rel=\"noopener\">56<\/a> (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig1\" target=\"_blank\" rel=\"noopener\">1<\/a>a). Starting from a set of patient phenotype terms, SHEPHERD flags other patients in the cohort with similar genetic diseases suitable for follow-up diagnostic analysis. Concretely, SHEPHERD finds similar patients through a deep embedding scorer optimized to represent patients with the same causal genes or disease as nearby points in the embedding space. For this analysis, we leverage patients from three cohorts: the simulated cohort is used for training, and the UDN and MyGene2 cohorts are used for validation.<\/p>\n<p>SHEPHERD represents each patient as a point in the embedding space colored by the disease category of their diagnosed disease (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig5\" target=\"_blank\" rel=\"noopener\">5<\/a>). The categories correspond to the 33 disease categories outlined in Orphanet (Methods 2). Robust clustering of patients by disease area (AMI\u2009=\u20090.304; p value 5).<\/p>\n<p><b id=\"Fig5\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 5: SHEPHERD identifies patients-like-me from UDN and MyGene2 cohorts.<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/www.nature.com\/articles\/s41746-025-01749-1\/figures\/5\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig5\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/06\/41746_2025_1749_Fig5_HTML.png\" alt=\"figure 5\" loading=\"lazy\" width=\"685\" height=\"701\"\/><\/a><\/p>\n<p><b>a<\/b> Performance of SHEPHERD in retrieving MyGene2 patients with the same causal gene as a UDN patient (n\u2009=\u200975 UDN patients with at least one matching patient in the MyGene2 cohort). SHEPHERD is benchmarked against Phrank, a domain-specific algorithm. The performance metric is average recall at k for k\u2009=\u20091, 5, 10, 25, and 50. <b>b<\/b> Heatmap of the average distance between the phenotype embeddings of pairs of patients across disease categories. Darker colors indicate smaller distances and lighter colors indicate larger distances between patients of each pair of disease categories. <b>c<\/b> Two-dimensional UMAP plot of SHEPHERD&#8217;s embedding space of all simulated (circle), UDN (up-facing triangle), and MyGene2 (down-facing triangle) patients colored by their Orphanet disease category. Each of the four case studies consists of a zoomed-in UMAP displaying the query patient (star) and all patients with the same causal gene as the query (colored circles) and a table containing information regarding the top five most similar patients retrieved by SHEPHERD. Patients are bolded in the table if they share the same causal gene. All panels, except those labeled as a \u201cpatient card\u201d (colored box with the information provided by the UDN), depict SHEPHERD&#8217;s predictions or analyses performed on outputs of SHEPHERD.<\/p>\n<p>To further evaluate patient embeddings, we compare embedding distances between patients diagnosed with either the same or different disease (i.e., comparing diagonal vs. off-diagonal entries, Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig5\" target=\"_blank\" rel=\"noopener\">5<\/a>b). We find that distances between patients of the same category are significantly smaller than between patients of different categories (p value 5b and Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#MOESM1\" target=\"_blank\" rel=\"noopener\">7<\/a>). For example, patients with neoplastic diseases and gastroenterologic diseases cluster together. Similarly, patients with hematologic and hepatic diseases and patients with odontologic and renal diseases cluster together in the embedding space. These clusters represent real co-occurrences of symptoms in disease presentations. For instance, patients with odontologic diseases, atypical dentin dysplasia, and orofaciodigital syndrome I, have both orofacial and renal disease presentations. Atypical dentin dysplasia is caused by a mutation in SMOC2, a matricellular protein involved in both craniofacial development and kidney fibrosis<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 57\" title=\"Gerarduzzi, C. et al. Silencing SMOC2 ameliorates kidney fibrosis by inhibiting fibroblast to myofibroblast transformation. JCI Insight 2, e90299 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR57\" id=\"ref-link-section-d356388952e1316\" target=\"_blank\" rel=\"noopener\">57<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 58\" title=\"Morkmued, S. et al. Deficiency of the SMOC2 matricellular protein impairs bone healing and produces age-dependent bone loss. Sci. Rep.10, 14817 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR58\" id=\"ref-link-section-d356388952e1319\" target=\"_blank\" rel=\"noopener\">58<\/a>. Orofaciodigital syndrome I is caused by a mutation in OFD1, which is involved in organogenesis and plays a vital role in the normal growth of orofacial and kidney tissues<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 59\" title=\"Romio, L. et al. OFD1, the gene mutated in oral-facial-digital syndrome type 1, is expressed in the metanephros and in human embryonic renal mesenchymal cells. J. Am. Soc. Nephrol. 14, 680&#x2013;689. (2003).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR59\" id=\"ref-link-section-d356388952e1326\" target=\"_blank\" rel=\"noopener\">59<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 60\" title=\"Saal, S. et al. Renal insufficiency, a frequent complication with age in oral-facial-digital syndrome type I. Clin. Genet. 77, 258&#x2013;265 (2010).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR60\" id=\"ref-link-section-d356388952e1329\" target=\"_blank\" rel=\"noopener\">60<\/a>. These relationships reflect that diseases often involve multiple organ systems and indicate that the embedding space can capture the relationship between patients with similar symptoms even when their diagnoses differ.<\/p>\n<p>SHEPHERD can identify \u201cpatients-like-me\u201d with similar genetic diseases<\/p>\n<p>We next examine SHEPHERD\u2019s ability to identify \u201cpatients-like-me\u201d from a large cohort of rare disease patients. We either rank all simulated, UDN, and MyGene2 patients (UDN-P3 and UDN-P4 cases) or all UDN and MyGene2 patients (UDN-P5 and UDN-P6 cases; Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig5\" target=\"_blank\" rel=\"noopener\">5<\/a>c; <a href=\"https:\/\/huggingface.co\/spaces\/emilyalsentzer\/SHEPHERD\" target=\"_blank\" rel=\"noopener\">SHEPHERD Tool<\/a> Tab 2) to identify patients most similar to the query UDN patient. We locate each query patient and all similar patients with the same causal gene in SHEPHERD\u2019s embedding space, and find that patients with the same causal gene are embedded nearby. In all four patient cases, SHEPHERD retrieves patients with the same causal gene and disease as the query patient among the top five predictions. Patients ranked above the patient with the same causal gene have very similar disease presentations to the query patient. For UDN-P4 and UDN-P5, the patients have a variant of the same disease caused by a different gene (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig5\" target=\"_blank\" rel=\"noopener\">5<\/a>c). For UDN-P6, patients with Coffin-Siris syndrome 8 (ranked first) and GATAD2B-associated syndrome (ranked second) both exhibit impaired intellectual development, hypotonia, feeding difficulties, and hypertelorism, among other phenotypic abnormalities. For UDN-P3, patients with X-linked intellectual disability due to GRIA3 (ranked first) and Coffin-Lowry syndrome (ranked second) share impaired intellectual development, seizures, scoliosis, and other phenotypic abnormalities.<\/p>\n<p>The most similar patients identified by SHEPHERD do not necessarily have the most phenotype terms in common with the query patient. This reflects SHEPHERD\u2019s ability to capture phenotypic similarity rather than just calculating a direct overlap in phenotype terms, typical of some information-theoretic approaches used in practice. In particular, patients who share the same causal gene have two to four phenotype terms in common. Only 10.0, 9.0, 26.6, and 7.7% of the phenotype terms found in query patients UDN-P3, UDN-P4, UDN-P5, and UDN-P6 are also found in the most similar genotype-matched individual, respectively. In contrast, patients with the most phenotype terms in common with the query are ranked at positions 366, 463, 41, and 16, respectively. For example, one patient shares ten phenotype terms with UDN-P6, which is 38.5% of UDN-P6\u2019s phenotypes, yet has a different causal gene and is ranked 16th. This capability of SHEPHERD to consider indirect, deep associations between genes and phenotypic features makes SHEPHERD highly complementary to graph-theoretic techniques and statistical tests that can only score direct associations, which can be ineffective for poorly characterized diseases.<\/p>\n<p>We next quantify SHEPHERD\u2019s ability to identify \u201cpatients-like-me\u201d for each UDN patient from all patients in the real-world MyGene2 cohort. As before, we evaluate the average recall at k, here defined as the number of MyGene2 patients with the same causal gene as the query correctly predicted in the top-k ranked patients on average for all UDN patients in the cohort. SHEPHERD ranks a patient with the same causal gene first in 11.5% of UDN patients, achieving a recall of 0.31, 0.43, 0.49, and 0.53 for k\u2009=\u20095, 10, 25, and 50, respectively (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig5\" target=\"_blank\" rel=\"noopener\">5<\/a>a). We compare SHEPHERD to Phrank, an alternative approach that can calculate phenotypic similarity. Phrank uses information theory to calculate the similarity between two sets of phenotype terms based on shared ancestors in the Human Phenotype Ontology. We find that SHEPHERD performs significantly better than Phrank in identifying \u201cpatients-like-me\u201d (Mann\u2013Whitney p value\u2009=\u20090.04). SHEPHERD ranks a patient with the same causal gene first for 7.4% more patients and reduces the number of patients that clinicians need to consider by 703 (17.2%) compared to Phrank.<\/p>\n<p>Finally, we evaluate whether SHEPHERD embeds patients with the same disease (rather than gene) closer to each other than to patients with different diseases. Again, we compare UDN patients to MyGene2 patients. We find that embedding distances between patients diagnosed with the same disease are significantly smaller compared to patients with different diseases (p value\u2009=\u20092.42\u2009\u00d7\u200910\u22128; Kolmogorov\u2013Smirnov test; Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#MOESM1\" target=\"_blank\" rel=\"noopener\">6<\/a>), further strengthening the evidence that SHEPHERD can capture similarities between different diseases with similar presenting symptoms, but can nevertheless differentiate patients that have the same diagnosed disease.<\/p>\n<p>SHEPHERD provides an interpretable characterization of novel diseases<\/p>\n<p>In addition to supporting causal gene discovery and patients-like-me identification, SHEPHERD can help characterize novel clinical presentations through our current knowledge of rare diseases (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig1\" target=\"_blank\" rel=\"noopener\">1<\/a>a). Given a patient\u2019s set of HPO phenotype terms, SHEPHERD provides an interpretable summary of the patient\u2019s disease based on its similarity to each disease in the KG. SHEPHERD produces a ranked list of all diseases using the embedding similarity between each disease and the patient\u2019s phenotype terms, which are then summarized to generate a distribution of similarities to disease categories. More concretely, SHEPHERD learns an embedding space in which the similarity between a patient and a disease is inversely proportional to the embedding distance between the patient and their diagnosed disease (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig6\" target=\"_blank\" rel=\"noopener\">6<\/a>a). Aggregating SHEPHERD-generated similarities of individual diseases by their disease category enables interpretable characterization of the patient\u2019s disease. For example, a patient\u2019s presenting syndrome may be w1% similar to rare neurologic diseases, w2% similar to rare bone diseases, w3% similar to rare developmental defects during embryogenesis, etc. SHEPHERD can leverage gene-phenotype-disease associations to generate granular descriptions of a patient\u2019s disease. For this analysis, we leverage patients from two cohorts: the simulated cohort is used for training, and the UDN cohort is used for validation.<\/p>\n<p><b id=\"Fig6\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 6: SHEPHERD performs novel disease characterization.<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/www.nature.com\/articles\/s41746-025-01749-1\/figures\/6\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig6\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/06\/41746_2025_1749_Fig6_HTML.png\" alt=\"figure 6\" loading=\"lazy\" width=\"685\" height=\"634\"\/><\/a><\/p>\n<p><b>a<\/b> Bar plots of the similarity between UDN patients and diseases found in each disease category. We group UDN patients by the disease category of their true disease and show plots for all categories with at least five patients. The bars that do not correspond to the disease category of each patient\u2019s true disease are colored gray. <b>b<\/b> The column for each of the four case studies contains: the percent similarity distributions of the patient\u2019s phenotype terms to diseases in each disease category based on a phenotype search via the KG (top) or SHEPHERD (bottom), a table of the five most similar diseases according to SHEPHERD, and a table of the patient\u2019s five phenotypic features that are most highly attended by SHEPHERD.<\/p>\n<p>We observe that SHEPHERD learns to embed patients near diseases of the same category; on average, 45.7% of the top ten ranked diseases with a known disease category belong to the same category as the patient\u2019s disease, which is nearly three times more than the random expectation alone (16.4%). To evaluate SHEPHERD\u2019s ability to provide interpretable disease names for patients with known rare diseases, we first calculate the similarity between UDN patients and all diseases. This allows us to assess whether the patients are most similar to diseases that share the same disease category as the patient\u2019s disease (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig6\" target=\"_blank\" rel=\"noopener\">6<\/a>a). Concretely, for each patient, we stratify patients by their primary disease category and calculate the average similarity of a patient to all disease nodes under each disease category. As expected, we find that patients tend to be most similar to diseases of the same disease category as their own. For example, patients with a rare bone disease are predicted to be most similar to diseases under the category of rare bone disease (13.0% similarity), followed by rare developmental defects during embryogenesis (10.2%), rare inborn errors of metabolism (9.6%), and rare odontology diseases (8.2%). Similarly, patients with a disease categorized as a rare developmental defect during embryogenesis, a rare inborn error of metabolism, or a rare neurologic disease tend to be most similar to other diseases of the same category.<\/p>\n<p>We examine two patients in depth to interrogate SHEPHERD\u2019s predictive capabilities for characterizing known rare diseases: UDN-P7 and UDN-P8. Patient UDN-P7 (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig6\" target=\"_blank\" rel=\"noopener\">6<\/a>b; <a href=\"https:\/\/huggingface.co\/spaces\/emilyalsentzer\/SHEPHERD\" target=\"_blank\" rel=\"noopener\">SHEPHERD Tool<\/a> Tab 3, Patient UDN-P7) received a diagnosis for limb-girdle muscular dystrophy 3 (sarcoglycanopathy; MIM:608099) due to variants in SGCA. SHEPHERD compares the patient\u2019s clinical presentation to diseases across 19 disease categories and finds that the patient is most similar to rare neurologic diseases, as expected. From SHEPHERD\u2019s predictions, two of the top five most similar diseases are other types of AR limb-girdle muscular dystrophy, and all five are related to muscular dystrophy. We compare SHEPHERD to a simple phenotypic search of the patient\u2019s HPO terms to generate a distribution of similarities to disease categories. This phenotype search approach can correctly identify the patient\u2019s disease as a rare neurologic disease, but cannot produce disease-level rankings. Patient UDN-P8 (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig6\" target=\"_blank\" rel=\"noopener\">6<\/a>b; <a href=\"https:\/\/huggingface.co\/spaces\/emilyalsentzer\/SHEPHERD\" target=\"_blank\" rel=\"noopener\">SHEPHERD Tool<\/a> Tab 3, Patient UDN-P8) was diagnosed four years after acceptance to the UDN with the bone disease spondyloepimetaphyseal dysplasia caused by a mutation in RPL13. Again, SHEPHERD can ascertain that the patient\u2019s symptoms are similar to other bone diseases; all of the top five ranked disorders are rare bone diseases with overlapping phenotype terms found in the query patient. In contrast, the phenotype search approach does not identify UDN-P8\u2019s disease as a rare bone disease; rather, it predicts that the patient has a disease due to a rare developmental defect during embryogenesis. These findings on our case studies of two patients with known rare diseases suggest that SHEPHERD can produce correct and granular hypotheses about a patient\u2019s rare disorder.<\/p>\n<p>We also investigate SHEPHERD\u2019s hypotheses for two patients with novel genetic diseases, UDN-P9 and UDN-P10. UDN-P9 (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig6\" target=\"_blank\" rel=\"noopener\">6<\/a>b; <a href=\"https:\/\/huggingface.co\/spaces\/emilyalsentzer\/SHEPHERD\" target=\"_blank\" rel=\"noopener\">SHEPHERD Tool<\/a> Tab 3, Patient UDN-P9) was diagnosed with ATP5PO-related Leigh syndrome caused by a novel mutation in ATP5PO, a gene previously unassociated with any disease<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 61\" title=\"Ganapathi, M. et al. A homozygous splice variant in atp5po, disrupts mitochondrial complex v function and causes leigh syndrome in two unrelated families. J. Inherit. Metab. Dis. 45, 996&#x2013;1012 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR61\" id=\"ref-link-section-d356388952e1499\" target=\"_blank\" rel=\"noopener\">61<\/a>. As Leigh syndrome is a metabolic disorder with neuropathological features, SHEPHERD correctly identifies UDN-P9\u2019s disease as most similar to diseases under the categories of rare inborn errors of metabolism and rare neurological diseases. In contrast, the phenotype search method incorrectly predicts a tie between a disorder due to a rare inborn error of metabolism and a rare neoplastic disease, failing to label the patient\u2019s disease as a neurological disorder. Three of the top five diseases\u2014combined oxidative phosphorylation deficiency 39 (MIM:618397; ranked by SHEPHERD as #1), pyruvate dehydrogenase E3-binding protein deficiency (MIM:245349; ranked by SHEPHERD as #3), and combined oxidative phosphorylation defect type 26 (MIM:616672; ranked by SHEPHERD as #5)\u2014are mitochondrial diseases affecting the same pathway as ATP5PO and result in a defect in the aerobic energy production. These diseases\u2019 causal genes co-localize with ATP5PO<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Chen, H., Morris, M. A., Rossier, C., Blouin, J.-L. &amp; Antonarakis, S. E. Cloning of the cDNA for the human ATP synthase OSCP subunit (ATP50) by exon trapping and mapping to chromosome 21q22. 1-q22. 2. Genomics 28, 470&#x2013;476 (1995).\" href=\"#ref-CR62\" id=\"ref-link-section-d356388952e1509\">62<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Aggeler, R. et al. A functionally active human F1F0 ATPase can be purified by immunocapture from heart tissue and fibroblast cell lines: subunit structure and activity studies. J. Biol. Chem. 277, 33906&#x2013;33912 (2002).\" href=\"#ref-CR63\" id=\"ref-link-section-d356388952e1509_1\">63<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Brautigam, C. A., Wynn, R. M., Chuang, J. L. &amp; Chuang, D. T. Subunit and catalytic component stoichiometries of an in vitro reconstituted human pyruvate dehydrogenase complex. J. Biol. Chem. 284, 13086&#x2013;13098 (2009).\" href=\"#ref-CR64\" id=\"ref-link-section-d356388952e1509_2\">64<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 65\" title=\"Jiang, Y. et al. Component co-expression and purification of recombinant human pyruvate dehydrogenase complex from baculovirus infected SF9 cells. Protein Expr. Purif. 97, 9&#x2013;16 (2014).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR65\" id=\"ref-link-section-d356388952e1512\" target=\"_blank\" rel=\"noopener\">65<\/a>. Combined oxidative phosphorylation deficiency 39 and combined oxidative phosphorylation defect type 26 are associated with neurological presentations of mitochondrial disease, including hypotonia, seizures, and features of Leigh syndrome<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 66\" title=\"Glasgow, R. I. et al. Novel GFM2 variants associated with early-onset neurological presentations of mitochondrial disease and impaired expression of oxphos subunits. Neurogenetics 18, 227&#x2013;235 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR66\" id=\"ref-link-section-d356388952e1516\" target=\"_blank\" rel=\"noopener\">66<\/a>. The remaining two most similar diseases (ranked by SHEPHERD as #2 and #4) are rare neurologic diseases with phenotype terms identical to UDN-P9\u2019s. The causal gene, CNP, for the second-ranked disease, hypomyelinating leukodystropy-20 (MIM:619071), is three hops away from ATP5PO in the physical protein interaction network<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 67\" title=\"Warde-Farley, D. et al. The genemania prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res. 38, W214&#x2013;W220 (2010).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR67\" id=\"ref-link-section-d356388952e1527\" target=\"_blank\" rel=\"noopener\">67<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 68\" title=\"Franz, M. et al. GeneMANIA update 2018. Nucleic Acids Res. 46, W60&#x2013;W64 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR68\" id=\"ref-link-section-d356388952e1530\" target=\"_blank\" rel=\"noopener\">68<\/a>, suggesting that they may be functionally related<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Ispolatov, I., Yuryev, A., Mazo, I. &amp; Maslov, S. Binding properties and evolution of homodimers in protein&#x2013;protein interaction networks. Nucleic Acids Res. 33, 3629&#x2013;3635 (2005).\" href=\"#ref-CR69\" id=\"ref-link-section-d356388952e1534\">69<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Keskin, O., Tuncbag, N. &amp; Gursoy, A. Predicting protein&#x2013;protein interactions from the molecular to the proteome level. Chem. Rev. 116, 4884&#x2013;4909 (2016).\" href=\"#ref-CR70\" id=\"ref-link-section-d356388952e1534_1\">70<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 71\" title=\"Zitnik, M., Sosi&#x10D;, R., Feldman, M. W. &amp; Leskovec, J. Evolution of resilience in protein interactomes across the tree of life. Proc. Natl Acad. Sci. USA 116, 4426&#x2013;4433 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR71\" id=\"ref-link-section-d356388952e1537\" target=\"_blank\" rel=\"noopener\">71<\/a> or operate together<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 72\" title=\"Westermarck, J., Ivaska, J. &amp; Corthals, G. L. Identification of protein interactions involved in cellular signaling. Mol. Cell. Proteomics 12, 1752&#x2013;1763 (2013).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR72\" id=\"ref-link-section-d356388952e1541\" target=\"_blank\" rel=\"noopener\">72<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 73\" title=\"Luck, K. et al. A reference map of the human binary protein interactome. Nature 580, 402&#x2013;408 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#ref-CR73\" id=\"ref-link-section-d356388952e1544\" target=\"_blank\" rel=\"noopener\">73<\/a> to mediate phenotypic features associated with UDN-P9\u2019s disease and hypomyelinating leukodystropy-20.<\/p>\n<p>Patient UDN-P10 (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41746-025-01749-1#Fig6\" target=\"_blank\" rel=\"noopener\">6<\/a>b; <a href=\"https:\/\/huggingface.co\/spaces\/emilyalsentzer\/SHEPHERD\" target=\"_blank\" rel=\"noopener\">SHEPHERD Tool<\/a> Tab 3, Patient UDN-P10), is characterized by SHEPHERD as most similar to diseases under the categories of rare inborn errors of metabolism, rare hepatic disease, rare gastroenterological disease, and rare endocrine disease. These top categories are aligned with many of the patient\u2019s symptoms, particularly duodenal atresia, intestinal malrotation, pancreatic exocrine insufficiency, liver disease, and developmental delay. In contrast, the phenotype search approach predicts that the patient\u2019s disease is most similar to diseases due to rare developmental defects during embryogenesis. Three of the top five most similar individual diseases from SHEPHERD\u2019s outputs\u2014methylmalonic acidemia with homocystinuria type cblF (MIM:277380; ranked by SHEPHERD as #1), neonatal hemochromatosis (MIM:231100; ranked by SHEPHERD as #2), and ALG8-CDG (MIM:608104; ranked by SHEPHERD as #4)\u2014are also due to inborn errors of metabolism, and the diseases are associated with phenotypes that are similar to those seen in the patient, including abnormalities in liver and gastrointestinal function and developmental delay. Notably, the rare respiratory disease category is the third lowest-ranked category. UDN clinicians hypothesized that the patient\u2019s GLYR1 variants cause a mislocalization of the cystic fibrosis conductance regulator (CFTR), which is associated with cystic fibrosis. While the patient has gastrointestinal and pancreatic symptoms similar to those in cystic fibrosis, the patient does not have any of the pulmonary features classic for that condition. Such granularity in SHEPHERD\u2019s predictions is a reflection of SHEPHERD\u2019s ability to differentiate between diseases despite partially overlapping phenotypes and causal genes sharing the same pathway.<\/p>\n","protected":false},"excerpt":{"rendered":"Overview of the undiagnosed diseases network patient cohort We assemble a cohort of 465 patients in the Undiagnosed&hellip;\n","protected":false},"author":2,"featured_media":206474,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3846],"tags":[3967,12848,8668,7371,21371,3968,267,3941,3690,1096,20181,70,20774,16,15],"class_list":{"0":"post-206473","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-genetics","8":"tag-biomedicine","9":"tag-biotechnology","10":"tag-computer-science","11":"tag-diagnosis","12":"tag-diseases","13":"tag-general","14":"tag-genetics","15":"tag-health-care","16":"tag-machine-learning","17":"tag-medical-research","18":"tag-medicine-public-health","19":"tag-science","20":"tag-translational-research","21":"tag-uk","22":"tag-united-kingdom"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@uk\/114730064659551709","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/206473","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/comments?post=206473"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/206473\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media\/206474"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media?parent=206473"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/categories?post=206473"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/tags?post=206473"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}