{"id":765718,"date":"2026-05-01T10:21:24","date_gmt":"2026-05-01T10:21:24","guid":{"rendered":"https:\/\/www.europesays.com\/us\/765718\/"},"modified":"2026-05-01T10:21:24","modified_gmt":"2026-05-01T10:21:24","slug":"genetic-association-and-machine-learning-improve-the-prediction-of-type-1-diabetes-risk","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/us\/765718\/","title":{"rendered":"Genetic association and machine learning improve the prediction of type 1 diabetes risk"},"content":{"rendered":"<p>Ethics statement<\/p>\n<p>The use of human genetic data in this study was approved by the University of California, San Diego, Institutional Review Board.<\/p>\n<p>Genome-wide association and genotype imputation<\/p>\n<p>For the MHC analysis, we compiled genotype data from 10,107 T1D and 19,639 nondiabetic individuals of European ancestry across five cohorts (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#MOESM3\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a>). Cohorts were selected based on the availability of genome-wide genotyping array data for imputation into the TOPMed r3 panel<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 24\" title=\"Hu, X. et al. Additive and interaction effects at three amino acid positions in HLA-DQ and HLA-DR molecules drive type 1 diabetes risk. Nat. Genet. 47, 898&#x2013;905 (2015).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR24\" id=\"ref-link-section-d123039683e2631\" rel=\"nofollow noopener\" target=\"_blank\">24<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 37\" title=\"Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290&#x2013;299 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR37\" id=\"ref-link-section-d123039683e2634\" rel=\"nofollow noopener\" target=\"_blank\">37<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 38\" title=\"McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279&#x2013;1283 (2016).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR38\" id=\"ref-link-section-d123039683e2637\" rel=\"nofollow noopener\" target=\"_blank\">38<\/a> and the MHC locus using the Michigan HLA reference panel<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 39\" title=\"Luo, Y. et al. A high-resolution HLA reference panel capturing global population diversity enables multi-ancestry fine-mapping in HIV host response. Nat. Genet. 53, 1504&#x2013;1516 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR39\" id=\"ref-link-section-d123039683e2641\" rel=\"nofollow noopener\" target=\"_blank\">39<\/a> and several cohorts (GENIE ROI, CSGNM) were further excluded due to lower quality imputation at this locus. T1D cases were matched to nondiabetic individuals by ancestry and, where possible, genotype array<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 9\" title=\"Chiou, J. et al. Interpreting type 1 diabetes risk with genetics and single-cell epigenomics. Nature 594, 398&#x2013;402 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR9\" id=\"ref-link-section-d123039683e2645\" rel=\"nofollow noopener\" target=\"_blank\">9<\/a>. We performed quality control on variants using the HRC imputation preparation program (v4.2.9, <a href=\"https:\/\/www.well.ox.ac.uk\/~wrayner\/tools\/\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/www.well.ox.ac.uk\/~wrayner\/tools\/<\/a>) and PLINK (v1.9)to remove variants with MAF\u2009&lt;\u20091%, missing genotypes &gt;5%, violating Hardy-Weinberg equilibrium (HWE, P\u2009&lt;\u20091\u2009\u00d7\u200910\u22125 in unaffected cohorts; HWE, P\u2009&lt;\u20091\u2009\u00d7\u200910\u221210 in case cohorts), allele ambiguity and difference in allele frequency &gt;0.2 compared to HRC r1.1 reference panel<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 38\" title=\"McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279&#x2013;1283 (2016).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR38\" id=\"ref-link-section-d123039683e2667\" rel=\"nofollow noopener\" target=\"_blank\">38<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 40\" title=\"Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559&#x2013;575 (2007).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR40\" id=\"ref-link-section-d123039683e2670\" rel=\"nofollow noopener\" target=\"_blank\">40<\/a>. We imputed 55,615 variants from the four-digit multi-ethnic HLA reference panel (v1). We retained variants with imputation accuracy (r2)\u2009&gt;\u20090.5 and a standard deviation in nondiabetic allele frequency &lt;\u20090.055 for association testing<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 41\" title=\"Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284&#x2013;1287 (2016).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR41\" id=\"ref-link-section-d123039683e2679\" rel=\"nofollow noopener\" target=\"_blank\">41<\/a>.<\/p>\n<p>To examine genetic risk outside the MHC, we compiled association data from 20,355 T1D cases and 797,363 nondiabetic European ancestry individuals, matched by ancestry and genotype array where possible (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#MOESM3\" rel=\"nofollow noopener\" target=\"_blank\">11<\/a>). For the FinnGen cohort, we downloaded the summary statistics from the r10 version of \u2018T1D_Early\u2019, which includes 2,832 individuals diagnosed with T1D under the age of 20 years and excludes individuals with T2D.<\/p>\n<p>We used genotyping data for 263 individuals from nPOD, including 115 T1D cases and 148 individuals without T1D<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 42\" title=\"Campbell-Thompson, M. et al. Network for Pancreatic Organ Donors with Diabetes (nPOD): developing a tissue biobank for type 1 diabetes. Diabetes Metab. Res. Rev. 28, 608&#x2013;617 (2012).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR42\" id=\"ref-link-section-d123039683e2692\" rel=\"nofollow noopener\" target=\"_blank\">42<\/a>. Additionally, we used 1,999 T2D individuals from the WTCCC1 cohort<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 43\" title=\"Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661&#x2013;678 (2007).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR43\" id=\"ref-link-section-d123039683e2696\" rel=\"nofollow noopener\" target=\"_blank\">43<\/a>. To examine the predictive ability of T1GRS in individuals of African ancestry, we used 284 T1D individuals from SEARCH and 404 nondisease individuals from CLEAR<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 44\" title=\"SEARCH Study Group. SEARCH for Diabetes in Youth: a multicenter study of the prevalence, incidence and classification of diabetes mellitus in youth. Control. Clin. Trials 25, 458&#x2013;471 (2004).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR44\" id=\"ref-link-section-d123039683e2700\" rel=\"nofollow noopener\" target=\"_blank\">44<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 45\" title=\"Danila, M. I. et al. Dense genotyping of immune-related regions identifies loci for rheumatoid arthritis risk and damage in African Americans. Mol. Med. 23, 177&#x2013;187 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR45\" id=\"ref-link-section-d123039683e2703\" rel=\"nofollow noopener\" target=\"_blank\">45<\/a>. For all cohorts, we performed variant quality control as described above before imputation using the Michigan HLA and TOPMed r3 panels.<\/p>\n<p>We considered participants with whole-genome sequencing and EHR data from the AoURP Controlled Tier Dataset v7 (ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 46\" title=\"The All of Us Research Program Investigators et al. The &#x2018;All of Us&#x2019; Research Program. N. Engl. J. Med. 381, 668&#x2013;676 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR46\" id=\"ref-link-section-d123039683e2710\" rel=\"nofollow noopener\" target=\"_blank\">46<\/a>). T1D and nondiabetic individuals were identified using a combination of diagnosis codes (ICD-9\/ICD-10) and drug exposures extracted from the EHR data. Briefly, a participant was classified as having T1D if they met all of the following criteria: (1) an ICD-9\/ICD-10 diagnosis code for T1D on at least three visits or self-reported T1D on enrollment survey; (2) three or more instances of a recorded insulin prescription; (3) did not have a diagnosis of cystic fibrosis, secondary diabetes mellitus and drug- or chemical-induced diabetes mellitus; and (4) were not prescribed any other oral or injectable hypoglycemic agents. A participant was identified as nondiabetic if they did not have any of the following: (1) an ICD-9\/ICD-10 code corresponding to T1D\/T2D or self-reported T1D, or (2) an insulin prescription. Diagnosis codes used for classification were as follows: T1D\u2014ICD-9-CM 250.x1, 250.x3; ICD-10-CM E10.x; T2D\u2014ICD-9-CM 250.x0, 250.x2; ICD-10-CM E11.x; cystic fibrosis\u2014ICD-9-CM 277.00\u2013277.09; ICD-10-CM E84.x; secondary diabetes mellitus\u2014ICD-9-CM 249.x and ICD-10-CM E08.x, E13.x; and drug-induced or chemical-induced diabetes mellitus\u2014ICD-10-CM E09.x.<\/p>\n<p>For insulin drug exposures, we included all medications classified under the ATC code A10A, including fast, intermediate, long, combination and inhaled insulins. Noninsulin hypoglycemic agents were defined by ATC code A10B and include biguanides, sulfonylureas, \u03b1-glucosidase inhibitors, thiazolidinediones, dipeptidyl-4 inhibitors, glucagon-like peptide 1 analogs, sodium-glucose cotransporter 2 inhibitors, meglitinides and their combinations.<\/p>\n<p>Complications were defined by the presence of an ICD-9\/ICD-10 code related to the complication. Cardiovascular complications were defined as myocardial infarction, cerebral infarction, heart failure, major vessel atherosclerosis, renal disease identified as a diagnostic code of diabetes and any severity of renal disease or separate ICD-9\/ICD-10 code of proteinuria. Neuropathy was defined as any diagnostic code listing diabetes and any severity of neurologic complication. Retinopathy was defined using the following codes: cardiovascular\u2014ICD-9-CM 410.7, 410.1x, 410.4x, 410.6x, 410.9x, 410.9, 428.xx, 434.x1, 437.0, 440.0, 440.1, 440.8; ICD-10-CM 121.x, 150.x, 163.x, 167.2, 170.0, 170.1, 170.8; nephropathy\u2014ICD-9-CM 249.4x, 250.4x, 791.0; ICD-10-CM E08.2x, E09.2x, E10.2x, E11.2x, E13.2x, R80.x, N06.x; neuropathy\u2014ICD-9-CM 250.6x, 249.6x; ICD-10-CM E08.4x, E09.4x, E10.4x, E11.4x, E13.4x; and retinopathy\u2014ICD-9-CM 362.0x and 250.5x; ICD-10-CM E08.3x, E09.3x, E10.3x, E11.3x, E13.3x.<\/p>\n<p>Details of the generation and quality control of the genomic data are provided in the AoURP Genomic Quality Report, release C2022Q4R9. Briefly, we used computed genetic ancestries provided by AoU to identify participants of European Ancestry. Genomic quality control and imputation were performed as described above for research participants.<\/p>\n<p>Association testing and meta-analysis<\/p>\n<p>We used the first bias-corrected LogReg in EPACTS (v3.3.0, <a href=\"https:\/\/genome.sph.umich.edu\/wiki\/EPACTS\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/genome.sph.umich.edu\/wiki\/EPACTS<\/a>) and variants for association with MAF\u2009&gt;\u20091% at the MHC locus and MAF\u2009&gt;\u20090.1% genome-wide were tested, including the first four genotype PCs and sex as covariates. In the MHC locus, we tested variants for association across a 4 Mb locus on chromosome 6 spanning 30\u201334\u2009Mb (hg19). For both MHC-specific and genome-wide association analyses, we combined summary statistics from all tested cohorts using a fixed-effects inverse variance-weighted meta-analysis. For new loci, we considered variants with P\u2009&lt;\u20091\u2009\u00d7\u200910\u22128, identified as a more appropriate significance threshold for MAF\u2009&gt;\u20090.1% variants<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 47\" title=\"Fadista, J., Manning, A. K., Florez, J. C. &amp; Groop, L. The (in)famous GWAS P-value threshold revisited and updated for low-frequency variants. Eur. J. Hum. Genet. 24, 1202&#x2013;1205 (2016).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR47\" id=\"ref-link-section-d123039683e2744\" rel=\"nofollow noopener\" target=\"_blank\">47<\/a>. For replication analyses, we updated the meta-analysis by replacing the summary statistics for the \u2018T1D_EARLY\u2019 phenotype in Finngen with those from the more recent r12 version, which includes an additional 219 T1D cases and 409,349 nondiabetic individuals. From the meta-analysis, we estimated the degree of confounding using LD score regression<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 48\" title=\"Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291&#x2013;295 (2015).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR48\" id=\"ref-link-section-d123039683e2748\" rel=\"nofollow noopener\" target=\"_blank\">48<\/a> by calculating the intercept from one (0.0844), which supported a minimal effect of residual population structure on the results.<\/p>\n<p>Conditional analysis of independent signals<\/p>\n<p>To identify independent signals at the MHC locus, we performed stepwise conditional analyses by including the most significant variant from each meta-analysis as a covariate in the association tests for each cohort, followed by reperforming through meta-analysis. We repeated this process by iteratively adding each new variant to the model until no variants remained significant at P\u2009&lt;\u20095\u2009\u00d7\u200910\u22128, where this threshold was selected based on testing variants with MAF\u2009&gt;\u20091%. We also performed a \u2018preconditional\u2019 analysis to examine the effect of signals outside of 70 established class I and II HLA risk alleles and 20 DR3\/DR4 pairwise HLA haplotypic and allelic interactions<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 24\" title=\"Hu, X. et al. Additive and interaction effects at three amino acid positions in HLA-DQ and HLA-DR molecules drive type 1 diabetes risk. Nat. Genet. 47, 898&#x2013;905 (2015).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR24\" id=\"ref-link-section-d123039683e2765\" rel=\"nofollow noopener\" target=\"_blank\">24<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Noble, J. A. et al. HLA class I and genetic susceptibility to type 1 diabetes: results from the Type 1 Diabetes Genetics Consortium. Diabetes 59, 2972&#x2013;2979 (2010).\" href=\"#ref-CR49\" id=\"ref-link-section-d123039683e2768\">49<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Varney, M. D. et al.HLA DPA1, DPB1 alleles and haplotypes contribute to the risk associated with type 1 diabetes: analysis of the type 1 diabetes genetics consortium families. Diabetes 59, 2055&#x2013;2062 (2010).\" href=\"#ref-CR50\" id=\"ref-link-section-d123039683e2768_1\">50<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 51\" title=\"Erlich, H. et al. HLA DR-DQ haplotypes and genotypes and type 1 diabetes risk: analysis of the type 1 diabetes genetics consortium families. Diabetes 57, 1084&#x2013;1092 (2008).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR51\" id=\"ref-link-section-d123039683e2771\" rel=\"nofollow noopener\" target=\"_blank\">51<\/a>. We added 90 covariates into the model to capture the additive effect of each alternate allele across 70 HLA risk alleles, and a binary column for each of the 20 interactions, before performing association testing and meta-analysis. We then performed stepwise conditional analyses by iteratively adding the most significant variant as an additional covariate in the model and reperforming the meta-analysis until no variants remained significant at P\u2009&lt;\u20095\u2009\u00d7\u200910\u22128.<\/p>\n<p>Credible set generation for signals in the MHC<\/p>\n<p>For all independent signals identified at the MHC locus through stepwise conditional analysis, we generated 95% credible sets. We first identified variants in linkage with the lead variant for each signal (r2\u2009&gt;\u20090.1) and we calculated the Bayes Factor for each variant based on the effect and standard error as described by Wakefield<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 52\" title=\"Wakefield, J. Bayes factors for genome-wide association studies: comparison with P-values. Genet. Epidemiol. 33, 79&#x2013;86 (2009).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR52\" id=\"ref-link-section-d123039683e2793\" rel=\"nofollow noopener\" target=\"_blank\">52<\/a>. We generated PIP scores by dividing the Bayes Factor by the total sum of the Bayes Factors for all variants in the set. We included variants up to the 95% threshold in each credible set.<\/p>\n<p>SuSiE fine-mapping of the non-MHC loci<\/p>\n<p>We used SuSiE (v0.11.42) to fine-map loci identified through two rounds of variant clumping using PLINK (v1.9; \u2018&#8211;clump-p1 5e-8 &#8211;clump-p2 0.05 &#8211;clump-r2 0.1 &#8211;clump-kb 10000\u2019; \u2018&#8211;clump-p1 5e-8 &#8211;clump-r2 0 &#8211;clump-kb 500\u2019). We generated loci 500\u2009kb around the variants in each clumped region, considering all variants regardless of MAF. We identified 87 significant loci and included 10 additional known loci not reaching genome-wide significance (CDKN1C, CYP27B1, LMO7, CCR7, 17q24, ACOXL, CCR5, IRF2, TAGAP, 6q27) for a total of 97 loci. We created 95% credible sets for each locus in SuSiE using genotypes from six dbGAP cohorts, including 32,518 individuals (DCCT, GENIE ROI, GENIE UK, GoKIND, T1DGC and WTCCC1) to define the LD matrix and set parameters to \u2018L\u2009=\u200910, coverage\u2009=\u20090.95, min_abs_corr\u2009=\u20090.01, max_iter\u2009=\u200950,000\u2019. For complex loci with multiple signals identified in a previous study (DLK1, IFIH1, TYK2, IL10, PTPN2, AIRE, UBASH3A, CTLA4, IL2RA), we recomputed the meta-analysis using only the six dbGAP cohorts with genotype data included in the LD matrix above<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 9\" title=\"Chiou, J. et al. Interpreting type 1 diabetes risk with genetics and single-cell epigenomics. Nature 594, 398&#x2013;402 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR9\" id=\"ref-link-section-d123039683e2862\" rel=\"nofollow noopener\" target=\"_blank\">9<\/a>. Lead variants were defined as the variant with the largest PIP for the signal. We defined new loci as variants that reached genome-wide significance and mapped &gt;\u2009500\u2009kb from other known loci.<\/p>\n<p>Annotations of credible sets<\/p>\n<p>We leveraged genomic datasets to examine preferential TF binding to annotate new credible sets. We overlapped all credible set variants with accessible chromatin peaks in 46 immune and 12 pancreatic cell types<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 26\" title=\"Calderon, D. et al. Landscape of stimulation-responsive chromatin across diverse human immune cells. Nat. Genet. 51, 1494&#x2013;1505 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR26\" id=\"ref-link-section-d123039683e2874\" rel=\"nofollow noopener\" target=\"_blank\">26<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 53\" title=\"Zhang, K. et al. A single-cell atlas of chromatin accessibility in the human genome. Cell 184, 5985&#x2013;6001 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR53\" id=\"ref-link-section-d123039683e2877\" rel=\"nofollow noopener\" target=\"_blank\">53<\/a>. We tested each variant for preferential allelic binding to TF motifs using FIMO (v4.12.0)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 54\" title=\"Grant, C. E., Bailey, T. L. &amp; Noble, W. S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017&#x2013;1018 (2011).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR54\" id=\"ref-link-section-d123039683e2881\" rel=\"nofollow noopener\" target=\"_blank\">54<\/a>. We also leveraged databases such as GTEx and JASPAR to annotate credible set variants<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 55\" title=\"GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318&#x2013;1330 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR55\" id=\"ref-link-section-d123039683e2885\" rel=\"nofollow noopener\" target=\"_blank\">55<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 56\" title=\"Castro-Mondragon, J. A. et al. JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 50, D165&#x2013;D173 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR56\" id=\"ref-link-section-d123039683e2888\" rel=\"nofollow noopener\" target=\"_blank\">56<\/a>.<\/p>\n<p>Constructing a nonlinear machine learning polygenic risk score<\/p>\n<p>We leveraged 199 variants, including the lead variants from 27 MHC signals, 70 established HLA-associated alleles and 102 non-MHC lead variants (including five putative loci). We developed two models based on the CatBoost classifier framework (v1.0.6), one with the 199 variants alone (\u2018T1GRS-var\u2019 model) and another with additional covariates of sex, PC1-4 and binary covariates for each cohort (\u2018T1GRS-cov\u2019 model). This approach generates a probability ranging from 0 to 1 that represents the model\u2019s confidence that an individual has T1D, which can be treated as a GRS. We refer to this probability as a \u2018score\u2019 to avoid confusion that it represents the actual probability of T1D. The discovery dataset, comprising five cohorts with 10,107 T1D and 19,639 unaffected individuals, was combined into a single genotype matrix, which was randomly split into ten subsections for cross-fold validation. Across ten iterations, a model was trained on 90% of the data and evaluated on the remaining 10% (as outlined previously<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 57\" title=\"Dutschmann, T.-M., Kinzel, L., ter Laak, A. &amp; Baumann, K. Large-scale evaluation of k-fold cross-validation ensembles for uncertainty estimation. J. Cheminform. 15, 49 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR57\" id=\"ref-link-section-d123039683e2900\" rel=\"nofollow noopener\" target=\"_blank\">57<\/a>). Hyperparameters were determined by exhaustive grid search on the first cross-validation fold of the discovery dataset. Briefly, we used a binary CatBoostClassifier with 254 estimators, a depth of 5, a learning rate of 0.12 and a gradient boosting method. Specific hyperparameter settings are available at <a href=\"https:\/\/github.com\/Gaulton-Lab\/t1d-grs-analysis-catboost\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/github.com\/Gaulton-Lab\/t1d-grs-analysis-catboost<\/a>.<\/p>\n<p>The probability scores for individuals in each testing fold were recorded and used to calculate the overall AUC of the model. This process was identical in the T1GRS-cov and T1GRS-var models, including all variant, MHC-only and non-MHC submodels. A representative model for each evaluation was trained on all individuals for validation purposes. Independent validation was performed on the NIH AoU Research Cohort containing 234 T1D and 78,658 nondiabetic individuals and the nPOD cohort, comprising 115 T1D and 148 nondiabetic individuals. A standard random seed was set to ensure reproducibility and a frozen model with identical hyperparameters was used for every validation.<\/p>\n<p>Generation of a LogReg comparison model<\/p>\n<p>As a comparison to our T1GRS model, we also built a LogReg classifier. This model learns a single weight for each feature and performs a linear combination to predict the probability of T1D. The model outputs a probability score between 0 and 1. To prevent overfitting and allow the model to learn weaker non-MHC-based predictors of T1D, we applied strong L2 regularization (C\u2009=\u20090.001). The LogReg model was trained and evaluated using the same tenfold cross-validation format as all T1GRS models, using the same seed and training data.<\/p>\n<p>Feature importance and interaction analysis<\/p>\n<p>Feature importance and interaction within nonlinear models were calculated using the SHAP machine learning interpretability suite (v0.41.0)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 58\" title=\"Lundberg, S. &amp; Lee, S.-I. A unified approach to interpreting model predictions. Preprint at &#010;                https:\/\/arxiv.org\/abs\/1705.07874&#010;                &#010;               (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR58\" id=\"ref-link-section-d123039683e2933\" rel=\"nofollow noopener\" target=\"_blank\">58<\/a>. SHAP is an approach to explain the output of any machine learning model based on cooperative game theory and the concept of Shapley values. SHAP values assign each feature an importance value for a particular prediction in the context of a specific model. The magnitude of feature importance is determined by the mean absolute value of all SHAP values for a given feature. SHAP values also capture nonlinear interactions between features on a per-individual basis and enable ranking pairwise feature interactions by magnitude. Each model was run through the standard SHAP pipeline and feature importance was recorded. Feature interaction analysis was performed using the shap_interaction_values function. To identify significant feature interactions, we converted all pairwise interaction values to z scores and calculated P values for each interaction using a z test. We then applied FDR correction to P values.<\/p>\n<p>Complexity scores<\/p>\n<p>To understand how complex a model\u2019s decision is for an individual\u2019s disease classification, we developed a \u2018complexity\u2019 score, which is the total displacement of SHAP values for an individual, calculated by summing the absolute values of each feature, resulting in a single score per individual. A lower complexity score suggests a more straightforward classification, for example, driven by one or a few highly influential factors. For example, an individual with a strong MHC signal as the primary driver for a positive classification and minimal influence from other non-MHC features would likely exhibit a low score. Conversely, a higher complexity score indicates that many features, each contributing a smaller amount, collectively lead to the disease outcome. Individuals were assigned to deciles based on their complexity scores.<\/p>\n<p>Defining genetic subtypes and validation<\/p>\n<p>Individual-specific SHAP feature contribution vectors from the T1GRS-var model formed the basis of this analysis for both discovery and validation cohorts. All analyses were made reproducible by using a fixed random seed. Starting with the discovery cohort, high-dimensional SHAP vectors were first reduced using PCA, retaining up to 175 principal components. We then employed high-dimensional clustering in ScanPy (v1.8.2)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 59\" title=\"Wolf, F. A., Angerer, P. &amp; Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR59\" id=\"ref-link-section-d123039683e2966\" rel=\"nofollow noopener\" target=\"_blank\">59<\/a> as follows: in PCA space, a k-nearest neighbor graph was constructed (using 120 neighbors). This graph was then used to generate a two-dimensional UMAP embedding, with a \u2018min_dist\u2019 parameter of 0.25. Subgroups of individuals within the discovery cohort were then identified by applying the Leiden community detection algorithm to the kNN graph, using a resolution of 0.05. Next, we used the ScanPy ingest workflow to project the validation cohort onto our existing clusters. To assign validation individuals to the clusters, PCA representations were projected into the discovery cohort\u2019s UMAP space and then assigned to clusters using the same Leiden algorithm.<\/p>\n<p>Analysis of age of onset, clinical complications and T2D loci in T1D subtypes<\/p>\n<p>To identify differences in age of onset between clusters, we performed a log-rank test and considered significant differences at P\u2009&lt;\u20090.05. To identify differences in clinical complications between clusters, the OR was calculated in the discovery and AoU datasets and P values were calculated using a Cox proportional hazards test. To determine enrichment of T2D loci within clusters, we defined T1GRS variants in LD (r2\u2009&gt;\u20090.2) with reported T2D variants<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 60\" title=\"Suzuki, K. et al. Genetic drivers of heterogeneity in type 2 diabetes pathophysiology. Nature 627, 347&#x2013;357 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR60\" id=\"ref-link-section-d123039683e2991\" rel=\"nofollow noopener\" target=\"_blank\">60<\/a>. We performed LogReg for each cluster using mean-normalized SHAP values for each locus as the predictor and T2D association as the binary outcome.<\/p>\n<p>Analysis of T1D GRS<\/p>\n<p>We calculated GRS2 using 60 exact TOPMed variants, 2 exact Michigan HLA for rs116522341, <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/snp\/?term=rs1281934\" rel=\"nofollow noopener\" target=\"_blank\">rs1281934<\/a>, and the proxy variants DQB1*06:02, B*18:01, DPB1*03:01, <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/snp\/?term=rs1611547\" rel=\"nofollow noopener\" target=\"_blank\">rs1611547<\/a> and rs114170382 from Michigan HLA for <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/snp\/?term=rs17843689\" rel=\"nofollow noopener\" target=\"_blank\">rs17843689<\/a>, rs371250843, rs559242105, rs144530872 and rs149663102, respectively. In GRS2, we excluded individuals with more than two HLA-DR\/DQ proxy SNPs according to the published methods<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 16\" title=\"Sharp, S. A. et al. Development and standardization of an improved type 1 diabetes genetic risk score for use in newborn screening and incident diagnosis. Diabetes Care 42, 200&#x2013;207 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR16\" id=\"ref-link-section-d123039683e3037\" rel=\"nofollow noopener\" target=\"_blank\">16<\/a>. Within each GRS, we examined the total GRS and its components of MHC and non-MHC variants. We calculated the AUC for the receiver operating characteristic analysis to assess the differentiation power of each GRS for T1D. We then tested the differences between AUCs using the DeLong test. First, we compared T1GRS and GRS2 in individuals with T1D and those without using both the \u2018T1GRS-cov\u2019 and the \u2018T1GRS-var\u2019 models. Next, we validated T1GRS using individuals in the nPOD biorepository to differentiate between T1D and nondiabetes and using the T1D from nPOD and 1,999 T2D from WTCCC1 (ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 43\" title=\"Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661&#x2013;678 (2007).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR43\" id=\"ref-link-section-d123039683e3041\" rel=\"nofollow noopener\" target=\"_blank\">43<\/a>).<\/p>\n<p>We calculated a published African ancestry risk score in 284 T1D and 404 nondiabetic individuals from SEARCH and CLEAR to compare with T1GRS<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 44\" title=\"SEARCH Study Group. SEARCH for Diabetes in Youth: a multicenter study of the prevalence, incidence and classification of diabetes mellitus in youth. Control. Clin. Trials 25, 458&#x2013;471 (2004).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR44\" id=\"ref-link-section-d123039683e3048\" rel=\"nofollow noopener\" target=\"_blank\">44<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 45\" title=\"Danila, M. I. et al. Dense genotyping of immune-related regions identifies loci for rheumatoid arthritis risk and damage in African Americans. Mol. Med. 23, 177&#x2013;187 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR45\" id=\"ref-link-section-d123039683e3051\" rel=\"nofollow noopener\" target=\"_blank\">45<\/a>. We used TOPMed to impute <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/snp\/?term=rs34850435\" rel=\"nofollow noopener\" target=\"_blank\">rs34850435<\/a>, <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/snp\/?term=rs9271594\" rel=\"nofollow noopener\" target=\"_blank\">rs9271594<\/a>, <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/snp\/?term=rs9273363\" rel=\"nofollow noopener\" target=\"_blank\">rs9273363<\/a>, <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/snp\/?term=rs2290400\" rel=\"nofollow noopener\" target=\"_blank\">rs2290400<\/a> and rs689, while Michigan HLA was used to impute <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/snp\/?term=rs2187668\" rel=\"nofollow noopener\" target=\"_blank\">rs2187668<\/a>. The variant <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/snp\/?term=rs9268838\" rel=\"nofollow noopener\" target=\"_blank\">rs9268838<\/a> was used as a proxy for <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/snp\/?term=rs34303755\" rel=\"nofollow noopener\" target=\"_blank\">rs34303755<\/a> (r2\u2009=\u20090.849, D\u2032\u2009=\u20091.0 in African ancestry).<\/p>\n<p>Lastly, we generated a scale for T1GRS scores using the number of individuals with T1D at various percentiles and calculated a diagnostic for each GRS value using the Youden index (sensitivity\u2009+\u2009specificity\u2009\u2212\u20091). We calculated sensitivity at each GRS score on the scale as TP\/(TP\u2009+\u2009FN) and specificity as TN\/(TN\u2009+\u2009FP)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 61\" title=\"Florkowski, C. M. Sensitivity, specificity, receiver-operating characteristic (ROC) curves and likelihood ratios: communicating the performance of diagnostic tests. Clin. Biochem. Rev. 29, S83&#x2013;S87 (2008).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR61\" id=\"ref-link-section-d123039683e3115\" rel=\"nofollow noopener\" target=\"_blank\">61<\/a>. We defined DR3\/DR4 individuals using four-digit HLA alleles imputed from the T1DGC reference panel using SNP2HLA<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 32\" title=\"McGrail, C. et al. Genetic discovery and risk prediction for type 1 diabetes in individuals without high-risk HLA-DR3\/DR4 haplotypes. Diabetes Care 48, 202&#x2013;211 (2025).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR32\" id=\"ref-link-section-d123039683e3119\" rel=\"nofollow noopener\" target=\"_blank\">32<\/a>. DR3 status was classified as HLA-DRB1*03:01\u2013DQB1*02:01, while DR4 as HLA-DRB1*04:01\/02\/04\/05\/08\u2013DQB1*03:02\/04\/02:02 (ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 33\" title=\"Inshaw, J. R. J., Cutler, A. J., Crouch, D. J. M., Wicker, L. S. &amp; Todd, J. A. Genetic variants predisposing most strongly to type 1 diabetes diagnosed under age 7 years lie near candidate genes that function in the immune system and in pancreatic &#x3B2;-cells. Diabetes Care 43, 169&#x2013;177 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR33\" id=\"ref-link-section-d123039683e3155\" rel=\"nofollow noopener\" target=\"_blank\">33<\/a>).<\/p>\n<p>Variant category classification<\/p>\n<p>To assign variants used in T1GRS to cell types, we first intersected credible sets with the cell-type atlas of cis-regulatory elements (CATlas)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 53\" title=\"Zhang, K. et al. A single-cell atlas of chromatin accessibility in the human genome. Cell 184, 5985&#x2013;6001 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR53\" id=\"ref-link-section-d123039683e3170\" rel=\"nofollow noopener\" target=\"_blank\">53<\/a> for pancreatic and immune cell types. For loci that did not intersect these cis-regulatory elements, we determined credible set intersection with gene bodies in GENCODE GRCh38.p14 (ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 62\" title=\"Frankish, A. et al. GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res. 51, D942&#x2013;D949 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#ref-CR62\" id=\"ref-link-section-d123039683e3177\" rel=\"nofollow noopener\" target=\"_blank\">62<\/a>) and annotated loci with cell types based on gene expression patterns. Lastly, for several loci previously linked to specific genes, we annotated loci to cell types based on the expression patterns of these genes. The full list of links between loci in T1GRS and cell types, as well as the references for the links, is provided in Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#MOESM3\" rel=\"nofollow noopener\" target=\"_blank\">18<\/a>.<\/p>\n<p>Intersection of cluster loci and cell-type regulatory elements<\/p>\n<p>To identify the top loci for each cluster, SHAP values were averaged across all individuals within the cluster. For each T1GRS variant, SHAP values were normalized across clusters to sum to 1. We identified loci for each cluster with a normalized SHAP value greater than 0.75. Using the top loci for each cluster, credible sets for these loci were intersected with regulatory elements for 12 immune and pancreatic endocrine cell types derived from ENCODE. For each cluster, the posterior probability of association for all variants intersecting a regulatory element in a cell type was summed. Permutation testing was performed by shuffling the cumulative probability of associations across all cell types 10,000 times for each cluster and calculating a P value from this null distribution.<\/p>\n<p>Reporting summary<\/p>\n<p>Further information on research design is available in the <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41588-026-02578-y#MOESM2\" rel=\"nofollow noopener\" target=\"_blank\">Nature Portfolio Reporting Summary<\/a> linked to this article.<\/p>\n","protected":false},"excerpt":{"rendered":"Ethics statement The use of human genetic data in this study was approved by the University of California,&hellip;\n","protected":false},"author":3,"featured_media":765719,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[2906,21939,15576,6958,21938,834,29197,31718,210,15577,49197,67,132,68],"class_list":{"0":"post-765718","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-health","8":"tag-agriculture","9":"tag-animal-genetics-and-genomics","10":"tag-biomedicine","11":"tag-cancer-research","12":"tag-gene-function","13":"tag-general","14":"tag-genetics-research","15":"tag-genome-wide-association-studies","16":"tag-health","17":"tag-human-genetics","18":"tag-type-1-diabetes","19":"tag-united-states","20":"tag-unitedstates","21":"tag-us"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@us\/116498824349435184","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/posts\/765718","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/comments?post=765718"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/posts\/765718\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/media\/765719"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/media?parent=765718"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/categories?post=765718"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/tags?post=765718"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}