{"id":109480,"date":"2025-05-17T16:43:13","date_gmt":"2025-05-17T16:43:13","guid":{"rendered":"https:\/\/www.europesays.com\/uk\/109480\/"},"modified":"2025-05-17T16:43:13","modified_gmt":"2025-05-17T16:43:13","slug":"tracing-human-genetic-histories-and-natural-selection-with-precise-local-ancestry-inference","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/uk\/109480\/","title":{"rendered":"Tracing human genetic histories and natural selection with precise local ancestry inference"},"content":{"rendered":"<p>Orchestra<\/p>\n<p>Orchestra is a LAI algorithm that consists of a two-stage pipeline: an original deterministic base layer and a smoothing module based on a deep learning architecture that combines elements of convolutional neural networks<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 47\" title=\"Krizhevsky, A., Sutskever, I. &amp; Hinton, G. Imagenet classification with deep convolutional neural networks. Commun. ACM 60, 84&#x2013;90 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#ref-CR47\" id=\"ref-link-section-d132520604e1077\" target=\"_blank\" rel=\"noopener\">47<\/a> and attention-based (transformer) mechanisms<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 48\" title=\"Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998&#x2013;6008 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#ref-CR48\" id=\"ref-link-section-d132520604e1081\" target=\"_blank\" rel=\"noopener\">48<\/a>. Orchestra\u2019s base layer is a deterministic algorithm modeled on recombination, built on the assumption that shared homologous haplotypes are identical by descent (IBD). We expect that a person would share longer DNA segments with individuals from the same population and shorter segments with individuals from remote populations. The base layer looks at each window on a chromatid and finds the minimum number of segments that would be needed to reconstruct that window when those segments are sampled from specific, carefully selected reference populations. We refer to this value as recombination distance. In this study, each chromosome was divided into windows spanning 600 SNPs.<\/p>\n<p>The base layer works as a greedy search algorithm. In each window, the algorithm starts at the first position, looking for the longest continuous matching haplotype in a reference population. Where the match stops, the algorithm starts again from that position to find the longest local match, and so on. This is carried out using a NumPy array of the (boolean) matches at any given position between the sample and the reference sequences (as rows) and using the product along the rows. This allows to accelerate the computation in a trade-off of some use of memory for extra speed, plus it allows the calculation to be parallelized easily across the data. The procedure is repeated until it produces a vector of recombination distances against all reference populations.<\/p>\n<p>The smoothing layer uses the vector of recombination distances produced by the base layer as an ancestry fingerprint that gets converted to a measure of ancestry in terms of probabilities. This layer is designed to give higher weights to low-frequency classes (populations) in the loss function to handle class imbalance effectively. Then, the information from surrounding and more remote windows is factored in, to output a final ancestry label. To do this effectively, the smoothing layer consists of two different types of layers: convolutional and attention-based. There are five convolutional and two attention layers. The attention layers are sandwiched between the third and fourth, and the fourth and fifth convolutional layers. The convolutional layers are moving filters that generate different insights from the base layer output and retain the information in parallel. Whereas in a normal convolutional neural network we would have pooling layers in between the convolutional layers<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 47\" title=\"Krizhevsky, A., Sutskever, I. &amp; Hinton, G. Imagenet classification with deep convolutional neural networks. Commun. ACM 60, 84&#x2013;90 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#ref-CR47\" id=\"ref-link-section-d132520604e1091\" target=\"_blank\" rel=\"noopener\">47<\/a>, we use attention layers that process the result of the parallel filters as a (similarity) vector space that would typically be the output of an embedding into an n-dimensional vector space in a transformer architecture<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 48\" title=\"Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998&#x2013;6008 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#ref-CR48\" id=\"ref-link-section-d132520604e1095\" target=\"_blank\" rel=\"noopener\">48<\/a> to provide global information flow using proximity in this convolutional vector space. The purpose of the convolutional layers is to process information at the window level in the case of the first two convolutional layers and to bring local information concerning other nearby windows in the case of the third, fourth and fifth convolutional layers. The closest windows tend to have a larger impact due to the fact that windows tend to form blocks of a given ancestry, but the attention mechanism allows the use of a comprehensive context to weight the base layer outputs from all other windows. In this regard, the attention augments the local convolutional layers with global information flow. The simplification of the attention layer relative to regular transformer architecture allows the use of a context long enough to span the entirety of the windowed data for a chromosome pack. The final convolutional layer provides the output in terms of probabilities. The population that is assigned the maximum probability is given as the final ancestry.<\/p>\n<p>Due to computation limitation, the smoothing layer was not trained on all chromosomes at a time, but was instead trained on chromosome packs (1\/2, 3\/4,\u2026, 17\/18, 19\/20\/21\/22). Training on a larger set of chromosomes at a time, or even the entirety of the genome, is expected to further increase the accuracy of the smoothing layer.<\/p>\n<p>During the training phase, where Orchestra adjusts its smoothing layer parameters (weights and bias terms) using simulated admixed individuals, it is inevitable that subsequent generations will include direct descendants of the original samples (see \u2018Simulated Admixed Individuals\u2019). This means the same haplotypes are present in both the reference and target sets, potentially leading to an overfitted model. To minimize bias from using source samples of the synthetic admixed data as reference sequences for population classification, Orchestra\u2019s base layer algorithm was modified to remove the best matching haplotypes from the entire training set using a greedy matching algorithm, under the assumption that these best matches represent the \u2018ancestral\u2019 samples of the simulated genomes. Once the model is trained, it can be applied to any separate testing cohort.<\/p>\n<p>Reference panel<\/p>\n<p>Despite the increasing number of publicly available datasets, obtaining a comprehensive and balanced reference panel remains a challenging step in ancestry deconvolution. Here, we used 1KGP-16pops (N\u2009=\u20091365), composed of unrelated non-admixed individuals collected by 1KGP, as the gold standard dataset, which enables easy accuracy comparisons with previously published studies. Next we created the custom-35pops dataset (N\u2009=\u200910,169), where we aimed to assemble a genome-wide dataset of non-admixed modern samples from diverse populations around the globe<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Byrska-Bishop, M. et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell 185, 3426&#x2013;3440(2022).\" href=\"#ref-CR49\" id=\"ref-link-section-d132520604e1114\">49<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Behar, D. M. et al. The genome-wide structure of the Jewish people. Nature 466, 238&#x2013;242 (2010).\" href=\"#ref-CR50\" id=\"ref-link-section-d132520604e1114_1\">50<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Tambets, K. et al. Genes reveal traces of common recent demographic history for most of the Uralic-speaking populations. Genome Biol. 19, 139 (2018).\" href=\"#ref-CR51\" id=\"ref-link-section-d132520604e1114_2\">51<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203&#x2013;209 (2018).\" href=\"#ref-CR52\" id=\"ref-link-section-d132520604e1114_3\">52<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Carmi, S. et al. Sequencing an Ashkenazi reference panel supports population-targeted personal genomics and illuminates Jewish and European origins. Nat. Commun. 5, 4835 (2014).\" href=\"#ref-CR53\" id=\"ref-link-section-d132520604e1114_4\">53<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Kim, J. et al. KoVariome: Korean national standard reference Variome database of whole genomes with comprehensive SNV, indel, CNV, and SV analyses. Sci. Rep. 8, 5677 (2018).\" href=\"#ref-CR54\" id=\"ref-link-section-d132520604e1114_5\">54<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Lowy-Gallego, E. et al. Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project. Wellcome Open Res 4, 50 (2019).\" href=\"#ref-CR55\" id=\"ref-link-section-d132520604e1114_6\">55<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201&#x2013;206 (2016).\" href=\"#ref-CR56\" id=\"ref-link-section-d132520604e1114_7\">56<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Almarri, M. A. et al. The genomic history of the Middle East. Cell 184, 4612&#x2013;4625.e14 (2021).\" href=\"#ref-CR57\" id=\"ref-link-section-d132520604e1114_8\">57<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Malaria Genomic Epidemiology Network. Insights into malaria susceptibility using genome-wide data on 17,000 individuals from Africa, Asia and Oceania. Nat. Commun. 10, 5732 (2019).\" href=\"#ref-CR58\" id=\"ref-link-section-d132520604e1114_9\">58<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Zhang, W. et al. Whole genome sequencing of 35 individuals provides insights into the genetic architecture of Korean population. BMC Bioinforma. 15, S6 (2014).\" href=\"#ref-CR59\" id=\"ref-link-section-d132520604e1114_10\">59<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Wang, C. C. et al. Genomic insights into the formation of human populations in East Asia. Nature 591, 413&#x2013;419 (2021).\" href=\"#ref-CR60\" id=\"ref-link-section-d132520604e1114_11\">60<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Jeong, C. et al. The genetic history of admixture across inner Eurasia. Nat. Ecol. Evol. 3, 966&#x2013;976 (2019).\" href=\"#ref-CR61\" id=\"ref-link-section-d132520604e1114_12\">61<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Biagini, S. A. et al. People from Ibiza: an unexpected isolate in the Western Mediterranean. Eur. J. Hum. Genet. 27, 941&#x2013;951 (2019).\" href=\"#ref-CR62\" id=\"ref-link-section-d132520604e1114_13\">62<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Vyas, D. N., Al-Meeri, A. &amp; Mulligan, C. J. Testing support for the northern and southern dispersal routes out of Africa: an analysis of Levantine and southern Arabian populations. Am. J. Phys. Anthropol. 164, 736&#x2013;749 (2017).\" href=\"#ref-CR63\" id=\"ref-link-section-d132520604e1114_14\">63<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Skoglund, P. et al. Reconstructing prehistoric African population structure. Cell 171, 59&#x2013;71 (2017).\" href=\"#ref-CR64\" id=\"ref-link-section-d132520604e1114_15\">64<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Skoglund, P. et al. Genomic insights into the peopling of the Southwest Pacific. Nature 538, 510&#x2013;513 (2016).\" href=\"#ref-CR65\" id=\"ref-link-section-d132520604e1114_16\">65<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Lazaridis, I. et al. Genomic insights into the origin of farming in the ancient Near East. Nature 536, 419&#x2013;424 (2016).\" href=\"#ref-CR66\" id=\"ref-link-section-d132520604e1114_17\">66<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Lazaridis, I. et al. Ancient human genomes suggest three ancestral populations for present-day Europeans. Nature 513, 409&#x2013;413 (2014).\" href=\"#ref-CR67\" id=\"ref-link-section-d132520604e1114_18\">67<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Pickrell, J. K. et al. The genetic prehistory of southern Africa. Nat. Commun. 3, 1143 (2012).\" href=\"#ref-CR68\" id=\"ref-link-section-d132520604e1114_19\">68<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Anagnostou, P. et al. Berbers and Arabs: tracing the genetic diversity and history of Southern Tunisia through genome wide analysis. Am. J. Phys. Anthropol. 173, 697&#x2013;708 (2020).\" href=\"#ref-CR69\" id=\"ref-link-section-d132520604e1114_20\">69<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Henn, B. M. et al. Genomic ancestry of North Africans supports back-to-Africa migrations. PLoS Genet 8, e1002397 (2012).\" href=\"#ref-CR70\" id=\"ref-link-section-d132520604e1114_21\">70<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Arauna, L. R. et al. Recent Historical Migrations Have Shaped the Gene Pool of Arabs and Berbers in North Africa. Mol. Biol. Evol. 34, 318&#x2013;329 (2017).\" href=\"#ref-CR71\" id=\"ref-link-section-d132520604e1114_22\">71<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Hollfelder, N. et al. Northeast African genomic variation shaped by the continuity of indigenous groups and Eurasian migrations. PLoS Genet 13, e1006976 (2017).\" href=\"#ref-CR72\" id=\"ref-link-section-d132520604e1114_23\">72<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Dobon, B. et al. The genetics of East African populations: a Nilo-Saharan component in the African genetic landscape. Sci. Rep. 5, 9996 (2015).\" href=\"#ref-CR73\" id=\"ref-link-section-d132520604e1114_24\">73<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Behar, D. M. et al. No evidence from genome-wide data of a Khazar origin for the Ashkenazi Jews. Hum. Biol. 85, 859&#x2013;900 (2013).\" href=\"#ref-CR74\" id=\"ref-link-section-d132520604e1114_25\">74<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Yunusbayev, B. et al. The Caucasus as an asymmetric semipermeable barrier to ancient human migrations. Mol. Biol. Evol. 29, 359&#x2013;365 (2012).\" href=\"#ref-CR75\" id=\"ref-link-section-d132520604e1114_26\">75<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Yunusbayev, B. et al. The genetic legacy of the expansion of Turkic-speaking nomads across Eurasia. PLoS Genet 11, e1005068 (2015).\" href=\"#ref-CR76\" id=\"ref-link-section-d132520604e1114_27\">76<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Botigu&#xE9;, L. R. et al. Gene flow from North Africa contributes to differential human genetic diversity in southern Europe. Proc. Natl Acad. Sci. USA. 110, 11791&#x2013;11796 (2013).\" href=\"#ref-CR77\" id=\"ref-link-section-d132520604e1114_28\">77<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Flores-Bello, A. et al. Genetic origins, singularity, and heterogeneity of Basques. Curr. Biol. 31, 2167&#x2013;2177 (2021).\" href=\"#ref-CR78\" id=\"ref-link-section-d132520604e1114_29\">78<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Pathak, A. K. et al. The genetic ancestry of modern Indus Valley populations from Northwest India. Am. J. Hum. Genet. 103, 918&#x2013;929 (2018).\" href=\"#ref-CR79\" id=\"ref-link-section-d132520604e1114_30\">79<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Nelson, M. R. et al. The population reference sample, POPRES: a resource for population, disease, and pharmacological genetics research. Am. J. Hum. Genet. 83, 347&#x2013;358 (2008).\" href=\"#ref-CR80\" id=\"ref-link-section-d132520604e1114_31\">80<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Changmai, P. et al. Indian genetic heritage in Southeast Asian populations. PLoS Genet 18, e1010036 (2022).\" href=\"#ref-CR81\" id=\"ref-link-section-d132520604e1114_32\">81<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"T&#xE4;tte, K. et al. The genetic legacy of continental scale admixture in Indian Austroasiatic speakers. Sci. Rep. 9, 3818 (2019).\" href=\"#ref-CR82\" id=\"ref-link-section-d132520604e1114_33\">82<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 83\" title=\"M&#xF6;rseburg, A. et al. Multi-layered population structure in Island Southeast Asians. Eur. J. Hum. Genet. 24, 1605&#x2013;1611 (2016).\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#ref-CR83\" id=\"ref-link-section-d132520604e1117\" target=\"_blank\" rel=\"noopener\">83<\/a>. A detailed list of all data mined for this study can be found in the Supplementary Table\u00a0<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#MOESM1\" target=\"_blank\" rel=\"noopener\">1<\/a>. The granularity we were able to achieve on each continent was largely dependent on the number and diversity of samples we had available. Where samples were limited or not divergent enough based on initial experiments, we grouped populations into meaningful, broader geographic regions with shared genetic ancestry. Precise information about dataset preparation and merging can be found in Supplementary Methods.<\/p>\n<p>Our custom-35pops dataset is composed of many studies, including genotyping arrays and WGS, which gave rise to artifacts that interfered with the detection of biological patterns due to batch effects<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 84\" title=\"Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733&#x2013;739 (2010).\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#ref-CR84\" id=\"ref-link-section-d132520604e1127\" target=\"_blank\" rel=\"noopener\">84<\/a>. As to our knowledge, there are no effective and systematic algorithms to remove them, therefore we applied conventionally recommended quality control measures<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 85\" title=\"Laurie, C. C. et al. Quality control and quality assurance in genotypic data for genome-wide association studies. Genet. Epidemiol. 34, 591&#x2013;602 (2010).\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#ref-CR85\" id=\"ref-link-section-d132520604e1131\" target=\"_blank\" rel=\"noopener\">85<\/a> (see Supplementary Methods).<\/p>\n<p>After batch effect removal, we used two complementary strategies to identify admixed individuals. The first was PCA-UMAP, an existing protocol for dimensionality reduction<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 86\" title=\"Diaz-Papkovich, A., Anderson-Trocm&#xE9;, L., Ben-Eghan, C. &amp; Gravel, S. UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts. PLoS Genet 15, e1008432 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#ref-CR86\" id=\"ref-link-section-d132520604e1138\" target=\"_blank\" rel=\"noopener\">86<\/a> that allowed us to visualize relatedness among individuals. The second was GNN-tSNE, where we used tsinfer<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 5\" title=\"Kelleher, J. et al. Inferring whole-genome histories in large population datasets. Nat. Genet. 51, 1330&#x2013;1338 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#ref-CR5\" id=\"ref-link-section-d132520604e1142\" target=\"_blank\" rel=\"noopener\">5<\/a> to infer the tree sequences for our dataset and compute the genealogical nearest neighbors (GNN), to which we applied t-distributed stochastic neighbor embedding (t-SNE), another non-linear dimensionality reduction technique. Outliers (i.e., admixed genomes) were removed automatically using a K-Nearest Neighbor (KNN) algorithm (see Supplementary Methods).<\/p>\n<p>To detect Native American ancestry in target genomes, we generate a reference panel of 100 pure in silico Native American genomes (see Supplementary Methods).<\/p>\n<p>Simulated admixed individuals<\/p>\n<p>SLiM v3.7<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 23\" title=\"Haller, B. C. &amp; Messer, P. W. SLiM 3: Forward Genetic Simulations Beyond the Wright-Fisher Model. Mol. Biol. Evol. 36, 632&#x2013;637 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#ref-CR23\" id=\"ref-link-section-d132520604e1157\" target=\"_blank\" rel=\"noopener\">23<\/a> was used to generate admixed genomes based on single-ancestry populations from the reference panel. We forward simulated 1\u20136 generations of admixture. True local ancestry of every position in every simulated individual was tracked across generations using the tree-sequence recording function, and browsed with tskit and pyslim packages<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 87\" title=\"Kelleher, J., Etheridge, A. M. &amp; McVean, G. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS Comput. Biol. 12, e1004842 (2016).\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#ref-CR87\" id=\"ref-link-section-d132520604e1161\" target=\"_blank\" rel=\"noopener\">87<\/a>. HapMap recombination map was supplied for modeling the non-uniform recombination events across the genome.<\/p>\n<p>Given that SLiM does not support simulations with more than one chromosome, we ran each chromosome independently but followed the same mating scheme over generations to obtain whole-genome simulations with the same ancestors for all chromosomes. We tracked the pedigree obtained in the first chromosome run by tagging each simulated individual and each pair\u2019s offspring, and reproduced the same genealogy for the others.<\/p>\n<p>We simulated a fully intermixed scenario where all individuals from populations in the reference panel had an equal probability of contributing to mating, with no specific rates assigned to population mixing, as individuals were chosen entirely at random for each generation. By generation 6, each haploid genome could contain ancestry from up to 32 populations. The expected median number of admixed populations per haploid genome is ~1, 2, 4, 7, 13, and 21 for generations 1 through 6. We also implemented a more realistic non-random model where individuals preferentially mate within their continent (e.g., Europeans, Western Asians, and North Africans; Sub-Saharan Africans; Central and South Asians; and East and Southeast Asians), with a 10% migration rate per generation. We found that the benchmarking results were nearly identical. Thus, we opted to train Orchestra in the fully intermixed random-mating scenario to ensure robustness.<\/p>\n<p>We ran a non-Wright-Fisher model of evolution, since we implemented a couple of modifications. Despite choosing parents via random sampling, we avoided inbreeding by recording pedigree information and thus identifying the degree of relatedness of each pair selected. Only individuals that were not close relatives were allowed to mate (at least a coefficient of relationship <\/p>\n<p>Benchmarking<\/p>\n<p>For benchmarking, we inferred local ancestry with RFMix v2.03<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 9\" title=\"Maples, B. K., Gravel, S., Kenny, E. E. &amp; Bustamante, C. D. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Hum. Genet. 93, 278&#x2013;288 (2013).\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#ref-CR9\" id=\"ref-link-section-d132520604e1182\" target=\"_blank\" rel=\"noopener\">9<\/a>, FLARE v0.1.0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 10\" title=\"Browning, S. R., Waples, R. K. &amp; Browning, B. L. Fast, accurate local ancestry inference with FLARE. Am. J. Hum. Genet. 110, 326&#x2013;335 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#ref-CR10\" id=\"ref-link-section-d132520604e1186\" target=\"_blank\" rel=\"noopener\">10<\/a> and Gnomix<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 11\" title=\"Hilmarsson, H. et al. High resolution ancestry deconvolution for next generation genomic data. Preprint at bioRxiv 2021.09.19.460980 &#010;                  https:\/\/doi.org\/10.1101\/2021.09.19.460980&#010;                  &#010;                 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#ref-CR11\" id=\"ref-link-section-d132520604e1190\" target=\"_blank\" rel=\"noopener\">11<\/a>. We split the samples randomly into training (80%) and testing (20%) cohorts, and measured performance globally, per generation, and per population. We report precision and recall as accuracy measures with scikit-learn<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 88\" title=\"Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825&#x2013;2830(2011).\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#ref-CR88\" id=\"ref-link-section-d132520604e1194\" target=\"_blank\" rel=\"noopener\">88<\/a>, which are computed per population; i.e., the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) is calculated with respect to the selected population as the positive class. In the case of TN and FN, the prediction can be anything other than the population under consideration. We additionally report a commonly used alternative measure for accuracy by assessing each method with Pearson\u2019s r2. We calculated the coefficient of determination by comparing both the inferred and the true local ancestry label per haplotype, as well as the diploid ancestry dose by counting the number of ancestry labels (0, 1 or 2) per site. Values of r2 were estimated for each ancestry separately and the weighted mean r2 across all ancestries was reported per generation.<\/p>\n<p>All programs were supplied with the same training (as reference panel) and test (as target) genotype data, and the HapMap recombination map. Statistical parameters were kept at default values with the following exceptions: the Gnomix model was trained from scratch and its accuracy was optimized in our scenario using Large mode, smooth_size: 100, context_ratio: 0.10 and window_size_cM: 0.5; whereas the number of generations for RFmix was specified with -G 6. For the 1KGP-16pops WGS panel, we processed variants using a MAF filter (\u2009<\/p>\n<p>We assembled a new panel using UKBB samples for additional benchmarking, providing a distinct and independent validation dataset to assess Orchestra\u2019s ability to generalize beyond the original test set. We picked samples that (1) closely aligned with their respective ancestral groups based on both PCA\u2009+\u2009UMAP and GNN+t-sne dimensionality reduction approaches but were not incorporated into the final reference panel due to sample size limits, and\/or (2) they were close to their population cluster but overlapped with other\/s and therefore were excluded at the quality control stage. For regions like the Republic of Ireland, Wales, Scotland, and England, which had a larger sample base in this panel, we considered a maximum of 1000 samples per region. Altogether, we collected 10,241 samples spanning 103 countries. Accuracy was evaluated by attributing the appropriate ancestry label from our ancestry reference panel to each country and then comparing it with the LAI results from each method (Supplementary Fig.\u00a0<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#MOESM1\" target=\"_blank\" rel=\"noopener\">2c<\/a>).<\/p>\n<p>Retracing genetic histories<\/p>\n<p>We simulated Latin American individuals from Southern (SPP and ITA) and Northern (FRG and BRI) Europeans, Western (GSE, GLS and NGE) and Central and Southern (SAF) Africans and artificially-reconstructed Native Americans (NAM) from our custom ancestry panel. Simulations were performed by emulating genetic intermixing for 12 generations using SLiM<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 23\" title=\"Haller, B. C. &amp; Messer, P. W. SLiM 3: Forward Genetic Simulations Beyond the Wright-Fisher Model. Mol. Biol. Evol. 36, 632&#x2013;637 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#ref-CR23\" id=\"ref-link-section-d132520604e1235\" target=\"_blank\" rel=\"noopener\">23<\/a>. Details about the simulations can be found in the Supplementary Methods section. For our benchmarking evaluation, we partitioned the samples randomly into training and testing groups, allocating 80% and 20% of the simulations, respectively, to assess Orchestra against other LAI tools. We ensured that training and testing samples never coincided across the three simulated regional datasets, guaranteeing consistent comparison metrics. We evaluated the performance in terms of precision and recall for each Latin American region using scikit-learn.<\/p>\n<p>For real-world data analysis, we identified UKBB participants born in the Americas using birthplace codes (data-field f.20115), specifically from South America (\u2009&gt;C.600) and North America (C.400-C.500). Additionally, we incorporated admixed American 1KGP samples (codes: MXL, PUR, CLM and PEL). We then extracted the composite SNP set from both the imputed UKBB and whole-genome 1KGP sets and ran Orchestra for LAI assessment. Samples with ancestry patterns closely matching British ancestry (BRI\u2009+\u2009FRG\u2009+\u2009SCA\u2009&gt;\u200980%) were discarded (Supplementary Fig.\u00a0<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#MOESM1\" target=\"_blank\" rel=\"noopener\">23<\/a>).<\/p>\n<p>Finally, we applied Orchestra to all UKBB samples not selected for our reference panel, and samples obtained from various datasets, especially from Reich lab, that were not included in our reference panel because they belonged to ethnic groups other than those found in our 35 reference populations (Supplementary Figs.\u00a0<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#MOESM1\" target=\"_blank\" rel=\"noopener\">5<\/a>&#8211;<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#MOESM1\" target=\"_blank\" rel=\"noopener\">7<\/a>).<\/p>\n<p>Ancestral mapping<\/p>\n<p>We created 35 different reference panels for each target population, by excluding the target population from its own reference panel. Orchestra was then trained on each reference panel and applied to the target population. Due to computational limitations, training on these 35 datasets was limited to chromosomes 17\u201322. The admixture proportions obtained for each population were converted into a matrix of distances that were projected onto two-dimensional space using the SMACOF (Scaling by MAjorizing a COmplicated Function) algorithm implementation of scikit-learn using the sklearn.manifold.MDS function with the option metric=True to convert a non-Euclidean symmetric dissimilarity matrix to coordinates in 2 dimensional space. The SMACOF algorithm uses a random initialization followed by stress majorization to iteratively minimize a stress function given by the squared difference between the dissimilarity matrix entries and the Euclidean distances between the points in 2 dimensional space. To create the matrix of distances, the obtained series of ancestry proportions were first averaged over the samples for each population. These averages were then assembled inside a two dimensional matrix with each dimension being the number of the populations, with 0 on the diagonal. 0.00000000000000001 was then added to all the entries of the matrix to remove any 0 entries and allow the reciprocal to be taken. The matrix was then added to its transpose, to make it symmetric, which was a necessity to apply the algorithm. The reciprocal was taken to convert the similarity measures of proportions into dissimilarity, before being input into the sklearn.manifold.MDS function to produce the coordinates.<\/p>\n<p>Detecting natural selection signatures<\/p>\n<p>To detect natural selection in admixed populations we focused on the Fadm and LAD statistics described in Cuadros\u2013Espinoza et al.<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 34\" title=\"Cuadros-Espinoza, S., Laval, G., Quintana-Murci, L. &amp; Patin, E. The genomic signatures of natural selection in admixed human populations. Am. J. Hum. Genet. 109, 710&#x2013;726 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#ref-CR34\" id=\"ref-link-section-d132520604e1272\" target=\"_blank\" rel=\"noopener\">34<\/a>. We explain the steps in detail in the Supplementary Methods section. We first scanned the genomes of seven admixed populations already analyzed in Cuadros\u2013Espinoza et al.<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 34\" title=\"Cuadros-Espinoza, S., Laval, G., Quintana-Murci, L. &amp; Patin, E. The genomic signatures of natural selection in admixed human populations. Am. J. Hum. Genet. 109, 710&#x2013;726 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#ref-CR34\" id=\"ref-link-section-d132520604e1276\" target=\"_blank\" rel=\"noopener\">34<\/a>, leveraging either publicly available genotypes<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Pierron, D. et al. Strong selection during the last millennium for African ancestry in the admixed population of Madagascar. Nat. Commun. 9, 932 (2018).\" href=\"#ref-CR89\" id=\"ref-link-section-d132520604e1280\">89<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Hudjashov, G. et al. Complex patterns of admixture across the Indonesian archipelago. Mol. Biol. Evol. 34, 2439&#x2013;2452 (2017).\" href=\"#ref-CR90\" id=\"ref-link-section-d132520604e1280_1\">90<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Moreno-Estrada, A. et al. Human genetics. the genetics of Mexico recapitulates Native American substructure and affects biomedical traits. Science 344, 1280&#x2013;1285 (2014).\" href=\"#ref-CR91\" id=\"ref-link-section-d132520604e1280_2\">91<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Vicente, M. et al. Population history and genetic adaptation of the Fulani nomads: inferences from genome-wide data and the lactase persistence trait. BMC Genomics 20, 915 (2019).\" href=\"#ref-CR92\" id=\"ref-link-section-d132520604e1280_3\">92<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 93\" title=\"Laso-Jadart, R. et al. The genetic legacy of the indian ocean slave trade: recent admixture and post-admixture selection in the Makranis of Pakistan. Am. J. Hum. Genet. 101, 977&#x2013;984 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#ref-CR93\" id=\"ref-link-section-d132520604e1283\" target=\"_blank\" rel=\"noopener\">93<\/a> or an appropriate proxy from the UKBB or Reich lab. These served as positive controls to gauge the accuracy in replicating previously identified signals. Detailed information on the datasets and the respective references for these populations is provided in Supplementary Fig.\u00a0<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#MOESM1\" target=\"_blank\" rel=\"noopener\">15<\/a>. External datasets were lifted over to hg38 and imputed, followed by the extraction of the composite SNP set for LAI assessment with Orchestra. In the absence of a direct proxy for Malagasy, we considered African, Black or mixed participants from the UKBB with a significant proportion of Austronesian (SEA\u2009+\u2009FIL\u2009&gt;\u20091%) and South African (SCA\u2009&gt;\u200920%) ancestry and low Indian (PNI\u2009+\u2009BEI\u2009+\u2009SSI\u2009<\/p>\n<p>We then extended this methodology to the White British from the UK, where we treated the British population as admixed, and looked at the Scandinavian component in British genomes. We analyzed 415,859 British participants from the UKBB dataset with detailed birth location information within the UK (using north and east coordinates or data-fields f.129 and f.130). Birth locations were categorized according to the 2018 NUTS Level 2 boundaries (counties) from the Office for National Statistics (<a href=\"https:\/\/data.gov.uk\/\" target=\"_blank\" rel=\"noopener\">https:\/\/data.gov.uk\/<\/a>). Shapefiles were loaded and processed using rgdal R package. Scandinavian (SCA) average percentage was next computed based on individuals mapped to each specific geographic area in Britain. Additionally, we retrieved the index of place names in Great Britain (July 2016) to pinpoint those towns or cities with Viking-origin names, suggesting past Viking settlements (evidenced by suffixes such as -by, -thorpe, or -toft). Next we took 287,346 British samples that showed traces of SCA ancestry. Due to the emergence of a signal in chromosome 10, we wanted to ensure the validity of this observation by analyzing three additional sets of British samples, each varying in their Scandinavian ancestry enrichment. Specifically, we assessed British samples bearing &gt;1%, &gt;5%, and &gt;20% Scandinavian ancestry (Supplementary Fig.\u00a0<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#MOESM1\" target=\"_blank\" rel=\"noopener\">16<\/a>). Given that SCA is the population with the lowest accuracy in our panel (Fig.\u00a0<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#Fig1\" target=\"_blank\" rel=\"noopener\">1c<\/a>), we removed 10 windows from the telomeric regions. We further refined the signal by adjusting the averaged SCA ancestry percentage present in every window against the standard deviation observed in its chromosome pack. Chr10 signal remained consistent across sets as illustrated in Supplementary Fig.\u00a0<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#MOESM1\" target=\"_blank\" rel=\"noopener\">16<\/a>.<\/p>\n<p>OpenTargets<\/p>\n<p>We selected variants that surpassed the established P value significance threshold and explored the Open Targets database (<a href=\"https:\/\/www.opentargets.org\/\" target=\"_blank\" rel=\"noopener\">https:\/\/www.opentargets.org\/<\/a>) for potential target genes. These genes were then ranked based on the aggregate score obtained from all variants.<\/p>\n<p>GWAS enrichment<\/p>\n<p>To investigate if the number of GWAS Catalog (<a href=\"http:\/\/www.ebi.ac.uk\/gwas\/\" target=\"_blank\" rel=\"noopener\">http:\/\/www.ebi.ac.uk\/gwas\/<\/a>) [release 2023-07-20] hits in the selected region is higher than expected by chance, we crossed GWAS Catalog signals (P\u2009-8) with 1000 Genomes Project variants and grouped together those in high LD (r\u00b2\u2009\u2009\u2265\u2009\u20090.8) and associated to the same phenotype. Enrichment P values were calculated by comparison with a null distribution from random genomic regions as a background model, controlling by region size and excluding gaps, sexual chromosomes and the major histocompatibility complex region, known to harbor a vast number of associations. Since the size of the region to be studied is ~2.5\u2009Mb, the number of simulations cannot be too large without covering the entire genome and having overlapping simulations. Thus, we also tested GWAS enrichment with a Fisher\u2019s exact test, which showed very similar results (odds ratio correlation: R\u00b2\u2009=\u20090.97; P value correlation: R\u00b2\u2009=\u20090.69). We selected this latter statistic for the analyzes due to its higher statistical power. GWAS Catalog reported traits were grouped by parent categories according to EFO terms from the ontologyIndex R package. P values were adjusted by Bonferroni correction.<\/p>\n<p>To determine whether this region in chr10 of Scandinavian ancestry, as opposed to British, would result in an increase or decrease for each GWAS phenotype, we computed the frequency difference of the GWAS variants between these two populations using our custom ancestry panel, providing insight into the potential direction of the effect resulting from this shift in ancestry.<\/p>\n<p>Phenotype mapping<\/p>\n<p>Following the association strategy of admixture mapping, we contrasted the proportion of SCA ancestry among cases and controls to evaluate the influence of the chr10 locus on phenotypes gathered by the UKBB. For that, we took main and secondary ICD10 codes and self-reported illness codes (data-fields f.41202, f.41204 and f.20002, respectively). Phenotypes represented by fewer than 100 cases were excluded. Given the absence of significant phenotypes using the Fisher exact test and subsequent P value correction, we filtered phenotypes based on their nominal P value (P\u2009<\/p>\n<p>Reporting summary<\/p>\n<p>Further information on research design is available in the\u00a0<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41467-025-59936-3#MOESM2\" target=\"_blank\" rel=\"noopener\">Nature Portfolio Reporting Summary<\/a> linked to this article.<\/p>\n","protected":false},"excerpt":{"rendered":"Orchestra Orchestra is a LAI algorithm that consists of a two-stage pipeline: an original deterministic base layer and&hellip;\n","protected":false},"author":2,"featured_media":109481,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3846],"tags":[3926,267,3900,3965,3966,12517,70,16,15],"class_list":{"0":"post-109480","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-genetics","8":"tag-evolution","9":"tag-genetics","10":"tag-genomics","11":"tag-humanities-and-social-sciences","12":"tag-multidisciplinary","13":"tag-population-genetics","14":"tag-science","15":"tag-uk","16":"tag-united-kingdom"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@uk\/114524179881696888","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/109480","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/comments?post=109480"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/109480\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media\/109481"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media?parent=109480"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/categories?post=109480"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/tags?post=109480"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}