{"id":779766,"date":"2026-05-07T14:02:27","date_gmt":"2026-05-07T14:02:27","guid":{"rendered":"https:\/\/www.europesays.com\/us\/779766\/"},"modified":"2026-05-07T14:02:27","modified_gmt":"2026-05-07T14:02:27","slug":"genome-wide-sweeps-create-ecological-units-in-the-human-gut-microbiome","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/us\/779766\/","title":{"rendered":"Genome-wide sweeps create ecological units in the human gut microbiome"},"content":{"rendered":"<p>Isolate genome collection<\/p>\n<p>A total of 19,837 publicly available isolate genomes were collected and consisted of the entire data in the Unified Human Gastrointestinal Genome (UHGG) catalogue<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 45\" title=\"Almeida, A. et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat. Biotechnol. 39, 105&#x2013;114 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR45\" id=\"ref-link-section-d163628604e1231\" rel=\"nofollow noopener\" target=\"_blank\">45<\/a> (v.1.0, 10,648 genomes) and 4 large-scale culturomic studies<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 28\" title=\"Groussin, M. et al. Elevated rates of horizontal gene transfer in the industrialized human microbiome. Cell 184, 2053&#x2013;2067 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR28\" id=\"ref-link-section-d163628604e1235\" rel=\"nofollow noopener\" target=\"_blank\">28<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Poyet, M. et al. A library of human gut bacterial isolates paired with longitudinal multiomics data enables mechanistic microbiome research. Nat. Med. 25, 1442&#x2013;1452 (2019).\" href=\"#ref-CR46\" id=\"ref-link-section-d163628604e1238\">46<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Huang, Y. et al. High-throughput microbial culturomics using automation and machine learning. Nat. Biotechnol. 41, 1424&#x2013;1433 (2023).\" href=\"#ref-CR47\" id=\"ref-link-section-d163628604e1238_1\">47<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 48\" title=\"Lin, X. et al. The genomic landscape of reference genomes of cultivated human gut bacteria. Nat. Commun. 14, 1663 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR48\" id=\"ref-link-section-d163628604e1241\" rel=\"nofollow noopener\" target=\"_blank\">48<\/a>. Furthermore, a group of 186 isolates from Austrian individuals were newly collected and sequenced. A total of 50 patients undergoing CRC screening colonoscopy at the Vienna General Hospital were enrolled in the study, which comprised 24 individuals with irritable bowel syndrome, 5 with UC and 21 healthy participants. None of the participants were found to have carcinomas during the colonoscopy.<\/p>\n<p>For bacterial isolation of participants in the Austrian study, brush samples and biopsy samples were collected during colonoscopy from the ileal or caecal mucosa or the ascending colon with or without an endoscopically visible biofilm and were immediately processed<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 49\" title=\"Baumgartner, M. et al. Mucosal biofilms are an endoscopic feature of irritable bowel syndrome and ulcerative colitis. Gastroenterology 161, 1245&#x2013;1256 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR49\" id=\"ref-link-section-d163628604e1248\" rel=\"nofollow noopener\" target=\"_blank\">49<\/a>. Brush and biopsy samples were vortexed or homogenized in 0.6\u2009ml of 0.9% NaCl, and the suspensions were subsequently plated on one of the following six culturing conditions: Columbia agar with 5% sheep blood, MacConkey agar, Columbia CNA agar with 5% sheep blood or CPS agar (Becton Dickinson) under aerobic conditions at 37\u2009\u00b0C; Brucella agar with 5% horse blood or Schaedler KV agar with 5% sheep blood (Becton Dickinson) under anaerobic conditions at 37\u2009\u00b0C. Aerobic cultures were assessed after 18\u2009h and 48\u2009h, and anaerobic cultures were assessed after 48\u2009h and 72\u2009h. Colonies identified as Bacteroides or Parabacteroides by matrix-assisted laser desorption ionization time-of-flight mass spectrometry analysis on a MALDI Biotyper MBT smart instrument (Bruker) or by 16S rRNA gene sequencing on a capillary sequencer (SeqStudio Genetic Analyzer, Applied Biosystems by Thermo Fischer Scientific) were saved as glycerol stocks at \u221280\u2009\u00b0C.<\/p>\n<p>Glycerol stocks of Bacteroides and Parabacteroides were cultured in brain heart infusion medium with supplements for 24\u2009h before DNA isolation. DNA isolation was performed on a King Fisher Flex instrument (Thermo Fisher Scientific) using a MagMA DNA Multi-Sample Ultra 2.0 kit (Thermo Fisher Scientific), which included an initial proteinase\u2009K digestion step and a RNase treatment step. Sequencing libraries were prepared using a NEBNext Ultra II FS DNA Library Prep kit, with NEBNext Multiplex Oligos for Illumina barcodes. Sequencing was performed on an Illumina NovaSeq 6000 platform using SP flow cells (300 cycles, 2\u2009\u00d7\u2009150\u2009bp paired-end reads). Reads were trimmed, filtered and merged with BBMap<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 50\" title=\"Bushnell, B. BBMap: A Fast, Accurate, Splice-Aware Aligner. OSTI &#010;                https:\/\/www.osti.gov\/biblio\/1241166&#010;                &#010;               (2014).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR50\" id=\"ref-link-section-d163628604e1267\" rel=\"nofollow noopener\" target=\"_blank\">50<\/a> (v.38.90; ktrim=r k=21 mink=11 hdist=2 minlen=125 qtrim=r trimq=15), and de novo genome assembly was performed using Spades (v.3.15.5)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 51\" title=\"Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455&#x2013;477 (2012).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR51\" id=\"ref-link-section-d163628604e1271\" rel=\"nofollow noopener\" target=\"_blank\">51<\/a> under isolate mode.<\/p>\n<p>Isolate quality filtering and taxonomic assignment<\/p>\n<p>A total of 20,023 genomes were collected and evaluated using CheckM (v.1.2.2)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 52\" title=\"Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. &amp; Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043&#x2013;1055 (2015).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR52\" id=\"ref-link-section-d163628604e1283\" rel=\"nofollow noopener\" target=\"_blank\">52<\/a> to screen for genomes that met the following criteria: &gt;85% genome completeness, &lt;5% contamination and N50\u2009&gt;\u200950\u2009kb. This selection process produced a total of 16,864 genomes that were of sufficient quality for downstream analyses. We assigned each genome to a SGB according to the MetaPhlAn4 reference genome database (v.Jan 2022)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 23\" title=\"Beghini, F. et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. eLife 10, e65088 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR23\" id=\"ref-link-section-d163628604e1287\" rel=\"nofollow noopener\" target=\"_blank\">23<\/a>. Each genome was assigned to the SGB it was most closely related to based on FastANI (v.1.33)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 53\" title=\"Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. &amp; Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR53\" id=\"ref-link-section-d163628604e1291\" rel=\"nofollow noopener\" target=\"_blank\">53<\/a>, with the centroid genome as the primary reference or, if unavailable, a representative genome. To account for potential mis-assignments in species with boundaries slightly lower than the commonly used threshold (95%)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 54\" title=\"Olm, M. R. et al. Consistent metagenome-derived metrics verify and delineate bacterial Species Boundaries. mSystems 5, e00731-19 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR54\" id=\"ref-link-section-d163628604e1295\" rel=\"nofollow noopener\" target=\"_blank\">54<\/a>, we adopted a more relaxed species boundary of 94% average nucleotide identity (ANI) for SGB assignment. Specifically, SGB assignments were only made for genomes that were less than 6% divergent from at least one reference genome with over 30% sequence alignment. As a result, the total divergence in each SGB could be as high as 10\u201312%. Genomes failing to meet the 94% ANI cutoff with their closest relatives were converted to synthetic fastq reads (ART-2016.06.05, -ss HS25)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 55\" title=\"Huang, W., Li, L., Myers, J. R. &amp; Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593&#x2013;594 (2012).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR55\" id=\"ref-link-section-d163628604e1299\" rel=\"nofollow noopener\" target=\"_blank\">55<\/a> and assigned by MetaPhlAn4 (v.4.0.3)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 23\" title=\"Beghini, F. et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. eLife 10, e65088 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR23\" id=\"ref-link-section-d163628604e1304\" rel=\"nofollow noopener\" target=\"_blank\">23<\/a>. Genomes that could not be assigned by either method were excluded. We further checked whether the ANI-based and MetaPhlAn4-based SGB assignments are generally consistent by converting a random subset of genomes in each ANI-based SGB to synthetic fastq reads (ART-2016.06.05, -ss HS25)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 55\" title=\"Huang, W., Li, L., Myers, J. R. &amp; Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593&#x2013;594 (2012).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR55\" id=\"ref-link-section-d163628604e1308\" rel=\"nofollow noopener\" target=\"_blank\">55<\/a> and assigning them by MetaPhlAn4. Although the majority of ANI-based and MetaPhlAn4-based SGB assignments were congruent, certain ANI-based SGBs had all genomes assigned to another SGB in MetaPhlAn4. All genomes in these SGBs were reassigned to their corresponding MetaPhlAn4-assigned SGB (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM3\" rel=\"nofollow noopener\" target=\"_blank\">4<\/a>).<\/p>\n<p>Isolate genome filtering based on metadata<\/p>\n<p>A key step to ensure unbiased genome-wide sweep identification is to filter the genomes so that each SGB only contains isolates that originate from different individuals. For each isolate, we retrieved information on human participant identifiers, age, sex, health status, country, year and creator of collection, and BioProject accession number from the UHGG database<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 45\" title=\"Almeida, A. et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat. Biotechnol. 39, 105&#x2013;114 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR45\" id=\"ref-link-section-d163628604e1323\" rel=\"nofollow noopener\" target=\"_blank\">45<\/a>, as well as from the text or supplementary materials of the respective publications. For each SGB, we only retained isolates that originated from a different human participant based on either a unique identifier, country of sample or BioProject number (representing a different study; studies with multiple BioProject numbers were checked for and manually corrected). For studies with more than ten genomes but no human participant identifier or country information, we created a human participant identifier as a combination of the following five factors: age (or age group when the exact age was not available), sex, health status of the participant, year and creator of collection. Human participants with different combination-based identifiers were considered as different individuals. When multiple genomes from a single SGB were isolated from the same individual, we chose the genome with the highest quality score according to dRep (v.3.4.1)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 56\" title=\"Olm, M. R., Brown, C. T., Brooks, B. &amp; Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 11, 2864&#x2013;2868 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR56\" id=\"ref-link-section-d163628604e1327\" rel=\"nofollow noopener\" target=\"_blank\">56<\/a>. This procedure resulted in a final collection of 6,411 high-quality isolate genomes that originated from different human participants (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM3\" rel=\"nofollow noopener\" target=\"_blank\">5<\/a>). These isolates spanned 995 SGBs, of which 176 contained more than 5 genomes. As one SGB, SGB10068, assigned as Escherichia coli, contained 1,053 genomes and was larger than any other SGB, we randomly subsampled this SGB to 25% of its original size so that it became comparable in size to the second-largest SGB. This SGB was renamed as SGB10068s to indicate the subsampling process. Each SGB was assigned to its corresponding family, genus and species-level taxonomy in the MetaPhlAn4 database.<\/p>\n<p>Estimation of recombined and clonal genome fractions via mixture modelling<\/p>\n<p>To estimate the recombined and clonal fractions in a pair of genomes, we developed a method that uses a combination of maximum-likelihood estimations (MLEs) and hidden Markov models (HMMs). This method is conceptually similar to a previously published method<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 22\" title=\"Sakoparnig, T., Field, C. &amp; van Nimwegen, E. Whole genome phylogenies reflect the distributions of recombination rates for many bacterial species. eLife 10, e65366 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR22\" id=\"ref-link-section-d163628604e1345\" rel=\"nofollow noopener\" target=\"_blank\">22<\/a>, but with several technical adjustments (detailed at the end of the model validation section below). The major rationale behind the method is that SNPs introduced by mutations between a pair of genomes should be randomly distributed across the genomes, whereas recombination generates regions in the genomes that have an increased or decreased number of SNPs depending on whether the recombined genome fragment stems from a distant or close relative (Extended Data Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#Fig5\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a>). It is therefore possible to partition genome alignments into regions that have been vertically inherited or recombined based on SNP distributions across the alignment.<\/p>\n<p>The SNP distribution for each pair of strains was determined by sliding 500\u2009bp windows across pairwise genome alignments with a step size of 50, which resulted in a probability mass function P(x\u2009=\u2009n) of 500\u2009bp windows that have n SNPs. This probability mass function was modelled as a fractional sum of a Poisson distribution that represents the clonal fraction of the genome and as a negative binomial distribution that represents the recombined fraction of the genome using the following equation:<\/p>\n<p>$$P(x=n)={f}_{c}\\frac{({{\\mu }_{c})}^{n}}{n!}{e}^{-{\\mu }_{c}}+{(1-f}_{c})\\frac{\\varGamma (n+\\alpha )\\,}{\\varGamma (\\alpha )\\,n!}{\\left(\\frac{\\alpha }{\\alpha +{\\mu }_{{nc}}}\\right)}^{\\alpha }{\\left(\\frac{{\\mu }_{{nc}}}{\\alpha +{\\mu }_{{nc}}}\\right)}^{n}$$<\/p>\n<p>where \u00b5c and \u00b5nc are the means for the SNPs per window in the clonal fraction (Poisson distribution) and the recombined fraction (negative binomial distribution), respectively,\u00a0and \u03b1 is the dispersion parameter of the negative binomial distribution. The fraction of the genome that is clonally inherited or recombined is represented by fc and 1\u2009\u2013\u2009fc,\u00a0respectively. The observed SNP distribution was fitted to the equation using MLE with the L-BFGS-B algorithm in the Python package SciPy<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 57\" title=\"Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261&#x2013;272 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR57\" id=\"ref-link-section-d163628604e1619\" rel=\"nofollow noopener\" target=\"_blank\">57<\/a>. As the negative binomial distribution can be interpreted as a gamma mixture of Poisson distributions with the dispersion parameter \u03b1, and approaches the Poisson distribution when \u03b1\u2009=\u20091, the lower bound of \u03b1 was set as 2 so that the distribution was sufficiently distinct from the Poisson distribution that represents the clonal fraction. The effective recombination rate r\/m, which is the number of SNPs exchanged by recombination relative to the number of SNPs introduced by mutation, can be calculated as \\({\\mu }_{{nc}}({1-f}_{c})\\)\/\\({\\mu }_{c}{f}_{c}\\).<\/p>\n<p>The estimated recombination and clonal fractions were further validated through the use of a HMM with the Python package pomegranate<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 58\" title=\"Schreiber, J. Pomegranate: fast and flexible probabilistic modeling in Python. J. Mach. Learn. Res. 18, 5992&#x2013;5997 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR58\" id=\"ref-link-section-d163628604e1722\" rel=\"nofollow noopener\" target=\"_blank\">58<\/a>, in which the two fractions serve as hidden states (C, clonal state; R,\u00a0recombined state). The Viterbi training algorithm was applied to the spatial SNP profiles for each pairwise alignment, with fc and 1\u2009\u2013\u2009fc from the MLE as the starting proportion of C:R. Similarly, the initial parameters for the HMM emission matrix were generated from the MLE estimated Poisson and negative binomial distributions. Although in most cases, the HMM produced results that were highly consistent with the MLE, the HMM validation step was effective in correcting occasional MLE failures at low clonal divergences. Furthermore, the relative occurrence rates of recombination to mutation (\u03c1\/\u03b8) could be estimated as the total number of R states divided by the total number of SNPs in the C state.<\/p>\n<p>Validation of model performance with simulated data<\/p>\n<p>To assess the performance of our mixture model across the expected biological ranges of evolutionary processes in gut bacteria, we evaluated it on sets of simulated genomes that covered a wide range of population genomic parameters. This approach also enabled us to optimize our model so that it would be most effective in the divergence range for which recombination is expected to influence the detection of GWSSs.<\/p>\n<p>We generated sets of genomes (n\u2009=\u200964) with the program CoreSimul<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 59\" title=\"Bobay, L.-M. CoreSimul: a forward-in-time simulator of genome evolution for prokaryotes modeling homologous recombination. BMC Bioinformatics 21, 264 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR59\" id=\"ref-link-section-d163628604e1775\" rel=\"nofollow noopener\" target=\"_blank\">59<\/a>, a forward simulation program that evolves a single genome along a phylogenetic tree to generate derived genomes while incorporating recombination. For each phylogenetic tree, 144 parameter combinations were tested: (1) the scale (that is, maximum pairwise distance) of the tree s\u2009=\u20090.0002, 0.001, 0.005, 0.02, 0.036 and 0.05; (2) the size of the recombination fragment, exponential distributions with mean \u03b4\u2009=\u2009200, 500 and 1,000; (3) the relative occurrence rates of recombination to mutation \u03c1\/\u03b8\u2009=\u20090.01, 0.1, 0.2 and 1; and (4) the rate of exponential decay with divergence for success of recombination \u03a6\u2009=\u20099,18, when Psuccess\u2009=\u200910\u2212\u03c0\u03a6. We simulated the evolutionary process of a 2\u2009million base-pair genome diverging into 64-genome populations with 2 different types of phylogenetic structures: one in which multiple genome-wide sweeps have occurred (Extended Data Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#Fig6\" rel=\"nofollow noopener\" target=\"_blank\">2a<\/a>) and another one in which the tree is fully balanced (Extended Data Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#Fig6\" rel=\"nofollow noopener\" target=\"_blank\">2g<\/a>). During each time segment (that is, time between consecutive nodes) on the tree, each branch on the tree receives mutations (Jukes\u2013Cantor 69 substitution model) and recombination events based on a Poisson process, but only branches that overlap in time are allowed to recombine with each other, and the probability of successful recombination exponentially decreases with sequence divergence<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 60\" title=\"Fraser, C., Hanage, W. P. &amp; Spratt, B. G. Recombination and the nature of bacterial speciation. Science 315, 476&#x2013;480 (2007).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR60\" id=\"ref-link-section-d163628604e1808\" rel=\"nofollow noopener\" target=\"_blank\">60<\/a>. For each pair of genomes, we tracked all regions that have undergone recombination since their last recent common ancestor, and regions with overlapping recombination events were merged and treated as a single event. Finally, we applied our mixture model to the simulated genomes and conducted a comparison between the estimated recombined fraction of the genome, the clonal divergence and two measures of recombination (relative effect of recombination and mutation r\/m, and relative recombination to mutation occurrence rate \u03c1\/\u03b8) with their corresponding values in the simulation.<\/p>\n<p>We found that our method performed well when there were &gt;1,500 total SNPs (&gt;0.075% divergence, including SNPs introduced by both recombination and mutation) in the pairwise alignment, when the majority of recombination fragments were &gt;500\u2009bp and when the overall recombined fraction of the genome was more than two-thirds of the genome. We propose several reasons for these limitations. First, the method becomes noisy when the overall number of SNPs falls under 1,500, which is probably due to a lack of sufficient SNP-containing windows for either the MLE or the HMM to perform efficiently. Second, our method considerably overestimates the recombined fraction of genomes when the length distribution has a mean of 200 because too many windows (500\u2009bp) are only partially recombined, and the fraction of windows that are identified as recombined no longer equals the fraction of the genome that is simulated as recombined. Third, the method also loses accuracy when the mean divergence of the recombined fraction is less than 2.5 times that of the clonal fraction as the SNP distributions in these two fractions overlap too much to be sufficiently resolved from each other. This mostly occurs when more than two-thirds of the genome is recombined under our parameter settings.<\/p>\n<p>Considering the above results, we optimized our method so that it would perform best in a range characterized by low-to-intermediate levels of genome recombination as expected in GWSS clusters that are of relatively recent origin and hence still retain a high fraction of vertically inherited genome (clonal frame). Our optimization strategy involved using an intermediate window size for counting SNPs and filtering at both ends of the recombination spectrum\u00a0where either a very small or very large fraction of the genome was recombined. We opted for a window size of 500\u2009bp because the recombination fragment size in bacteria is estimated to range from tens to thousands of base pairs<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 22\" title=\"Sakoparnig, T., Field, C. &amp; van Nimwegen, E. Whole genome phylogenies reflect the distributions of recombination rates for many bacterial species. eLife 10, e65366 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR22\" id=\"ref-link-section-d163628604e1824\" rel=\"nofollow noopener\" target=\"_blank\">22<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 61\" title=\"Touchon, M. et al. Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths. PLoS Genet. 5, e1000344 (2009).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR61\" id=\"ref-link-section-d163628604e1827\" rel=\"nofollow noopener\" target=\"_blank\">61<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 62\" title=\"McKane, M. &amp; Milkman, R. Transduction, restriction and recombination patterns in Escherichia coli. Genetics 139, 35&#x2013;43 (1995).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR62\" id=\"ref-link-section-d163628604e1830\" rel=\"nofollow noopener\" target=\"_blank\">62<\/a>. Moreover, further using a smaller window size could compromise the resolution of the method at low divergences owing to insufficient numbers of SNPs per window. Validation of our method using simulated data enabled us to establish robust filters to ensure accurate parameter estimation, free from the influence of degenerate parameter sets resulting from the MLE being confined to a local minimum. These filters were set for genome pairs that were expected to be very highly or lowly recombined. All genome pairs with less than 1,500 SNPs were considered 100% clonal, with the divergence of the clonal fraction deemed as 10\u22125. Meanwhile, all genome pairs for which the estimated mean of the recombined fraction is less than 2.5 times that of the clonal fraction were considered as 100% recombined, with both the clonal and recombined fractions of the genome sharing the same divergence as the overall genome alignment.<\/p>\n<p>As a precaution against sporadic failures of the MLE, we also implemented a corrective measure. We cross-checked whether the MLE-estimated recombination fraction exceeded twice that determined by the HMM. If such a discrepancy occurred, we substituted the MLE-estimated parameters with those derived from the HMM. Conversely, if no such discrepancy was observed, the MLE-derived clonal divergence and recombination fractions were deemed the final estimated parameters. As the HMM was also used to determine the spatial information of the recombined regions, we found that most recombined regions that stretched for less than eight consecutive sliding windows in the HMM were falsely identified. Therefore, we reassigned them as clonal regions after completion of the HMM.<\/p>\n<p>Identifying putative GWSSs from the isolate collection<\/p>\n<p>We established two criteria for the conservative identification of putative GWSSs in the isolate genome collection and encapsulated the relevant workflow into a package called PopCoGenomeS<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 24\" title=\"cusoiv. cusoiv\/PopCoGenomeS: The tool PopCoGenomeS. Zenodo &#010;                https:\/\/doi.org\/10.5281\/zenodo.19139639&#010;                &#010;               (2026).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR24\" id=\"ref-link-section-d163628604e1848\" rel=\"nofollow noopener\" target=\"_blank\">24<\/a>. First, to ensure a sufficiently large clonal frame for confident phylogenetic analysis and downstream metagenomic mapping, we only considered genomes that were predominantly vertically inherited (that is, the pairwise recombined portion is &lt;50%). Second, the divergence among the genomes considered should satisfy the 5\u00d7 rule, which is a stricter variant of the previously established 4\u00d7 rule<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 25\" title=\"William Birky, C. &amp; Barraclough, T. G. in Lost Sex: The Evolutionary Biology of Parthenogenesis (eds Sch&#xF6;n, I., Martens, K. &amp; Dijk, P.) 201&#x2013;216 (Springer, 2009).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR25\" id=\"ref-link-section-d163628604e1852\" rel=\"nofollow noopener\" target=\"_blank\">25<\/a>. According to the 4\u00d7 rule, if sister clades on a tree with the same sample size are separated by a distance gap that exceeds 4 times the within-clade distance, there is less than 5% probability that the clades are formed due to random drift. The 5\u00d7 rule decreases the probability of drift to less than 1% and allows for uneven sample sizes, including cases when the sister clade is represented by a single genome<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 26\" title=\"Rosenberg, N. A. The shapes of neutral gene genealogies in two species: probabilities of monophyly, paraphyly, and polyphyly in a coalescent model. Evolution 57, 1465&#x2013;1477 (2003).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR26\" id=\"ref-link-section-d163628604e1856\" rel=\"nofollow noopener\" target=\"_blank\">26<\/a>.<\/p>\n<p>To first identify groups of isolates with mostly vertically inherited genomes (clonal frame &gt;50%), we applied our mixture model and its associated filters to each of the 176 SGBs that contained more than 5 genomes. In each SGB, we identified vertically inherited groups of genomes using the package micropan (bClust, average linkage)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 63\" title=\"Snipen, L. &amp; Liland, K. H. micropan: an R-package for microbial pan-genomics. BMC Bioinformatics 16, 79 (2015).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR63\" id=\"ref-link-section-d163628604e1863\" rel=\"nofollow noopener\" target=\"_blank\">63<\/a> in R to generate networks of genomes for which pairwise vertical inheritance averaged &gt;50%, as determined by our mixture model. In some SGBs, the fraction of recombined genomes plateaued or gradually decreased with clonal divergence after the initial increase, which may be due to the model nearing the limits of its suitable range (Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a>). Therefore, from each vertically inherited genome cluster identified, we removed genomes for which the average divergence from other cluster members exceeded that of genomes outside the cluster.<\/p>\n<p>We then checked whether entire groups of vertically inherited genomes could be putative GWSS clusters. We applied the 5\u00d7 rule to each vertically inherited group of genomes in a SGB by determining whether the most closely related isolate outside the group was more than 5\u00d7 distant compared with the average clonal divergence in the group. If this condition was met, the entire genome group was considered a putative GWSS cluster. Subsequently, we scanned all vertically inherited groups of genomes for evidence of GWSSs in the group. Each clade in a maximum-likelihood tree, constructed based on whole genome alignments of a vertically inherited group (phyml, GTR\u2009+\u2009G\u2009+\u2009I model)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 64\" title=\"Guindon, S. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59, 307&#x2013;321 (2010).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR64\" id=\"ref-link-section-d163628604e1873\" rel=\"nofollow noopener\" target=\"_blank\">64<\/a>, was evaluated according to the 5\u00d7 rule. If the average clonal divergence in a clade was less than one-fifth of that between it and its sister clade, then the clade was identified as a GWSS cluster.<\/p>\n<p>Validation of putative GWSS clusters in metagenomes<\/p>\n<p>We sought to validate the structure of GWSS clusters and the extent of their occurrence in metagenomes representing a large diversity of host conditions and geographical locations. To confirm that the 5\u00d7 distance gaps for putative isolate GWSS clusters were not due to incomplete or biased sampling, we developed a pipeline that enabled testing of the 5\u00d7 rule by combining isolate genomes and metagenomes. To this end, we identified a consensus clonal frame (CCF) for each putative GWSS cluster based on the isolate genomes and then implemented the 5\u00d7 rule twice using two distinct distances: first, the distance of each genome and metagenome sample to the CCF; and second, the pairwise distances between all isolate genomes and metagenome samples based on their alignments to the CCF. This procedure is described in detail below.<\/p>\n<p>First, we constructed a database in which each isolate-based GWSS cluster was represented by its clonal frame to ensure that the distances we calculated between metagenomes and each putative GWSS cluster reflected only vertically acquired substitutions. We determined whether the clusters are nested (if one cluster completely encompasses another), and only kept the encompassing cluster. For putative GWSS clusters that consisted of three or more genomes, we extracted the clonal frame of each cluster by removing all recombined fragments in the core genome alignment (Mugsy (v.1.2.3)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 65\" title=\"Angiuoli, S. V. &amp; Salzberg, S. L. Mugsy: fast multiple alignment of closely related whole genomes. Bioinformatics 27, 334&#x2013;342 (2011).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR65\" id=\"ref-link-section-d163628604e1888\" rel=\"nofollow noopener\" target=\"_blank\">65<\/a>) of the sweep with ClonalFrameML (v.1.12)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 66\" title=\"Didelot, X. &amp; Wilson, D. J. clonalframeml: efficient inference of recombination in whole bacterial genomes. PLoS Comput. Biol. 11, e1004041 (2015).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR66\" id=\"ref-link-section-d163628604e1892\" rel=\"nofollow noopener\" target=\"_blank\">66<\/a> and constructed a CCF by selecting the major allele of each SNP. For GWSSs with only two genomes, we extracted the clonal frame by removing all recombined segments (+500\u2009bp upstream and downstream) identified by our previous bipartitioning HMM model and randomly assigned the clonal frame of one genome as the CCF. We then clustered all of the CCFs with fastANI (v.1.33)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 53\" title=\"Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. &amp; Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR53\" id=\"ref-link-section-d163628604e1896\" rel=\"nofollow noopener\" target=\"_blank\">53<\/a> and sorted CCFs with ANI\u2009&gt;\u200999% into separate databases. This resulted in 6 CCF databases each containing 53\u201375 clonal frames.<\/p>\n<p>Second, to ensure that the addition of metagenomic samples to the GWSS clusters successfully mitigated potential isolate sampling bias, we acquired a subset of metagenomes that covered many host phenotypes from the curatedMetagenomicData (v.3.4.2)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 67\" title=\"Pasolli, E. et al. Accessible, curated metagenomic data through ExperimentHub. Nat. Methods 14, 1023&#x2013;1024 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR67\" id=\"ref-link-section-d163628604e1903\" rel=\"nofollow noopener\" target=\"_blank\">67<\/a> database. Stool metagenomes were first dereplicated by human participants so that for samples sequenced over a time series from the same participant, only the sample with the maximum number of reads was kept. We then grouped the samples by study, age category, disease and country, and selected up to five metagenomes from each unique group combination. If multiple participants from the same family were included, we only kept the metagenome for one adult member. This process resulted in a collection of 1,477 metagenomes representative of 74 datasets (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM3\" rel=\"nofollow noopener\" target=\"_blank\">6<\/a>).<\/p>\n<p>Third, to ensure that calculation of sequence distances between the metagenomes and the CCF do not represent a mixture of strains, we filtered for metagenomes that were dominated by a single strain from each sweep-containing SGB. Metagenomes were aligned against the MetaPhlAn4 reference database (v.Jan 2022) with MetaPhlAn (v.4.0.6)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 23\" title=\"Beghini, F. et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. eLife 10, e65088 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR23\" id=\"ref-link-section-d163628604e1913\" rel=\"nofollow noopener\" target=\"_blank\">23<\/a> default settings, and all polymorphic sites with a Phred quality score \u226520 and coverage \u22653 were identified. The allele frequency spectrum was generated for each SGB with &gt;40 polymorphic sites in every metagenome, and SGBs for which the 5th percentile of the spectrum exceeded 0.8 was considered as single-strain dominance in that metagenome. This cutoff roughly corresponds to a ratio of 9:1 between major and minor strains and reduces the number of usable metagenome samples per SGB to between 1 and 742 (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM3\" rel=\"nofollow noopener\" target=\"_blank\">7<\/a>).<\/p>\n<p>Fourth, we established a consistent metric to determine the distances of isolate genomes and metagenomes to the CCFs. To determine the distances between single-strain metagenomes and the CCFs, we aligned metagenomes to each of the 6 CCF databases with bowtie (v.2.5.1; -X 2000, &#8211;no-mixed, &#8211;very-sensitive)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 68\" title=\"Langmead, B., Trapnell, C., Pop, M. &amp; Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR68\" id=\"ref-link-section-d163628604e1924\" rel=\"nofollow noopener\" target=\"_blank\">68<\/a> and calculated the distances as 1\u2013 the popANI metric in inStrain (v.1.7.5; default settings)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 69\" title=\"Olm, M. R. et al. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat. Biotechnol. 39, 727&#x2013;736 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR69\" id=\"ref-link-section-d163628604e1928\" rel=\"nofollow noopener\" target=\"_blank\">69<\/a>. The popANI metric takes into account both major and minor alleles when calculating the distance between metagenomic samples aligned to a reference, which makes it a population-level measurement of ANI. Furthermore, to enable direct comparisons between the distances of isolate genomes and metagenomes to CCFs, we converted isolate genomes to synthetic fastq reads (HiSeq 2500 platform, 10\u00d7 coverage, ART-2016.06.05)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 55\" title=\"Huang, W., Li, L., Myers, J. R. &amp; Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593&#x2013;594 (2012).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR55\" id=\"ref-link-section-d163628604e1932\" rel=\"nofollow noopener\" target=\"_blank\">55<\/a> and calculated their distances to the CCFs in the same manner as the metagenomes. We included all isolates that were in the putative sweep clusters as well as up to six isolates that were most closely related to the sweep (\u2018sister isolates\u2019) in the calculation. Eventually, only distances calculated from samples with\u00a02\u00d7 coverage and 25% coverage breadth (50% for isolate genomes originating from the sweep) were retained for each clonal frame.<\/p>\n<p>Finally, we applied the 5\u00d7 rule to the sample-to-CCF distances for each GWSS cluster putatively identified on the basis of isolate genomes. All samples, processed as described above, were sorted by their distance to the CCF from the nearest to the furthest. Then, beginning with the third sample in proximity to the CCF, we progressively examined samples in increasing order of distance until a sample for which the distance from the CCF exceeded five times the average distance of all samples closer to the CCF was identified. Therefore, a GWSS consisted of least three samples. All samples in closer proximity to the CCF than the identified sample were deemed part of the sweep. To avoid identifying GWSSs based on outlier samples that aligned to the corresponding CCF (that is, genomes or metagenomes misassigned to a certain SGB), we eliminated all sweeps with fewer than three samples found outside the sweep. Moreover, as some SGBs are known to be mixtures of recently diverged species, and distance gaps can arise from samples far away from the reference CCF not surpassing the coverage threshold, we excluded sweeps if the number of samples in the sweep exceeded two-thirds of the total number of samples and if, simultaneously, the distance of the closest sample outside of the sweep was more than 1.5 times further from the CCF compared with that determined only by isolates. Finally, in cases when a sample was assigned to two sweeps from the same SGB, we removed the sample from the sweep with less coverage and reran the sweep assignment for the corresponding CCF.<\/p>\n<p>As a final verification of the GWSSs identified by sample-to-CCF distances, we performed an additional 5\u00d7 test based on pairwise distances. This test included all samples (isolate genomes and metagenomes) in the GWSS cluster as well as up to six of the most closely related isolate genomes and metagenomes to the GWSS cluster. For GWSSs occurring in more than 200 samples, we subsampled across the range of distances to the corresponding CCF to obtain a final set of approximately 100 samples for analysis. Pairwise genetic distances between samples (isolate genomes and metagenomes) were calculated by obtaining the 1\u2009\u2013\u2009popANI between samples when mapping to the same CCF using inStrain (v.1.7.5)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 69\" title=\"Olm, M. R. et al. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat. Biotechnol. 39, 727&#x2013;736 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR69\" id=\"ref-link-section-d163628604e1942\" rel=\"nofollow noopener\" target=\"_blank\">69<\/a>. We only considered pairwise samples for which more than 25% of the reference CCF was covered by both samples. We then assessed whether the average pairwise distance for all samples in the GWSS cluster was less than one-fifth of the average minimal distance of each within-GWSS sample from its closest relative outside the GWSS cluster. GWSSs that passed this additional 5\u00d7 test were confirmed as true GWSS clusters.<\/p>\n<p>The phylogenetic structure of all confirmed GWSSs was determined by extracting the bac120 marker gene set in the GTDB-Tk database (R214)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 44\" title=\"Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. &amp; Parks, D. H. GTDB-Tk v2: memory friendly classification with the genome taxonomy database. Bioinformatics 38, 5315&#x2013;5316 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR44\" id=\"ref-link-section-d163628604e1949\" rel=\"nofollow noopener\" target=\"_blank\">44<\/a> from their corresponding CCFs and inferring a phylogenetic tree from the marker gene alignments using default settings in GTDB-Tk (v.2.3.2)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 44\" title=\"Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. &amp; Parks, D. H. GTDB-Tk v2: memory friendly classification with the genome taxonomy database. Bioinformatics 38, 5315&#x2013;5316 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR44\" id=\"ref-link-section-d163628604e1953\" rel=\"nofollow noopener\" target=\"_blank\">44<\/a>.<\/p>\n<p>Robustness of GWSS assignments<\/p>\n<p>We evaluated the sensitivity of GWSS identification to the adjustments of key parameters used to detect them. We focused on three types of parameters: (1) how the recombination fraction cutoff used to determine vertical inheritance affects the detection of isolate-based sweeps; (2) how the number of metagenomes included influences the number of GWSSs detected; and (3) how changes to coverage cutoffs, both in depth and breadth, for samples mapping to CCFs affect GWSS detection. We tested 54 combinations of recombination and coverage cutoffs (parameter types\u20091 and 3) spanning all stages of the GWSS pipeline and found that the number of total and young (&lt;100\u2009years old) GWSSs identified changed by less than 15% (Supplementary Table\u2009<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM3\" rel=\"nofollow noopener\" target=\"_blank\">8<\/a> and Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">3c<\/a>), with the dominant effect arising from the recombination fraction cutoff (Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">3a<\/a>). For parameter\u2009type\u00a02, rarefaction analysis showed that although the number of GWSSs detected increases with the number of metagenomes included, this trend plateaued between 20 and 40% of the total dataset (Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">3b<\/a>), which suggests that the number of metagenomes used in the current analysis is sufficient to recover nearly all GWSSs detectable with the available isolate genomes. Together, these findings indicate that the GWSS assignments are robust to parameter adjustments and remain stable across reasonable changes to cutoff values and sampling effort. Further details on the parameter adjustments and their effects are provided in Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM3\" rel=\"nofollow noopener\" target=\"_blank\">8<\/a>.<\/p>\n<p>Sensitivity of GWSS assignments to low-level occurrence of closely related strains<\/p>\n<p>We assessed the sensitivity of GWSS assignments to the low-level co-occurrence of closely related strains. Using the largest of our 6 CCF databases (comprising 75 CCFs), we introduced simulated reads, derived from the SGBs containing the CCFs at 0.5\u00d7 coverage of an isolate randomly sampled in the SGB, into all 1,477 metagenomes used for sweep confirmation. We found that the introduction of these simulated reads did not alter the number or identity of the GWSSs detected. Examination of two key parameters, the maximum divergence in the GWSS (Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">4a<\/a>) and the distance of the GWSS to the closest relative (Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">4b<\/a>), indicated that these distances also changed minimally. Overall, this analysis suggests that the existence of closely related strains at a low level has a limited impact on GWSS detection and does not substantially affect the robustness of our approach.<\/p>\n<p>SGB category assignments<\/p>\n<p>All 176 SGBs for which we performed GWSS searches were classified into commensals, pathogens or commensal SGBs that are frequently found in fermented and functional foods (probiotics) according to the following standards. An SGB was categorized as a pathogen if we found that at least a portion of isolate genomes originated from an outbreak by checking the source studies of the isolates. The criterion was that multiple (\u22653) genomes were sequenced from the outbreak, as this is also the lowest number of genomes required for the identification of a GWSS. An SGB was designated as a probiotic if literature searches of the corresponding taxa revealed a species commonly found in probiotic products or fermented foods. The remaining SGBs were classified as commensals. The classification of SGBs is provided in Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM3\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a>.<\/p>\n<p>Calculation of Tajima\u2019s D for GWSS clusters<\/p>\n<p>For calculation of the Tajima\u2019s D of a GWSS cluster, a clonal frame was reconstructed for each isolate and metagenome sample associated with the GWSS based on the SNP profile of the sample when mapped against the CCF of the GWSS cluster. All reconstructed clonal frames in a GWSS cluster were used to calculate the Tajima\u2019s D of the cluster and its significance level, assuming that D follows a beta distribution using the tajima.test function in the pegas package (v.1.3)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 70\" title=\"Paradis, E. pegas: an R package for population genetics with an integrated&#x2013;modular approach. Bioinformatics 26, 419&#x2013;420 (2010).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR70\" id=\"ref-link-section-d163628604e2027\" rel=\"nofollow noopener\" target=\"_blank\">70<\/a> in R.<\/p>\n<p>GWSS cluster age estimation<\/p>\n<p>The age of each GWSS cluster was calculated using two independent methods: (1) dividing the maximum pairwise SNP distances in each cluster by 2 and then with a constant molecular clock of 1\u201310\u2009mutations per genome per year; and (2) estimating a molecular clock from strains in twin metagenomes or metagenomic time series. In each GWSS cluster, the SNP distances between strains in two samples were calculated by normalizing the population_SNPs metric (inStrain (v.1.7.5)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 69\" title=\"Olm, M. R. et al. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat. Biotechnol. 39, 727&#x2013;736 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR69\" id=\"ref-link-section-d163628604e2039\" rel=\"nofollow noopener\" target=\"_blank\">69<\/a>, defined as sites for which coverage is &gt;5\u00d7 with no shared alleles between the samples) by the fraction of reference CCF with &gt;5\u00d7 coverage in both samples. This pairwise SNP distance calculation was performed exclusively on samples for which more than 25% of the reference CCF had &gt;5\u00d7 coverage.<\/p>\n<p>We were able to estimate the metagenomic molecular clock of nine SGBs from the metagenome time series or twin metagenomes by finding all strains that persisted in individuals over a period of time or were shared between twins. We retrieved all metagenomes from the same human participant who was at least sampled 1\u2009year apart from curatedMetagenomicData (v.3.4.2)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 67\" title=\"Pasolli, E. et al. Accessible, curated metagenomic data through ExperimentHub. Nat. Methods 14, 1023&#x2013;1024 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR67\" id=\"ref-link-section-d163628604e2046\" rel=\"nofollow noopener\" target=\"_blank\">67<\/a>. If multiple time points were sampled for the same human participant, we selected the two time points that were furthest apart. We also retrieved all metagenomes and their related metadata from 250 adult twins from the TwinsUK study<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 71\" title=\"Xie, H. et al. Shotgun metagenomics of 250 adult twins reveals genetic and environmental impacts on the gut microbiome. Cell Syst. 3, 572&#x2013;584 (2016).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR71\" id=\"ref-link-section-d163628604e2050\" rel=\"nofollow noopener\" target=\"_blank\">71<\/a>. Twins were assumed to have identical strains when they were living in the same household, and the years that the twins had lived apart were assumed as the time that the strains had to accumulate mutations. The genetic distance between strains in each metagenome and the CCF of their corresponding GWSS were calculated in the same manner as in the previous section (\u2018Validation of putative GWSS clusters in metagenomes\u2019). To account for shifts in strain dominance and strain replacement events over time, we only considered metagenomes from the same person or twin pair to be sharing strains from the same GWSS if one metagenome was more closely related to the reference CCF of the GWSS compared with the threshold previously used to establish the GWSS cluster, whereas the other metagenome was closer to the CCF than half of the minimum distance observed for metagenomes outside of the GWSS cluster.<\/p>\n<p>For each SGB, the metagenomic molecular clock was expressed as a linear function, with the SNP distance between shared strains in metagenomes as the independent variable and the time difference between the metagenomes as the dependent variable. When SGBs contained shared strains in both the metagenome time series and twin metagenome datasets, the linear function was determined as the best fit across all data points. For SGBs with shared strains in only one dataset, the function was defined as the average slope of lines constrained to pass through each data point and the origin. We eventually estimated the age of every GWSS cluster belonging to the nine SGBs by extrapolating the corresponding metagenomic molecular clock to the maximum pairwise SNP distance of the GWSS cluster.<\/p>\n<p>Validation of GWSS detection and age with pathogen datasets<\/p>\n<p>We evaluated how well our sweep detection and age estimation pipeline performed on pathogens with well-documented pandemics, as many of these can be considered as rapid, global genome-wide sweeps. We selected Vibrio cholerae as a validation case because its ongoing seventh pandemic, which includes all currently circulating pandemic strains, originated from a single source population in the Bay of Bengal followed by local diversification<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 72\" title=\"Mutreja, A. et al. Evidence for several waves of global transmission in the seventh cholera pandemic. Nature 477, 462&#x2013;465 (2011).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR72\" id=\"ref-link-section-d163628604e2068\" rel=\"nofollow noopener\" target=\"_blank\">72<\/a>. These seventh pandemic strains form a distinct clonal group known as the L2 phyletic lineage. Given these features, all modern V.\u2009cholerae L2 isolates (that is, collected after 1995) should be identifiable as a GWSS. We briefly summarize the evaluation results here, with the full details provided as\u00a0<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">Supplementary Text<\/a>.<\/p>\n<p>Using dereplicated datasets of modern L2 isolates (post-1995) and non-L2 controls (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM3\" rel=\"nofollow noopener\" target=\"_blank\">9<\/a>), we showed that when these isolate genomes are converted into simulated human gut metagenomes with V.\u2009cholerae infection, our method accurately identified the L2 lineage as a distinct clonal GWSS cluster with a clear divergence gap from non-L2 isolates (Extended Data Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#Fig13\" rel=\"nofollow noopener\" target=\"_blank\">9a<\/a>). We also estimated sweep ages for the currently circulating strains and individual waves of the seventh pandemic using a molecular clock of one to ten SNPs per genome per year. Historically, the seventh cholera pandemic comprises three global waves, with the first wave now extinct and only strains from the second and third waves still circulating<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 72\" title=\"Mutreja, A. et al. Evidence for several waves of global transmission in the seventh cholera pandemic. Nature 477, 462&#x2013;465 (2011).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR72\" id=\"ref-link-section-d163628604e2090\" rel=\"nofollow noopener\" target=\"_blank\">72<\/a>. We found that the estimated sweep ages (5\u201351\u2009years for the overall circulating pandemic strains and 4\u201346\u2009years for the third wave) closely matched historical estimates of 45 and 35 years<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 72\" title=\"Mutreja, A. et al. Evidence for several waves of global transmission in the seventh cholera pandemic. Nature 477, 462&#x2013;465 (2011).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR72\" id=\"ref-link-section-d163628604e2094\" rel=\"nofollow noopener\" target=\"_blank\">72<\/a>, respectively, and the neighbour-joining tree based on the clonal divergence among isolates resolved waves\u20092 and 3 as discrete, nested clusters (Extended Data Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#Fig13\" rel=\"nofollow noopener\" target=\"_blank\">9b<\/a>).<\/p>\n<p>We also used this dataset to test whether using the CCF as the representative of a sweep cluster affects sweep detection or age estimation. Specifically, we compared sweep detection and age estimation results using the CCF for the L2 lineage to those obtained using clonal frames derived from ten randomly sampled L2 isolates. The difference was minimal: the maximum pairwise distance in the entire L2 lineage (representing the seventh pandemic) calculated using the CCF was 103 SNPs, which corresponded to an estimated sweep age of 5.1\u201351.5\u2009years. By comparison, the maximum pairwise distance calculated using 10 randomly sampled L2 isolates as references was 116\u2009\u00b1\u20094.6 SNPs, which corresponded to a sweep age of 5.8\u201358 years. Thus, any potential bias introduced by using the CCF is negligible, particularly given that our goal was to estimate the sweep age on the correct order of magnitude.<\/p>\n<p>Curve fitting for measuring recombination rates<\/p>\n<p>Because in most SGBs the fraction of the genome that had undergone recombination increased linearly as the number of mutations in the clonal region increased, and subsequently plateaued, the slope of the linear segment of the recombined fraction\u2013mutation plots is a measure of recombination rate. We therefore segmented all recombined fraction\u2013mutation plots (Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">2<\/a>) for commensal bacteria to find their linearly increasing regions using the R\u2009package dpseg (v.0.1.1)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 73\" title=\"Machne, R. &amp; Stadler, P. F. dpseg: Piecewise Linear Segmentation by Dynamic Programming. R package version 0.1.1 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR73\" id=\"ref-link-section-d163628604e2116\" rel=\"nofollow noopener\" target=\"_blank\">73<\/a>. Sometimes a large number of data points clustered at low divergence and this could lead to oversegmentation of the scatter plot. Therefore, we subsampled the plot to 100 data points when there were more than 100 data points with fewer than 2,000 mutations. A total of 4 parameter combinations that included a breakpoint penalty, a\u00a0minimal segment length\u00a0and a\u00a0maximal segment length were tested for the curve fitting ((0.2,20,40), (0.1,10,20), (0.1,5,10), (0.2,20, all data points)). If the first linear fragment of the fit had R2\u2009&gt;\u20090.8, then this fragment was determined as the linearly increasing region. If R2\u2009&gt;\u20090.8 was satisfied under multiple parameter combinations, then the combination that had the maximum R2 or allowed all data points to fit to a single linear fragment with R2\u2009&gt;\u20090.8 was used. Otherwise, consecutive linear fragments with similar (within 75%) slopes were combined and refit as one fragment, and the first fragment with R2\u2009&gt;\u20090.33 was determined as the linearly increasing region. For SGBs in which automated segmentation was not satisfactory, we manually identified the linear range of increase. Finally, for all the identified linearly increasing regions of each SGB, we added a point (0,0) to the data points in the region and applied a linear regression model passing through the origin. The slope of the linear regression model was used as the recombination rate, and we were able to measure the recombination rates for 45 out 46 of the commensal SGBs with sweeps, and 52 out 95 of those without sweeps. The lower fraction of satisfactory fits in the SGBs with no confirmed sweeps was due to both fewer genomes per SGB and a more frequent absence of data points in the linearly increasing fraction of the recombined fraction\u2013mutation plots. All curve fittings are shown in Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">2<\/a>, and all measured recombination rates are in Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM3\" rel=\"nofollow noopener\" target=\"_blank\">10<\/a>.<\/p>\n<p>GWSS identification from StrainPhlAn marker gene trees<\/p>\n<p>Because we were interested in testing associations of GWSS clusters with human disease or physiological states, we explored the feasibility of identifying GWSSs based on phylogenetic distances of marker genes extracted from the metagenomes as this approach can be more easily scaled up to large metagenomic datasets. As\u00a0our SGB classifications were based on the MetaPhlAn4 database,\u00a0we performed strain-level marker gene profiling for SGBs in metagenomes with StrainPhlAn4 and tested how and to what extent the 5\u00d7 rule could be extended to StrainPhlAn4 marker gene trees to identify GWSS clusters. We set up two criteria for calling a GWSS cluster from the marker gene tree: (1) the normalized average genetic distance in a marker gene based GWSS cluster needs to be smaller than a normalized cutoff based on previously identified GWSS clusters; and (2) the phylogenetic distance between the proposed GWSS clade and its sister clade exceeds five times the average distance in the GWSS clade.<\/p>\n<p>To define the cutoff for the first criterion, we constructed mock metagenomes for all isolate genomes in GWSS clusters. Each mock metagenome for an isolate genome consisted of synthetic fastq reads for the target genome at 20\u00d7 coverage (ART-2016.06.05, -ss HS25 -f 20)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 55\" title=\"Huang, W., Li, L., Myers, J. R. &amp; Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593&#x2013;594 (2012).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR55\" id=\"ref-link-section-d163628604e2159\" rel=\"nofollow noopener\" target=\"_blank\">55<\/a>, and a randomly selected isolate genome from every other GWSS-containing SGB at 1\u00d7 coverage. Therefore, the total number of mock metagenomes for each SGB is the number of isolate genomes identified in GWSS clusters for that SGB. For each SGB, strain-profiling was performed for each mock metagenome with StrainPhlAn4 against the MetaPhlAn4 reference database (v.Jan 2022)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 23\" title=\"Beghini, F. et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. eLife 10, e65088 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR23\" id=\"ref-link-section-d163628604e2163\" rel=\"nofollow noopener\" target=\"_blank\">23<\/a>, which resulted in a tree that was built using marker genes from all the isolate-based mock metagenomes and marker genes extracted directly from all other isolate genomes in the SGB. The cutoff was set as SGB-specific normalized phylogenetic distance (nGD) thresholds that optimally separated isolate pairs in GWSS clusters from isolate pairs that had only one isolate genome in the GWSS cluster. nGDs were calculated as leaf-to-leaf branch lengths on the SGB marker gene tree normalized by their median. For SGBs with at least 50 pairs of isolates in the GWSS cluster, nGD cutoff thresholds were defined based on the value that would maximize the Youden\u2019s index (R package cutpointr, v.1.2.0)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 74\" title=\"Thiele, C. &amp; Hirschfeld, G. cutpointr: improved estimation and validation of optimal cutpoints in R. J. Stat. Softw. 98, 1&#x2013;27 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR74\" id=\"ref-link-section-d163628604e2167\" rel=\"nofollow noopener\" target=\"_blank\">74<\/a>, unless the value exceeded the 5th percentile of the isolate pairs that had only one isolate genome in the GWSS cluster. For SGBs with fewer than 50 total within-GWSS isolate pairs, the nGD corresponding to the 3rd percentile of the isolate pairs with only one isolate genome in the GWSS cluster was used as the cutoff.<\/p>\n<p>StrainPhlAn marker gene trees were constructed for each SGB with the same isolate and metagenome samples previously used to identify GWSSs. For 32 out 46 of the commensal bacteria, at least 2 out 3 of the samples in previously identified GWSS clusters were retained in GWSS clusters called from the StrainPhlAn marker gene trees (Extended Data Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#Fig9\" rel=\"nofollow noopener\" target=\"_blank\">5a<\/a>). Also, because GWSS clusters identified from marker gene trees are not necessarily the expansion of isolate based GWSS clusters but can be purely metagenome based, for the majority of the SGBs (33 out 46), GWSS clusters called from the StrainPhlAn marker gene trees included more samples than those included in the previously identified GWSS clusters (Extended Data Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#Fig9\" rel=\"nofollow noopener\" target=\"_blank\">5b<\/a>).<\/p>\n<p>Association studies of GWSS clusters in metagenomes<\/p>\n<p>Associations between GWSS clusters and five human health metrics, advanced age (&gt;65\u2009years old), CRC, UC, CD and T2D, were examined for each SGB. These five metrics were chosen because they represented different types of disease and health states that are related to gut microbiome dysbiosis and the availability of sufficient samples across diverse biogeographies. To start, we assembled a baseline metagenome database comprising 12 datasets, which collectively included 2,084 samples from 1,446 human participants in the curatedMetagenomicData database (v.3.4.2)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 67\" title=\"Pasolli, E. et al. Accessible, curated metagenomic data through ExperimentHub. Nat. Methods 14, 1023&#x2013;1024 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR67\" id=\"ref-link-section-d163628604e2188\" rel=\"nofollow noopener\" target=\"_blank\">67<\/a>. This database includes all adult samples from the 12 datasets and excludes individuals who were on antibiotics. Eventually, this baseline database included 654 healthy individuals and 792 individuals with various diseases: atherosclerotic cardiovascular disease (n\u2009=\u2009187), CRC (n\u2009=\u2009132), inflammatory bowel disease (IBD, n\u2009=\u2009186), glucose metabolism-related diseases (n\u2009=\u2009131), rheumatoid arthritis (n\u2009=\u200989) or adenoma (n\u2009=\u200967). We further expanded the database by incorporating five additional CRC datasets, six IBD datasets (including both UC and CD) and six additional T2D datasets, applying the same filtering criteria. This extended dataset of 6,783 samples from 4,614 individuals (including 646 patients with CRC, 749 patients with T2D, 467 patients with CD and 342 patients with UC) captures all large-scale studies available so far for these diseases (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM3\" rel=\"nofollow noopener\" target=\"_blank\">11<\/a>).<\/p>\n<p>The full 6,783 sample dataset was used to identify GWSSs. For all 46 commensal SGBs with previously confirmed GWSSs, strain-profiling was performed with StrainPhlAn4 against the MetaPhlAn4 reference database (v.Jan 2022)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 23\" title=\"Beghini, F. et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. eLife 10, e65088 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR23\" id=\"ref-link-section-d163628604e2217\" rel=\"nofollow noopener\" target=\"_blank\">23<\/a>. Markers for each SGB were extracted from all isolate genomes and all metagenomes with single-strain dominance for the SGB (see the section \u2018Validation of putative GWSS clusters in metagenomes\u2019). For each SGB, all extracted markers were aligned, filtered and constructed into a maximum-likelihood tree according to the default settings under the accurate mode of StrainPhlAn4. The two criteria for identifying GWSS clusters in StrainPhlAn marker gene trees were applied to each SGB. A total of 1,479 GWSS clusters were identified in 40 out of the 46 commensal SGBs examined.<\/p>\n<p>As further preparation for the association analysis, we performed additional filtering and metadata curation for all the samples involved in each SGB marker tree. As age and disease information were often unavailable for isolate genomes, we removed all isolate genomes from the SGB trees. For samples originating from the same participant or from participants in the same family, we only kept one sample at random from each participant or family member for each SGB tree. For each target disease (CRC, UC, CD and T2D), we compared samples from affected individuals with samples from participants\u00a0without the corresponding disease (control group). To prevent the control group from being dominated by samples from other disease cohorts, the association analysis for each disease was restricted to samples from the baseline dataset and the corresponding disease-specific expanded datasets. For age-related analyses, participants aged &gt;65\u2009years were classified as \u2018advanced age\u2019 and the remainder as \u2018normal age\u2019; associations were assessed between healthy individuals in these two age groups.<\/p>\n<p>Associations between GWSS clusters in each SGB and age and disease were examined by building a general linear model with stepwise, forward variable selection and false-discovery rate correction (Benjamini\u2013Hochberg procedure). We asked whether being in a certain sweep or not has a positive or negative impact on the sample being from patients with a disease or those of advanced age with the formula Y (diseased or advanced age)\u2009=\u2009\u03b21S1\u2009+\u2009\u03b22S2\u2009+\u2009\u2026\u03b2nSn\u2009+\u2009\u03bc, where S1, S2, \u2026 Sn represent GWSS clusters detected in each SGB. The forward selection was performed with the R package SignifReg (v.4.3)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 75\" title=\"Zambom, A. Z. &amp; Kim, J. Consistent significance controlled variable selection in high-dimensional regression. Stat 7, e210 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR75\" id=\"ref-link-section-d163628604e2274\" rel=\"nofollow noopener\" target=\"_blank\">75<\/a> under the criteria that a new predictor is added to the model if the addition of the predictor further minimizes the model P\u2009value, and every individual predictor remains significant at Padj\u2009&lt;\u20090.05 after correcting for multiple hypothesis testing with the Benjamini\u2013Hochberg procedure. All selected sweeps were further tested for geographical biases to ask whether sweeps are dominated by samples from certain countries by performing a chi-squared test for the country distribution in each sweep.<\/p>\n<p>Associations between SGBs and disease or age were performed in the same manner as for GWSS clusters in each SGB, with the formula Y (disease or advanced age)\u2009=\u2009\u03b21SGB1\u2009+\u2009\u03b22SGB2\u2009+\u2009\u2026\u03b2nSGBn\u2009+\u2009\u03bc, where SGB1, SGB2, \u2026 SGBn represent individual SGBs. Owing to the large number of metagenomes associated with each SGB, we did not test for geographical bias in each SGB.<\/p>\n<p>Identification of sweep-specific genes<\/p>\n<p>To identify sweep-specific genes that were specific to each GWSS cluster (genes that are both highly differentiated from other GWSS clusters and missing from sister genomes), we first predicted all protein-coding genes in the CCF and isolate genomes (Prodigal v.2.6.3)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 76\" title=\"Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR76\" id=\"ref-link-section-d163628604e2337\" rel=\"nofollow noopener\" target=\"_blank\">76<\/a> from each of the isolate-based sweep clusters. The protein-coding genes in each CCF were then pairwise aligned at the protein and nucleotide level (BLAST v.2.15.0+)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 77\" title=\"Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR77\" id=\"ref-link-section-d163628604e2341\" rel=\"nofollow noopener\" target=\"_blank\">77<\/a>, and proteins found in a single CCF were selected as sweep-specific genes. Specifically, the selected proteins had no homologue in the other CCFs after filtering for alignments with over 60% amino acid identity and alignment length. We then required that the genes encoding the selected proteins in each CCF are identical at the nucleotide level in isolate genomes in the corresponding sweep cluster and share less than 60% nucleotide identity and alignment length with sister genomes (up to six isolates that were most closely related to the sweep). Selected protein-coding genes were annotated using EggNOG (emapper v.2.1.12, database v.5.0.2)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 78\" title=\"Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309&#x2013;D314 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR78\" id=\"ref-link-section-d163628604e2345\" rel=\"nofollow noopener\" target=\"_blank\">78<\/a> and Prokka (v.1.14.6)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 79\" title=\"Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068&#x2013;2069 (2014).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR79\" id=\"ref-link-section-d163628604e2349\" rel=\"nofollow noopener\" target=\"_blank\">79<\/a>. Finally, to test for COG categories or Pfam families enriched in the sweep-specific gene clusters, annotations of the sweep-specific genes were compared with those in the entire CCFs using a Fisher\u2019s exact test with Bonferroni correction.<\/p>\n<p>Statistical analysis<\/p>\n<p>Statistical analyses and graphical representations were performed in R (v.4.2.1)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 80\" title=\"The R Development Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR80\" id=\"ref-link-section-d163628604e2361\" rel=\"nofollow noopener\" target=\"_blank\">80<\/a> using base R statistical functions and ggplot2 (v.3.5.1)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 81\" title=\"Wickham, H. Ggplot2: Elegant Graphics for Data Analysis (Springer, 2016).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR81\" id=\"ref-link-section-d163628604e2365\" rel=\"nofollow noopener\" target=\"_blank\">81<\/a>, ggpubr (v.0.6.0)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 82\" title=\"Kassambara, A. ggpubr: &#x2018;ggplot2&#x2019; based publication ready plots. R package version 0.6.0 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR82\" id=\"ref-link-section-d163628604e2369\" rel=\"nofollow noopener\" target=\"_blank\">82<\/a>, ggtree (v.3.4.4)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 83\" title=\"Yu, G., Smith, D. K., Zhu, H., Guan, Y. &amp; Lam, T. T.-Y. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol. Evol. 8, 28&#x2013;36 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR83\" id=\"ref-link-section-d163628604e2373\" rel=\"nofollow noopener\" target=\"_blank\">83<\/a>, ggtreeExtra (v.1.6.1)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 84\" title=\"Xu, S. et al. ggtreeExtra: compact visualization of richly annotated phylogenetic data. Mol. Biol. Evol. 38, 4039&#x2013;4042 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR84\" id=\"ref-link-section-d163628604e2377\" rel=\"nofollow noopener\" target=\"_blank\">84<\/a> and ComplexHeatmap (v.2.12.1)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 85\" title=\"Gu, Z., Eils, R. &amp; Schlesner, M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 32, 2847&#x2013;2849 (2016).\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#ref-CR85\" id=\"ref-link-section-d163628604e2382\" rel=\"nofollow noopener\" target=\"_blank\">85<\/a>. Correction for multiple testing (Benjamini\u2013Hochberg procedure) was applied when appropriate and significance was defined at Padj\u2009&lt;\u20090.05. All tests were two-sided, except for those assessing functional enrichment of genes specific to GWSS clusters. To access differences between two groups, Student\u2019s t-test was performed on data that passed the Shapiro\u2013Wilk normality test; otherwise, a Wilcoxon rank-sum test was performed. Correlations were assessed with Spearman\u2019s tests. All geographical biases in the datasets were accessed either with a chi-squared test or a Fisher\u2019s exact test.<\/p>\n<p>Ethical compliance<\/p>\n<p>For the Austrian isolate collection, study approval was granted by the ethics committee of the Medical University of Vienna (EK-Nr: 1617\/2014, 1910\/2019). All study participants gave written informed consent before study inclusion. The study was conducted in accordance with the ethical principles of the Declaration of Helsinki. The analysis of the Global Microbiome Conservancy isolate dataset was conducted with authorized access to data from the database of Genotypes and Phenotypes (accession <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/projects\/gap\/cgi-bin\/study.cgi?study_id=phs002235.v1.p1\" rel=\"nofollow noopener\" target=\"_blank\">phs002235.v1.p1<\/a>) under approval from the US National Human Genome Research Institute.<\/p>\n<p>Reporting summary<\/p>\n<p>Further information on research design is available in the\u00a0<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-026-10476-w#MOESM2\" rel=\"nofollow noopener\" target=\"_blank\">Nature Portfolio Reporting Summary<\/a> linked to this article.<\/p>\n","protected":false},"excerpt":{"rendered":"Isolate genome collection A total of 19,837 publicly available isolate genomes were collected and consisted of the entire&hellip;\n","protected":false},"author":3,"featured_media":779767,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[16278,210,10046,3814,20302,10047,159,67,132,68],"class_list":{"0":"post-779766","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-health","8":"tag-evolution","9":"tag-health","10":"tag-humanities-and-social-sciences","11":"tag-microbial-ecology","12":"tag-microbiome","13":"tag-multidisciplinary","14":"tag-science","15":"tag-united-states","16":"tag-unitedstates","17":"tag-us"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@us\/116533667597155975","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/posts\/779766","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/comments?post=779766"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/posts\/779766\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/media\/779767"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/media?parent=779766"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/categories?post=779766"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/tags?post=779766"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}