{"id":11771,"date":"2025-08-20T14:43:11","date_gmt":"2025-08-20T14:43:11","guid":{"rendered":"https:\/\/www.europesays.com\/ie\/11771\/"},"modified":"2025-08-20T14:43:11","modified_gmt":"2025-08-20T14:43:11","slug":"a-comparison-of-27-arabidopsis-thaliana-genomes-and-the-path-toward-an-unbiased-characterization-of-genetic-polymorphism","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/ie\/11771\/","title":{"rendered":"A comparison of 27 Arabidopsis thaliana genomes and the path toward an unbiased characterization of genetic polymorphism"},"content":{"rendered":"<p>DNA extraction and sequencing<\/p>\n<p>For long-read sequencing, we began with 3-week-old plants grown in soil that had been transferred to darkness for 24\u201348\u2009h before harvesting to reduce the starch content. A total of 20\u201330\u2009g of flash-frozen rosette tissue, pooled from individuals, was ground in liquid nitrogen with a pestle and mortar. Nuclei were isolated as described for accession Ey15-2 (ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 75\" title=\"Rabanal, F. A. et al. Pushing the limits of HiFi assemblies reveals centromere diversity between two Arabidopsis thaliana genomes. Nucleic Acids Res. 50, 12309&#x2013;12327 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR75\" id=\"ref-link-section-d23733814e1942\" rel=\"nofollow noopener\" target=\"_blank\">75<\/a>), and high-molecular-weight (HMW) DNA purified with Genomic-tips 100G (Qiagen, 10243) following the manufacturer\u2019s instructions. Ten micrograms of HMW DNA were sheared with either Megaruptor 3 (Diagenode, B06010003) or a FINE-JECT 26G\u00d7 1\u2033 needle (0.45\u2009\u00d7\u200925\u2009mm; 14-13651) to ca. 75\u2009kb, and used as input for long-read library preparation with the SMRTbell Express Template Preparation Kit 2.0 (Pacific Biosciences, 101-693-800). These libraries were size-selected with the BluePippin system (Sage Science) with a 30\u2009kb cutoff in a 0.75% DF Marker U1 high-pass 30\u201340\u2009kb vs3 gel cassette (Biozym, BLF7510). Libraries for accessions 9981 (Angit-1; CS76366) and 10002 (TueWal-2; CS76405) were sequenced on a Sequel II system (Pacific Biosciences), and the others on a Sequel I system.<\/p>\n<p>To prepare PCR-free libraries for short-read sequencing, the genomic DNA was fragmented to 250\u2013350\u2009bp using a Covaris S2 Focused Ultrasonicator (Covaris). The libraries were prepared according to the manufacturer\u2019s instructions with either the TruSeq DNA PCR-free kit (Illumina, 20015962) or the NxSeq AmpFREE Low DNA Library kit (Lucigen, 14000-2). In total, libraries for 89 accessions (including the main 27 for which we assembled their genomes) were sequenced in paired-end mode on a HiSeq 3000 system (Illumina).<\/p>\n<p>The ultra-HMW DNA extraction and sample preparation for optical maps were performed as described<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 75\" title=\"Rabanal, F. A. et al. Pushing the limits of HiFi assemblies reveals centromere diversity between two Arabidopsis thaliana genomes. Nucleic Acids Res. 50, 12309&#x2013;12327 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR75\" id=\"ref-link-section-d23733814e1952\" rel=\"nofollow noopener\" target=\"_blank\">75<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 76\" title=\"Ou, S. et al. Effect of sequence depth and length in long-read assembly of the maize inbred NC358. Nat. Commun. 11, 2288 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR76\" id=\"ref-link-section-d23733814e1955\" rel=\"nofollow noopener\" target=\"_blank\">76<\/a> at Corteva Agriscience using the Direct Label and Stain technology (Bionano Genomics).<\/p>\n<p>Assembly<\/p>\n<p>The CLR subreads were assembled with Canu v1.71 (ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 77\" title=\"Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722&#x2013;736 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR77\" id=\"ref-link-section-d23733814e1967\" rel=\"nofollow noopener\" target=\"_blank\">77<\/a>). Since accessions 9981 and 10002 had been sequenced at higher coverage on a Sequel II instrument, only about 200\u00d7 genome coverage worth of reads were used for assembly. We performed two rounds of polishing on the resulting contigs of all assemblies\u2014first with the CLR subreads and Arrow v2.3.2 (<a href=\"https:\/\/github.com\/PacificBiosciences\/gcpp\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/github.com\/PacificBiosciences\/gcpp<\/a>), and then with PCR-free short reads and Pilon v1.22 (ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 78\" title=\"Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9, e112963 (2014).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR78\" id=\"ref-link-section-d23733814e1978\" rel=\"nofollow noopener\" target=\"_blank\">78<\/a>).<\/p>\n<p>For scaffolding, we generated hybrid scaffolds with optical maps for eight accessions (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#MOESM4\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a>) using Bionano Access v1.5 and <a href=\"https:\/\/bionanogenomics.com\/support\/software-downloads\" rel=\"nofollow noopener\" target=\"_blank\">Bionano Solve<\/a> v3.6. The assembly was performed in pre-assembly mode with parameters nonhaplotype and no-CMPR-cut, without extend-split. Based on what we learned from these hybrid assemblies, we set the parameters for in silico scaffolding of the other genomes. We scaffolded contigs &gt;150\u2009kb with RagTag v1.1.1 (ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 79\" title=\"Alonge, M. et al. Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing. Genome Biol. 23, 258 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR79\" id=\"ref-link-section-d23733814e1995\" rel=\"nofollow noopener\" target=\"_blank\">79<\/a>; scaffold -q 60 -f 10000 -I 0.6 -remove-small) using the TAIR10 reference with hard-masked centromeres, rDNAs, telomeres and nuclear insertions of organelles to prevent misplacement of contigs due to reference bias<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 75\" title=\"Rabanal, F. A. et al. Pushing the limits of HiFi assemblies reveals centromere diversity between two Arabidopsis thaliana genomes. Nucleic Acids Res. 50, 12309&#x2013;12327 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR75\" id=\"ref-link-section-d23733814e1999\" rel=\"nofollow noopener\" target=\"_blank\">75<\/a>. All scaffolded assemblies were manually curated to specifically discard low-confidence centromere satellite-rich contigs or to invert contigs with satellite repeats at their edges, indicative of their correct orientation. These edits were implemented in the AGP files, which were converted to FASTA format with the RagTag agp2fa function<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 79\" title=\"Alonge, M. et al. Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing. Genome Biol. 23, 258 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR79\" id=\"ref-link-section-d23733814e2003\" rel=\"nofollow noopener\" target=\"_blank\">79<\/a>. To detect traces of residual heterozygosity, we aligned the original long reads to their corresponding chromosome scaffolds using pbmm2 v1.3.0 with the parameters align -sort -log-level DEBUG -preset SUBREAD -min-length 5000. Unmapped reads, as well as secondary and supplementary alignments, were filtered out using samtools v1.9 (view -b -F 2308 Chr1 Chr2 Chr3 Chr4 Chr5). The resulting BAM file was then analyzed with NucFreq v0.1 (-minobed 2) to assess genome-wide coverage of primary and secondary alleles<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 80\" title=\"Vollger, M. R. et al. Long-read sequence and assembly of segmental duplications. Nat. Methods 16, 88&#x2013;94 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR80\" id=\"ref-link-section-d23733814e2008\" rel=\"nofollow noopener\" target=\"_blank\">80<\/a>. AGP files, both before and after manual curation, as well as NucFreq plots, are available in the GitHub repository of this project.<\/p>\n<p>Repeat annotation<\/p>\n<p>Repetitive elements were annotated as described<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 75\" title=\"Rabanal, F. A. et al. Pushing the limits of HiFi assemblies reveals centromere diversity between two Arabidopsis thaliana genomes. Nucleic Acids Res. 50, 12309&#x2013;12327 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR75\" id=\"ref-link-section-d23733814e2020\" rel=\"nofollow noopener\" target=\"_blank\">75<\/a>. We ran RepeatMasker v4.0.9 (-cutoff 200 -nolow -gff -xsmall) using a custom library that included various consensus sequences for the CEN178 (ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 81\" title=\"Maheshwari, S., Ishii, T., Brown, C. T., Houben, A. &amp; Comai, L. Centromere location in Arabidopsis is unaltered by extreme divergence in CENH3 protein sequence. Genome Res. 27, 471&#x2013;478 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR81\" id=\"ref-link-section-d23733814e2024\" rel=\"nofollow noopener\" target=\"_blank\">81<\/a>), 5S rDNA<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 82\" title=\"Simon, L. et al. Genetic and epigenetic variation in 5S ribosomal RNA genes reveals genome dynamics in Arabidopsis thaliana. Nucleic Acids Res. 46, 3019&#x2013;3033 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR82\" id=\"ref-link-section-d23733814e2028\" rel=\"nofollow noopener\" target=\"_blank\">82<\/a>, 45S rDNA<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 83\" title=\"Rabanal, F. A. et al. Unstable inheritance of 45S rRNA genes in Arabidopsis thaliana. G3 7, 1201&#x2013;1209 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR83\" id=\"ref-link-section-d23733814e2032\" rel=\"nofollow noopener\" target=\"_blank\">83<\/a> and telomere repeats. We annotated tRNAs with tRNAscan-SE v2.0.6 (ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 84\" title=\"Chan, P. P. &amp; Lowe, T. M. tRNAscan-SE: searching for tRNA genes in genomic sequences. Methods Mol. Biol. 1962, 1&#x2013;14 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR84\" id=\"ref-link-section-d23733814e2036\" rel=\"nofollow noopener\" target=\"_blank\">84<\/a>) and TEs with Extensive de novo TE Annotator v1.9.7 (ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 85\" title=\"Ou, S. et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol. 20, 275 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR85\" id=\"ref-link-section-d23733814e2041\" rel=\"nofollow noopener\" target=\"_blank\">85<\/a>; \u2013step all \u2013sensitive 1 \u2013anno 1), a pipeline that combines several TE annotation tools (LTRharvest, LTR_FINDER, LTR_retriever, TIR-Learner, HelitronScanner and TEsorter)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Ellinghaus, D., Kurtz, S. &amp; Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9, 18 (2008).\" href=\"#ref-CR86\" id=\"ref-link-section-d23733814e2045\">86<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Ou, S. &amp; Jiang, N. LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiol. 176, 1410&#x2013;1422 (2018).\" href=\"#ref-CR87\" id=\"ref-link-section-d23733814e2045_1\">87<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Xu, Z. &amp; Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, W265&#x2013;W268 (2007).\" href=\"#ref-CR88\" id=\"ref-link-section-d23733814e2045_2\">88<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Ou, S. &amp; Jiang, N. LTR_FINDER_parallel: parallelization of LTR_FINDER enabling rapid identification of long terminal repeat retrotransposons. Mob. DNA 10, 48 (2019).\" href=\"#ref-CR89\" id=\"ref-link-section-d23733814e2045_3\">89<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Su, W., Gu, X. &amp; Peterson, T. TIR-Learner, a new ensemble method for TIR transposable element annotation, provides evidence for abundant new transposable elements in the maize genome. Mol. Plant 12, 447&#x2013;460 (2019).\" href=\"#ref-CR90\" id=\"ref-link-section-d23733814e2045_4\">90<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Shi, J. &amp; Liang, C. Generic repeat finder: a high-sensitivity tool for genome-wide de novo repeat detection. Plant Physiol. 180, 1803&#x2013;1815 (2019).\" href=\"#ref-CR91\" id=\"ref-link-section-d23733814e2045_5\">91<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Xiong, W., He, L., Lai, J., Dooner, H. K. &amp; Du, C. HelitronScanner uncovers a large overlooked cache of helitron transposons in many plant genomes. Proc. Natl Acad. Sci. USA 111, 10263&#x2013;10268 (2014).\" href=\"#ref-CR92\" id=\"ref-link-section-d23733814e2045_6\">92<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 93\" title=\"Zhang, R.-G. et al. TEsorter: an accurate and fast method to classify LTR-retrotransposons in plant genomes. Hortic. Res. 9, uhac017 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR93\" id=\"ref-link-section-d23733814e2048\" rel=\"nofollow noopener\" target=\"_blank\">93<\/a>. Finally, to understand the causes of contig breaks, we determined the type of repetitive element closest to each contig edge, considering the first 2\u2009kb from each edge in contigs &gt;10\u2009kb.<\/p>\n<p>Pannagram<\/p>\n<p>Pannagram is a toolkit designed for reference-free pangenome alignment, annotation and analysis, as well as for generating diagrams<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 46\" title=\"Igolkina, A. A., Bezlepsky, A. D. &amp; Nordborg, M. Pannagram: unbiased pangenome alignment and the mobilome calling. Preprint at bioRxiv &#010;                https:\/\/doi.org\/10.1101\/2025.02.07.637071&#010;                &#010;               (2025).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR46\" id=\"ref-link-section-d23733814e2060\" rel=\"nofollow noopener\" target=\"_blank\">46<\/a>.<\/p>\n<p>We represent the WGA as a matrix of corresponding positions, where rows represent accessions and columns represent homologous positions. The construction of the alignment is done in a reference-free manner (see below). However, to visualize the alignment in genome browsers, columns must be sorted in some manner, for example, to correspond to the TAIR10 sequence order. Then, columns of the pangenome are used as positions in the pangenome coordinate system.<\/p>\n<p>To perform reference-free WGA, we developed a three-step pipeline. First, we use several accessions as references and build draft pairwise alignments between each and all other accessions. This process results in several reference-based matrices of corresponding positions. Next, we intersect these matrices, selecting only those columns that are present in all reference-biased matrices, which produces reliable and reference-independent correspondences. In the final step, we resolve unaligned sequences between blocks of corresponding positions using multiple sequence alignment tools. Once the reference-free alignment is complete, it can be sorted according to the desired order of accessions. In our case, we employ an alphabetical order, with the TAIR10 genome first.<\/p>\n<p>For the pairwise alignments between a reference genome (not necessarily TAIR10) and another accession, the focal accession genome is divided into blocks of 5,000\u2009bp, and each block is then mapped to the corresponding chromosomes of the reference genome using BLAST, with exactly one best hit retained for each block through this process. Next, the BLAST hits that are not in close proximity to each other in both genomes are removed. An additional BLAST search is performed to align corresponding unaligned sequences between remaining hits.<\/p>\n<p>To resolve any unaligned blocks after the reference-randomization procedure, MAFFT<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 94\" title=\"Katoh, K., Misawa, K., Kuma, K.-I. &amp; Miyata, T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059&#x2013;3066 (2002).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR94\" id=\"ref-link-section-d23733814e2077\" rel=\"nofollow noopener\" target=\"_blank\">94<\/a> is used. Blocks longer than 30\u2009kb cannot be aligned within a reasonable time using MAFFT, so they are considered to be highly diverged. We found the final unaligned regions to be primarily associated with centromeric regions, rDNA clusters, telomeres and complex regions of multiple and long insertions and deletions, which are regions that are not of primary interest in this paper.<\/p>\n<p>Given the WGA, SNPs can simply be output as sequence differences. However, sequence differences can arise from ambiguities in local alignment and do not necessarily correspond to SNPs (Supplementary Note <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">7<\/a>). If we consider all sequence differences as SNPs, a pair of accessions differs at over 800,000 positions on average; however, if we restrict ourselves to isolated sequence differences, the number shrinks to 600,000.<\/p>\n<p>Pangenome graphGraph construction<\/p>\n<p>We constructed genome graphs for each of the five chromosomes using the PGGB pipeline<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 29\" title=\"Garrison, E. et al. Building pangenome graphs. Nat. Methods 21, 2008&#x2013;2012 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR29\" id=\"ref-link-section-d23733814e2100\" rel=\"nofollow noopener\" target=\"_blank\">29<\/a>. First, we prepared the assemblies by splitting them into chromosomes and removing all unplaced contigs. To enforce linearity for simpler analysis and comparison, we used a modified version of accession 22001 with the genome rearranged to a consensus pan-genomic order (suffix: \u2018f\u2019). We added the TAIR10 reference genome to the graph to enable anchoring and presentation of results in a reference framework.<\/p>\n<p>We executed the PGGB pipeline (downloaded on 25 January 2024) with the following parameters: -s 10000 -p 90 -n 27. PGGB consists of the following three methods: an all-against-all alignment with wfmash (v0.12.4-5-g0b191bb), graph induction using seqwish (v0.7.9-2-gf44b402) and two rounds of pangenome ordering (odgi v0.8.3-26-gbc7742ed) followed by normalization with smoothxg (v0.7.2-11-g9970e0d). The graph was used for analyzing the pangenome and synteny, as well as detecting variation using vg deconstruct<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 95\" title=\"Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875&#x2013;879 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR95\" id=\"ref-link-section-d23733814e2107\" rel=\"nofollow noopener\" target=\"_blank\">95<\/a>.<\/p>\n<p>Similarity<\/p>\n<p>We exploited graph properties to classify different levels of similarity between genomes. Nodes traversed in all accessions are labeled as core, nodes traversed in only one accession are private and all other nodes (&gt;1 and <\/p>\n<p>Synteny windows<\/p>\n<p>Every node in the graph can be translated to its exact position for each path. This direct connection allows us to create sliding window approaches for each sample\/path using graph-based statistics. Here we used nonoverlapping windows of 300\u2009kb and calculated the average similarity (see above) of these regions. This was performed for each graph and path independently and the results were represented in a heat map.<\/p>\n<p>Saturation analysis<\/p>\n<p>A saturation analysis was performed using a bootstrapping approach. In each iteration, we removed a specific number of paths from our graph and performed the same pangenome categorization as above (\u2018Similarity\u2019). In addition, we added the total pangenome, which describes the total amount of sequence (core\u2009+\u2009shell\u2009+\u2009private sequence). We performed 20 different (unique) combinations for each size (number of genomes).<\/p>\n<p>Deconstructing the graph<\/p>\n<p>To achieve full insights into graph variation and cover all bubbles in the graph, VG deconstruct was run multiple times with each accession reference path once (vg deconstruct -a -e). After, the reported VCF (v1.54.0 \u2018Parafada\u2019) files were converted to a BED file with all important information provided. In addition, each chromosome was merged, and the genotype information was concluded and added. Bubbles were identified by the start and end positions, and all traversals within these bubbles were also reported. Scripts can be found in the repository.<\/p>\n<p>sSVs and cSVs in the graphs were defined as follows:<\/p>\n<ul class=\"u-list-style-bullet\">\n<li>\n<p>All SVs represent indels, having one very small traversal (deletion) and a large one containing the SV sequence (insertion).<\/p>\n<\/li>\n<li>\n<p>Bubbles were identified as sSVs if the bubble was shared by all accessions in the graph (here 28), and as cSVs if not. Traversals covering the insertion are at least 15\u2009bp long and must exhibit high similarity (95% sequence). The deletion part of the bubble should be small, at most 5% of the length of the inserted sequence.<\/p>\n<\/li>\n<li>\n<p>Most cSVs correspond to bubbles that have a complex structure and\/or are sub-bubbles of larger bubbles.<\/p>\n<\/li>\n<\/ul>\n<p>General pangenome<\/p>\n<p>To perform a reference-free pangenome analysis, we used genome graphs built separately for each chromosome. The complete graph contains 18.3 million nodes and 20.9 million edges, with a total size of 225\u2009Mb, and has a mean compression rate of 6.75% across all chromosomes. Similar to other genome-wide analyses in this study, the large-scale reciprocal translocation in accession 22001 was masked to maximize linearity and increase resolution in the variation graph.<\/p>\n<p>The mobile-ome<\/p>\n<p>The mobile-ome refers to the collection of insertions and deletions that are likely to have occurred recently and are therefore not fixed in our sample. We hypothesize that each mobile event results in an SV, specifically a presence\u2013absence polymorphism at the location of the insertion or deletion. Consequently, our initial approach involves extracting all presence\u2013absence SVs and systematically decomposing them step by step. To distinguish between simple bi-allelic presence\u2013absence polymorphisms (indels) and cSV, we analyzed the lengths of alleles within the SVs. We distinguish two types based on the similarity threshold s, with s\u2009=\u20090.9 in our case. We consider a simple indel as one that contains alleles of two length types\u2014those that are shorter than (1\u2009\u2212\u2009s) of the SV length (absence allele) and those that are longer than s of the SV length (presence allele). The distinction between simple and complex presence\u2013absence polymorphisms is partially a computational construct to filter SVs and simplify further analysis. Simple indels and complex presence\u2013absence polymorphisms form a continuum, and by relaxing the similarity threshold (s\u2009A. thaliana TEs, as well as against themselves. The indels that exhibited some similarity to known TEs were divided into the following groups: is complete\u2014significant similarity to known TEs and can be classified as TEs themselves; contains complete\u2014contained regions with similarity to known TEs, but also additional sequences; is fragment\u2014contained only partially sequenced with similarity to known TEs; and contains fragment\u2014partial coverage by BLAST hits of TE segments, but also additional sequences unrelated to known TEs.<\/p>\n<p>We consider all these indels as parts of the mobile-ome. Indels without similarity to known TEs but showing nested similarities within the indel data set (where one sequence is a subsequence of another) were considered as potential candidates for new mobile-ome elements. To investigate their potential function, we obtained all six open reading frames within each of these indels. From each translated sequence, we selected either all continuous stretches without stop codons that were longer than 100 codons or the longest stretch that exceeded 30 codons without a stop. Subsequently, we performed a BLAST search using the obtained amino acid sequences against the NCBI protein database and classified the potential proteins into four categories. If the BLAST results for an sSV contained keywords related to TE, we assigned the sequence to the TE-like category. These keywords were \u2018transcriptase\u2019, \u2018reverse\u2019, \u2018transpos\u2019, \u2018gag-\u2019, \u2018pol-\u2019, \u2018integrase\u2019, \u2018gag\/pol\u2019, \u2018gagpol\u2019, \u2018retrovirus\u2019, \u2018RNA-directed DNA polymerase\u2019 and \u2018RNA-dependent DNA polymerase\u2019. sSVs that only had BLAST hits with descriptions such as \u2018hypothetical protein\u2019, \u2018unnamed protein product\u2019, \u2018uncharacterized protein\u2019, \u2018predicted protein\u2019, \u2018PREDICTED:\u2019, \u2018putative protein\u2019 and \u2018unknown\u2019 were categorized as \u2018undefined proteins\u2019. Indels without any BLAST hits were classified as \u2018no protein\u2019. In all other cases, sSV was categorized as a \u2018defined protein.\u2019<\/p>\n<p>Gene annotationPreliminary annotation<\/p>\n<p>Gene annotation was mainly based on Augustus (v3.3.3)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 96\" title=\"Stanke, M., Diekhans, M., Baertsch, R. &amp; Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637&#x2013;644 (2008).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR96\" id=\"ref-link-section-d23733814e2218\" rel=\"nofollow noopener\" target=\"_blank\">96<\/a>. Augustus-predicted gene models were trained using parameters obtained from \u2018hints\u2019 from three different sources. First, we ran BUSCO (v4.0.1)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 97\" title=\"Seppey, M., Manni, M. &amp; Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness. Methods Mol. Biol. 1962, 227&#x2013;245 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR97\" id=\"ref-link-section-d23733814e2222\" rel=\"nofollow noopener\" target=\"_blank\">97<\/a> with -m genome option. Second, the A. thaliana reference gene annotation was projected onto each genome using Liftoff<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 98\" title=\"Shumate, A. &amp; Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics 37, 1639&#x2013;1643 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR98\" id=\"ref-link-section-d23733814e2232\" rel=\"nofollow noopener\" target=\"_blank\">98<\/a> with the -exclude_partial and -copies options. Third, the RNA-seq data for each accession were used\u2014wiggle hints were generated using bam2wig and wig2hints, and EST hints were generated using bam2hints (all three tools provided by Augustus). Augustus was run with the following nondefault parameters:<\/p>\n<ul class=\"u-list-style-none\">\n<li>\n<p class=\"c-code-block\">\n<p>                        \u2013softmasking 1<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                        \u2013species=BUSCO_retraining<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                        \u2013gff3=on<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                        \u2013extrinsicCfgFile=Custom_Config<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                        \u2013hintsfile=Liftoff_hints<\/p>\n<\/li>\n<\/ul>\n<p>For every accession, the GFF3 output of Augustus was run through the Augustus-provided tool getAnno.pl to translate gene annotations into protein sequences. Finally, for each annotation, the Augustus output was combined and evaluated using augustus_GFF3_to_EVM_GFF3.pl (provided by EVidenceModeler<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 99\" title=\"Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biol. 9, R7 (2008).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR99\" id=\"ref-link-section-d23733814e2302\" rel=\"nofollow noopener\" target=\"_blank\">99<\/a>).<\/p>\n<p>In addition to the Augustus-generated annotations, we used two types of independent evidence for gene models: from the SNAP de novo annotation tool<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 100\" title=\"Campbell, M. S. et al. MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations. Plant Physiol. 164, 513&#x2013;524 (2014).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR100\" id=\"ref-link-section-d23733814e2309\" rel=\"nofollow noopener\" target=\"_blank\">100<\/a> and Cufflinks transcriptome assemblies<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 101\" title=\"Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562&#x2013;578 (2012).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR101\" id=\"ref-link-section-d23733814e2313\" rel=\"nofollow noopener\" target=\"_blank\">101<\/a>. Annotations produced by Augustus, SNAP and Cufflinks were combined and then subdivided into 1-Mb windows with 1-kb overlap using partition_EVM_input.pl (provided by EVidenceModeler). We ran EVidenceModeler with annotation GFF files, the assembly fasta file, the partitions and a weight matrix. We chose weights for each input based on their ability to recreate the Araport11 gene annotation. Running EVidenceModeler produced the final annotation compilation for each accession. We retained only the longest isoform for each gene using gffread<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 102\" title=\"Pertea, G. &amp; Pertea, M. GFF utilities: GffRead and GffCompare. F1000Res. 9, ISCB Comm J-304 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR102\" id=\"ref-link-section-d23733814e2317\" rel=\"nofollow noopener\" target=\"_blank\">102<\/a>.<\/p>\n<p>Reconciling annotations<\/p>\n<p>To enable comparison between the independent annotations, we used the pangenome coordinate system, reconciling discrepancies using majority voting (Supplementary Note <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">6<\/a>\u2014\u2018Details about reconciling annotations and gene filtering\u2019). Additionally, we compared the sequences of each gene across different accessions. If a gene showed significant variation because it was located in regions heavily influenced by SVs, we excluded it from the analysis. In total, 3,438 genes in our annotation were the result of splitting preliminary annotations and 1,020 were the result of merges. Lastly, we added 1,789 TAIR10 genes that had not been detected by our annotation pipeline (the likely reason for which is that our RNA-seq data only covered four tissues\/stages) to our annotation. For these genes, the same pangenome coordinate approach was used to map the TAIR10 annotation of the 1,789 added genes into their annotations in other genomes. Our approach generated a total of 34,153 putative genes. For the details of annotation reconciliation and filtering, see Supplementary Note <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">6<\/a>\u2014\u2018Details about reconciling annotations and gene filtering\u2019.<\/p>\n<p>Ancestry analysis<\/p>\n<p>All PC sequences from all accessions were compared using DIAMOND\u2019s blastp module<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 103\" title=\"Buchfink, B., Reuter, K. &amp; Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366&#x2013;368 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR103\" id=\"ref-link-section-d23733814e2343\" rel=\"nofollow noopener\" target=\"_blank\">103<\/a> (version 2.0.11) against the A. lyrata MN47 proteome (version 2, GenBank: GCA_944990045.1), and the best hit was considered as the A. lyrata homolog. To avoid bias due to mis-annotated genes in the A. lyrata proteome, we further applied Liftoff v1.63 (ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 98\" title=\"Shumate, A. &amp; Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics 37, 1639&#x2013;1643 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR98\" id=\"ref-link-section-d23733814e2356\" rel=\"nofollow noopener\" target=\"_blank\">98<\/a>) to annotate all A. thaliana genes from all accessions on A. lyrata MN47 (v2, <a href=\"https:\/\/doi.org\/10.6084\/m9.figshare.22285444.v1\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/doi.org\/10.6084\/m9.figshare.22285444.v1<\/a>) and A. lyrata NT1 (v2, <a href=\"https:\/\/doi.org\/10.6084\/m9.figshare.22293196.v1\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/doi.org\/10.6084\/m9.figshare.22293196.v1<\/a>) assemblies. Next, each annotation group from A. thaliana was assigned to the A. lyrata homolog (by LiftOff or proteome similarity) that was common to at least 50% of its members, sharing at least 80% identity and covering at least 80% of the A. thaliana coding sequence. A. thaliana annotation groups were defined to be ancestrally relative to the A. lyrata gene if they were part of a colinear segment of at least two genes. To that end, all A. thaliana genes were ordered according to their relative position in the pangenome coordinate system. Each pair of consecutive genes in A. thaliana was assigned to the same colinear segment as its homologs in A. lyrata if the homologs were separated by fewer than six genes. The ancestral state was defined as \u2018similar\u2019 for cases where the genes from A. lyrata and A. thaliana were not part of the same colinear segment but shared at least 80% sequence identity over at least 80% of the length A. thaliana gene. Further details are available in <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">Supplementary Note 6<\/a>\u2014under \u2018Genes and TEs\u2019 for TE analysis and \u2018New genes\u2019 for the origin of new genes.<\/p>\n<p>Expression analysisRNA-seq read mapping and gene expression calculation<\/p>\n<p>Raw RNA-seq reads from 7-day-old seedlings, 9-leaf rosettes, flowers and pollen<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 54\" title=\"Kornienko, A. E., Nizhynska, V., Molla Morales, A., Pisupati, R. &amp; Nordborg, M. Population-level annotation of lncRNAs in Arabidopsis reveals extensive expression variation associated with transposable element-like silencing. Plant Cell 36, 85&#x2013;111 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR54\" id=\"ref-link-section-d23733814e2435\" rel=\"nofollow noopener\" target=\"_blank\">54<\/a> were aligned either to the TAIR10 reference genome or the corresponding accession accession genome using STAR v2.7.1 (ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 104\" title=\"Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15&#x2013;21 (2013).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR104\" id=\"ref-link-section-d23733814e2439\" rel=\"nofollow noopener\" target=\"_blank\">104<\/a>) with the following custom options:<\/p>\n<ul class=\"u-list-style-none\">\n<li>\n<p class=\"c-code-block\">\n<p>                        \u2013alignIntronMax 6000<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                        \u2013alignMatesGapMax 6000<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                        \u2013outFilterIntronMotifs RemoveNoncanonical<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                        \u2013outFilterMismatchNoverReadLmax 0.1<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                        \u2013outFilterMismatchNmax 999<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                        \u2013outFilterMismatchNoverLmax 0.3<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                        \u2013outFilterMultimapNmax 1<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                        \u2013alignSJoverhangMin 8<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                        \u2013outSAMattributes NH HI AS nM NM MD jM jI XS<\/p>\n<\/li>\n<\/ul>\n<p>Read alignment statistics are provided in Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#MOESM7\" rel=\"nofollow noopener\" target=\"_blank\">4<\/a>. Expression levels were assessed using featurecounts from Subread v2.0.1 (ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 105\" title=\"Liao, Y., Smyth, G. K. &amp; Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923&#x2013;930 (2014).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR105\" id=\"ref-link-section-d23733814e2558\" rel=\"nofollow noopener\" target=\"_blank\">105<\/a>) on each RNA-seq sample with either the TAIR10 gene annotation or the accession-specific annotations from this study. The entire locus, including exons and introns, was used for expression estimates. Expression levels were normalized by calculating TPMs, which represent the number of read counts divided by the gene length in kilobases, and then dividing the total number of counts per kilobase for all genes by 1 million.<\/p>\n<p>Mapping to TAIR10 versus the own genome<\/p>\n<p>To determine whether the gene expression calculation was consistent between RNA-seq mapping in TAIR10 versus accession-specific genomes, we focused on the annotation groups with a one-to-one correspondence with an Araport11 gene. For each RNA-seq sample, we obtained the Pearson\u2019s correlation coefficient between the number of exonic counts obtained from TAIR10 mapping and accession-specific mapping. We also determined the number of genes that were correctly or wrongly estimated using TAIR10 mapping. We called a gene \u2018wrong\u2019 if the counts in TAIR10 and the counts in its own genome differed by more than 30% (Ncounts_min\/Ncounts_max \u2264 0.7). Only genes with at least six counts in either calculation were analyzed.<\/p>\n<p>Chromatin immunoprecipitation followed by sequencing analysis<\/p>\n<p>We used chromatin immunoprecipitation followed by sequencing (ChIP\u2013seq) data from 6 accessions and sRNA-seq data from 14 accessions<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 54\" title=\"Kornienko, A. E., Nizhynska, V., Molla Morales, A., Pisupati, R. &amp; Nordborg, M. Population-level annotation of lncRNAs in Arabidopsis reveals extensive expression variation associated with transposable element-like silencing. Plant Cell 36, 85&#x2013;111 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR54\" id=\"ref-link-section-d23733814e2579\" rel=\"nofollow noopener\" target=\"_blank\">54<\/a>. We used STAR<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 104\" title=\"Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15&#x2013;21 (2013).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR104\" id=\"ref-link-section-d23733814e2583\" rel=\"nofollow noopener\" target=\"_blank\">104<\/a> to map ChIP\u2013seq reads with these nondefault options:<\/p>\n<ul class=\"u-list-style-none\">\n<li>\n<p class=\"c-code-block\">\n<p>                      \u2013alignIntronMax 5<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                      \u2013outFilterMismatchNmax 10<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                      \u2013outFilterMultimapNmax 1<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                      \u2013alignEndsType EndToEnd<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                      \u2013twopassMode Basic<\/p>\n<\/li>\n<\/ul>\n<p>The ChIP\u2013seq data were log2-normalized to input using bamCompare (deeptools package<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 106\" title=\"Ram&#xED;rez, F. et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 44, W160&#x2013;W165 (2016).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR106\" id=\"ref-link-section-d23733814e2654\" rel=\"nofollow noopener\" target=\"_blank\">106<\/a>) using<\/p>\n<p>The ChIP\u2013seq coverage was estimated using bedtools map-mean<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 107\" title=\"Quinlan, A. R. &amp; Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841&#x2013;842 (2010).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR107\" id=\"ref-link-section-d23733814e2711\" rel=\"nofollow noopener\" target=\"_blank\">107<\/a>. The ChIP\u2013seq coverage was further normalized to obtain value range similarity across accessions. For this, we applied quantile-normalization using an R function:<\/p>\n<ul class=\"u-list-style-none\">\n<li>\n<p>function(x) { (x-quantile(x,.20)) \/ (quantile(x,.80) &#8211; quantile(x,.20)) }<\/p>\n<\/li>\n<\/ul>\n<p>which equalized the 20% and 80% quantile values of each ChIP\u2013seq sample. After quantile-normalization, the replicated samples were averaged.<\/p>\n<p>sRNA-seq analysis<\/p>\n<p>We used sRNA-seq data for 14 accessions<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 54\" title=\"Kornienko, A. E., Nizhynska, V., Molla Morales, A., Pisupati, R. &amp; Nordborg, M. Population-level annotation of lncRNAs in Arabidopsis reveals extensive expression variation associated with transposable element-like silencing. Plant Cell 36, 85&#x2013;111 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR54\" id=\"ref-link-section-d23733814e2736\" rel=\"nofollow noopener\" target=\"_blank\">54<\/a>. To process the sRNA-seq data, we trimmed the reads using cutadapt<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 108\" title=\"Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17, 10&#x2013;12 (2011).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR108\" id=\"ref-link-section-d23733814e2740\" rel=\"nofollow noopener\" target=\"_blank\">108<\/a>: cutadapt -a AACTGTAGGCACCATCAAT \u2013minimum-length 18. We then used STAR<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 104\" title=\"Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15&#x2013;21 (2013).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR104\" id=\"ref-link-section-d23733814e2744\" rel=\"nofollow noopener\" target=\"_blank\">104<\/a> with the following nondefault options to map sRNA-seq reads to the corresponding genome:<\/p>\n<ul class=\"u-list-style-none\">\n<li>\n<p class=\"c-code-block\">\n<p>                      \u2013runRNGseed 12345<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                      \u2013alignEndsType Extend5pOfRead1<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                      \u2013alignIntronMax 5000 \u2013alignSJDBoverhangMin 1<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                      \u2013outReadsUnmapped Fastx \u2013outSAMmultNmax 100<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                      \u2013outSAMprimaryFlag AllBestScore<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                      \u2013outSAMattributes NH HI AS nM NM MD jM jI XS<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                      \u2013outFilterMultimapNmax 10<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                      \u2013outFilterMatchNmin 16<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                      \u2013outFilterMatchNminOverLread 0.66<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                      \u2013outFilterMismatchNmax 2<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                      \u2013outFilterMismatchNoverReadLmax 0.05<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                      \u2013outFilterIntronMotifs RemoveNoncanonicalUnannotated<\/p>\n<\/li>\n<li>\n<p class=\"c-code-block\">\n<p>                      \u2013twopassMode None<\/p>\n<\/li>\n<\/ul>\n<p>We extracted 24-nt reads, calculated read coverage for each position of the genome using genomeCoverageBed (bedtools v.2.27.1), normalized it by the total number of uniquely mapped reads in each sample, and calculated 24-nt sRNA coverage for each locus of interest using bedtools map -mean function.<\/p>\n<p>DNA methylation analysis<\/p>\n<p>To estimate DNA methylation levels, we used published BS-seq data for 12 accessions<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 53\" title=\"Kawakatsu, T. et al. Epigenomic diversity in a global collection of Arabidopsis thaliana accessions. Cell 166, 492&#x2013;505 (2016).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR53\" id=\"ref-link-section-d23733814e2920\" rel=\"nofollow noopener\" target=\"_blank\">53<\/a>. After trimming with TrimGalore (<a href=\"https:\/\/github.com\/FelixKrueger\/TrimGalore\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/github.com\/FelixKrueger\/TrimGalore<\/a>) with \u2013clip_r1 10 \u2013clip_r2 15 \u2013three_prime_clip_r1 10 \u2013three_prime_clip_r2 10, reads for each accession were mapped to its corresponding genome with Bismark<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 109\" title=\"Krueger, F. &amp; Andrews, S. R. Bismark: a flexible aligner and methylation caller for bisulfite-seq applications. Bioinformatics 27, 1571&#x2013;1572 (2011).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR109\" id=\"ref-link-section-d23733814e2931\" rel=\"nofollow noopener\" target=\"_blank\">109<\/a> with \u2013score_min L,0,-0.5 for a relaxed mismatch threshold and the \u2013un \u2013ambiguous parameters to obtain additional unmapped and multiply-mapping reads. Methylation was called as described<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 110\" title=\"Pisupati, R., Nizhynska, V., Moll&#xE1; Morales, A. &amp; Nordborg, M. On the causes of gene-body methylation variation in Arabidopsis thaliana. PLoS Genet. 19, e1010728 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#ref-CR110\" id=\"ref-link-section-d23733814e2935\" rel=\"nofollow noopener\" target=\"_blank\">110<\/a>. CG, CHG and CHH methylation levels for genes and SVs in each accession were then calculated for each gene by focusing on all Cs in the specific context within the gene and calculating the ratio between the total number of methylated and unmethylated reads across all sites.<\/p>\n<p>Mapping to TAIR10 versus own genome<\/p>\n<p>To estimate reference bias, we mapped BS-seq data for all accessions to the TAIR10 genome and performed CG, CHG and CHH methylation level estimation in the same way as for own genomes. We then focused on annotation groups with a one-to-one correspondence with an Araport11 gene (the current annotation of the TAIR10 genome). We calculated Pearson\u2019s correlation coefficient between the methylation level estimates obtained from TAIR10 mapping and accession-genome mapping. We also estimated the number of genes that were correctly or wrongly estimated using TAIR10 mapping. For each methylation context, we called a gene \u2018wrongly estimated\u2019 if the methylation level in TAIR10 and the own genome differed by more than 50% (methlevel_min\/methlevel_max\u2009\u2264\u20090.5). For a more refined analysis of reference bias, see Supplementary Note <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">8<\/a>.<\/p>\n<p>Reporting summary<\/p>\n<p>Further information on research design is available in the <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41588-025-02293-0#MOESM2\" rel=\"nofollow noopener\" target=\"_blank\">Nature Portfolio Reporting Summary<\/a> linked to this article.<\/p>\n","protected":false},"excerpt":{"rendered":"DNA extraction and sequencing For long-read sequencing, we began with 3-week-old plants grown in soil that had been&hellip;\n","protected":false},"author":2,"featured_media":11772,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[272],"tags":[2567,2569,2564,2566,18,2568,910,458,2565,19,17,11358,11216,133],"class_list":{"0":"post-11771","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-genetics","8":"tag-agriculture","9":"tag-animal-genetics-and-genomics","10":"tag-biomedicine","11":"tag-cancer-research","12":"tag-eire","13":"tag-gene-function","14":"tag-general","15":"tag-genetics","16":"tag-human-genetics","17":"tag-ie","18":"tag-ireland","19":"tag-plant-genetics","20":"tag-population-genetics","21":"tag-science"},"share_on_mastodon":{"url":"","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/11771","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/comments?post=11771"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/11771\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media\/11772"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media?parent=11771"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/categories?post=11771"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/tags?post=11771"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}