{"id":355067,"date":"2025-08-18T20:58:14","date_gmt":"2025-08-18T20:58:14","guid":{"rendered":"https:\/\/www.europesays.com\/uk\/355067\/"},"modified":"2025-08-18T20:58:14","modified_gmt":"2025-08-18T20:58:14","slug":"how-ai-is-decoding-the-grammar-of-the-genome","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/uk\/355067\/","title":{"rendered":"how AI is decoding the grammar of the genome"},"content":{"rendered":"\n<p>In 1862, Victor Hugo reportedly wrote to his publisher to ask how his newly published novel Les Mis\u00e9rables was selling, with a single character query: \u201c?\u201d The response: \u201c!\u201d<\/p>\n<p>This story of one of the world\u2019s most concise correspondences is apocryphal. But some genome-focused artificial intelligence (AI) systems can, like the French writer\u2019s publisher, respond meaningfully to equally short prompts.<\/p>\n<p>Instead of the detailed queries required to use the chatbot ChatGPT effectively, Evo, an AI model trained on some 300 billion nucleotide bases, including 80,000 microbial whole-genome sequences, will \u2014 prompted with \u2018#\u2019 \u2014 dream up a new sequence of mobile DNA. It does so on the basis of other such biological systems that the model has been exposed to (see <a href=\"http:\/\/go.nature.com\/3jvp922\" data-track=\"click\" data-label=\"http:\/\/go.nature.com\/3jvp922\" data-track-category=\"body text link\" target=\"_blank\" rel=\"noopener\">go.nature.com\/3jvp922<\/a>). Given a prompt such as \u2018030\u2019, an AI tool called regLM can spit out 200-base sequences that are predicted to exhibit regulatory activity in any of three human cell lines (<a href=\"http:\/\/go.nature.com\/4jpttm8\" data-track=\"click\" data-label=\"http:\/\/go.nature.com\/4jpttm8\" data-track-category=\"body text link\" target=\"_blank\" rel=\"noopener\">go.nature.com\/4jpttm8<\/a>).<\/p>\n<p>Evo and regLM are part of a fast-growing suite of tools that aim to internalize, decode, interpret and build on the grammar of the genome \u2014 especially the vast portion that does not code for proteins. Think AlphaFold, but for regulatory DNA, which are sequences that control gene expression.<\/p>\n<p><a href=\"https:\/\/www.nature.com\/articles\/d41586-025-01998-w\" class=\"u-link-inherit\" data-track=\"click\" data-track-label=\"recommended article\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"recommended__image\" alt=\"\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/08\/d41586-025-02621-8_51349054.jpg\"\/><\/p>\n<p class=\"recommended__title u-serif\">DeepMind\u2019s new AlphaGenome AI tackles the \u2018dark matter\u2019 in our DNA<\/p>\n<p><\/a><\/p>\n<p>When Google DeepMind released AlphaFold in 2020, the company claimed it had solved a decades-old \u2018grand challenge\u2019 in biology \u2014 predicting a protein\u2019s 3D shape from its sequence alone. But the non-coding fraction of the genome could prove to be an even grander challenge.<\/p>\n<p>A given sequence of amino acids will generally fold into the same shape, whatever the cellular context. That predictability is not true of the genome, in which short, functional sequence motifs \u2014 gene promoters and enhancers, transcription start and stop sites and so on \u2014 can be scattered across long stretches of seemingly purposeless DNA. These motifs might overlap, interact over long distances, bind to competing protein factors or respond to signals that are only present in specific cells or at certain times in development. They are also tightly wrapped within chromatin, a complex of DNA and protein, which might be more or less accessible to external proteins depending on what the cell is doing.<\/p>\n<p>\u201cHow proteins are encoded in the genome, the code of how genes are expressed, when and where, how much \u2014 is one of the most fascinating problems in biology,\u201d says Stein Aerts, a computational biologist at the VIB Center for AI &amp; Computational Biology and the Catholic University of Leuven (KU Leuven) in Belgium. But with training, AI tools can detect subtle differences between sequences and predict what they do and how they behave, identifying crucial motifs and even estimating the impact of altering them. From there, AI models can attempt to predict the physiological impact of genetic variants and even guide the design of new sequences with specified functions.<\/p>\n<p>These tools are not perfect, and researchers cannot even agree on how best to assess their performance. But that makes the field exciting. \u201cIt\u2019s so clear that it\u2019s a solvable problem,\u201d says Julia Zeitlinger, a developmental and computational biologist at the Stowers Institute for Medical Research in Kansas City, Missouri, who developed an AI model called BPNet and uses it to decode the mechanistic sequence rules of gene regulation, \u201cbut it\u2019s not clear how\u201d.<\/p>\n<p><b>Of puppies and puffins<\/b><\/p>\n<p>DeepSEA, one of the first genomic AI tools, was published<a href=\"#ref-CR1\" data-track=\"click\" data-action=\"anchor-link\" data-track-label=\"go to reference\" data-track-category=\"references\">1<\/a> ten years ago this month by computational biologists Jian Zhou and Olga Troyanskaya at Princeton University in New Jersey.<\/p>\n<p>DeepSEA is a convolutional neural network (CNN) \u2014 the same kind of deep-learning architecture used to teach computers to classify images as, say, a cat or a dog. Zhou and Troyanskaya trained a model on epigenetics data, including transcription-factor binding, chromatin accessibility and histone modifications, from a public research project called the Encyclopedia of DNA Elements (ENCODE). The model learnt to predict the presence of such features in 1,000-base segments of DNA it had never encountered.<\/p>\n<p>DeepSEA\u2019s training enabled it to tease apart the biological consequence and severity of sequence variants associated with human disease. For instance, one breast-cancer-associated sequence variant called rs4784227 seems to strengthen the binding of a DNA-binding protein called FOXA1, whereas a variant associated with the blood condition \u03b1-thalassemia creates a possible binding site for GATA1, a transcription factor involved in blood-cell development.<\/p>\n<p><a href=\"https:\/\/www.nature.com\/articles\/d41586-025-00531-3\" class=\"u-link-inherit\" data-track=\"click\" data-track-label=\"recommended article\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"recommended__image\" alt=\"\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/08\/d41586-025-02621-8_50838582.jpg\"\/><\/p>\n<p class=\"recommended__title u-serif\">Biggest-ever AI biology model writes DNA on demand<\/p>\n<p><\/a><\/p>\n<p>Since then, the field has exploded. David Kelley, a principal investigator at the biotechnology company Calico Life Sciences in South San Francisco, California, has created or co-created <a href=\"https:\/\/github.com\/calico\" data-track=\"click\" data-label=\"https:\/\/github.com\/calico\" data-track-category=\"body text link\" target=\"_blank\" rel=\"noopener\">multiple AI models<\/a>, many with canine-inspired names. These include Akita<a href=\"#ref-CR2\" data-track=\"click\" data-action=\"anchor-link\" data-track-label=\"go to reference\" data-track-category=\"references\">2<\/a> (for predicting 3D genome folding), Basset<a href=\"#ref-CR3\" data-track=\"click\" data-action=\"anchor-link\" data-track-label=\"go to reference\" data-track-category=\"references\">3<\/a> and Basenji<a href=\"#ref-CR4\" data-track=\"click\" data-action=\"anchor-link\" data-track-label=\"go to reference\" data-track-category=\"references\">4<\/a> (for regulatory-sequence prediction) and Borzoi<a href=\"#ref-CR5\" data-track=\"click\" data-action=\"anchor-link\" data-track-label=\"go to reference\" data-track-category=\"references\">5<\/a>, which predicts gene expression across the length of a gene.<\/p>\n<p>These models raised a litter of variants: Basset begat Malinois, and Borzoi begat Scooby. Other researchers have built their own (non-canine) models including Puffin, ChromBPNet and more.<\/p>\n<p>Not all are CNNs. Enformer \u2014 a model that predicts both gene expression and epigenetic data over long distances \u2014 and Borzoi, for instance, \u201cuse both convolution blocks and transformer blocks\u201d, says Kelley, whose laboratory developed both models. \u201cThe convolution blocks are great for capturing the local sequence patterns, and then the transformer blocks help look around a larger region to consider the local patterns in a broader context before predicting the data.\u201d But whatever the architecture, they come in two basic forms, says Anshul Kundaje, who researches computational genomics at Stanford University in California. Supervised and sequence-to-function models are trained on functional genomic data \u2014 gene expression or chromatin accessibility, for instance \u2014 and learn to predict the function of DNA sequences they have never encountered. Often working at or near single-nucleotide resolution, these models can identify key motifs, such as functionally important protein-binding sites, and predict the significance of altering them. DeepSEA is one; Kundaje\u2019s ChromBPNet, which <a href=\"https:\/\/www.synapse.org\/Synapse:syn59449898\/wiki\/628018\" data-track=\"click\" data-label=\"https:\/\/www.synapse.org\/Synapse:syn59449898\/wiki\/628018\" data-track-category=\"body text link\" target=\"_blank\" rel=\"noopener\">predicts regions of chromatin accessibility<\/a>, another.<\/p>\n<p>The other class is unsupervised or self-supervised \u2018genomic language models\u2019 (gLMs). Like ChatGPT, they are trained on vast quantities of text \u2014 in this case, genomic sequence data \u2014 and are tasked with either predicting the next base (or \u2018token\u2019) in a sequence or filling in missing bases on the basis of surrounding context. These models \u201care not trying to predict the activity of a sequence, they\u2019re trying to predict the composition of a sequence\u201d, says Avantika Lal, a machine-learning scientist at biotechnology firm Genentech in South San Francisco.<\/p>\n<p>With machine-learning scientist G\u00f6k\u00e7en Eraslan and their colleagues at Genentech, Lal co-developed regLM, a language model that they trained by labelling regulatory sequences with succinct markers of activity<a href=\"#ref-CR6\" data-track=\"click\" data-action=\"anchor-link\" data-track-label=\"go to reference\" data-track-category=\"references\">6<\/a> \u2014 for instance, \u201804\u2019 to indicate strong expression in one cell line and low activity in another. The model is therefore not strictly unsupervised, says Eraslan \u2014 he calls it a \u2018function-to-sequence\u2019 model. But those same labels can then be used to prompt regLM to create new sequences with predicted behaviours.<\/p>\n<p>Evo 2, announced in February<a href=\"#ref-CR7\" data-track=\"click\" data-action=\"anchor-link\" data-track-label=\"go to reference\" data-track-category=\"references\">7<\/a>, was trained on 9.3 trillion DNA base pairs \u2014 \u201ca representative snapshot of genomes spanning all observed evolution\u201d, as the resulting bioRxiv preprint paper puts it. It could then identify intron\u2013exon boundaries, predict the impacts of mutations and generate \u2018realistic\u2019 gene and genomic sequences, among other things.<\/p>\n<p><b>Models made simple<\/b><\/p>\n<p>Genomic AI models can also be distinguished by the type of regulatory interactions they predict, Kundaje says. Sequence-to-function models mostly identify important DNA motifs (which, because their function depends on their proximity to the regulated gene, are said to act in cis) without regard to the biology that occurs there.<\/p>\n<p>Trans models, by contrast, aim to identify which genes regulate which other genes, for instance, to tease apart networks of gene regulation. (They are called trans because the factors that mediate this regulation act at a distance.) But this, says Kundaje, \u201cis still very fraught and very problematic\u201d because trans models \u2014 which are generally trained on data such as RNA expression \u2014 must infer causal relationships without data that can reveal causality. There\u2019s no guarantee that two genes are directly linked just because their expression rises and falls in tandem. Even if they are, it\u2019s not necessarily obvious in which direction the relationship works: does A regulate B or vice versa? If these models are then asked to predict the impact of a perturbation \u2014 for example, what happens if a given gene is knocked out \u2014 the models often fail.<\/p>\n<p>Models can include both cis and trans elements, says Sushmita Roy, a computational biologist at the University of Wisconsin\u2013Madison, for instance by building regulatory networks on the basis of chromatin accessibility data and weighting those predictions by gene expression. But perhaps the first model to truly bridge the divide, Kundaje says, is Scooby \u2014 a single-cell version of Borzoi (<a href=\"http:\/\/go.nature.com\/3upffnp\" data-track=\"click\" data-label=\"http:\/\/go.nature.com\/3upffnp\" data-track-category=\"body text link\" target=\"_blank\" rel=\"noopener\">go.nature.com\/3upffnp<\/a>). By leveraging both chromatin accessibility and transcriptional data from the same cells, Scooby predicts genome features and cell state simultaneously. \u201cIt is one of the first cis\u2013trans models,\u201d he says.<\/p>\n<p><a href=\"https:\/\/www.nature.com\/articles\/d41586-024-01243-w\" class=\"u-link-inherit\" data-track=\"click\" data-track-label=\"recommended article\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"recommended__image\" alt=\"\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/08\/d41586-025-02621-8_27168132.jpg\"\/><\/p>\n<p class=\"recommended__title u-serif\">\u2018ChatGPT for CRISPR\u2019 creates new gene-editing tools<\/p>\n<p><\/a><\/p>\n<p>Sequence-to-function models can also probe other aspects of gene regulation. In 2024, teams led by Zhou (who is now at the University of Texas Southwestern Medical Center in Dallas), Kundaje and Charles Danko, a computational biologist at Cornell University in Ithaca, New York, independently described sequence-to-function models capable of predicting sites of transcription initiation<a href=\"#ref-CR8\" data-track=\"click\" data-action=\"anchor-link\" data-track-label=\"go to reference\" data-track-category=\"references\">8<\/a>\u2013<a href=\"#ref-CR10\" data-track=\"click\" data-action=\"anchor-link\" data-track-label=\"go to reference\" data-track-category=\"references\">10<\/a>.<\/p>\n<p>Zhou used his team\u2019s model, Puffin, to identify the common features and placement of key regulatory elements around sites of transcription initiation, including binding sites for the transcription factors YY1, SP1, CREB and Initiator. Danko\u2019s team trained its AI model on matched genome sequences and transcription initiation data from 58 individuals, creating a suite of models that were, he says, \u201cfor the first time aware of how differences between individuals in their genome sequence influence the pattern\u201d of transcription initiation.<\/p>\n<p>Collectively, says Zhou, these studies begin to tease apart the motifs that regulate the positioning and strength of transcription initiation, including that of the transcription factor TFIID. TFIID is an essential protein complex that binds to the promoter element known as a TATA box \u2014 despite the fact that most eukaryotic promoters don\u2019t seem to contain a TATA box. \u201cOne mechanistic interpretation is that TFIID is binding the best available of the \u2018bad options\u2019 when it picks a site\u201d in a TATA-less promoter, Danko explains.<\/p>\n<p>Most genomic models make these predictions from relatively small inputs \u2014 anywhere from a few hundred to a few thousand bases. But gene regulation can occur over much longer tracts of genome space, and some models are able to make predictions at or near those scales. Borzoi, for instance, accepts 524 kilobases of input DNA, and Evo 2 and Google DeepMind\u2019s newly announced <a href=\"https:\/\/www.nature.com\/articles\/d41586-025-01998-w\" data-track=\"click\" data-label=\"https:\/\/www.nature.com\/articles\/d41586-025-01998-w\" data-track-category=\"body text link\" target=\"_blank\" rel=\"noopener\">AlphaGenome<\/a> can work with a megabase.<\/p>\n<p>These models can transform those sequences into vast collections of estimated data. Given an input sequence of 196,608 bases of human DNA, for instance, Enformer outputs 2,131 predictions of transcription factor binding, 1,860 of histone modifications, 684 of chromatin accessibility and 638 of gene expression, at 128-base resolution (<a href=\"http:\/\/go.nature.com\/4mbe42h\" data-track=\"click\" data-label=\"http:\/\/go.nature.com\/4mbe42h\" data-track-category=\"body text link\" target=\"_blank\" rel=\"noopener\">go.nature.com\/4mbe42h<\/a>).<\/p>\n<p><b>A finite genome <\/b><\/p>\n<p>Yet despite these models\u2019 extensive \u2018receptive fields\u2019, they can still miss things, says Jacob Schreiber, a computational biologist at the Research Institute of Molecular Pathology in Vienna, because enhancers might exert effects that are biologically meaningful but invisible to the AI tool. \u201cWe have not cracked long-range regulation,\u201d he says.<\/p>\n<p>Another challenge is that, as vast as it is, the human genome is finite \u2014 there are only about 20,000\u201325,000 genes, for instance, and only a fraction of those are regulated in a cell-type-specific manner. That means that for all those billions of bases, there are relatively few examples of regulatory strategies from which a model can learn.<\/p>\n<p><img decoding=\"async\" class=\"figure__image\" alt=\"Carl de Boer Headshot inside the School of Biomedical Engineering, UBC.\" loading=\"lazy\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/08\/d41586-025-02621-8_51345198.jpg\"\/><\/p>\n<p class=\"figure__caption u-sans-serif\">Carl de Boer is a biomedical engineer at the University of British Columbia in Canada.Credit: Paul Joseph<\/p>\n<p>\u201cThere\u2019s just so many different biochemical mechanisms that could happen on DNA that there are probably a very large number of them that only occur once or even zero times in our genome sequence,\u201d says biomedical engineer Carl de Boer at the University of British Columbia in Vancouver, Canada.<\/p>\n<p>One approach to broadening an AI model\u2019s knowledge base is to feed it more than just reference genomes. Some model builders, for instance, train their tools on data from multiple individuals or from across the phylogenetic tree to give the models a sense of genetic diversity.<\/p>\n<p>Another approach, advanced by de Boer and Jussi Taipale, a systems biologist at the University of Cambridge, UK, is to look beyond natural genomes to fully artificial DNAs<a href=\"#ref-CR11\" data-track=\"click\" data-action=\"anchor-link\" data-track-label=\"go to reference\" data-track-category=\"references\">11<\/a>.<\/p>\n<p>As a postdoc at the Broad Institute of MIT and Harvard in Cambridge, Massachusetts, de Boer and his colleagues tested some 100 million random sequences, each of which were 80 nucleotides in length \u2014 \u201cabout a human genome\u2019s worth\u201d \u2014 for their ability to drive expression of a fluorescent protein in yeast (Saccharomyces cerevisiae)<a href=\"#ref-CR12\" data-track=\"click\" data-action=\"anchor-link\" data-track-label=\"go to reference\" data-track-category=\"references\">12<\/a>. (The yeast genome is made up of about 12 million bases, compared with roughly 3 billion in the human genome.) This approach, de Boer says, \u201cis actually much better\u201d for understanding the grammar of the genome than using genomic DNA, \u201cbecause all of the signals you see in the random DNA are causal\u201d. If you see fluorescence, the sequence is active. The genome, by contrast, is a product of evolution, meaning elements might be positioned owing to selective pressures as well as function.<\/p>\n<p>According to de Boer, the yeast exercise yielded two key insights. First, it reinforced that \u201cthere are probably widespread biophysical interactions happening in regulatory regions\u201d. Functional motifs were not randomly arranged in active sequences; they were positioned in specific configurations \u2014 for instance, to conform to the helical spacing of the DNA double helix.<\/p>\n<p>The second insight involved the importance of low-affinity transcription-factor\u2013DNA interactions. Even weak interactions, the team found, could exert a large influence on gene regulation, just as relatively weak chemical interactions can hold two proteins together.<\/p>\n","protected":false},"excerpt":{"rendered":"In 1862, Victor Hugo reportedly wrote to his publisher to ask how his newly published novel Les Mis\u00e9rables&hellip;\n","protected":false},"author":2,"featured_media":355068,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[3900,3965,3690,3966,70,49793,53,16,15],"class_list":{"0":"post-355067","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-science","8":"tag-genomics","9":"tag-humanities-and-social-sciences","10":"tag-machine-learning","11":"tag-multidisciplinary","12":"tag-science","13":"tag-synthetic-biology","14":"tag-technology","15":"tag-uk","16":"tag-united-kingdom"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@uk\/115051778325680140","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/355067","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/comments?post=355067"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/355067\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media\/355068"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media?parent=355067"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/categories?post=355067"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/tags?post=355067"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}