Scientists have built an AI tool that reads genetic code the way ChatGPT reads text – scanning DNA for mutation patterns to trace genes back through time to their common ancestors.
It’s faster than anything currently available, works with incomplete data, and could change how researchers study everything from malaria-carrying mosquitoes to human evolutionary history.
The research was conducted at the University of Oregon (UO). The tool was developed by computational biologist Andrew Kern and his lab.
Genomes as language
The comparison between DNA and written language isn’t just a metaphor. Genomes really are built like text – a four-letter alphabet of A, T, C and G, combined in different sequences to form genes and chromosomes.
What Kern’s lab is most interested in is the misspellings: mutations, changes in DNA sequences that accumulate over time and get passed down from generation to generation, leaving a trail that researchers can follow backwards through evolutionary history.
Traditional methods for doing this – based on math and statistics – are the gold standard, and in most cases they’re hard to beat. But they’re slow, and they struggle with large or incomplete datasets.
A single mosquito chromosome can take hours or even days to decode. That’s a real bottleneck when you’re working at scale.
Borrowing from ChatGPT
To get around this, Kern and his team modified GPT-2 – the older machine learning architecture that underlies ChatGPT.
Instead of training it on volumes of English text, they trained it on simulations of genetic evolution across a range of species, including bacteria, rodents, mosquitoes, and primates.
“We can’t repeat evolution, so one of the key workflows we have is developing simulations,” said Kevin Korfmann, lead author of the study.
“The simulations mimic evolutionary processes, and then we use the outcomes as training data for our deep learning models.”
The model learns to recognise mutation patterns and use them to estimate when two genes last shared a common ancestor – a measure geneticists call “coalescence time.” Stretches of DNA with many mutations tend to trace back to a distant common ancestor.
Those with fewer mutations likely diverged more recently. It’s the same principle that explains why chimpanzees are considered our closest living relatives, while sea sponges – genetically diverged more than 700 million years ago – are among the most distant
When the team tested the tool against existing state-of-the-art statistical methods, it performed just as well – which came as a genuine surprise.
“You never really know what’s going to work when you’re essentially borrowing techniques from a totally different world and applying them to a new problem,” Kern said. “But this was a case where things worked really well.”
The speed difference, though, was dramatic. Where traditional methods can take hours or days to process a single mosquito chromosome, the new tool does it in minutes.
The reason, Korfmann noted, is that the heavy statistical lifting happens during training rather than during each individual analysis.
“It just reads the patterns because all of the expensive statistical work was done up front, during training, which sidesteps the bottleneck,” he said.
The tool also handles incomplete data – a common headache in genetics research – without falling apart. For Kern, who regularly works with patchy mosquito genetic databases in his malaria research, that’s not a minor convenience.
Why mosquitoes matter
Insecticides have long been one of the main weapons against malaria-spreading mosquitoes. But mosquitoes, like everything else, evolve.
Resistance to insecticides is now showing up across mosquito populations worldwide, and understanding how and when that resistance emerged is critical to staying ahead of it.
“A major challenge in preventing the spread of malaria has been understanding the evolution of insecticide resistance,” Kern said.
“Now, we can go in with our AI model, ask how long ago these resistance genes arose in the population, and learn about the evolutionary history of this critical carrier of malaria.”
Future research directions
Right now, the model traces ancestry between pairs of genes. The next goal is to scale that up by reconstructing full genealogical trees across multiple lineages simultaneously.
Some traditional methods can already do this, but Kern and Korfmann want to get there from a machine learning angle.
“There’s so much going on in the machine learning field that we haven’t applied yet in our field,” Korfmann said. “There’s tons of translational work to do to get these novel algorithms working in biology.”
The gap between AI research and biological application, in other words, is still wide. But it’s closing.
The research is published in the journal Proceedings of the National Academy of Sciences.
—–
Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.
Check us out on EarthSnap, a free app brought to you by Eric Ralls and Earth.com.
—–