{"id":8096,"date":"2026-04-20T09:10:13","date_gmt":"2026-04-20T09:10:13","guid":{"rendered":"https:\/\/www.europesays.com\/ai\/8096\/"},"modified":"2026-04-20T09:10:13","modified_gmt":"2026-04-20T09:10:13","slug":"this-protein-engineering-breakthrough-generates-over-10m-data-points-and-turbocharges-ai-in-just-three-days","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/ai\/8096\/","title":{"rendered":"This protein-engineering breakthrough generates over 10M data points and turbocharges AI in just three days"},"content":{"rendered":"<p>            <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.europesays.com\/ai\/wp-content\/uploads\/2026\/04\/scientists-discover-ne-11.jpg\" alt=\"Scientists Discover New AI Protein Dataset Method\" title=\"The process of generating protein activity data (top) and reading the output and training AI models (bottom). Credit: Linqi Cheng \/ Rice University\" width=\"800\" height=\"459\"\/><\/p>\n<p>                The process of generating protein activity data (top) and reading the output and training AI models (bottom). Credit: Linqi Cheng \/ Rice University<\/p>\n<p>Protein engineering is a field primed for artificial intelligence research. Each protein is made up of amino acids; to optimize a protein function, researchers modify proteins by switching out one of 20 different amino acids for another. For a protein that is just 50 amino acids in length, this leads to approximately 1.13&#215;1065 potential combinations to test\u2014that&#8217;s 1 followed by 65 zeroes, or five times as many zeroes as a trillion has.<\/p>\n<p>This number of potential combinations, impossible to test in the lab, makes protein engineering an ideal challenge for AI. Modeling which of these combinations will give the best results is a perfect problem for the technology&#8217;s massive computing power. But AI is only as good as the data used to train it, and in some areas of protein engineering, the right data just didn&#8217;t exist.<\/p>\n<p>&#8220;One of the biggest bottlenecks in AI-guided protein engineering is not coming up with machine-learning models. It is generating the right and enough experimental data to train them,&#8221; said Han Xiao, Rice University professor of chemistry, biosciences and bioengineering and director of the SynthX Center. &#8220;For engineering protein activity, which optimizes what a protein does, we had a very clear problem: There simply were not enough datasets to train accurate models.&#8221;<\/p>\n<p>To be able to generate AI models that could accurately predict how to optimize a protein&#8217;s function (activity), Xiao&#8217;s team had to first generate enough activity data about any given protein to train an AI model. In a recent Nature Biotechnology <a href=\"https:\/\/www.nature.com\/articles\/s41587-026-03087-3\" target=\"_blank\" rel=\"nofollow noopener\">publication<\/a>, Xiao&#8217;s team and collaborators from Johns Hopkins University and Microsoft have done just that, sharing an approach that provided the needed data and created accurate models in just three days.<\/p>\n<p>            <img decoding=\"async\" src=\"https:\/\/www.europesays.com\/ai\/wp-content\/uploads\/2026\/04\/scientists-discover-ne-10.jpg\" alt=\"Scientists Discover New AI Protein Dataset Method\" title=\"Development of the Sequence Display platform for the evolution of UGI and rAPOBEC1. Credit: Nature Biotechnology (2026). DOI: 10.1038\/s41587-026-03087-3\"\/><\/p>\n<p>                Development of the Sequence Display platform for the evolution of UGI and rAPOBEC1. Credit: Nature Biotechnology (2026). DOI: 10.1038\/s41587-026-03087-3<\/p>\n<p>This approach, called Sequence Display, can generate more than 10 million data points in a single experiment. These data points are then fed into protein language AI models, which use them to predict which changes to a protein&#8217;s amino acids will create the desired change for the protein&#8217;s activity or function.<\/p>\n<p>&#8220;We were able to develop an activity-based barcoding system that records the activity of individual protein variants and generates the kind of dataset needed to train a machine learning model,&#8221; said Linqi Cheng, a Rice graduate student and first author on the study. &#8220;Then the model was able to predict mutations that significantly improved the activity of the protein we were studying.&#8221;<\/p>\n<p>The team chose a <a href=\"https:\/\/phys.org\/news\/2025-04-bespoke-enzymes-machine-crispr-toolbox.html?utm_source=embeddings&amp;utm_medium=related&amp;utm_campaign=internal\" rel=\"related nofollow noopener\" target=\"_blank\">small CRISPR-Cas protein<\/a> for proof of concept. This protein was valued for its size but limited in its activity to target stretches of DNA to cut. The researchers wanted to identify a version that could cut a wider variety of DNA targets.<\/p>\n<p>First, they mutated the DNA that codes for the Cas9 protein, creating many variations. A blank <a href=\"https:\/\/phys.org\/news\/2025-10-sharper-gene-scissors-biotechnology-toolbox.html?utm_source=embeddings&amp;utm_medium=related&amp;utm_campaign=internal\" rel=\"related nofollow noopener\" target=\"_blank\">DNA barcode<\/a> was attached to each variant, along with a special editor that would change the barcode in response to the protein&#8217;s activity level. As the protein&#8217;s activity levels increased, so did the editor&#8217;s. This meant that the most active protein variations had the biggest changes in their barcodes. The DNA barcodes were then read by next-generation sequencing, which would essentially scan the barcode and classify each sequence by level of activity.<\/p>\n<p class=\"mb-3\">\n        Discover the latest in science, tech, and space with over 100,000 subscribers who rely on Phys.org for daily insights.<br \/>\n        Sign up for our <a href=\"https:\/\/sciencex.com\/help\/newsletter\/\" target=\"_blank\" rel=\"nofollow noopener\">free newsletter<\/a> and get updates on breakthroughs,<br \/>\n        innovations, and research that matter\u2014daily or weekly.\n    <\/p>\n<p>&#8220;The AI is not replacing the experiment here. It instead depends on the experiment,&#8221; Cheng said. &#8220;Sequence Display gives us the data foundation, and the models help us search a much larger data space for strong candidates.&#8221;<\/p>\n<p>The team successfully repeated this process with other proteins, including aminoacyl-tRNA synthetases, cytosine deaminase and uracil glycosylase inhibitor. In each case, the barcoding experiment generated enough data points to train AI models.<\/p>\n<p>&#8220;What this approach provides is a practical framework for integrating AI with protein engineering,&#8221; said Xiao, who is also a Cancer Prevention and Research Institute Scholar. &#8220;Rather than relying on machine learning as a stand-alone solution, we couple it with an experimental platform that generates high-quality training data. This synergy enables more efficient discovery of advanced research tools and next-generation therapeutic proteins.&#8221;<\/p>\n<p>\t\t\t\t\t\t\t\t\t\t\t\t\t\tPublication details\t\t\t\t\t\t\t\t\t\t\t\t\t<\/p>\n<p>Linqi Cheng et al, Sequence Display enables large-scale sequence\u2013activity datasets for rapid protein evolution, Nature Biotechnology (2026). <a data-doi=\"1\" href=\"https:\/\/dx.doi.org\/10.1038\/s41587-026-03087-3\" target=\"_blank\" rel=\"nofollow noopener\">DOI: 10.1038\/s41587-026-03087-3<\/a><\/p>\n<p>\t\t\t\t\t\t\t\t\t\t\t\tKey concepts<br \/>\n\t\t\t\t\t\t\t\t\t\t\t\t<a class=\"concept-link\" href=\"https:\/\/phys.org\/concepts\/dna-sequencing\/\" rel=\"nofollow noopener\" target=\"_blank\">DNA sequencing<\/a><a class=\"concept-link\" href=\"https:\/\/phys.org\/concepts\/bioinformatics\/\" rel=\"nofollow noopener\" target=\"_blank\">Bioinformatics<\/a>\t\t\t\t\t\t\t\t\t\t\t<\/p>\n<p>\n\t\t\t\t\t\t\t\t\t\t\t\tCitation:<br \/>\n\t\t\t\t\t\t\t\t\t\t\t\tThis protein-engineering breakthrough generates over 10M data points and turbocharges AI in just three days (2026, April 19)<br \/>\n\t\t\t\t\t\t\t\t\t\t\t\tretrieved 20 April 2026<br \/>\n\t\t\t\t\t\t\t\t\t\t\t\tfrom https:\/\/phys.org\/news\/2026-04-protein-breakthrough-generates-10m-turbocharges.html\n\t\t\t\t\t\t\t\t\t\t\t <\/p>\n<p>\n\t\t\t\t\t\t\t\t\t\t\t This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no<br \/>\n\t\t\t\t\t\t\t\t\t\t\t part may be reproduced without the written permission. The content is provided for information purposes only.\n\t\t\t\t\t\t\t\t\t\t\t <\/p>\n","protected":false},"excerpt":{"rendered":"The process of generating protein activity data (top) and reading the output and training AI models (bottom). Credit:&hellip;\n","protected":false},"author":2,"featured_media":8097,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[24,25,165,166,164,161,160,162,134,163],"class_list":{"0":"post-8096","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-ai","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-materials","11":"tag-nanotech","12":"tag-physics","13":"tag-physics-news","14":"tag-science","15":"tag-science-news","16":"tag-technology","17":"tag-technology-news"},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/posts\/8096","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/comments?post=8096"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/posts\/8096\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/media\/8097"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/media?parent=8096"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/categories?post=8096"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/tags?post=8096"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}