{"id":102896,"date":"2025-10-04T15:33:13","date_gmt":"2025-10-04T15:33:13","guid":{"rendered":"https:\/\/www.europesays.com\/ie\/102896\/"},"modified":"2025-10-04T15:33:13","modified_gmt":"2025-10-04T15:33:13","slug":"massive-datasets-meet-their-match","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/ie\/102896\/","title":{"rendered":"Massive Datasets Meet Their Match"},"content":{"rendered":"<p>Newswise \u2014 Just as streaming services have replaced CDs and DVDs by letting people watch or listen without downloading the content first, data streaming lets scientists analyze raw data from tools like microscopes and drones without needing to save the entire dataset beforehand.<\/p>\n<p>Recent award-winning research by <a href=\"https:\/\/www.pnnl.gov\/people\/s-m-ferdous\" rel=\"nofollow noopener\" target=\"_blank\">S. M. Ferdous<\/a> at <a href=\"https:\/\/www.pnnl.gov\/\" rel=\"nofollow noopener\" target=\"_blank\">Pacific Northwest National Laboratory (PNNL)<\/a> and his collaborators Ahammed Ullah and Alex Pothen at Purdue University makes analyzing large data streams significantly faster, and in the process, can make extreme-scale data AI-ready. By combining streaming with parallel computing, the team developed an algorithm that speeds up data analysis by nearly two orders of magnitude.\u00a0<\/p>\n<p>Powering a poly-streaming model with parallel computing<\/p>\n<p>For algorithm design and analysis, researchers use models of computation which are agreed-upon rules for what operations an algorithm can perform and what resources it can use. Different models capture different settings. The classic random access memory (RAM) model does not impose a strict memory limit, but in a streaming model, memory is limited.<\/p>\n<p>A streaming algorithm processes a large dataset sequentially, often in one or a few passes, while maintaining a compact summary that fits in its limited memory. These summaries are designed to recover a high-quality solution for the entire input.<\/p>\n<p>\u201cThe poly-streaming model generalizes streaming to many processors and streams,\u201d said Ullah. \u201cEach processor maintains a small local summary of what it sees. Processors communicate as needed, which helps them choose summaries of good quality while limiting the number of passes. With suitably designed algorithms, the combined summaries suffice to obtain a high-quality solution.\u201d<\/p>\n<p>Ullah formulated the poly-streaming model as part of his PhD thesis in collaboration with Ferdous and Pothen. Within this framework, algorithms can jointly optimize time via parallel computing and space via data summarization. The researchers demonstrated its effectiveness using the maximum weight matching problem in graphs, which is a classical optimization problem with many applications.<\/p>\n<p>Making large datasets manageable<\/p>\n<p>\u201cThe size of data is getting larger and larger,\u201d said Ferdous, a staff scientist and past Linus Pauling Fellow at PNNL. \u201cWhen the datasets get too large, we can\u2019t easily store them on a computer. At the same time, we need to solve larger and larger problems involving these datasets.\u201d<\/p>\n<p>One solution has been to use supercomputers, such as exascale computers developed by the Department of Energy (DOE). However, some problems are too large for even the supercomputers to handle, and the large number of memory accesses increases the time needed to solve them. Streaming the datasets circumvents these memory storage issues, since only a small summary of the data is saved in the streaming mode, and the amount of memory needed to analyze the dataset is much smaller.<\/p>\n<p>\u201cWhile this doesn\u2019t give an exact solution, we can prove that the approximations are accurate; they are a factor of two off the best solution, in the worst case,\u201d said Pothen, professor of computer science at Purdue University and Ullah\u2019s PhD advisor.<\/p>\n<p>Making extreme-scale data ready for AI<\/p>\n<p>Optimization problems such as the maximum weight matching problem have many applications. One such application is in the field of AI, where the data may need to be denoised and reduced in size before it can be analyzed. The maximum weight matching problem can play a crucial role in processing data for AI tasks by identifying significant subsets of the data. This preprocessing step makes the data more relevant and leads to more accuracy in reasoning tasks.<\/p>\n<p>Making large datasets \u201cAI-ready\u201d can be a challenge. Taking raw data and running it through an AI model without first denoising the data or reducing its size may lead to inaccurate results or make the computations infeasible.\u00a0\u00a0<\/p>\n<p>\u201cThe poly-streaming model has the ability to process extreme-scale data,\u201d said Ferdous. \u201cOur model can act as the mediator between the raw data and the AI model by processing and making sense of the data before the AI model analyzes it further.\u201d<\/p>\n<p>Looking ahead, the research team sees their model being especially applicable for processing the large amounts of data from DOE\u2019s scientific user facilities and preparing it for AI analysis, bridging the gap between AI and instrumentation.<\/p>\n<p>The theoretical contributions, practical performance, and the applicability of the poly-streaming model were recognized with <a href=\"https:\/\/algo-conference.org\/2025\/schedule\/#tuesday\" rel=\"nofollow noopener\" target=\"_blank\">the \u2018best paper\u2019 prize at the recent European Symposium on Algorithms<\/a>, which took place in Warsaw, Poland during September 15 \u2013 17, 2025. This work was supported by the Advanced Scientific Computing Research program of the DOE Office of Science, and by PNNL\u2019s Linus Pauling Distinguished Postdoctoral Fellowship.\u00a0<\/p>\n<p class=\"text-center\">###<\/p>\n<p><strong>About PNNL<\/strong><\/p>\n<p><a href=\"https:\/\/www.pnnl.gov\/\" rel=\"nofollow noopener\" target=\"_blank\">Pacific Northwest National Laboratory<\/a>\u00a0draws on its distinguishing strengths in chemistry, Earth sciences, biology and data science to advance scientific knowledge and address challenges in\u00a0energy resiliency and national security.\u00a0Founded in 1965, PNNL is operated by Battelle and supported by the Office of Science of the U.S. Department of Energy. The Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, visit the\u00a0<a href=\"https:\/\/www.energy.gov\/science\/office-science?__hstc=249664665.0e04b6d5539702b3e4f6217ddb79b38a.1739301739041.1759266154791.1759335549106.294&amp;__hssc=249664665.3.1759335549106&amp;__hsfp=5555856\" rel=\"nofollow noopener\" target=\"_blank\">DOE Office of Science website.<\/a>\u00a0For more information on PNNL, visit\u00a0<a href=\"https:\/\/www.pnnl.gov\/news?news[0]=type:994&amp;news[1]=type:165&amp;news[2]=type:24&amp;news[3]=type:23\" rel=\"nofollow noopener\" target=\"_blank\">PNNL&#8217;s News Center<\/a>. Follow us on\u00a0<a href=\"https:\/\/twitter.com\/PNNLab\" rel=\"nofollow noopener\" target=\"_blank\">Twitter<\/a>,\u00a0<a href=\"https:\/\/www.facebook.com\/PNNLgov\" rel=\"nofollow noopener\" target=\"_blank\">Facebook<\/a>,\u00a0<a href=\"https:\/\/www.linkedin.com\/company\/pacific-northwest-national-laboratory\" rel=\"nofollow noopener\" target=\"_blank\">LinkedIn<\/a>\u00a0and\u00a0<a href=\"https:\/\/www.instagram.com\/pnnlab\/\" rel=\"nofollow noopener\" target=\"_blank\">Instagram<\/a>.<\/p>\n<p>                                    <script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><script async src=\"\/\/www.instagram.com\/embed.js\"><\/script><\/p>\n","protected":false},"excerpt":{"rendered":"Newswise \u2014 Just as streaming services have replaced CDs and DVDs by letting people watch or listen without&hellip;\n","protected":false},"author":2,"featured_media":42004,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[262],"tags":[289,314,53750,18,9656,19,17,14998,941,57935,64837,64838,7905,82],"class_list":{"0":"post-102896","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-computing","8":"tag-artificial-intelligence","9":"tag-computing","10":"tag-doe-science-news-source","11":"tag-eire","12":"tag-engineering","13":"tag-ie","14":"tag-ireland","15":"tag-mathematics","16":"tag-newswise","17":"tag-pacific-northwest-national-laboratory","18":"tag-parallel-computingalgorithm-optimizationscientific-computing","19":"tag-stem-education","20":"tag-supercomputing","21":"tag-technology"},"share_on_mastodon":{"url":"","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/102896","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/comments?post=102896"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/102896\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media\/42004"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media?parent=102896"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/categories?post=102896"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/tags?post=102896"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}