{"id":19999,"date":"2026-04-28T12:24:19","date_gmt":"2026-04-28T12:24:19","guid":{"rendered":"https:\/\/www.europesays.com\/ai\/19999\/"},"modified":"2026-04-28T12:24:19","modified_gmt":"2026-04-28T12:24:19","slug":"googles-deepminds-new-approach-to-distributed-training-of-ai-models","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/ai\/19999\/","title":{"rendered":"Google\u2019s DeepMind\u2019s New Approach to Distributed Training of AI Models"},"content":{"rendered":"<p style=\"font-weight: 400;\">Google\u2019s AI research arm DeepMind has introduced a decoupled distributed low-communication (DiLoCo) training architecture that could train advanced AI models across distributed datacentres.<\/p>\n<p style=\"font-weight: 400;\">The architecture isolates local disruptions such as hardware failures and network issues so that other parts of the system can continue the learning processes efficiently. This is achieved through partitioning large-scale training workloads into decoupled compute islands through which data is exchanged in an asynchronous fashion.<\/p>\n<p style=\"font-weight: 400;\">\u201cThis enables large language model pre-training across geographically distant datacentres without requiring the tight synchronization that makes conventional approaches brittle at scale,\u201d says a blog post by DeepMind. <\/p>\n<p style=\"font-weight: 400;\">DeepMind has shared a <a href=\"https:\/\/arxiv.org\/abs\/2604.21428v1\" rel=\"nofollow noopener\" target=\"_blank\">research paper<\/a> titled \u201cDecoupled DiLoCo for Resilient Distributed Pre-training\u201d wherein it explains how the decoupled DiLoCo breaks down a global cluster of processors into independent asynchronous learners.<\/p>\n<p style=\"font-weight: 400;\">The paper explains that frontier AI models are trained around a large, tightly coupled system where identical chips must stay in near-perfect synchronisation. \u201cThis approach is highly effective for today\u2019s models, but as we look towards the future of scale, achieving such levels of sync across thousands of chips becomes a logistical challenge,\u201d it says.<\/p>\n<p style=\"font-weight: 400;\">By dividing large training runs across decoupled \u201cislands\u201d of compute, with asynchronous data flowing between them, this architecture isolates local disruptions so that other parts of the system can keep learning efficiently.<\/p>\n<p style=\"font-weight: 400;\">\u201cThe result is a more resilient and flexible way to train advanced models across globally distributed datacentres. And crucially, Decoupled DiLoCo does not suffer the communication delays that made previous distributed methods like Data-Parallel impractical at global scale,\u201d the paper says.<\/p>\n<p style=\"font-weight: 400;\">In this process, each learner group functions on its own data shared at its own speed while communicating parameter fragments to a central lightweight synchroniser that aggregates them in an asynchronous manner. The central point defines a threshold with the minimum number of independent trainers required to complete a task before the system moves the training forward.<\/p>\n<p style=\"font-weight: 400;\">This \u201cminimum quorum\u201d strategy is supported by an adaptive grace window (a buffer designed to maximise sample efficiency without giving up on the system\u2019s speed, and token-weighted merging, which is a weightage system used to reconcile the divergent states of different learners.<\/p>\n<p style=\"font-weight: 400;\">This results in the faster learners or those capable of processing more data getting weighted appropriately in order to maintain model stability without sacrificing speed.<\/p>\n<p style=\"font-weight: 400;\">DeepMind claims that the approach allows for zero global downtimes and also maintains a training goodput (Google\u2019s metric to measure AI system efficiency) of nearly 90% even when hardware failures are simulated aggressively. The blog says this contrasts with the traditional elastic methods where goodput can fall up to 40% during runtime.<\/p>\n<p style=\"font-weight: 400;\">The Decoupled DiLoCo builds on two earlier advances:\u00a0<a href=\"https:\/\/blog.google\/innovation-and-ai\/products\/introducing-pathways-next-generation-ai-architecture\/\" rel=\"nofollow noopener\" target=\"_blank\">Pathways<\/a>, which introduced a distributed AI system based on asynchronous data flow, and\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2311.08105\" rel=\"nofollow noopener\" target=\"_blank\">DiLoCo<\/a>, which dramatically reduced the bandwidth required between distributed data centers, making it practical to train large language models across distant locations.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-221919 size-large\" src=\"https:\/\/www.europesays.com\/ai\/wp-content\/uploads\/2026\/04\/DiLoCo-1024x452.jpg\" alt=\"deepmind-diloco\" width=\"1024\" height=\"452\"  \/>Left: The Decoupled DiLoCo approach requires orders of magnitude less bandwidth than conventional training methods, making it very efficient. Middle: With increasing levels of hardware failure, Decoupled DiLoCo continues to deliver a high level of \u201cgoodput\u201d, or useful training, while that of other approaches nosedives. (The first two charts are based on simulated training runs). Right: In real-world experiments, the benchmarked ML performance of Gemma 4 models trained using Decoupled DiLoCo equalled the performance attained with conventional training approaches.<\/p>\n<p style=\"font-weight: 400;\">\u201cDecoupled DiLoCo brings those ideas together to train AI models more flexibly at scale. Built on top of Pathways, it enables asynchronous training across separate islands of compute (known as learner units) so that a chip failure in one area doesn\u2019t interrupt the progress of the others,\u201d says the research note.<\/p>\n<p style=\"font-weight: 400;\">Additionally, the infrastructure is also self-healing. In testing, the researchers used a method called \u201cchaos engineering\u201d to introduce artificial hardware failures during training runs. Decoupled DiLoCo continued the training process after the loss of entire learner units, and then seamlessly reintegrated them when they came back online.<\/p>\n<p style=\"font-weight: 400;\">The system was tested using Google\u2019s Gemma 4 AI models, with Douillard claiming the team trained a 12-billion parameter model across four separate U.S. regions using 2-5 Gb\/s of wide-area networking.<\/p>\n<p style=\"font-weight: 400;\">\u201cDecoupled DiLoCo is not only more resilient to failures, but is also practical for executing production-level, fully distributed pre-training. We successfully trained a 12 billion parameter model across four separate U.S. regions using 2-5 Gbps of wide-area networking (relatively achievable using existing internet connectivity between datacentre facilities, rather than requiring new custom network infrastructure between facilities),\u201d a <a href=\"https:\/\/deepmind.google\/blog\/decoupled-diloco\/\" rel=\"nofollow noopener\" target=\"_blank\">blog post<\/a> by DeepMind says.<\/p>\n<p style=\"font-weight: 400;\">The system achieved this training result more than 20 times faster than conventional synchronization methods. This is because our system incorporates required communication into longer periods of computation, avoiding the \u201cblocking\u201d bottlenecks where one part of the system must wait for another, the post said.<\/p>\n<p style=\"font-weight: 400;\">Additionally, the system also allowed the researchers to mix different generation of Tensor Processing Units (TPU) in a single training run. This means the shelf life of the existing hardware would go up substantially and enhance the total compute available for model training in the future.<\/p>\n<p style=\"font-weight: 400;\">During their experiments, chips from different generations running at different speeds still matched the ML performance of single-chip-type training runs, ensuring that even older hardware can meaningfully accelerate AI training, DeepMind says.<\/p>\n<p style=\"font-weight: 400;\">What\u2019s more, because new generations of hardware don\u2019t arrive everywhere all at once, being able to train across generations can alleviate recurring logistical and capacity bottlenecks.<\/p>\n","protected":false},"excerpt":{"rendered":"Google\u2019s AI research arm DeepMind has introduced a decoupled distributed low-communication (DiLoCo) training architecture that could train advanced&hellip;\n","protected":false},"author":2,"featured_media":20000,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[24,5044,14163,132,7543,1160,3202],"class_list":{"0":"post-19999","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-google","8":"tag-ai","9":"tag-deepmind","10":"tag-diloco","11":"tag-google","12":"tag-google-deepmind","13":"tag-models","14":"tag-training"},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/posts\/19999","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/comments?post=19999"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/posts\/19999\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/media\/20000"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/media?parent=19999"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/categories?post=19999"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/ai\/wp-json\/wp\/v2\/tags?post=19999"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}