Google’s DeepMind’s New Approach to Distributed Training of AI Models

Google’s AI research arm DeepMind has introduced a decoupled distributed low-communication (DiLoCo) training architecture that could train advanced AI models across distributed datacentres.

The architecture isolates local disruptions such as hardware failures and network issues so that other parts of the system can continue the learning processes efficiently. This is achieved through partitioning large-scale training workloads into decoupled compute islands through which data is exchanged in an asynchronous fashion.

“This enables large language model pre-training across geographically distant datacentres without requiring the tight synchronization that makes conventional approaches brittle at scale,” says a blog post by DeepMind.

DeepMind has shared a research paper titled “Decoupled DiLoCo for Resilient Distributed Pre-training” wherein it explains how the decoupled DiLoCo breaks down a global cluster of processors into independent asynchronous learners.

The paper explains that frontier AI models are trained around a large, tightly coupled system where identical chips must stay in near-perfect synchronisation. “This approach is highly effective for today’s models, but as we look towards the future of scale, achieving such levels of sync across thousands of chips becomes a logistical challenge,” it says.

By dividing large training runs across decoupled “islands” of compute, with asynchronous data flowing between them, this architecture isolates local disruptions so that other parts of the system can keep learning efficiently.

“The result is a more resilient and flexible way to train advanced models across globally distributed datacentres. And crucially, Decoupled DiLoCo does not suffer the communication delays that made previous distributed methods like Data-Parallel impractical at global scale,” the paper says.

In this process, each learner group functions on its own data shared at its own speed while communicating parameter fragments to a central lightweight synchroniser that aggregates them in an asynchronous manner. The central point defines a threshold with the minimum number of independent trainers required to complete a task before the system moves the training forward.

This “minimum quorum” strategy is supported by an adaptive grace window (a buffer designed to maximise sample efficiency without giving up on the system’s speed, and token-weighted merging, which is a weightage system used to reconcile the divergent states of different learners.

This results in the faster learners or those capable of processing more data getting weighted appropriately in order to maintain model stability without sacrificing speed.

DeepMind claims that the approach allows for zero global downtimes and also maintains a training goodput (Google’s metric to measure AI system efficiency) of nearly 90% even when hardware failures are simulated aggressively. The blog says this contrasts with the traditional elastic methods where goodput can fall up to 40% during runtime.

The Decoupled DiLoCo builds on two earlier advances: Pathways, which introduced a distributed AI system based on asynchronous data flow, and DiLoCo, which dramatically reduced the bandwidth required between distributed data centers, making it practical to train large language models across distant locations.

deepmind-diloco Left: The Decoupled DiLoCo approach requires orders of magnitude less bandwidth than conventional training methods, making it very efficient. Middle: With increasing levels of hardware failure, Decoupled DiLoCo continues to deliver a high level of “goodput”, or useful training, while that of other approaches nosedives. (The first two charts are based on simulated training runs). Right: In real-world experiments, the benchmarked ML performance of Gemma 4 models trained using Decoupled DiLoCo equalled the performance attained with conventional training approaches.

“Decoupled DiLoCo brings those ideas together to train AI models more flexibly at scale. Built on top of Pathways, it enables asynchronous training across separate islands of compute (known as learner units) so that a chip failure in one area doesn’t interrupt the progress of the others,” says the research note.

Additionally, the infrastructure is also self-healing. In testing, the researchers used a method called “chaos engineering” to introduce artificial hardware failures during training runs. Decoupled DiLoCo continued the training process after the loss of entire learner units, and then seamlessly reintegrated them when they came back online.

The system was tested using Google’s Gemma 4 AI models, with Douillard claiming the team trained a 12-billion parameter model across four separate U.S. regions using 2-5 Gb/s of wide-area networking.

“Decoupled DiLoCo is not only more resilient to failures, but is also practical for executing production-level, fully distributed pre-training. We successfully trained a 12 billion parameter model across four separate U.S. regions using 2-5 Gbps of wide-area networking (relatively achievable using existing internet connectivity between datacentre facilities, rather than requiring new custom network infrastructure between facilities),” a blog post by DeepMind says.

The system achieved this training result more than 20 times faster than conventional synchronization methods. This is because our system incorporates required communication into longer periods of computation, avoiding the “blocking” bottlenecks where one part of the system must wait for another, the post said.

Additionally, the system also allowed the researchers to mix different generation of Tensor Processing Units (TPU) in a single training run. This means the shelf life of the existing hardware would go up substantially and enhance the total compute available for model training in the future.

During their experiments, chips from different generations running at different speeds still matched the ML performance of single-chip-type training runs, ensuring that even older hardware can meaningfully accelerate AI training, DeepMind says.

What’s more, because new generations of hardware don’t arrive everywhere all at once, being able to train across generations can alleviate recurring logistical and capacity bottlenecks.

Google’s DeepMind’s New Approach to Distributed Training of AI Models

Tags: