DiLoCo Distributed Lunch

Feb 04, 2025

DiLoCo trained distributed better than co-located accelerators

Paper: Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

Researchers from Google DeepMind, Google Research and Apple are interested in optimizing distributed training for LLMs across many accelerators to reduce training time. However, this requires that all devices be co-located using low-latency, high-bandwidth communication links to exchange internal states and parameter gradients at each gradient step. Recent methods like DiLoCo have relaxed this co-location constraint by grouping accelerators into "workers" that synchronize less frequently.

Source: https://lexica.art/prompt/013ae78e-703d-4667-b1e5-fa3018b50a9e

Hmm..What’s the background?

The standard approach to optimization for training models remains mini-batch stochastic gradient descent with backpropagation. Training is done with multiple hardware accelerators. Modern training runs for LLMs can use tens of thousands of accelerators, which creates complex engineering challenges.

So what is proposed in the research paper?

Here are the main insights:

Instead of synchronizing all parameters at once, only subsets of parameters are synchronized in sequence, reducing peak bandwidth
Workers continue training while synchronizing, which decreases wall clock time
The data exchanged by workers is quantized to four bits per parameter, which further reduces bandwidth

These modifications collectively reduce the required bandwidth by two orders of magnitude while maintaining similar learning quality. Streaming DiLoCo is shown to be superior to the original DiLoCo and achieves similar performance to the bandwidth-costly data-parallelism.

What’s next?

The paper suggests that the ubiquity of co-located Data-Parallel training is due to the "hardware lottery" and not necessarily the superiority of the method. Future research should focus on bringing ideas from federated learning to large-scale training for LLMs, which currently are not well studied.

DiLoCo trained distributed better than co-located accelerators

Learned something new? Consider sharing it!

So Essentially

Discussion about this post