Technology

DiLoCo: Distributed Model Training in Gonka

Large language models like GPT or Qwen are trained on huge clusters of GPUs connected by ultra-fast channels. DiLoCo (Distributed Local Computation) changes the game – it allows training such models over the regular internet, without a single data center.

Why Distributed Training is Needed

Modern AI models contain hundreds of billions of parameters. Training such a model requires hundreds of GPUs working synchronously. The traditional approach is to assemble all GPUs in one data center and connect them with InfiniBand. This is expensive, limits scale, and creates a single point of failure. DiLoCo allows distributed training across clusters in different parts of the world.

How DiLoCo Works

Each GPU cluster (e.g., 8xH100) trains the model locally using the AdamW optimizer. Approximately every ~1,000 steps, the clusters synchronize with each other via a global optimizer (Nesterov momentum). Synchronization requires minimal bandwidth – a regular internet channel is sufficient. This is radically different from the classic approach where GPUs exchange data at every step.

What this means for the Gonka network

Thanks to DiLoCo, Gonka can train models with 30-50 billion parameters using host GPUs scattered all over the world. No single data center is needed – just clusters of 8 GPUs with an internet connection. This makes AI training truly decentralized and opens the way for models trained by the community itself.
DiLoCo is a technology for training AI models over the internet. GPU clusters work independently and synchronize rarely, allowing Gonka to train models without a centralized data center.

Want to learn more?

Understand the GNK economy or start earning right now.

Read also