Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Training Architecture Achieving 88% Goodput Under High Hardware Failure Rates
Training frontier AI fashions is, at its core, a coordination downside. Thousands of chips should talk with one another repeatedly, synchronizing each gradient replace throughout the community. When one chip fails and even slows down, the complete coaching run can stall. As fashions scale towards a whole lot of billions of parameters, that fragility turns…
