Paper A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models

Sparse Mixture-of-Experts (s-MoE) architectures have become a prominent strategy for scaling large AI systems by activating only a small subset of specialized sub-networks, referred to as "experts," for each input token. A central operational challenge in these systems is load balancing: distributing tokens across experts in a way that minimizes idle capacity, ensures efficient use of computational resources, and promotes thorough training across all model components. This study develops a theoretical framework for analyzing Auxiliary-Loss-Free Load Balancing (ALF-LB), a procedure introduced by researchers at DeepSeek, by recasting it as a primal-dual optimization method. In this formulation, load balancing is treated as an assignment problem solved via a lightweight, constant-time update at each training iteration, avoiding the computational overhead of auxiliary loss terms used in prior approaches. In a stylized deterministic setting, the framework yields several structural insights: a monotonic improvement condition for the optimization objective, a preference rule that systematically redirects tokens from overloaded to underloaded experts, and a formal approximate-balancing guarantee. The analysis is then extended to account for the stochastic and dynamic nature of real AI training through an online optimization formulation. In this setting, a strong convexity property of the objective is established, yielding a logarithmic expected regret bound under appropriate step-size choices. Empirical experiments on one-billion-parameter DeepSeekMoE models complement the theoretical findings, supporting the practical relevance of the framework. Together, these results provide a principled foundation for understanding and analyzing load balancing in sparse mixture-of-experts AI architectures.

Get the Paper