Auxiliary-Loss-Free Load Balancing
Strategy for Mixture-of-Experts
混合专家的无辅助损失负载平衡策略
Abstract 摘要
For Mixture-of-Experts (MoE) models, an unbalanced expert load will lead to routing collapse or increased computational overhead.
Existing methods commonly employ an auxiliary loss to encourage load balance, but a large auxiliary loss will introduce non-negligible interference gradients into training and thus impair the model performance.
In order to control load balance while not producing undesired gradients during training, we propose Loss-Free Balancing, featured by an auxiliary-loss-free load balancing strategy.
To be specific, before the top-K routing decision, Loss-Free Balancing will first apply an expert-wise bias to the routing scores of each expert.
By dynamically updating the bias of each expert according to its recent load, Loss-Free Balancing can consistently maintain a balanced distribution of expert load.
In addition, since Loss-Free Balancing does not produce any interference gradients, it also elevates the upper bound of model performance gained from MoE training.
We validate the performance of Loss-Free Balancing on MoE models with up to 3B parameters trained on up to 200B tokens.
Experimental results show that Loss-Free Balancing achieves both better performance and better load balance compared with traditional auxiliary-loss-controlled load balancing strategies.
对于混合专家(Mixture-of-Experts, MoE)模型,不平衡的专家负载会导致路由崩溃或增加计算开销。现有方法通常采用辅助损失来鼓励负载平衡,但较大的辅助损失会在训练过程中引入不可忽视的干扰梯度,从而损害模型性能。为在控制负载平衡的同时不产生不期望的训练梯度,我们提出了无损平衡(Loss-Free Balancing),以无辅助损失的负载平衡策略为特色。具体来说,在前 K 个路由决策之前,无损平衡会首先对每个专家的路由分数应用专家层面的偏差。通过根据各专家最近的负载动态更新偏差,无损平衡可以持续维持专家负载的平衡分布。此外,由于无损平衡不会产生任何干扰梯度,它还提升了从 MoE 训练中获得的模型性能上限。我们在多达 30 亿参数、训练数据量多达 2000 亿标记的 MoE 模型上验证了无损平衡的性能。实验结果表明,与传统的辅助损失控制负载平衡策略相比,无损平衡在性能和负载平衡上均实现了更好的效果。
1 Introduction 1 介绍
Mixture-of-Experts (MoE) architectures have emerged as a promising solution for managing computational costs when scaling up parameters in large language models (LLMs).
Recent applications of MoE in Transformer-based models (Vaswani et al., 2017) have led to successful attempts at scaling language models to substantial sizes (Shao et al., 2024; DeepSeek-AI et al., 2024; Dai et al., 2024; Fedus et al., 2021; Lepikhin et al., 2020), resulting in remarkable performance improvements.
However, training MoE models always face the circumstance of load imbalance, which may result in routing collapse (Shazeer et al., 2017) or increased computational overhead (Fedus et al., 2021; Lepikhin et al., 2020; Shazeer et al., 2017).
In order to avoid imbalanced routing, existing methods (Fedus et al., 2021; Lepikhin et al., 2020) commonly use an auxiliary loss to encourage balanced expert load.
Although the auxiliary loss can alleviate load imbalance during training, it also introduces undesired gradients that conflict with the language modeling objective.
These interference gradients will impair the model performance, so existing MoE methods always need to consider the trade-off between load balance and model performance.
混合专家(Mixture-of-Experts, MoE)架构作为在扩展大规模语言模型(LLM)参数时管理计算成本的一种可行解决方案,展现出很大潜力。MoE 在基于 Transformer 模型中的最新应用(Vaswani 等,2017)已经成功将语言模型扩展到相当大的规模(Shao 等,2024;DeepSeek-AI 等,2024;Dai 等,2024;Fedus 等,2021;Lepikhin 等,2020),从而产生了显著的性能提升。然而,训练 MoE 模型总是面临负载不平衡的问题,这可能导致路由崩溃(Shazeer 等,2017)或增加计算开销(Fedus 等,2021;Lepikhin 等,2020;Shazeer 等,2017)。为了避免不平衡路由,现有方法(Fedus 等,2021;Lepikhin 等,2020)通常使用辅助损失来鼓励均衡的专家负载。虽然辅助损失可以在训练过程中缓解负载不平衡,但它同时也引入了与语言建模目标冲突的不期望梯度。这些干扰梯度会损害模型性能,因此现有的 MoE 方法总是需要在负载平衡和模型性能之间进行权衡。
In this paper, we propose Loss-Free Balancing, an auxiliary-loss-free load balancing strategy, aiming at maintaining control over expert load balance while not introducing interference gradients.
Loss-Free Balancing features an iterative process of token routing and bias updating.
As illustrated in Figure 1, before the top-K routing decision of MoE, Loss-Free Balancing will first apply expert-wise biases to the original routing scores to produce biased gating scores, which determine the actual routing targets of each token during training.
These expert-wise biases will keep updating according to the expert load observed on recent training tokens, where the biases of heavy-load experts will be depressed and those of lite-load experts will be elevated.
Through this dynamic updating strategy, Loss-Free Balancing ensures that the biased gating scores can consistently lead to balanced routing results.
Compared with the auxiliary-loss-controlled load balancing strategies, Loss-Free Balancing does not introduce undesired gradients that disrupt the primary language modeling objective, so its training process is more noise-free and friendly.
在本文中,我们提出了无损平衡(Loss-Free Balancing),一种无辅助损失的负载平衡策略,旨在控制专家负载平衡,而不引入干扰梯度。无损平衡的特点是一个迭代的令牌路由和偏差更新过程。如图 1 所示,在 MoE 的前 K 个路由决策之前,无损平衡将首先对原始路由分数施加专家层面的偏差,以生成偏差门控分数,该分数决定了每个令牌在训练期间的实际路由目标。这些专家层面的偏差将根据最近训练令牌上观察到的专家负载不断更新,其中负载较重的专家的偏差将被抑制,负载较轻的专家的偏差将被提升。通过这种动态更新策略,无损平衡确保偏差门控分数能持续产生均衡的路由结果。与辅助损失控制的负载平衡策略相比,无损平衡不会引入干扰主要语言建模目标的不期望梯度,因此其训练过程更加无噪音且友好。
In order to validate the performance of Loss-Free Balancing, we train MoE language models with 1B parameters on 100B tokens and 3B parameters on 200B tokens from scratch.
Experimental results demonstrate that Loss-Free Balancing produces MoE models with better validation loss than traditional auxiliary-loss-controlled models.
Meanwhile, keeping the performance advantage, Loss-Free Balancing also achieves a significantly better load balance at the global and batch levels, and is naturally compatible with expert parallelism, which is usually employed for training extremely large MoE models.
为了验证无损平衡的性能,我们从零开始训练了拥有 10 亿参数并处理 1000 亿标记的数据的 MoE 语言模型,以及拥有 30 亿参数并处理 2000 亿标记的数据的模型。实验结果表明,与传统的辅助损失控制模型相比,无损平衡产生的 MoE 模型在验证损失上表现更佳。同时,无损平衡在保持性能优势的同时,在全局和批次层面上实现了显著更好的负载平衡,并且自然兼容专家并行处理,这通常用于训练极大规模的 MoE 模型。
2 Background 2 背景
2.1 Mixture-of-Experts 2.1 专家混合
Current dominant MoE architectures (Lepikhin et al., 2020; Fedus et al., 2021; Dai et al., 2024) replace the MLP layers in standard transformers with MoE layers. In an MoE layer, Top-K routing is employed to select the experts for each token. Let denote the input of the -th token to an -expert MoE layer, the output is computed as follows:
当前主流的 MoE 架构(Lepikhin 等,2020;Fedus 等,2021;Dai 等,2024)将标准 transformers 中的 MLP 层替换为 MoE 层。在 MoE 层中,使用 Top-K 路由为每个标记选择专家。令 表示第 个标记输入到一个有 位专家的 MoE 层,输出 计算如下:
(1) | ||||
where is a nonlinear gating function and is the centroid of the -th expert.
其中, 是一个非线性门控函数, 是第 个专家的质心。
2.2 Auxiliary Loss for Load Balance
2.2 负载平衡的辅助损失
Auxiliary Loss 辅助损失
Uncontrolled routing strategies are likely to encounter load imbalance, which has two notable drawbacks.
Firstly, there is a risk of routing collapse (Shazeer et al., 2017), where the model consistently selects only a few experts, hindering sufficient training of the other experts.
Secondly, when experts are distributed across multiple devices, load imbalance can exacerbate computation bottlenecks.
To address these issues, an auxiliary loss (Fedus et al., 2021; Lepikhin et al., 2020) is commonly employed to control load balance.
For a sequence of length , the auxiliary loss is defined as:
无控制的路由策略可能会遇到负载不平衡,这有两个显著的缺点。首先,存在路由崩溃的风险(Shazeer 等,2017),即模型始终只选择少数几个专家,妨碍其他专家的充分训练。其次,当专家分布在多个设备上时,负载不平衡可能加剧计算瓶颈。为了解决这些问题,通常会使用辅助损失(Fedus 等,2021;Lepikhin 等,2020)来控制负载平衡。对于一个长度为 的序列,辅助损失定义为:
(2) | ||||
where is the total number of experts, is the number of experts selected for each token, is the routing score of Expert for Token , represents the fraction of tokens routed to Expert , denotes the average gating scores of Expert , and is a hyper-parameter controlling the strength of the auxiliary loss.
其中, 是总专家数, 是为每个标记选择的专家数, 是专家 对于标记 的路由分数, 表示被路由到专家 的标记比例, 表示专家 的平均门控分数,而 是控制辅助损失强度的超参数。
The Dilemma Between Load Balance and Model Performance
负载平衡与模型性能之间的矛盾
The auxiliary loss mentioned above can encourage load balance, but it also interferes with language modeling training as an additional regularization term. The absence of an auxiliary loss or a small auxiliary loss coefficient can lead to poor balance, while a large can impair training, resulting in suboptimal performance.
To illustrate this dilemma, we present the relationship between load balance and model performance in Figure 2.
We vary among 1e-2, 1e-3, 1e-4, and 0, and present the corresponding , which measures the degree of load balance and its computation details are described in § 4.1.
As shown in the figure, a small causes routing collapse, affecting the model efficiency and potentially leading to some experts being insufficiently learned or exploited; while a large keeps load balance under control but notably degrades the model performance.
In order to break this dilemma, we propose Loss-Free Balancing as a solution, which directly controls the expert load balance, but does not introduce unexpected gradients other than the gradients from the language modeling loss.
上述的辅助损失可以促进负载平衡,但作为一个额外的正则化项,它也干扰了语言模型的训练。缺乏辅助损失或小的辅助损失系数 可能导致不良的平衡,而大的 可能会损害训练,导致次优性能。为了说明这种矛盾,我们在图 2 中展示了负载平衡与模型性能之间的关系。我们将 在 1e-2、1e-3、1e-4 和 0 之间变化,并呈现了对应的 ,它衡量负载平衡的程度,其计算细节在第 4.1 节中描述。如图所示,小的 会导致路由崩溃,影响模型效率并可能导致一些专家未能充分学习或利用;而大的 则在保持负载平衡的同时显著降低了模型性能。为了解决这一困境,我们提出了无损平衡作为解决方案,它直接控制专家负载平衡,但不引入语言模型损失以外的意外梯度。
3 Auxiliary-Loss-Free Load Balancing Strategy
3 无辅助损失的负载平衡策略
For a better load-balancing alternative that does not directly interfere with the main gradients from the training objective, we propose Loss-Free Balancing, which directly adjusts the gating scores of each expert according to their balance condition.
As illustrated in Figure 1, we add an expert-wise bias term to the gating scores of each expert, and use the biased scores to determine the top-K selection:
为了获得更好的负载均衡替代方案,而不直接干扰由训练目标产生的主要梯度,我们提出了无损失平衡方法,该方法直接根据每个专家的平衡条件调整其门控分数。如图 1 所示,我们为每个专家的门控分数添加一个专家偏置项 ,并使用偏置后的分数确定前 K 个选择:
(3) |
Note that the expert bias term is only used to adjust the routing strategy by influencing the top-K selection.
It is not added to the that weights the output of the selected experts when computing the final output of the MoE layer.
请注意,专家偏置项 仅用于通过影响前 K 个选择来调整路由策略。在计算 MoE 层的最终输出时,它不会被加到选定专家的 上。
In order to derive proper biases, we adjust each bias iteratively according to the following principle: decreasing it when the corresponding expert has a relatively heavy load, and vice versa.
To be specific, for each , we keep monitoring its corresponding expert load on the previous batch.
If an expert has a heavy load on the previous batch, we will reduce its bias.
Otherwise, we will increase it.
Algorithm 1 describes the details of our update algorithm for the expert-wise biases.
It is worth noting that we update the biases based on the historical balance condition, since utilizing the load information of the current sequence will break the causal constraint of language modeling, leading to leakage of the information of future tokens.
Through the dynamic adjustment for the biases, we can achieve good expert load balance, but not directly introduce noisy gradients into the model like the auxiliary-loss-controlled method does.
为了推导出合适的偏置,我们根据以下原则迭代调整每个偏置 :当对应专家的负载相对较重时,减少它,反之亦然。具体来说,对于每个 ,我们会监控其对应专家在前一批次的负载。如果某个专家在前一批次负载较重,我们将减少其偏置值,否则将增加它。算法 1 描述了我们的专家偏置更新算法的细节。值得注意的是,我们基于历史平衡条件更新偏置,因为利用当前序列的负载信息将打破语言建模的因果约束,导致未来标记信息泄漏。通过对偏置进行动态调整,我们能够实现良好的专家负载平衡,但不会像辅助损失控制方法那样直接引入噪声梯度。
Comparison with Other Load Balancing Methods.
与其他负载均衡方法的比较。
In order to show the theoretical advantages of Loss-Free Balancing, we compare it with other two mainstream load balancing methods, i.e., the auxiliary-loss-controlled method (Lepikhin et al., 2020; Fedus et al., 2021) and the Expert Choice (EC) (Zhou et al., 2022) method.
As described in § 2.2, the auxiliary-loss-controlled method faces the dilemma between load balance and model performance, and a perfect trade-off may not exist.
As for the EC method, it will break the causal constraint of language modeling, since the target experts of each token are conditioned on the future tokens in the same sequence or batch.
This will result in the leakage of information about future tokens, thus destroying the generalization of the model.
Table 1 summarizes the properties of different load balancing methods.
为了展示无损失平衡的理论优势,我们将其与其他两种主流负载均衡方法进行比较,即辅助损失控制方法(Lepikhin et al., 2020; Fedus et al., 2021)和专家选择(EC)方法(Zhou et al., 2022)。如§ 2.2 所述,辅助损失控制方法面临负载平衡和模型性能之间的困境,可能不存在完美的权衡。至于 EC 方法,它会打破语言建模的因果约束,因为每个标记的目标专家依赖同一序列或批次中的未来标记。这样会导致未来标记信息的泄漏,从而破坏模型的泛化能力。表 1 总结了不同负载均衡方法的特性。
表 1:不同负载均衡方法的比较。好的属性显示为绿色,坏的属性显示为红色。
Load Balancing Methods 负载均衡方法 | Balanced 平衡 Expert Load 专家负载 | Interference 干扰 Gradients 梯度 | Future Token 未来标记 Leakage 泄漏 |
Loss-Controlled (strong auxiliary loss) 损失控制(强辅助损失) |
balanced 平衡 | strong 强 | no leakage 无泄漏 |
Loss-Controlled (weak auxiliary loss) 损失控制(弱辅助损失) |
imbalanced 不平衡 | weak 弱 | no leakage 无泄漏 |
Expert Choice 专家选择 | balanced 平衡 | none 无 | with leakage 有泄漏 |
Loss-Free (Ours) 无损失(我们的) | balanced 平衡 | none 无 | no leakage 无泄漏 |
4 Experiments 4 实验
4.1 Experimental Setups 4.1 实验设置
Model Architecture. 模型架构。
We employ the DeepSeekMoE (Dai et al., 2024) architecture as the backbone since it outperforms conventional MoE architectures like GShard (Lepikhin et al., 2020) significantly.
Compared with GShard (Lepikhin et al., 2020), it segments experts into finer granularity and isolates some experts as shared ones.
Slightly different from DeepSeekMoE, in our main experiments, we choose sigmoid instead of softmax as the gating function , since we find that the sigmoid baseline performs better than the softmax baseline.
Even so, we still provide the experimental results and discussion for the softmax gate in Appendix C.
Our experiments are based on two model sizes of 1B and 3B total parameters, and we tune the bias update rate under only the 1B scale.
Experiments under the 3B scale directly inherit the best configuration for the 1B scale.
Due to the page limit, we present more details about our architecture in Appendix A.
我们采用了 DeepSeekMoE (Dai et al., 2024)架构作为主干网络,因为它显著优于传统的 MoE 架构如 GShard (Lepikhin et al., 2020)。与 GShard (Lepikhin et al., 2020)相比,它将专家分割得更细,并将一些专家隔离为共享专家。与 DeepSeekMoE 稍有不同,在我们的主要实验中,我们选择 sigmoid 而不是 softmax 作为门控函数 ,因为我们发现 sigmoid 基准优于 softmax 基准。即便如此,我们仍在附录 C 中提供了 softmax 门的实验结果和讨论。我们的实验基于两个模型大小,分别为 1B 和 3B 总参数,并且我们仅在 1B 规模下调整偏差更新率。3B 规模的实验直接继承了 1B 规模的最佳配置。由于篇幅限制,我们在附录 A 中提供了更多关于我们架构的细节。
Training Settings 训练设置
We use a multilingual training corpus created by DeepSeek-AI, sourced from a diverse range of textual materials including web text, mathematical material, coding scripts, and published
literature.
We employ the HuggingFace Tokenizer222https://github.com/huggingface/tokenizers to train a byte pair encoding (BPE) (Sennrich et al., 2015) tokenizer with a vocabulary size of 32K.
In order to draw solid conclusions, we train the 1B model on 100B tokens and the 3B model on 200B tokens to ensure sufficient training.
We apply the cosine learning rate scheduler (Loshchilov & Hutter, 2016) and multi-step learning rate scheduler (Dai et al., 2024) for the 1B and 3B models, respectively.
Due to the page limit, we list more details about our training settings and hyper-parameters in Appendix B).
我们使用由 DeepSeek-AI 创建的多语言训练语料库,该语料库来源广泛,包括网页文本、数学材料、代码脚本和已发表的文献。我们使用 HuggingFace Tokenizer 2 来训练一个词汇量为 32K 的字节对编码(BPE)(Sennrich et al., 2015)分词器。为了得出可靠的结论,我们在 1B 模型上训练了 100B 个 tokens,并在 3B 模型上训练了 200B 个 tokens,以确保充分的训练。我们为 1B 和 3B 模型分别应用了余弦学习率调度器(Loshchilov & Hutter, 2016)和多步学习率调度器(Dai et al., 2024)。由于篇幅限制,我们在附录 B 中列出了更多关于我们训练设置和超参数的细节。
Baseline. 基线。
We compare our Loss-Free Balancing method with the conventional auxiliary-loss-controlled method.
For the baseline, we set the auxiliary loss coefficient to 0.001 to achieve a reasonable trade-off between model performance and load balance (see Figure 2).
We do not take the EC method into comparison due to its issue of future token leakage, which we will discuss in depth in § 5.2.
我们将我们的无损平衡方法与传统的辅助损失控制方法进行比较。对于基线,我们将辅助损失系数 设置为 0.001,以在模型性能和负载平衡之间实现合理的权衡(参见图 2)。由于存在未来 token 泄漏的问题,我们没有将 EC 方法纳入比较,这将在第 5.2 节中详细讨论。
Metrics. 指标。
We reserve a validation set from the training corpus to evaluate model performance and load balance.
For model performance, we take perplexity as the metric.
For load balance, we introduce a metric called maximal violation (MaxVio) to quantify the degree of load balance of an MoE layer:
我们从训练语料库中保留了一组验证集,用于评估模型性能和负载平衡。对于模型性能,我们采用困惑度作为指标。对于负载平衡,我们引入了一个称为最大偏差(MaxVio)的指标来量化 MoE 层的负载平衡程度:
(4) |
where represents the number of tokens assigned to the -th expert, and denotes the expected expert load under perfect load balance.
其中 表示分配给第 个专家的 token 数,而 代表在完美负载平衡下预期的专家负载。
MaxVio has two variants: and .
For , we count on the whole validation set, so it reflects the degree of balanced expert utilization and efficiency upper bound when the batch size approaches the limitation.
For , we count on each training batch, so it is more related to the training efficiency.
For simplicity, in the rest of this paper, we report the MaxVio averaged across all layers as a load balance measurement of the whole model.
MaxVio 有两个变体: 和 。对于 ,我们在整个验证集上统计 ,因此它反映了平衡专家使用和效率上限的程度,当批量大小接近限制时。对于 ,我们在每个训练批次上统计 ,因此它与训练效率更相关。为简单起见,在本文的其余部分中,我们报告了跨所有层平均的 MaxVio,作为整个模型的负载平衡测量。
4.2 Main Results 4.2 主要结果
Table 2 shows the validation perplexity and for the 1B and 3B MoE models trained with auxiliary loss or our auxiliary-loss-free load balancing strategy.
As shown in the table, compared with the auxiliary-loss-controlled method, our Loss-Free Balancing achieves better perplexity and much better global load balance for both 1B and 3B models.
In addition, to present the load balance condition during training, we provide a load balancing curve depicting over training steps in Figure 3, which demonstrates the persistent advantage of Loss-Free Balancing on load balance.
In summary, our Loss-Free Balancing method avoids interfering gradients during training and effectively controls the load balance, breaking the dilemma between load balance and model performance in MoE training.
表 2 显示了使用辅助损失或我们的无辅助损失负载平衡策略训练的 1B 和 3B MoE 模型的验证困惑度和 。如表所示,与辅助损失控制方法相比,我们的无损平衡方法在 1B 和 3B 模型上均实现了更好的困惑度和更好的全局负载平衡。此外,为了展示训练期间的负载平衡情况,我们在图 3 中提供了描绘 随训练步骤变化的负载平衡曲线,这表明无损平衡对负载平衡的持久优势。总之,我们的无损平衡方法在训练中避免了干扰梯度,并有效地控制了负载平衡,打破了 MoE 训练中负载平衡与模型性能之间的困境。
表 2:无损平衡在 1B 和 3B 模型上的困惑度更低,负载平衡更好。使用验证集来计算这些指标(详情见附录 B)。
Model Size 模型大小 | Load Balancing Methods 负载平衡方法 | Validation Perplexity 验证困惑度 | |
1B | Loss-Controlled 损失控制 | 9.56 | 0.72 |
Loss-Free 无损 | 9.50 | 0.04 | |
3B | Loss-Controlled 损失控制 | 7.97 | 0.52 |
Loss-Free 无损 | 7.92 | 0.04 |
4.3 Empirical Studies on Bias Update Algorithm
4.3 偏差更新算法的实证研究
We conduct empirical studies on the update rate and variants of the bias update algorithm to validate the optimal configuration used in our main experiments.
我们对偏差更新算法的更新速率及其变体进行实证研究,以验证在主要实验中使用的最佳配置。
Update rate. 更新速率。
The update rate in Algorithm 1 controls the speed at which the expert bias converges to the “suitable bias”. Figure 4 illustrates that an overly low update rate may lead to slow convergence, while an unnecessarily high update rate can cause undesirable fluctuations of the expert bias during the later stage of training, deteriorating load balance in this stage. Both situations can impair performance. An appropriate choice is , which shows good training balance and validation perplexity.
算法 1 中的更新速率 控制专家偏差 收敛到“适当偏差”的速度。图 4 显示,过低的更新速率 可能导致收敛缓慢,而过高的更新速率 会在训练后期引起专家偏差 不必要的波动,从而在这一阶段恶化负载平衡。这两种情况都会损害性能。适当的选择是 ,表现出了良好的训练平衡和验证困惑度。
Update rule. 更新规则。
We investigate a different update rule of the expert-wise biases.
To be specific, we attempt to change the update rule of to , which encourages the bias of experts with high violation errors to change faster. Although this variant slightly improves load balance, it does not lead to better performance, as shown in Table 3.
Therefore, we maintain the version.
我们探讨了一种不同的专家偏差更新规则。具体来说,我们尝试将 的更新规则更改为 ,以鼓励高违规误差专家的偏差更快改变。虽然这种变体稍微改善了负载平衡,但并没有带来更好的性能,如表 3 所示。因此,我们保持 版本。
表 3:该变体 稍微改善了负载平衡,但未显示出模型性能的提升。
Method 方法 | Perplexity 困惑度 | |
, | 9.50 | 0.044 |
, | 9.53 | 0.028 |
, | 9.51 | 0.036 |
, | 9.51 | 0.040 |
Multiplicative bias. 乘法偏差。
In addition to adding the expert-wise biases to the gating scores, using multiplicative biases is also a potential variant:
除了将专家偏差添加到门控得分中,使用乘法偏差也是一个潜在的变体:
(5) |
These can be updated using a similar procedure to Algorithm 1, except that they should be initialized as 1 instead of 0.
Table 4 shows that using multiplicative biases results in slightly worse model performance compared to using additive biases, without significant improvements in load balance.
Based on these findings, we conclude that additive biases are a more suitable choice for our method.
这些 可以采用类似于算法 1 的程序进行更新,只是它们应初始化为 1 而不是 0。表 4 显示,与使用加性偏差相比,使用乘法偏差导致模型性能略微下降,并且负载平衡也没有显著改善。基于这些发现,我们得出结论,加性偏差是我们方法的更合适选择。
表 4:乘法偏差在负载平衡上显示相似情况,但在性能上略逊于加性偏差。
Method 方法 | Perplexity 困惑度 | |
Addative Bias, 加性偏差, | 9.50 | 0.044 |
Multiplicative Bias, 乘法偏差, |
9.52 | 0.041 |
Multiplicative Bias, 乘法偏差, |
9.52 | 0.036 |
Multiplicative Bias, 乘法偏差, |
9.54 | 0.048 |
5 Discussion 5 讨论
5.1 Loss-Free Balancing Is Compatible with Expert Parallelism
5.1 无损平衡与专家并行兼容
Extremely large-scale MoE models often employ expert parallelism (Lepikhin et al., 2020) for training or inference, which distributes experts across different devices to reduce memory requirements.
In such scenarios, load balance on the data in a single computation step is crucial for efficiency.
Due to expert parallelism, each computation step involves micro_batch_size * ep_data_parallel_size samples, which we refer to as a computation batch.
Here, micro_batch_size denotes the number of samples processed in one gradient accumulation step on a single device.
超大规模 MoE 模型通常采用专家并行(Lepikhin 等,2020)进行训练或推理,这将专家分布在不同设备上以减少内存需求。在这种情况下,单个计算步骤的数据负载平衡对效率至关重要。由于专家并行,每个计算步骤涉及 micro_batch_size * ep_data_parallel_size 样本,我们称之为计算批次。这里,micro_batch_size 表示在单个设备上进行一次梯度累积步骤处理的样本数。
Loss-Free Balancing can achieve nearly optimal global load balance, and the load balance in each computation step will get closer to the global load balance as the computation batch size increases.
In Figure 5, we examine the computation-batch-level load balance with the metric.
The results show that the load balance of our Loss-Free Balancing always keeps improving as the computation batch size increases, but the load balance of the auxiliary-loss-controlled method approximately maintains a constant level when the computation batch is large.
Since expert parallelism will significantly increase the computation batch size by ep_data_parallel_size times, Loss-Free Balancing is naturally compatible with large-scale MoE training, and its advantage on the load balance will be further enhanced as the size of expert parallelism increases.
无损均衡可以实现接近最优的全局负载均衡,并且在每个计算步骤中的负载均衡会随着计算批量的增加而更接近全局负载均衡。在图 5 中,我们使用 指标检查了计算批级别的负载均衡。结果显示,无损均衡的负载均衡会随着计算批量的增加而不断改善,但辅助损失控制的方法在计算批量较大时,负载均衡基本保持在一个恒定水平。由于专家并行性会将计算批量显著增加到 ep_data_parallel_size 倍,无损均衡自然与大规模 MoE 训练兼容,并且随着专家并行性的增长,其在负载均衡方面的优势将进一步增强。
5.2 Load Balancing and Future Token Leakage
5.2 负载均衡与未来标记泄漏
For casual language models, load balancing methods must adhere to the causal constraint of language modeling to avoid future token leakage. While conventional auxiliary-controlled balancing and our Loss-Free Balancing obey this constraint, Expert Choice (EC) (Zhou et al., 2022) violates it. EC ensures perfect load balance by assigning exactly the same number of tokens to each expert. However, this approach inherently leads to a severe issue of future token leakage.
对于随意语言模型,负载均衡方法必须遵循语言建模的因果约束,以避免未来标记泄漏。尽管传统的辅助控制均衡和我们的无损均衡遵循这一约束,但专家选择 (EC)(Zhou 等,2022)违背了这一点。EC 通过将完全相同数量的标记分配给每位专家来确保完美的负载均衡。然而,这种方法固有地导致了未来标记泄漏的严重问题。
In EC, future tokens can influence the expert assignment of previous tokens. Figure 6 illustrates how information can be easily transmitted within a sequence via such influence. Theoretically, the token assignment of an MoE layer with sparse ratio (average activated experts per token divided by total expert number ) can leak more than bits per token (proof in Appendix D.1). For a 9-layer MoE model with 16 experts and an average of 2 experts per token, this amounts to 50 bits, sufficient for each token to determine its successor’s identity.
在 EC 中,未来标记可以影响先前标记的专家分配。图 6 展示了信息如何通过这样的影响在序列中轻松传递。理论上,稀疏比 的 MoE 层的标记分配(每个标记激活的专家平均数 除以总专家数 )可以泄漏超过 比特每个标记(证明见附录 D.1)。对于一个有 9 层,16 个专家且每个标记平均使用 2 个专家的 MoE 模型,这相当于 50 比特,足以让每个标记确定其后继者的身份。
We designed experiments to demonstrate the existence of future token leakage in realistic model training. (1) We reduced the chunk size, within which top-K selection is performed, from 8192 tokens (4 sentences) to 512 (1/4 sentence), with the expectation of exposing such leakage. We observed an abnormal loss drop (about 10%), confirming the presence of leakage. (2) We made leakage more difficult by shuffling tokens across chunks in the top-K selection step, and observed that the abnormal loss drop was mitigated. Detailed experimental results on EC’s information leakage are provided in Appendix D.2.
我们设计了实验来证明在现实模型训练中存在未来标记泄漏。(1)我们将执行 top-K 选择的块大小从 8192 个标记(4 个句子)减少到 512(1/4 个句子),以期揭示这种泄漏。我们观察到一个异常的损失下降(约 10%),证实了泄漏的存在。(2)我们通过在 top-K 选择步骤中在块间打乱标记,使泄漏更加困难,并观察到异常损失下降得到了缓解。EC 的信息泄漏详细实验结果见附录 D.2。
Future token leakage is fatal since it destroys the generalization of a model and prevents reliable evaluation of the model performance.
Therefore, compared with EC, scaling up an MoE model with our Loss-Free Balancing is safer.
未来标记泄漏是致命的,因为它破坏了模型的泛化能力,并阻碍了对模型性能的可靠评估。因此,与 EC 相比,使用我们的无损均衡扩大 MoE 模型规模更安全。
6 Conclusion 6 结论
In this work, we introduced Loss-Free Balancing, a novel MoE load balance control method without introducing auxiliary-loss gradients. Loss-Free Balancing addresses the issue of traditional auxiliary-loss load balance control, which introduces additional gradients during training and potentially impairs model performance when enforcing load balance. Experiments conducted on 1B and 3B MoE models, trained on 100B and 300B tokens respectively, demonstrate that Loss-Free Balancing achieves better model performance and load balance compared to the traditional auxiliary-loss training.
在这项工作中,我们介绍了无损均衡,这是一种不引入辅助损失梯度的创新 MoE 负载均衡控制方法。无损均衡解决了传统辅助损失负载均衡控制的问题,该问题在训练期间引入了额外的梯度,并可能在强制负载均衡时损害模型性能。在对 1B 和 3B MoE 模型进行的实验中,分别在 100B 和 300B 个标记上训练,结果表明无损均衡比传统辅助损失训练在模型性能和负载均衡方面表现更优。
References 参考文献
-
Dai et al. (2024) Dai 等 (2024)
Damai Dai, Chengqi Deng, Chenggang Zhao, Runxin Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang.
Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.
ArXiv, abs/2401.06066, 2024.
URL https://api.semanticscholar.org/CorpusID:266933338.
达迈·戴、程琦邓、程刚赵、润鑫徐、华佐高、德力陈、佳石李、望定曾、星凯于、玉乌、振达谢、Y.K.李、潘潘黄、富丽罗、重庆阮、志方随、温枫梁。Deepseekmoe:在混合专家语言模型中的终极专家特化。ArXiv,abs/2401.06066,2024。URL https://api.semanticscholar.org/CorpusID:266933338。 -
DeepSeek-AI et al. (2024)
DeepSeek-AI 等人(2024) DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bing-Li Wang, Jun-Mei Song, Deli Chen, Xin Xie, Kang Guan, Yu mei You, Aixin Liu, Qiushi Du, Wenjun Gao, Xuan Lu, Qinyu Chen, Yaohui Wang, Chengqi Deng, Jiashi Li, Chenggang Zhao, Chong Ruan, Fuli Luo, and Wenfeng Liang. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. ArXiv, abs/2406.11931, 2024. URL https://api.semanticscholar.org/CorpusID:270562723.
DeepSeek-AI,齐浩朱,大亚郭,志宏邵,德建杨,沛艺王,润鑫徐,玉乌,玉坤李,华佐高,仕荣马,望定曾,晓毕,子辉顾,翰维徐,达迈戴,凯栋,丽悦张,一诗朴,志斌苟,振达谢,哲文浩,炳莉王,君美宋,德力陈,新谢,康冠,玉梅尤,爱馨刘,秋实杜,文君高,轩鲁,青玉陈,耀慧王,程琦邓,佳石李,程刚赵,重庆阮,富丽罗,和温枫梁。Deepseek-coder-v2:突破在代码智能领域的闭源模型壁垒。ArXiv,abs/2406.11931,2024。URL https://api.semanticscholar.org/CorpusID:270562723。 -
Fedus et al. (2021) Fedus 等人(2021)
William Fedus, Barret Zoph, and Noam M. Shazeer.
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.
J. Mach. Learn. Res., 23:120:1–120:39, 2021.
URL https://api.semanticscholar.org/CorpusID:231573431.
威廉·费杜斯、巴雷特·佐普和诺姆·M. ·沙齐尔。Switch transformers:通过简单高效的稀疏性扩展到万亿参数模型。J. Mach. Learn. Res.,23:120:1-120:39,2021。URL https://api.semanticscholar.org/CorpusID:231573431。 -
Lepikhin et al. (2020) Lepikhin 等人(2020)
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam M. Shazeer, and Z. Chen.
Gshard: Scaling giant models with conditional computation and automatic sharding.
ArXiv, abs/2006.16668, 2020.
URL https://api.semanticscholar.org/CorpusID:220265858.
德米特里·勒皮金、HyoukJoong Lee、远中旭、德豪陈、Orhan Firat、燕平黄、马克西姆·克里孔、诺姆·M. ·沙齐尔和 Z.陈。Gshard:通过条件计算和自动分片扩展巨型模型。ArXiv,abs/2006.16668,2020。URL https://api.semanticscholar.org/CorpusID:220265858。 -
Loshchilov & Hutter (2016)
Ilya Loshchilov and Frank Hutter.
Sgdr: Stochastic gradient descent with warm restarts.
arXiv: Learning, 2016.
URL https://api.semanticscholar.org/CorpusID:14337532.
伊利亚·洛什奇洛夫和弗兰克·哈特。Sgdr:带有热重启的随机梯度下降。arXiv:学习,2016。URL https://api.semanticscholar.org/CorpusID:14337532。 -
Sennrich et al. (2015) Sennrich 等人(2015)
Rico Sennrich, Barry Haddow, and Alexandra Birch.
Neural machine translation of rare words with subword units.
ArXiv, abs/1508.07909, 2015.
URL https://api.semanticscholar.org/CorpusID:1114678.
里科·塞恩里奇、巴里·哈多和亚历山德拉·伯奇。使用子词单元进行罕见词的神经机器翻译。ArXiv,abs/1508.07909,2015。URL https://api.semanticscholar.org/CorpusID:1114678。 -
Shao et al. (2024) Shao 等人(2024)
Zhihong Shao, Damai Dai, Daya Guo, Bo Liu, and Zihan Wang.
Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.
ArXiv, abs/2405.04434, 2024.
URL https://api.semanticscholar.org/CorpusID:269613809.
志洪邵、达迈戴、大亚郭、伯刘和子汉王。Deepseek-v2:一个强大、经济和高效的混合专家语言模型。ArXiv,abs/2405.04434,2024。URL https://api.semanticscholar.org/CorpusID:269613809。 -
Shazeer et al. (2017) Shazeer 等人(2017)
Noam M. Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean.
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.
ArXiv, abs/1701.06538, 2017.
URL https://api.semanticscholar.org/CorpusID:12462234.
诺姆·M. ·沙齐尔、阿扎利亚·米尔霍塞尼、克日什托夫·马奇亚兹、安迪·戴维斯、郭·V 勒、杰弗里·E. ·辛顿和杰夫·迪恩。极大的神经网络:稀疏门控的混合专家层。ArXiv,abs/1701.06538,2017。URL https://api.semanticscholar.org/CorpusID:12462234。 -
Vaswani et al. (2017) Vaswani 等人(2017)
Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin.
Attention is all you need.
In Neural Information Processing Systems, 2017.
URL https://api.semanticscholar.org/CorpusID:13756489.
阿希什·瓦斯瓦尼、诺姆·M. ·沙齐尔、尼基·帕尔玛、雅各布·乌斯科雷特、利昂·琼斯、艾登·N. 戈麦斯、卢卡斯·凯赛尔和伊利亚·波洛舒金。注意力是你所需要的一切。2017 年神经信息处理系统会议。URL https://api.semanticscholar.org/CorpusID:13756489。 -
Zhou et al. (2022) Zhou 等人(2022)
Yan-Quan Zhou, Tao Lei, Han-Chu Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M. Dai, Zhifeng Chen, Quoc V. Le, and James Laudon.
Mixture-of-experts with expert choice routing.
ArXiv, abs/2202.09368, 2022.
URL https://api.semanticscholar.org/CorpusID:247011948.
燕权周、涛雷、汉楚刘、南杜、燕平黄、文森特赵、安德鲁·M 戴、志峰陈、郭·V 勒和詹姆斯·劳登。具有专家选择路由的混合专家。ArXiv,abs/2202.09368,2022。URL https://api.semanticscholar.org/CorpusID:247011948。
Appendix A Model Architecture
附录 A 模型架构
We employ the DeepSeekMoE (Dai et al., 2024) architecture as the backbone, which introduces shared experts to mitigate knowledge redundancy among routed experts:
我们采用了 DeepSeekMoE(戴等,2024)架构作为骨干,该架构引入共享专家以减轻路由专家之间的知识冗余:
(6) |
where denotes the routed experts, while the shared experts. DeepSeekMoE replaces all FFN layers with MoE layers, except the dense FFN layer just after the input embedding layer.
其中 表示路由专家,而 表示共享专家。DeepSeekMoE 将所有 FFN 层替换为 MoE 层,除了输入嵌入层之后的密集 FFN 层。
The detailed architecture hyper-parameters are listed in Table 5.
详细的架构超参数列在表 5 中。
表 5:模型架构。
hyper-parameters 超参数 | 1B | 3B |
Vocab size 词汇大小 | 32064 | 32064 |
Hidden size 隐藏层大小 | 1024 | 1280 |
Attention heads 注意力头 | 8 | 10 |
MoE layers MoE 层 | 9 | 11 |
Granularity () 颗粒度( ) | 4 | |
Shared experts 共享专家 | 2 | 2 |
Routed experts 路由专家 | 64 | 64 |
Activated routed experts 激活的路由专家 |
6 | 6 |
Appendix B Training Settings
附录 B 训练设置
Following the work of Dai et al. (2024), we initialize all learnable parameters with a standard deviation of 0.006, and set the maximum training sequence length to 2048.
按照戴等(2024)的研究工作,我们初始化所有可学习参数的标准差为 0.006,并将最大训练序列长度设置为 2048。
For the 1B model, we employ a cosine learning rate scheduler with warmup, setting the learning rate to 1e-3, the minimum learning rate to 1e-4, and the warmup steps to 1000. The training batch size for the 1B model is set to 1152, resulting in a total of 40000 training steps (100B tokens).
对于 1B 模型,我们采用余弦学习率调度器进行预热,将学习率设置为 1e-3,最低学习率为 1e-4,预热步数为 1000。1B 模型的训练批量大小设置为 1152,总训练步数为 40000(100B token)。
For the 3B model, we use a multistep learning rate scheduler with stage steps = [45211, 50862, 56514] and corresponding stage learning rates of [7.8e-4, 2.47e-4, 7.8e-5]. The warmup steps for the 3B model are set to 2000. We use a training batch size of 1728 for the 3B model, resulting in a total of 56514 training steps (200B tokens).
对于 3B 模型,我们使用多步学习率调度器,阶段步数为[45211, 50862, 56514],相应的阶段学习率为[7.8e-4, 2.47e-4, 7.8e-5]。3B 模型的预热步数设置为 2000。我们使用的训练批量大小为 1728,总训练步数为 56514(200B token)。
For validation, we leave around 70M tokens from the training corpus as the validation set (30 * 1B_batch_size * max_seq_len = 20 * 3B_batch_size * max_seq_len = 71M tokens).
对于验证,我们保留大约 7000 万个 token 作为验证集(30 * 1B_batch_size * max_seq_len = 20 * 3B_batch_size * max_seq_len = 7100 万个 token)。
Appendix C Experiments with Softmax Gate
附录 C 软最大门的实验
C.1 Comparison of Sigmoid Gate Baseline and Softmax Gate Baseline
C.1 Sigmoid 门基线和软最大门基线的比较
We compare the sigmoid gate baseline and the softmax gate baseline with varying auxiliary loss coefficients on a 1B-sized model. As shown in Figure 7, the softmax gate exhibits higher perplexity under similar load balance conditions, and its performance is more sensitive to load imbalance compared to the sigmoid gate.
我们比较了在 1B 大小模型上具有不同辅助损失系数 的 Sigmoid 门基线和软最大门基线。如图 7 所示,软最大门在类似负载平衡条件下表现出更高的困惑度,并且其性能比 Sigmoid 门对负载不平衡更为敏感。
C.2 Loss-Free Load Balancing with Softmax Gate
C.2 无损失的软最大门负载平衡
Adjusting the per-expert bias for the softmax gate is more challenging due to the normalization property of softmax, which makes the score gap between two experts sensitive to the scores of other experts. In such a situation, we choose the variant to maintain load balance, where is set to 1e-3. For the baseline, we choose = 0.0003, which yields the lowest perplexity for the softmax gate. The results are presented in Table 6, showing that Loss-Free Balancing achieves a slightly lower perplexity while maintaining significantly better load balance compared to the auxiliary-loss training method. Figure 8 confirms that Loss-Free Balancing maintains a superior load balance throughout most of the training process.
调整软最大门的每个专家偏置更具挑战性,因为软最大具有归一化特性,这使得两个专家之间的分数差对其他专家的分数很敏感。在这种情况下,我们选择 变体来维持负载平衡,其中 设置为 1e-3。对于基线,我们选择 = 0.0003,这为软最大门带来了最低的困惑度。结果如表 6 所示,无损失平衡相比于辅助损失训练方法,略微降低了困惑度,同时显著提高了负载平衡。图 8 确认无损失平衡在训练过程中大部分时间内保持了优越的负载平衡。
表 6:对软最大门,无损失平衡在实现显著更好负载平衡的同时,相较于辅助损失训练方法,略微降低了困惑度。
Load Balancing 负载平衡 | Perplexity 困惑度 | |
Loss-Controlled 损失控制 | 9.604 | 0.937 |
Loss-Free 无损失 | 9.599 | 0.027 |
Appendix D Future Token Leakage in Expert Choice
附录 D 专家选择中的未来 Token 泄漏
D.1 Proof for Theoretical Leakage Amount
D.1 理论泄漏量的证明
Let denote the MoE sparsity. Here denotes the average number of experts activated per token, and is the total number of experts.
For an MoE layer in Expert Choice, the maximum information leakage (in bits per token), i.e., the information that the combinations of routing allocation can carry is:
设 表示 MoE 稀疏性。这里 表示每个 token 激活的平均专家数, 是专家的总数量。对于专家选择中的 MoE 层,最大信息泄漏 (以每个 token 的比特为单位),即路由分配组合能够携带的信息量为:
(7) | ||||
For a model with a sparse ratio and 9 MoE layers, the total leakage information is more than 50 bits per token.
对于具有稀疏比 和 9 个 MoE 层的模型,总泄漏信息超过每个 token 的 50 比特。
D.2 Experimental Evidence D.2 实验证据
We investigate the potential future token leakage of the Expert Choice by varying the chunk size used for experts’ top- selection, ranging from 512 tokens to 8192 tokens.333A chunk size of 2048 tokens means performing top- selection inside a sentence, while 512 tokens correspond to a quarter of a sentence and 8192 tokens to four sentences. We train a 2B MoE model on 100B tokens. The results, shown in Table 9, reveal two key findings:
我们通过改变用于专家顶级 选择的块大小(范围从 512 token 到 8192 token)来调查专家选择的潜在未来 token 泄漏。我们在 100B token 上训练一个 2B MoE 模型。结果如表 9 所示,揭示了两个关键发现:
-
1.
Using a small chunk size of 512 leads to an abnormal loss drop, which can be attributed to significant future token leakage. A smaller chunk size allows the model to more easily exploit information from future tokens within the chunk during training.
1. 使用小块大小为 512 会导致异常的损失下降,这可归因于显著的未来 token 泄漏。较小的块大小允许模型在训练过程中更容易利用块内未来 token 的信息。 -
2.
Shuffling tokens within a batch before chunking and selecting mitigates the observed loss drop. Such shuffling makes it more challenging for the model to utilize information leakage, as the future tokens are no longer in their original context. This finding supports the hypothesis that the loss drop originates from the model’s accessing and exploiting future token information.
2. 在分块和选择之前,对一个批次中的标记进行洗牌可以减轻观察到的损失下降。这样的洗牌使得模型更难以利用信息泄露,因为未来的标记不再处于其原始上下文中。这一发现支持了损失下降来源于模型访问和利用未来标记信息的假设。