Outrageously Large Neural Networks:
The Sparsely-Gated Mixture-of-Experts Layer
极其庞大的神经网络：稀疏门控专家混合层

Noam Shazeer 诺姆·沙泽尔 Google Brain, {noam,azalia,andydavis,qvl,geoffhinton,jeff}@google.com Azalia Mirhoseini 阿扎利亚·米尔霍塞尼Equally major contributorsWork done as a member of the Google Brain Residency program (g.co/brainresidency) Google Brain, {noam,azalia,andydavis,qvl,geoffhinton,jeff}@google.com
同等重要的贡献者工作作为 Google Brain Residency 项目的成员（g.co/brainresidency）Google Brain，{noam,azalia,andydavis,qvl,geoffhinton,jeff}@google.com Krzysztof Maziarz^∗ 克日什托夫·马齐亚尔兹∗ Jagiellonian University, Cracow, krzysztof.maziarz@student.uj.edu.pl
克拉科夫雅盖隆大学，krzysztof.maziarz@student.uj.edu.pl Andy Davis 安迪·戴维斯 Google Brain, {noam,azalia,andydavis,qvl,geoffhinton,jeff}@google.com Quoc Le 国黎 Google Brain, {noam,azalia,andydavis,qvl,geoffhinton,jeff}@google.com Geoffrey Hinton 杰弗里·辛顿 Google Brain, {noam,azalia,andydavis,qvl,geoffhinton,jeff}@google.com Jeff Dean 杰夫·迪恩 Google Brain, {noam,azalia,andydavis,qvl,geoffhinton,jeff}@google.com

Abstract 摘要

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.
神经网络吸收信息的能力受其参数数量的限制。条件计算，即网络的部分在每个示例的基础上激活，理论上已被提出作为一种在不成比例增加计算量的情况下显著增加模型容量的方法。然而，在实践中，存在显著的算法和性能挑战。在这项工作中，我们解决了这些挑战，最终实现了条件计算的承诺，在现代 GPU 集群上实现了超过 1000 倍的模型容量改进，而计算效率仅有轻微损失。我们引入了稀疏门控专家混合层（MoE），由多达数千个前馈子网络组成。一个可训练的门控网络决定了每个示例使用的这些专家的稀疏组合。我们将 MoE 应用于语言建模和机器翻译任务，在这些任务中，模型容量对于吸收训练语料库中可用的大量知识至关重要。我们提出了模型架构，其中一个具有多达 1370 亿参数的 MoE 在堆叠的 LSTM 层之间以卷积方式应用。在大型语言建模和机器翻译基准测试中，这些模型在较低计算成本下取得了显著优于最新技术的结果。

1 Introduction and Related Work
1 引言与相关工作

1.1 Conditional Computation
1.1 条件计算

Exploiting scale in both training data and model size has been central to the success of deep learning. When datasets are sufficiently large, increasing the capacity (number of parameters) of neural networks can give much better prediction accuracy. This has been shown in domains such as text (Sutskever et al., 2014; Bahdanau et al., 2014; Jozefowicz et al., 2016; Wu et al., 2016), images (Krizhevsky et al., 2012; Le et al., 2012), and audio (Hinton et al., 2012; Amodei et al., 2015). For typical deep learning models, where the entire model is activated for every example, this leads to a roughly quadratic blow-up in training costs, as both the model size and the number of training examples increase. Unfortunately, the advances in computing power and distributed computation fall short of meeting such demand.
在训练数据和模型规模上利用规模效应是深度学习成功的关键。当数据集足够大时，增加神经网络的容量（参数数量）可以显著提高预测准确性。这在文本（Sutskever 等，2014；Bahdanau 等，2014；Jozefowicz 等，2016；Wu 等，2016）、图像（Krizhevsky 等，2012；Le 等，2012）和音频（Hinton 等，2012；Amodei 等，2015）等领域中已有所证明。对于典型的深度学习模型来说，每个样本都激活整个模型，这导致训练成本大约呈二次方增长，因为模型规模和训练样本数量都在增加。不幸的是，计算能力和分布式计算的进步还不足以满足这种需求。

Various forms of conditional computation have been proposed as a way to increase model capacity without a proportional increase in computational costs (Davis & Arel, 2013; Bengio et al., 2013; Eigen et al., 2013; Ludovic Denoyer, 2014; Cho & Bengio, 2014; Bengio et al., 2015; Almahairi et al., 2015). In these schemes, large parts of a network are active or inactive on a per-example basis. The gating decisions may be binary or sparse and continuous, stochastic or deterministic. Various forms of reinforcement learning and back-propagation are proposed for trarining the gating decisions.
为了在不成比例增加计算成本的情况下提高模型容量，已经提出了各种形式的条件计算（Davis & Arel, 2013; Bengio et al., 2013; Eigen et al., 2013; Ludovic Denoyer, 2014; Cho & Bengio, 2014; Bengio et al., 2015; Almahairi et al., 2015）。在这些方案中，网络的大部分在每个样本的基础上是活跃或不活跃的。门控决策可以是二进制的或稀疏且连续的，随机的或确定性的。提出了各种形式的强化学习和反向传播来训练门控决策。

While these ideas are promising in theory, no work to date has yet demonstrated massive improvements in model capacity, training time, or model quality. We blame this on a combination of the following challenges:
虽然这些想法在理论上很有前景，但迄今为止还没有任何工作证明在模型容量、训练时间或模型质量方面有大幅改进。我们将其归咎于以下挑战的结合：

•

Modern computing devices, especially GPUs, are much faster at arithmetic than at branching. Most of the works above recognize this and propose turning on/off large chunks of the network with each gating decision.

• 现代计算设备，尤其是 GPU，在算术运算方面比在分支方面快得多。上述大多数工作都认识到这一点，并提出通过每个门控决策来打开/关闭网络的大块部分。
•

Large batch sizes are critical for performance, as they amortize the costs of parameter transfers and updates. Conditional computation reduces the batch sizes for the conditionally active chunks of the network.

• 大批量大小对于性能至关重要，因为它们分摊了参数传输和更新的成本。条件计算减少了网络中条件激活块的批量大小。
•

Network bandwidth can be a bottleneck. A cluster of GPUs may have computational power thousands of times greater than the aggregate inter-device network bandwidth. To be computationally efficient, the relative computational versus network demands of an algorithm must exceed this ratio. Embedding layers, which can be seen as a form of conditional computation, are handicapped by this very problem. Since the embeddings generally need to be sent across the network, the number of (example, parameter) interactions is limited by network bandwidth instead of computational capacity.

• 网络带宽可能成为瓶颈。一个 GPU 集群的计算能力可能比设备间网络带宽的总和高出数千倍。为了在计算上高效，算法的相对计算需求与网络需求必须超过这个比率。嵌入层可以看作是一种条件计算形式，正是因为这个问题而受到限制。由于嵌入通常需要通过网络传输，（示例，参数）交互的数量受到网络带宽而不是计算能力的限制。
•

Depending on the scheme, loss terms may be necessary to achieve the desired level of sparsity per-chunk and/or per example. Bengio et al. (2015) use three such terms. These issues can affect both model quality and load-balancing.

• 根据方案的不同，可能需要损失项来实现每个块和/或每个示例所需的稀疏度水平。Bengio 等人（2015）使用了三个这样的项。这些问题可能会影响模型质量和负载平衡。
•

Model capacity is most critical for very large data sets. The existing literature on conditional computation deals with relatively small image recognition data sets consisting of up to 600,000 images. It is hard to imagine that the labels of these images provide a sufficient signal to adequately train a model with millions, let alone billions of parameters.

• 对于非常大的数据集，模型容量是最关键的。现有关于条件计算的文献处理的是相对较小的图像识别数据集，这些数据集包含最多 60 万张图像。很难想象这些图像的标签能够提供足够的信号来充分训练一个拥有数百万甚至数十亿参数的模型。

In this work, we for the first time address all of the above challenges and finally realize the promise of conditional computation. We obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets.
在这项工作中，我们首次解决了上述所有挑战，并最终实现了条件计算的承诺。我们在计算效率仅有轻微损失的情况下，获得了超过 1000 倍的模型容量提升，并在公共语言建模和翻译数据集上显著推进了最先进的成果。

1.2 Our Approach: The Sparsely-Gated Mixture-of-Experts Layer
1.2 我们的方法：稀疏门控专家混合层

Our approach to conditional computation is to introduce a new type of general purpose neural network component: a Sparsely-Gated Mixture-of-Experts Layer (MoE). The MoE consists of a number of experts, each a simple feed-forward neural network, and a trainable gating network which selects a sparse combination of the experts to process each input (see Figure 1). All parts of the network are trained jointly by back-propagation.
我们对条件计算的方法是引入一种新型的通用神经网络组件：稀疏门控专家混合层（MoE）。MoE 由多个专家组成，每个专家都是一个简单的前馈神经网络，还有一个可训练的门控网络，它选择专家的稀疏组合来处理每个输入（见图 1）。网络的所有部分通过反向传播共同训练。

Refer to caption — Figure 1: A Mixture of Experts (MoE) layer embedded within a recurrent language model. In this case, the sparse gating function selects two experts to perform computations. Their outputs are modulated by the outputs of the gating network.
图 1：嵌入递归语言模型中的专家混合（MoE）层。在这种情况下，稀疏门控函数选择两个专家进行计算。他们的输出由门控网络的输出调制。

While the introduced technique is generic, in this paper we focus on language modeling and machine translation tasks, which are known to benefit from very large models. In particular, we apply a MoE convolutionally between stacked LSTM layers (Hochreiter & Schmidhuber, 1997), as in Figure 1. The MoE is called once for each position in the text, selecting a potentially different combination of experts at each position. The different experts tend to become highly specialized based on syntax and semantics (see Appendix E Table 9). On both language modeling and machine translation benchmarks, we improve on best published results at a fraction of the computational cost.
虽然所介绍的技术是通用的，但在本文中我们专注于语言建模和机器翻译任务，这些任务已知从非常大的模型中受益。特别是，我们在堆叠的 LSTM 层之间应用了 MoE 卷积，如图 1 所示。对于文本中的每个位置，MoE 调用一次，在每个位置选择可能不同的专家组合。不同的专家往往根据句法和语义变得高度专业化（见附录 E 表 9）。在语言建模和机器翻译基准测试中，我们在计算成本的一小部分上改进了已发布的最佳结果。

1.3 Related work on Mixtures of Experts
1.3 专家混合相关工作

Since its introduction more than two decades ago (Jacobs et al., 1991; Jordan & Jacobs, 1994), the mixture-of-experts approach has been the subject of much research. Different types of expert architectures hae been proposed such as SVMs (Collobert et al., 2002), Gaussian Processes (Tresp, 2001; Theis & Bethge, 2015; Deisenroth & Ng, 2015), Dirichlet Processes (Shahbaba & Neal, 2009), and deep networks. Other work has focused on different expert configurations such as a hierarchical structure (Yao et al., 2009), infinite numbers of experts (Rasmussen & Ghahramani, 2002), and adding experts sequentially (Aljundi et al., 2016). Garmash & Monz (2016) suggest an ensemble model in the format of mixture of experts for machine translation. The gating network is trained on a pre-trained ensemble NMT model.
自从二十多年前引入以来（Jacobs et al., 1991; Jordan & Jacobs, 1994），专家混合方法一直是许多研究的主题。已经提出了不同类型的专家架构，如支持向量机（Collobert et al., 2002）、高斯过程（Tresp, 2001; Theis & Bethge, 2015; Deisenroth & Ng, 2015）、狄利克雷过程（Shahbaba & Neal, 2009）和深度网络。其他工作则集中在不同的专家配置上，如分层结构（Yao et al., 2009）、无限数量的专家（Rasmussen & Ghahramani, 2002）以及顺序添加专家（Aljundi et al., 2016）。Garmash & Monz（2016）建议在机器翻译中采用专家混合形式的集成模型。门控网络在预训练的集成 NMT 模型上进行训练。

The works above concern top-level mixtures of experts. The mixture of experts is the whole model. Eigen et al. (2013) introduce the idea of using multiple MoEs with their own gating networks as parts of a deep model. It is intuitive that the latter approach is more powerful, since complex problems may contain many sub-problems each requiring different experts. They also allude in their conclusion to the potential to introduce sparsity, turning MoEs into a vehicle for computational computation.
上述作品涉及顶级专家混合。专家混合是整个模型。Eigen 等人（2013）提出了使用多个具有各自门控网络的 MoE 作为深度模型一部分的想法。直观上，后一种方法更强大，因为复杂问题可能包含许多需要不同专家的子问题。他们在结论中还提到引入稀疏性的潜力，将 MoE 转变为计算计算的工具。

Our work builds on this use of MoEs as a general purpose neural network component. While Eigen et al. (2013) uses two stacked MoEs allowing for two sets of gating decisions, our convolutional application of the MoE allows for different gating decisions at each position in the text. We also realize sparse gating and demonstrate its use as a practical way to massively increase model capacity.
我们的工作基于将 MoE 用作通用神经网络组件。虽然 Eigen 等人（2013）使用了两个堆叠的 MoE，允许进行两组门控决策，但我们对 MoE 的卷积应用允许在文本的每个位置进行不同的门控决策。我们还实现了稀疏门控，并展示了其作为大幅增加模型容量的实用方法。

2 The Structure of the Mixture-of-Experts layer
2 专家混合层的结构

The Mixture-of-Experts (MoE) layer consists of a set of $n$ “expert networks" $E_{1},\cdots,E_{n}$ , and a “gating network" $G$ whose output is a sparse $n$ -dimensional vector. Figure 1 shows an overview of the MoE module. The experts are themselves neural networks, each with their own parameters. Although in principle we only require that the experts accept the same sized inputs and produce the same-sized outputs, in our initial investigations in this paper, we restrict ourselves to the case where the models are feed-forward networks with identical architectures, but with separate parameters.
专家混合（MoE）层由一组“专家网络”和一个“门控网络”组成，其输出是一个稀疏的维度向量。图 1 展示了 MoE 模块的概述。专家本身是神经网络，每个都有自己的参数。尽管原则上我们只要求专家接受相同大小的输入并产生相同大小的输出，但在本文的初步研究中，我们将自己限制在模型是具有相同架构但参数独立的前馈网络的情况下。

Let us denote by $G(x)$ and $E_{i}(x)$ the output of the gating network and the output of the $i$ -th expert network for a given input $x$ . The output $y$ of the MoE module can be written as follows:
令我们用 $G(x)$ 表示门控网络的输出，用 $E_{i}(x)$ 表示第 $i$ 个专家网络在给定输入 $x$ 时的输出。MoE 模块的输出 $y$ 可以写成如下形式：

y=\sum_{i=1}^{n}G(x)_{i}E_{i}(x)

(1)

We save computation based on the sparsity of the output of $G(x)$ . Wherever $G(x)_{i}=0$ , we need not compute $E_{i}(x)$ . In our experiments, we have up to thousands of experts, but only need to evaluate a handful of them for every example. If the number of experts is very large, we can reduce the branching factor by using a two-level hierarchical MoE. In a hierarchical MoE, a primary gating network chooses a sparse weighted combination of “experts", each of which is itself a secondary mixture-of-experts with its own gating network. In the following we focus on ordinary MoEs. We provide more details on hierarchical MoEs in Appendix B.
我们基于 $G(x)$ 的输出稀疏性节省计算。无论 $G(x)_{i}=0$ ，我们都不需要计算 $E_{i}(x)$ 。在我们的实验中，我们有多达数千个专家，但每个例子只需要评估其中的少数几个。如果专家的数量非常多，我们可以通过使用两级层次化 MoE 来减少分支因子。在层次化 MoE 中，主要门控网络选择“专家”的稀疏加权组合，每个“专家”本身就是一个具有自己门控网络的次级专家混合体。以下我们重点讨论普通 MoE。我们在附录 B 中提供了关于层次化 MoE 的更多细节。

Our implementation is related to other models of conditional computation. A MoE whose experts are simple weight matrices is similar to the parameterized weight matrix proposed in (Cho & Bengio, 2014). A MoE whose experts have one hidden layer is similar to the block-wise dropout described in (Bengio et al., 2015), where the dropped-out layer is sandwiched between fully-activated layers.
我们的实现与其他条件计算模型有关。专家是简单权重矩阵的 MoE 类似于(Cho & Bengio, 2014)中提出的参数化权重矩阵。专家有一个隐藏层的 MoE 类似于(Bengio et al., 2015)中描述的分块丢弃，其中丢弃层夹在完全激活的层之间。

2.1 Gating Network 2.1 门控网络

Softmax Gating: Softmax 门控：

A simple choice of non-sparse gating function (Jordan & Jacobs, 1994) is to multiply the input by a trainable weight matrix $W_{g}$ and then apply the $Softmax$ function.
一种非稀疏门控函数的简单选择（Jordan & Jacobs, 1994）是将输入乘以一个可训练的权重矩阵 $W_{g}$ ，然后应用 $Softmax$ 函数。

G_{\sigma}(x)=Softmax(x\cdot W_{g})

(2)

Noisy Top-K Gating: 嘈杂的 Top-K 门控：

We add two components to the Softmax gating network: sparsity and noise. Before taking the softmax function, we add tunable Gaussian noise, then keep only the top k values, setting the rest to $-\infty$ (which causes the corresponding gate values to equal $0$ ). The sparsity serves to save computation, as described above. While this form of sparsity creates some theoretically scary discontinuities in the output of gating function, we have not yet observed this to be a problem in practice. The noise term helps with load balancing, as will be discussed in Appendix A. The amount of noise per component is controlled by a second trainable weight matrix $W_{noise}$ .
我们在 Softmax 门控网络中添加了两个组件：稀疏性和噪声。在使用 softmax 函数之前，我们添加了可调的高斯噪声，然后只保留前 k 个值，将其余的设置为 $-\infty$ （这会导致相应的门值等于 $0$ ）。如上所述，稀疏性有助于节省计算。虽然这种形式的稀疏性在门控函数的输出中会产生一些理论上令人担忧的不连续性，但我们在实践中尚未发现这是一个问题。噪声项有助于负载平衡，详见附录 A。每个组件的噪声量由第二个可训练的权重矩阵 $W_{noise}$ 控制。

G(x)=Softmax(KeepTopK(H(x),k))

(3)

H(x)_{i}=(x\cdot W_{g})_{i}+StandardNormal()\cdot Softplus((x\cdot W_{noise})_{i})

(4)

KeepTopK(v,k)_{i}=\begin{cases}v_{i}&\text{if $v_{i}$ is in the top $k$ elements of $v$.}\\ -\infty&\text{otherwise.}\end{cases}

(5)

Training the Gating Network
训练门控网络

We train the gating network by simple back-propagation, along with the rest of the model. If we choose $k>1$ , the gate values for the top k experts have nonzero derivatives with respect to the weights of the gating network. This type of occasionally-sensitive behavior is described in (Bengio et al., 2013) with respect to noisy rectifiers. Gradients also back-propagate through the gating network to its inputs. Our method differs here from (Bengio et al., 2015) who use boolean gates and a REINFORCE-style approach to train the gating network.
我们通过简单的反向传播训练门控网络，以及模型的其余部分。如果我们选择 $k>1$ ，那么前 k 个专家的门控值相对于门控网络的权重具有非零导数。这种偶尔敏感的行为在(Bengio et al., 2013)中被描述为噪声整流器。梯度也通过门控网络反向传播到其输入。我们的方法在这里与(Bengio et al., 2015)不同，他们使用布尔门和 REINFORCE 风格的方法来训练门控网络。

3 Addressing Performance Challenges
解决性能挑战

3.1 The Shrinking Batch Problem
3.1 批次缩减问题

On modern CPUs and GPUs, large batch sizes are necessary for computational efficiency, so as to amortize the overhead of parameter loads and updates. If the gating network chooses $k$ out of $n$ experts for each example, then for a batch of $b$ examples, each expert receives a much smaller batch of approximately $\frac{kb}{n}\ll b$ examples. This causes a naive MoE implementation to become very inefficient as the number of experts increases. The solution to this shrinking batch problem is to make the original batch size as large as possible. However, batch size tends to be limited by the memory necessary to store activations between the forwards and backwards passes. We propose the following techniques for increasing the batch size:
在现代 CPU 和 GPU 上，为了计算效率，需要较大的批量大小，以摊销参数加载和更新的开销。如果门控网络为每个样本选择 $k$ 个专家中的 $n$ 个，那么对于 $b$ 个样本的批次，每个专家接收的批次大约为 $\frac{kb}{n}\ll b$ 个样本。随着专家数量的增加，这会导致一个天真的 MoE 实现变得非常低效。解决这个批量缩小问题的方法是使原始批量大小尽可能大。然而，批量大小往往受到在前向和后向传递之间存储激活所需的内存的限制。我们提出了以下增加批量大小的技术：

Mixing Data Parallelism and Model Parallelism:
混合数据并行和模型并行：

In a conventional distributed training setting, multiple copies of the model on different devices asynchronously process distinct batches of data, and parameters are synchronized through a set of parameter servers. In our technique, these different batches run synchronously so that they can be combined for the MoE layer. We distribute the standard layers of the model and the gating network according to conventional data-parallel schemes, but keep only one shared copy of each expert. Each expert in the MoE layer receives a combined batch consisting of the relevant examples from all of the data-parallel input batches. The same set of devices function as data-parallel replicas (for the standard layers and the gating networks) and as model-parallel shards (each hosting a subset of the experts). If the model is distributed over $d$ devices, and each device processes a batch of size $b$ , each expert receives a batch of approximately $\frac{kbd}{n}$ examples. Thus, we achieve a factor of $d$ improvement in expert batch size.
在传统的分布式训练设置中，不同设备上的多个模型副本异步处理不同的数据批次，并通过一组参数服务器同步参数。在我们的技术中，这些不同的批次同步运行，以便它们可以在 MoE 层中结合。我们根据传统的数据并行方案分配模型的标准层和门控网络，但只保留每个专家的一个共享副本。MoE 层中的每个专家接收一个组合批次，该批次由所有数据并行输入批次中的相关示例组成。同一组设备既作为数据并行副本（用于标准层和门控网络），又作为模型并行分片（每个分片托管一部分专家）。如果模型分布在 $d$ 个设备上，并且每个设备处理大小为 $b$ 的批次，则每个专家接收大约 $\frac{kbd}{n}$ 个示例的批次。因此，我们在专家批次大小上实现了 $d$ 倍的改进。

In the case of a hierarchical MoE (Section B), the primary gating network employs data parallelism, and the secondary MoEs employ model parallelism. Each secondary MoE resides on one device.
在分层 MoE 的情况下（B 部分），主门控网络采用数据并行性，而次级 MoE 采用模型并行性。每个次级 MoE 驻留在一个设备上。

This technique allows us to increase the number of experts (and hence the number of parameters) by proportionally increasing the number of devices in the training cluster. The total batch size increases, keeping the batch size per expert constant. The memory and bandwidth requirements per device also remain constant, as do the step times, as does the amount of time necessary to process a number of training examples equal to the number of parameters in the model. It is our goal to train a trillion-parameter model on a trillion-word corpus. We have not scaled our systems this far as of the writing of this paper, but it should be possible by adding more hardware.
这种技术使我们能够通过按比例增加训练集群中的设备数量来增加专家的数量（从而增加参数的数量）。总批量大小增加，但每个专家的批量大小保持不变。每个设备的内存和带宽需求也保持不变，步长时间也保持不变，处理与模型参数数量相等的训练样本所需的时间也保持不变。我们的目标是在一个包含一万亿词汇的语料库上训练一个拥有一万亿参数的模型。撰写本文时，我们的系统尚未扩展到如此规模，但通过增加更多硬件，这应该是可行的。

Taking Advantage of Convolutionality:
利用卷积性：

In our language models, we apply the same MoE to each time step of the previous layer. If we wait for the previous layer to finish, we can apply the MoE to all the time steps together as one big batch. Doing so increases the size of the input batch to the MoE layer by a factor of the number of unrolled time steps.
在我们的语言模型中，我们对前一层的每个时间步应用相同的 MoE。如果我们等待前一层完成，我们可以将 MoE 应用于所有时间步，作为一个大的批次。这样做会将输入到 MoE 层的批次大小增加一个未展开时间步数的倍数。

Increasing Batch Size for a Recurrent MoE:
增加递归专家模型的批量大小：

We suspect that even more powerful models may involve applying a MoE recurrently. For example, the weight matrices of a LSTM or other RNN could be replaced by a MoE. Sadly, such models break the convolutional trick from the last paragraph, since the input to the MoE at one timestep depends on the output of the MoE at the previous timestep. Gruslys et al. (2016) describe a technique for drastically reducing the number of stored activations in an unrolled RNN, at the cost of recomputing forward activations. This would allow for a large increase in batch size.
我们怀疑，更强大的模型可能涉及反复应用 MoE。例如，LSTM 或其他 RNN 的权重矩阵可以被 MoE 取代。不幸的是，这样的模型打破了上一段提到的卷积技巧，因为在一个时间步长上 MoE 的输入取决于前一个时间步长上 MoE 的输出。Gruslys 等人（2016）描述了一种技术，可以大幅减少展开 RNN 中存储的激活数量，但代价是重新计算前向激活。这将允许大幅增加批量大小。

3.2 Network Bandwidth 3.2 网络带宽

Another major performance concern in distributed computing is network bandwidth. Since the experts are stationary (see above) and the number of gating parameters is small, most of the communication involves sending the inputs and outputs of the experts across the network. To maintain computational efficiency, the ratio of an expert’s computation to the size of its input and output must exceed the ratio of computational to network capacity of the computing device. For GPUs, this may be thousands to one. In our experiments, we use experts with one hidden layer containing thousands of RELU-activated units. Since the weight matrices in the expert have sizes $input$ _ ${size}\times hidden$ _ ${size}$ and $hidden$ _ ${size}\times output$ _ ${size}$ , the ratio of computation to input and output is equal to the size of the hidden layer. Conveniently, we can increase computational efficiency simply by using a larger hidden layer, or more hidden layers.
在分布式计算中，另一个主要的性能问题是网络带宽。由于专家是静止的（见上文），且门控参数的数量很少，大部分通信涉及将专家的输入和输出通过网络发送。为了保持计算效率，专家的计算量与其输入和输出大小的比率必须超过计算设备的计算能力与网络容量的比率。对于 GPU，这个比率可能是千比一。在我们的实验中，我们使用了包含数千个 RELU 激活单元的单隐藏层专家。由于专家中的权重矩阵大小为 $input$ _ ${size}\times hidden$ _ ${size}$ 和 $hidden$ _ ${size}\times output$ _ ${size}$ ，计算与输入输出的比率等于隐藏层的大小。方便的是，我们可以通过使用更大的隐藏层或更多的隐藏层来简单地提高计算效率。

4 Balancing Expert Utilization
4. 平衡专家利用

We have observed that the gating network tends to converge to a state where it always produces large weights for the same few experts. This imbalance is self-reinforcing, as the favored experts are trained more rapidly and thus are selected even more by the gating network. Eigen et al. (2013) describe the same phenomenon, and use a hard constraint at the beginning of training to avoid this local minimum. Bengio et al. (2015) include a soft constraint on the batch-wise average of each gate.¹¹1Bengio et al. (2015) also include two additional losses. One controls per-example sparsity, which we do not need since it is enforced by the fixed value of $k$ . A third loss encourages diversity of gate values. In our experiments, we find that the gate values naturally diversify as the experts specialize (in a virtuous cycle), and we do not need to enforce diversity of gate values.
Bengio 等人（2015）还包括两个额外的损失函数。一个控制每个样本的稀疏性，由于我们使用固定值 $k$ ，因此不需要这个损失函数。第三个损失函数鼓励门值的多样性。在我们的实验中，我们发现随着专家的专业化，门值自然会多样化（形成良性循环），因此我们不需要强制门值的多样性。
我们观察到，门控网络往往会收敛到一种状态，即它总是为同样的少数几个专家生成较大的权重。这种不平衡是自我强化的，因为被偏爱的专家训练得更快，因此更频繁地被门控网络选择。Eigen 等人（2013）描述了同样的现象，并在训练开始时使用硬约束来避免这种局部最小值。Bengio 等人（2015）在每个门的批次平均值上加入了软约束。

We take a soft constraint approach. We define the importance of an expert relative to a batch of training examples to be the batchwise sum of the gate values for that expert. We define an additional loss $L_{importance}$ , which is added to the overall loss function for the model. This loss is equal to the square of the coefficient of variation of the set of importance values, multiplied by a hand-tuned scaling factor $w_{importance}$ . This additional loss encourages all experts to have equal importance.
我们采用软约束方法。我们定义一个专家相对于一批训练样本的重要性为该专家的门值的批量和。我们定义了一个额外的损失 $L_{importance}$ ，它被添加到模型的总体损失函数中。这个损失等于重要性值集合的变异系数的平方，再乘以一个手动调整的缩放因子 $w_{importance}$ 。这个额外的损失鼓励所有专家具有相等的重要性。

Importance(X)=\sum_{x\in X}G(x)

(6)

L_{importance}(X)=w_{importance}\cdot CV(Importance(X))^{2}

(7)

While this loss function can ensure equal importance, experts may still receive very different numbers of examples. For example, one expert may receive a few examples with large weights, and another may receive many examples with small weights. This can cause memory and performance problems on distributed hardware. To solve this problem, we introduce a second loss function, $L_{load}$ , which ensures balanced loads. Appendix A contains the definition of this function, along with experimental results.
虽然这个损失函数可以确保同等重要性，但专家们可能仍会收到数量非常不同的样本。例如，一个专家可能会收到少量权重较大的样本，而另一个专家可能会收到大量权重较小的样本。这可能会在分布式硬件上引起内存和性能问题。为了解决这个问题，我们引入了第二个损失函数， $L_{load}$ ，以确保负载平衡。附录 A 包含了该函数的定义以及实验结果。

5 Experiments 5 实验

5.1 1 Billion Word Language Modeling Benchmark
51.1 亿词语言建模基准

Dataset: 数据集：

This dataset, introduced by (Chelba et al., 2013) consists of shuffled unique sentences from news articles, totaling approximately 829 million words, with a vocabulary of 793,471 words.
该数据集由（Chelba et al., 2013）引入，由新闻文章中打乱的独特句子组成，总计约 8.29 亿个单词，词汇量为 793,471 个单词。

Previous State-of-the-Art:
以前的最新技术：

The best previously published results (Jozefowicz et al., 2016) use models consisting of one or more stacked Long Short-Term Memory (LSTM) layers (Hochreiter & Schmidhuber, 1997; Gers et al., 2000). The number of parameters in the LSTM layers of these models vary from 2 million to 151 million. Quality increases greatly with parameter count, as do computational costs. Results for these models form the top line of Figure 2-right.
最好的先前发表的结果（Jozefowicz 等，2016）使用由一个或多个堆叠的长短期记忆（LSTM）层（Hochreiter & Schmidhuber，1997；Gers 等，2000）组成的模型。这些模型的 LSTM 层中的参数数量从 200 万到 1.51 亿不等。随着参数数量的增加，质量大大提高，计算成本也随之增加。这些模型的结果构成了图 2 右侧的顶线。

MoE Models: 教育部模型：

Our models consist of two stacked LSTM layers with a MoE layer between them (see Figure 1). We vary the sizes of the layers and the number of experts. For full details on model architecture, training regimen, additional baselines and results, see Appendix C.
我们的模型由两个堆叠的 LSTM 层组成，中间有一个 MoE 层（见图 1）。我们改变了层的大小和专家的数量。有关模型架构、训练方案、附加基线和结果的详细信息，请参见附录 C。

Low Computation, Varied Capacity:
低计算，多样容量：

To investigate the effects of adding capacity, we trained a series of MoE models all with roughly equal computational costs: about 8 million multiply-and-adds per training example per timestep in the forwards pass, excluding the softmax layer. We call this metric (ops/timestep). We trained models with flat MoEs containing 4, 32, and 256 experts, and models with hierarchical MoEs containing 256, 1024, and 4096 experts. Each expert had about 1 million parameters. For all the MoE layers, 4 experts were active per input.
为了研究增加容量的效果，我们训练了一系列计算成本大致相等的 MoE 模型：每个训练样本每个时间步的前向传递大约需要 800 万次乘加运算，不包括 softmax 层。我们称这个指标为（ops/timestep）。我们训练了包含 4、32 和 256 个专家的平坦 MoE 模型，以及包含 256、1024 和 4096 个专家的分层 MoE 模型。每个专家大约有 100 万个参数。对于所有的 MoE 层，每个输入有 4 个专家是活跃的。

The results of these models are shown in Figure 2-left. The model with 4 always-active experts performed (unsurprisingly) similarly to the computationally-matched baseline models, while the largest of the models (4096 experts) achieved an impressive 24% lower perplexity on the test set.
这些模型的结果显示在图 2 左侧。拥有 4 个始终活跃专家的模型（毫不意外地）表现与计算匹配的基线模型相似，而最大的模型（4096 个专家）在测试集上实现了令人印象深刻的 24%更低的困惑度。

Table 1: Summary of high-capacity MoE-augmented models with varying computational budgets, vs. best previously published results (Jozefowicz et al., 2016). Details in Appendix C.
表 1：具有不同计算预算的高容量 MoE 增强模型的总结，与之前发表的最佳结果（Jozefowicz 等，2016）进行比较。详情见附录 C。

	Test 测试	Test 测试	#Parameters	ops/timestep 操作/时间步	Training 训练	TFLOPS
	Perplexity 困惑度	Perplexity 困惑度	excluding embedding 排除嵌入		Time 时间	/GPU
	10 epochs 10 个周期	100 epochs 100 个周期	and softmax layers 和 softmax 层		10 epochs 10 个周期
Best Published Results 最佳已发表结果	34.7	30.6	151 million 1.51 亿	151 million 1.51 亿	59 hours, 32 k40s 59 小时，32 个 k40s	1.09
Low-Budget MoE Model 低预算 MoE 模型	34.1		4303 million 4303 百万	8.9 million 890 万	15 hours, 16 k40s 15 小时，16 个 k40s	0.74
Medium-Budget MoE Model 中等预算 MoE 模型	31.3		4313 million 4313 百万	33.8 million 3380 万	17 hours, 32 k40s 17 小时，32 个 k40s	1.22
High-Budget MoE Model 高预算 MoE 模型	28.0		4371 million 4371 百万	142.7 million 1.427 亿	47 hours, 32 k40s 47 小时，32 个 k40s	1.56

Varied Computation, High Capacity:
多样化计算，高容量：

In addition to the largest model from the previous section, we trained two more MoE models with similarly high capacity (4 billion parameters), but higher computation budgets. These models had larger LSTMs, and fewer but larger and experts. Details can be found in Appendix C.2. Results of these three models form the bottom line of Figure 2-right. Table 1 compares the results of these models to the best previously-published result on this dataset . Even the fastest of these models beats the best published result (when controlling for the number of training epochs), despite requiring only 6% of the computation.
除了上一节中最大的模型外，我们还训练了另外两个具有相似高容量（40 亿参数），但计算预算更高的 MoE 模型。这些模型具有更大的 LSTM，专家数量较少但规模更大。详细信息见附录 C.2。这三个模型的结果构成了图 2 右下方的底线。表 1 将这些模型的结果与该数据集上先前发布的最佳结果进行了比较。即使是这些模型中最快的一个，在控制训练周期数的情况下，也超过了已发布的最佳结果，尽管只需要 6%的计算量。

Computational Efficiency:
计算效率：

We trained our models using TensorFlow (Abadi et al., 2016) on clusters containing 16-32 Tesla K40 GPUs. For each of our models, we determine computational efficiency in TFLOPS/GPU by dividing the number of floating point operations required to process one training batch by the observed step time and the number of GPUs in the cluster. The operation counts used here are higher than the ones we report in our ops/timestep numbers in that we include the backwards pass, we include the importance-sampling-based training of the softmax layer, and we count a multiply-and-add as two separate operations. For all of our MoE models, the floating point operations involved in the experts represent between 37% and 46% of the total.
我们使用 TensorFlow (Abadi et al., 2016) 在包含 16-32 个 Tesla K40 GPU 的集群上训练我们的模型。对于每个模型，我们通过将处理一个训练批次所需的浮点运算次数除以观察到的步长时间和集群中的 GPU 数量来确定计算效率（以 TFLOPS/GPU 为单位）。这里使用的操作计数高于我们在 ops/timestep 数字中报告的计数，因为我们包括了反向传播、基于重要性采样的 softmax 层训练，并且我们将乘法和加法分别计为两个独立的操作。对于我们所有的 MoE 模型，专家参与的浮点运算占总数的 37% 到 46% 之间。

For our baseline models wtih no MoE, observed computational efficiency ranged from 1.07-1.29 TFLOPS/GPU. For our low-computation MoE models, computation efficiency ranged from 0.74-0.90 TFLOPS/GPU, except for the 4-expert model which did not make full use of the available parallelism. Our highest-computation MoE model was more efficient at 1.56 TFLOPS/GPU, likely due to the larger matrices. These numbers represent a significant fraction of the theoretical maximum of 4.29 TFLOPS/GPU claimed by NVIDIA. Detailed results are in Appendix C, Table 7.
对于没有 MoE 的基线模型，观察到的计算效率范围为 1.07-1.29 TFLOPS/GPU。对于低计算量的 MoE 模型，计算效率范围为 0.74-0.90 TFLOPS/GPU，除了 4 专家模型未能充分利用可用的并行性。我们最高计算量的 MoE 模型效率更高，为 1.56 TFLOPS/GPU，这可能是由于较大的矩阵。这些数字代表了 NVIDIA 声称的 4.29 TFLOPS/GPU 理论最大值的一个显著部分。详细结果见附录 C，表 7。

5.2 100 Billion Word Google News Corpus
52.1 亿词谷歌新闻语料库

On the 1-billion-word corpus, adding additional capacity seems to produce diminishing returns as the number of parameters in the MoE layer exceeds 1 billion, as can be seen in Figure 2-left. We hypothesized that for a larger training set, even higher capacities would produce significant quality improvements.
在 10 亿词语料库上，当 MoE 层的参数数量超过 10 亿时，增加额外的容量似乎会产生递减的回报，如图 2 左侧所示。我们假设对于更大的训练集，更高的容量将会显著提高质量。

We constructed a similar training set consisting of shuffled unique sentences from Google’s internal news corpus, totalling roughly 100 billion words. Similarly to the previous section, we tested a series of models with similar computational costs of about 8 million ops/timestep. In addition to a baseline LSTM model, we trained models augmented with MoE layers containing 32, 256, 1024, 4096, 16384, 65536, and 131072 experts. This corresponds to up to 137 billion parameters in the MoE layer. Details on architecture, training, and results are given in Appendix D.
我们构建了一个类似的训练集，该训练集由谷歌内部新闻语料库中打乱的独特句子组成，总计约 1000 亿个单词。与前一节类似，我们测试了一系列计算成本约为每时间步 800 万操作的模型。除了一个基线 LSTM 模型外，我们还训练了包含 32、256、1024、4096、16384、65536 和 131072 个专家的 MoE 层增强模型。这相当于 MoE 层中多达 1370 亿个参数。关于架构、训练和结果的详细信息见附录 D。

Results: 结果：

Figure 3 shows test perplexity as a function of capacity after training on 10 billion words (top line) and 100 billion words (bottom line). When training over the full 100 billion words, test perplexity improves significantly up to 65536 experts (68 billion parameters), dropping 39% lower than the computationally matched baseline, but degrades at 131072 experts, possibly a result of too much sparsity. The widening gap between the two lines demonstrates (unsurprisingly) that increased model capacity helps more on larger training sets.
图 3 显示了在训练了 100 亿词（上线）和 1000 亿词（下线）后的测试困惑度与容量的关系。在对全部 1000 亿词进行训练时，测试困惑度在专家数量达到 65536（680 亿参数）时显著改善，比计算匹配的基线低 39%，但在专家数量达到 131072 时恶化，可能是由于稀疏性过高所致。两条线之间不断扩大的差距（不出所料）表明，增加模型容量在较大的训练集上更有帮助。

Even at 65536 experts (99.994% layer sparsity), computational efficiency for the model stays at a respectable 0.72 TFLOPS/GPU.
即使在 65536 个专家（99.994%的层稀疏性）情况下，模型的计算效率仍保持在 0.72 TFLOPS/GPU 的可观水平。

5.3 Machine Translation (Single Language Pair)
5.3 机器翻译（单语言对）

Model Architecture: 模型架构：

Our model was a modified version of the GNMT model described in (Wu et al., 2016). To reduce computation, we decreased the number of LSTM layers in the encoder and decoder from 9 and 8 to 3 and 2 respectively. We inserted MoE layers in both the encoder (between layers 2 and 3) and the decoder (between layers 1 and 2). Each MoE layer contained up to 2048 experts each with about two million parameters, adding a total of about 8 billion parameters to the models. Further details on model architecture, testing procedure and results can be found in Appendix E.
我们的模型是对 (Wu et al., 2016) 中描述的 GNMT 模型的修改版本。为了减少计算量，我们将编码器和解码器中的 LSTM 层数从 9 层和 8 层分别减少到 3 层和 2 层。我们在编码器（第 2 层和第 3 层之间）和解码器（第 1 层和第 2 层之间）中插入了 MoE 层。每个 MoE 层包含最多 2048 个专家，每个专家大约有两百万个参数，总共为模型增加了大约 80 亿个参数。关于模型架构、测试程序和结果的更多详细信息可以在附录 E 中找到。

Datasets: 数据集：

We benchmarked our method on the WMT’14 En $\rightarrow$ Fr and En $\rightarrow$ De corpora, whose training sets have 36M sentence pairs and 5M sentence pairs, respectively. The experimental protocols were also similar to those in (Wu et al., 2016): newstest2014 was used as the test set to compare against previous work (Luong et al., 2015a; Zhou et al., 2016; Wu et al., 2016), while the combination of newstest2012 and newstest2013 was used as the development set. We also tested the same model on a Google’s Production English to French data.
我们在 WMT'14 En $\rightarrow$ Fr 和 En $\rightarrow$ De 语料库上对我们的方法进行了基准测试，这些语料库的训练集分别有 3600 万对句子和 500 万对句子。实验协议也类似于(Wu et al., 2016)中的协议：newstest2014 被用作测试集，以与之前的工作(Luong et al., 2015a; Zhou et al., 2016; Wu et al., 2016)进行比较，而 newstest2012 和 newstest2013 的组合被用作开发集。我们还在 Google 的生产级英语到法语数据上测试了相同的模型。

Table 2: Results on WMT’14 En

\rightarrow

Fr newstest2014 (bold values represent best results).
表 2：WMT'14 英

\rightarrow

法 newstest2014 的结果（加粗值表示最佳结果）。

Model	Test 测试	Test 测试	ops/timenstep 操作/时间步	Total 总计	Training 训练
	Perplexity 困惑度	BLEU		#Parameters	Time 时间
MoE with 2048 Experts 拥有 2048 位专家的 MoE	2.69	40.35	85M	8.7B	3 days/64 k40s 3 天/64 k40s
MoE with 2048 Experts (longer training) 使用 2048 个专家的 MoE（更长的训练）	2.63	40.56	85M	8.7B	6 days/64 k40s 6 天/64 k40s
GNMT (Wu et al., 2016) GNMT (吴等, 2016)	2.79	39.22	214M	278M	6 days/96 k80s 6 天/96 个 k80s
GNMT+RL (Wu et al., 2016) GNMT+RL (Wu 等, 2016)	2.96	39.92	214M	278M	6 days/96 k80s 6 天/96 个 k80s
PBMT (Durrani et al., 2014) PBMT（Durrani 等，2014）		37.0
LSTM (6-layer) (Luong et al., 2015b) LSTM（6 层）（Luong 等，2015b）		31.5
LSTM (6-layer+PosUnk) (Luong et al., 2015b) LSTM（6 层+PosUnk）（Luong 等，2015b）		33.1
DeepAtt (Zhou et al., 2016) DeepAtt (Zhou 等, 2016)		37.7
DeepAtt+PosUnk (Zhou et al., 2016) DeepAtt+PosUnk (Zhou 等, 2016)		39.2

Table 3: Results on WMT’14 En

\rightarrow

De newstest2014 (bold values represent best results).
表 3：WMT'14 英德 newstest2014 的结果（加粗值表示最佳结果）。

Model	Test 测试	Test 测试	ops/timestep 操作/时间步	Total 总计	Training 训练
	Perplexity 困惑度	BLEU		#Parameters	Time 时间
MoE with 2048 Experts 拥有 2048 位专家的 MoE	4.64	26.03	85M	8.7B	1 day/64 k40s 1 天/64 k40s
GNMT (Wu et al., 2016) GNMT (吴等, 2016)	5.25	24.91	214M	278M	1 day/96 k80s 1 天/96 k80s
GNMT +RL (Wu et al., 2016) GNMT +RL (吴等, 2016)	8.08	24.66	214M	278M	1 day/96 k80s 1 天/96 k80s
PBMT (Durrani et al., 2014) PBMT（Durrani 等，2014）		20.7
DeepAtt (Zhou et al., 2016) DeepAtt (Zhou 等, 2016)		20.6

Table 4: Results on the Google Production En

\rightarrow

Fr dataset (bold values represent best results).
表 4：Google Production En

\rightarrow

Fr 数据集的结果（加粗值表示最佳结果）。

Model	Eval 评估	Eval 评估	Test 测试	Test 测试	ops/timestep 操作/时间步	Total 总计	Training 训练
	Perplexity 困惑度	BLEU	Perplexity 困惑度	BLEU		#Parameters	Time 时间
MoE with 2048 Experts 拥有 2048 位专家的 MoE	2.60	37.27	2.69	36.57	85M	8.7B	1 day/64 k40s 1 天/64 k40s
GNMT (Wu et al., 2016) GNMT (吴等, 2016)	2.78	35.80	2.87	35.56	214M	278M	6 days/96 k80s 6 天/96 个 k80s

Results: 结果：

Tables 2, 3, and 4 show the results of our largest models, compared with published results. Our approach achieved BLEU scores of 40.56 and 26.03 on the WMT’14 En $\rightarrow$ Fr and En $\rightarrow$ De benchmarks. As our models did not use RL refinement, these results constitute significant gains of 1.34 and 1.12 BLEU score on top of the strong baselines in (Wu et al., 2016). The perplexity scores are also better.²²2Reported perplexities relative to the tokenization used by both our models and GNMT.
报告的困惑与我们模型和 GNMT 使用的分词有关。
表 2、3 和 4 显示了我们最大模型的结果，并与已发表的结果进行了比较。我们的方法在 WMT'14 En $\rightarrow$ Fr 和 En $\rightarrow$ De 基准测试中分别达到了 40.56 和 26.03 的 BLEU 分数。由于我们的模型没有使用 RL 优化，这些结果在(Wu et al., 2016)的强基线基础上分别显著提高了 1.34 和 1.12 的 BLEU 分数。困惑度得分也更好。 On the Google Production dataset, our model achieved 1.01 higher test BLEU score even after training for only one sixth of the time.
在 Google Production 数据集上，我们的模型即使仅训练了六分之一的时间，测试 BLEU 分数也提高了 1.01。

5.4 Multilingual Machine Translation
5.4 多语言机器翻译

Dataset: 数据集：

(Johnson et al., 2016) train a single GNMT (Wu et al., 2016) model on a very large combined dataset of twelve language pairs. Results are somewhat worse than those for 12 separately trained single-pair GNMT models. This is not surprising, given that the twelve models have 12 times the capacity and twelve times the aggregate training of the one model. We repeat this experiment with a single MoE-augmented model. See Appendix E for details on model architecture. We train our model on the same dataset as (Johnson et al., 2016) and process the same number of training examples (about 3 billion sentence pairs). Our training time was shorter due to the lower computational budget of our model.
(Johnson 等, 2016) 在一个非常大的包含十二种语言对的综合数据集上训练了一个单一的 GNMT (Wu 等, 2016) 模型。结果比单独训练的十二个单对 GNMT 模型稍差。这并不令人惊讶，因为这十二个模型的容量是单一模型的十二倍，训练总量也是单一模型的十二倍。我们用一个单一的 MoE 增强模型重复了这个实验。有关模型架构的详细信息，请参见附录 E。我们在与 (Johnson 等, 2016) 相同的数据集上训练我们的模型，并处理相同数量的训练样本（约 30 亿个句子对）。由于我们模型的计算预算较低，我们的训练时间较短。

Results: 结果：

Results for the single-pair GNMT models, the multilingual GNMT model and the multilingual MoE model are given in Table 5. The MoE model achieves 19% lower perplexity on the dev set than the multilingual GNMT model. On BLEU score, the MoE model significantly beats the multilingual GNMT model on 11 of the 12 language pairs (by as much as 5.84 points), and even beats the monolingual GNMT models on 8 of 12 language pairs. The poor performance on English $\rightarrow$ Korean seems to be a result of severe overtraining, as for the rarer language pairs a small number of real examples were highly oversampled in the training corpus.
单对 GNMT 模型、多语言 GNMT 模型和多语言 MoE 模型的结果如表 5 所示。MoE 模型在开发集上的困惑度比多语言 GNMT 模型低 19%。在 BLEU 分数上，MoE 模型在 12 对语言中有 11 对显著超过多语言 GNMT 模型（最多高出 5.84 分），甚至在 12 对语言中有 8 对超过单语言 GNMT 模型。英语 $\rightarrow$ 韩语的较差表现似乎是由于严重的过度训练，因为对于较少见的语言对，少量的真实例子在训练语料库中被高度过采样。

Table 5: Multilingual Machine Translation (bold values represent best results).
表 5：多语言机器翻译（加粗值表示最佳结果）。

	GNMT-Mono	GNMT-Multi GNMT-多语种	MoE-Multi 教育部-多	MoE-Multi vs.
				GNMT-Multi GNMT-多语种
Parameters 参数	278M / model 278M / 模型	278M	8.7B
ops/timestep 操作/时间步	212M	212M	102M
training time, hardware 训练时间，硬件	various 各种	21 days, 96 k20s 21 天，96 个 k20s	12 days, 64 k40s 12 天，64 个 k40s
Perplexity (dev) 困惑度 (开发)		4.14	3.35	-19%
French $\rightarrow$ English Test BLEU 法语 $\rightarrow$ 英语测试 BLEU	36.47	34.40	37.46	+3.06
German $\rightarrow$ English Test BLEU 德语 $\rightarrow$ 英语测试 BLEU	31.77	31.17	34.80	+3.63
Japanese $\rightarrow$ English Test BLEU 日语 $\rightarrow$ 英语测试 BLEU	23.41	21.62	25.91	+4.29
Korean $\rightarrow$ English Test BLEU 韩语 $\rightarrow$ 英语测试 BLEU	25.42	22.87	28.71	+5.84
Portuguese $\rightarrow$ English Test BLEU 葡萄牙语 $\rightarrow$ 英语测试 BLEU	44.40	42.53	46.13	+3.60
Spanish $\rightarrow$ English Test BLEU 西班牙语 $\rightarrow$ 英语测试 BLEU	38.00	36.04	39.39	+3.35
English $\rightarrow$ French Test BLEU 英语 $\rightarrow$ 法语测试 BLEU	35.37	34.00	36.59	+2.59
English $\rightarrow$ German Test BLEU 英语 $\rightarrow$ 德语测试 BLEU	26.43	23.15	24.53	+1.38
English $\rightarrow$ Japanese Test BLEU 源文本：英语 $\rightarrow$ 日语测试 BLEU 翻译文本：	23.66	21.10	22.78	+1.68
English $\rightarrow$ Korean Test BLEU 英语 $\rightarrow$ 韩语测试 BLEU	19.75	18.41	16.62	-1.79
English $\rightarrow$ Portuguese Test BLEU 英语 $\rightarrow$ 葡萄牙语测试 BLEU	38.40	37.35	37.90	+0.55
English $\rightarrow$ Spanish Test BLEU 英语 $\rightarrow$ 西班牙语测试 BLEU	34.50	34.25	36.21	+1.96

6 Conclusion 6 结论

This work is the first to demonstrate major wins from conditional computation in deep networks. We carefully identified the design considerations and challenges of conditional computing and addressed them with a combination of algorithmic and engineering solutions. While we focused on text, conditional computation may help in other domains as well, provided sufficiently large training sets. We look forward to seeing many novel implementations and applications of conditional computation in the years to come.
这项工作首次展示了在深度网络中通过条件计算取得的重大成果。我们仔细识别了条件计算的设计考虑和挑战，并通过算法和工程解决方案的结合来应对这些问题。虽然我们专注于文本，但条件计算在其他领域也可能有所帮助，前提是有足够大的训练集。我们期待在未来几年看到许多条件计算的新颖实现和应用。

Acknowledgments 致谢

We would like to thank all of the members of the Google Brain and Google Translate teams who helped us with this project, in particular Zhifeng Chen, Yonghui Wu, and Melvin Johnson. Thanks also to our anonymous ICLR reviewers for the helpful suggestions on making this paper better.
我们要感谢所有帮助我们完成这个项目的 Google Brain 和 Google Translate 团队成员，特别是陈志峰、吴永辉和 Melvin Johnson。还要感谢我们匿名的 ICLR 审稿人对改进本文提出的有益建议。

References

Abadi et al. (2016) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Józefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR, abs/1603.04467, 2016. URL http://arxiv.org/abs/1603.04467.
Aljundi et al. (2016) Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning with a network of experts. CoRR, abs/1611.06194, 2016. URL http://arxiv.org/abs/1611.06194.
Almahairi et al. (2015) A. Almahairi, N. Ballas, T. Cooijmans, Y. Zheng, H. Larochelle, and A. Courville. Dynamic Capacity Networks. ArXiv e-prints, November 2015.
Amodei et al. (2015) Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Y. Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Y. Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. Deep speech 2: End-to-end speech recognition in english and mandarin. arXiv preprint arXiv:1512.02595, 2015.
Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
Bengio et al. (2015) Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015.
Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
Chelba et al. (2013) Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.
Cho & Bengio (2014) K. Cho and Y. Bengio. Exponentially Increasing the Capacity-to-Computation Ratio for Conditional Computation in Deep Learning. ArXiv e-prints, June 2014.
Collobert et al. (2002) Ronan Collobert, Samy Bengio, and Yoshua Bengio. A parallel mixture of SVMs for very large scale problems. Neural Computing, 2002.
Davis & Arel (2013) Andrew Davis and Itamar Arel. Low-rank approximations for conditional feedforward computation in deep neural networks. arXiv preprint arXiv:1312.4461, 2013.
Deisenroth & Ng (2015) Marc Peter Deisenroth and Jun Wei Ng. Distributed Gaussian processes. In ICML, 2015.
Duchi et al. (2010) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization, 2010.
Durrani et al. (2014) Nadir Durrani, Barry Haddow, Philipp Koehn, and Kenneth Heafield. Edinburgh’s phrase-based machine translation systems for wmt-14. In Proceedings of the Ninth Workshop on Statistical Machine Translation, 2014.
Eigen et al. (2013) David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314, 2013.
Garmash & Monz (2016) Ekaterina Garmash and Christof Monz. Ensemble learning for multi-source neural machine translation. In staff.science.uva.nl/c.monz, 2016.
Gers et al. (2000) Felix A. Gers, Jürgen A. Schmidhuber, and Fred A. Cummins. Learning to forget: Continual prediction with lstm. Neural Computation, 2000.
Gruslys et al. (2016) Audrunas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-efficient backpropagation through time. CoRR, abs/1606.03401, 2016. URL http://arxiv.org/abs/1606.03401.
He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, 2015.
Hinton et al. (2012) Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 2012.
Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 1997.
Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
Jacobs et al. (1991) Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts. Neural Computing, 1991.
Johnson et al. (2016) Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s multilingual neural machine translation system: Enabling zero-shot translation. CoRR, abs/1611.04558, 2016. URL http://arxiv.org/abs/1611.04558.
Jordan & Jacobs (1994) Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computing, 1994.
Jozefowicz et al. (2016) Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
Kingma & Ba (2015) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
Kneser & Ney (1995) Reinhard Kneser and Hermann. Ney. Improved backingoff for m-gram language modeling., 1995.
Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
Le et al. (2012) Quoc V. Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeffrey Dean, and Andrew Y. Ng. Building high-level features using large scale unsupervised learning. In ICML, 2012.
Ludovic Denoyer (2014) Patrick Gallinari Ludovic Denoyer. Deep sequential neural network. arXiv preprint arXiv:1410.0510, 2014.
Luong et al. (2015a) Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. EMNLP, 2015a.
Luong et al. (2015b) Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. Addressing the rare word problem in neural machine translation. ACL, 2015b.
Rasmussen & Ghahramani (2002) Carl Edward Rasmussen and Zoubin Ghahramani. Infinite mixtures of Gaussian process experts. NIPS, 2002.
Sak et al. (2014) Hasim Sak, Andrew W Senior, and Françoise Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In INTERSPEECH, pp. 338–342, 2014.
Schuster & Nakajima (2012) Mike Schuster and Kaisuke Nakajima. Japanese and Korean voice search. ICASSP, 2012.
Shahbaba & Neal (2009) Babak Shahbaba and Radford Neal. Nonlinear models using dirichlet process mixtures. JMLR, 2009.
Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
Theis & Bethge (2015) Lucas Theis and Matthias Bethge. Generative image modeling using spatial LSTMs. In NIPS, 2015.
Tresp (2001) Volker Tresp. Mixtures of Gaussian Processes. In NIPS, 2001.
Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
Yao et al. (2009) Bangpeng Yao, Dirk Walther, Diane Beck, and Li Fei-fei. Hierarchical mixture of classification experts uncovers interactions between brain regions. In NIPS. 2009.
Zaremba et al. (2014) Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
Zhou et al. (2016) Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. Deep recurrent models with fast-forward connections for neural machine translation. arXiv preprint arXiv:1606.04199, 2016.

Appendices 附录

A Load-Balancing Loss 负载均衡损失

As discussed in section 4, for load-balancing purposes, we want to define an additional loss function to encourage experts to receive roughly equal numbers of training examples. Unfortunately, the number of examples received by an expert is a discrete quantity, so it can not be used in back-propagation. Instead, we define a smooth estimator $Load(X)$ of the number of examples assigned to each expert for a batch $X$ of inputs. The smoothness allows us to back-propagate gradients through the estimator. This is the purpose of the noise term in the gating function. We define $P(x,i)$ as the probability that $G(x)_{i}$ is nonzero, given a new random choice of noise on element $i$ , but keeping the already-sampled choices of noise on the other elements. To compute $P(x,i)$ , we note that the $G(x)_{i}$ is nonzero if and only if $H(x)_{i}$ is greater than the $k^{th}$ -greatest element of $H(x)$ excluding itself. The probability works out to be:
如第 4 节所讨论的，为了负载均衡的目的，我们希望定义一个额外的损失函数，以鼓励专家接收到大致相等数量的训练样本。不幸的是，专家接收到的样本数量是一个离散量，因此不能用于反向传播。相反，我们定义了一个平滑估计器 $Load(X)$ ，用于估计每个专家在一批 $X$ 输入中分配到的样本数量。平滑性允许我们通过估计器进行梯度反向传播。这就是门控函数中噪声项的目的。我们定义 $P(x,i)$ 为在给定一个新的随机噪声选择的情况下， $G(x)_{i}$ 非零的概率，但保持其他元素上已采样的噪声选择不变。为了计算 $P(x,i)$ ，我们注意到，只有当 $H(x)_{i}$ 大于 $H(x)$ 中除自身外的第 $k^{th}$ 大元素时， $G(x)_{i}$ 才非零。概率计算结果为：

	$\displaystyle P(x,i)=Pr\Big{(}(x\cdot W_{g})_{i}+StandardNormal()\cdot Softplus((x\cdot W_{noise})_{i})$		(8)
	$\displaystyle>kth\_excluding(H(x),k,i)\Big{)}$		(8)

Where $kth\_excluding(v,k,i)$ means the kth highest component of $v$ , excluding component $i$ . Simplifying, we get:
其中 $kth\_excluding(v,k,i)$ 表示 $v$ 中第 k 高的成分，排除成分 $i$ 。简化后，我们得到：

P(x,i)=\Phi\Big{(}\frac{(x\cdot W_{g})_{i}-kth\_excluding(H(x),k,i)}{Softplus((x\cdot W_{noise})_{i})}\Big{)}

(9)

Where $\Phi$ is the CDF of the standard normal distribution.
其中 $\Phi$ 是标准正态分布的累积分布函数。

Load(X)_{i}=\sum_{x\in X}P(x,i)

(10)

We can now define the load loss to be the square of the coefficient of variation of the load vector, multiplied by a hand-tuned scaling factor $w_{load}$ .
我们现在可以将负载损失定义为负载向量变异系数的平方，再乘以一个手动调整的缩放因子 $w_{load}$ 。

L_{load}(X)=w_{load}\cdot CV(Load(X))^{2}

(11)

Initial Load Imbalance: 初始负载不平衡：

To avoid out-of-memory errors, we need to initialize the network in a state of approximately equal expert load (since the soft constraints need some time to work). To accomplish this, we initialize the matrices $W_{g}$ and $W_{noise}$ to all zeros, which yields no signal and some noise.
为了避免内存不足错误，我们需要在大致相等的专家负载状态下初始化网络（因为软约束需要一些时间才能起作用）。为此，我们将矩阵 $W_{g}$ 和 $W_{noise}$ 初始化为全零，这样不会产生信号，但会有一些噪声。

Experiments: 实验：

We trained a set of models with identical architecture (the MoE-256 model described in Appendix C), using different values of $w_{importance}$ and $w_{load}$ . We trained each model for 10 epochs, then measured perplexity on the test set. We also measured the coefficients of variation in $Importance$ and $Load$ , as well as ratio of the load on the most overloaded expert to the average load. This last value is significant for load balancing purposes on distributed hardware. All of these metrics were averaged over several training batches.
我们使用不同的 $w_{importance}$ 和 $w_{load}$ 值训练了一组具有相同架构的模型（附录 C 中描述的 MoE-256 模型）。我们对每个模型进行了 10 个周期的训练，然后在测试集上测量困惑度。我们还测量了 $Importance$ 和 $Load$ 的变异系数，以及最过载专家的负载与平均负载的比率。这个最后的值对于分布式硬件上的负载平衡非常重要。所有这些指标都在多个训练批次中取平均值。

Table 6: Experiments with different combinations of losses.
表 6：不同损失组合的实验。

$w_{importance}$	$w_{load}$	Test Perplexity 测试困惑度	$CV(Importance(X))$	$CV(Load(X))$	$\frac{max(Load(X))}{mean(Load(X))}$
0.0	0.0	39.8	3.04	3.01	17.80
0.2	0.0	35.6	0.06	0.17	1.47
0.0	0.2	35.7	0.22	0.04	1.15
0.1	0.1	35.6	0.06	0.05	1.14
0.01	0.01	35.7	0.48	0.11	1.37
1.0	1.0	35.7	0.03	0.02	1.07

Results: 结果：

Results are reported in Table 6. All the combinations containing at least one the two losses led to very similar model quality, where having no loss was much worse. Models with higher values of $w_{load}$ had lower loads on the most overloaded expert.
结果见表 6。所有包含至少一个损失的组合导致了非常相似的模型质量，而没有损失的情况要差得多。具有较高 $w_{load}$ 值的模型在最过载的专家上负载较低。

B Hierachical Mixture of Experts
层次专家混合模型

If the number of experts is very large, we can reduce the branching factor by using a two-level hierarchical MoE. In a hierarchical MoE, a primary gating network chooses a sparse weighted combination of “experts", each of which is itself a secondary mixture-of-experts with its own gating network.³³3 We have not found the need for deeper hierarchies.
我们没有发现需要更深层次的层级结构。
如果专家的数量非常多，我们可以通过使用两级层次化 MoE 来减少分支因子。在层次化 MoE 中，主门控网络选择一个稀疏加权组合的“专家”，每个专家本身就是一个具有自己门控网络的次级专家混合体。 If the hierarchical MoE consists of $a$ groups of $b$ experts each, we denote the primary gating network by $G_{primary}$ , the secondary gating networks by $(G_{1},G_{2}..G_{a})$ , and the expert networks by $(E_{0,0},E_{0,1}..E_{a,b})$ . The output of the MoE is given by:
如果分层 MoE 由 $a$ 个组，每组包含 $b$ 个专家组成，我们将主要门控网络记为 $G_{primary}$ ，次要门控网络记为 $(G_{1},G_{2}..G_{a})$ ，专家网络记为 $(E_{0,0},E_{0,1}..E_{a,b})$ 。MoE 的输出为：

y_{H}=\sum_{i=1}^{a}\sum_{j=1}^{b}G_{primary}(x)_{i}\cdot G_{i}(x)_{j}\cdot E_{i,j}(x)

(12)

Our metrics of expert utilization change to the following:
我们的专家利用率指标变更如下：

Importance_{H}(X)_{i,j}=\sum_{x\in X}G_{primary}(x)_{i}\cdot G_{i}(x)_{j}

(13)

Load_{H}(X)_{i,j}=\frac{Load_{primary}(X)_{i}\cdot Load_{i}(X^{(i)})_{j}}{|X^{(i)}|}

(14)

$Load_{primary}$ and $Load_{i}$ deonte the $Load$ functions for the primary gating network and $i^{th}$ secondary gating network respectively. $X^{(i)}$ denotes the subset of $X$ for which $G_{primary}(x)_{i}>0$ .
$Load_{primary}$ 和 $Load_{i}$ 分别表示主要门控网络和次要门控网络的 $Load$ 函数。 $X^{(i)}$ 表示 $X$ 的子集，其中 $G_{primary}(x)_{i}>0$ 。

It would seem simpler to let $Load_{H}(X)_{i,j}=Load_{i}(X_{i})_{j}$ , but this would not have a gradient with respect to the primary gating network, so we use the formulation above.
看起来让 $Load_{H}(X)_{i,j}=Load_{i}(X_{i})_{j}$ 会更简单，但这对主要门控网络没有梯度，因此我们使用上述公式。

C 1 Billion Word Language Modeling Benchmark - Experimental Details
C1 十亿词语言建模基准 - 实验细节

C.1 8-Million-Operations-per-Timestep Models
每时间步 1800 万次操作模型

Model Architecture: 模型架构：

Our model consists of five layers: a word embedding layer, a recurrent Long Short-Term Memory (LSTM) layer (Hochreiter & Schmidhuber, 1997; Gers et al., 2000), a MoE layer, a second LSTM layer, and a softmax layer. The dimensionality of the embedding layer, the number of units in each LSTM layer, and the input and output dimensionality of the MoE layer are all equal to 512. For every layer other than the softmax, we apply drouput (Zaremba et al., 2014) to the layer output, dropping each activation with probability $DropProb$ , otherwise dividing by $(1-DropProb)$ . After dropout, the output of the previous layer is added to the layer output. This residual connection encourages gradient flow (He et al., 2015).
我们的模型由五层组成：一个词嵌入层，一个循环长短期记忆（LSTM）层（Hochreiter & Schmidhuber, 1997; Gers et al., 2000），一个 MoE 层，第二个 LSTM 层和一个 softmax 层。嵌入层的维度、每个 LSTM 层的单元数以及 MoE 层的输入和输出维度都等于 512。对于 softmax 层以外的每一层，我们对层输出应用 dropout（Zaremba et al., 2014），以概率 $DropProb$ 丢弃每个激活，否则除以 $(1-DropProb)$ 。在 dropout 之后，将前一层的输出添加到层输出中。这种残差连接鼓励梯度流动（He et al., 2015）。

MoE Layer Architecture: MoE 层架构：

Each expert in the MoE layer is a feed forward network with one ReLU-activated hidden layer of size 1024 and an output layer of size 512. Thus, each expert contains $[512*1024]+[1024*512]=1M$ parameters. The output of the MoE layer is passed through a sigmoid function before dropout. We varied the number of experts between models, using ordinary MoE layers with 4, 32 and 256 experts and hierarchical MoE layers with 256, 1024 and 4096 experts. We call the resulting models MoE-4, MoE-32, MoE-256, MoE-256-h, MoE-1024-h and MoE-4096-h. For the hierarchical MoE layers, the first level branching factor was 16, corresponding to the number of GPUs in our cluster. We use Noisy-Top-K Gating (see Section 2.1) with $k=4$ for the ordinary MoE layers and $k=2$ at each level of the hierarchical MoE layers. Thus, each example is processed by exactly 4 experts for a total of 4M ops/timestep. The two LSTM layers contribute 2M ops/timestep each for the desired total of 8M.
在 MoE 层中，每个专家是一个前馈网络，具有一个大小为 1024 的 ReLU 激活隐藏层和一个大小为 512 的输出层。因此，每个专家包含 $[512*1024]+[1024*512]=1M$ 个参数。MoE 层的输出在 dropout 之前通过一个 sigmoid 函数。我们在模型之间改变了专家的数量，使用了包含 4、32 和 256 个专家的普通 MoE 层和包含 256、1024 和 4096 个专家的分层 MoE 层。我们将生成的模型称为 MoE-4、MoE-32、MoE-256、MoE-256-h、MoE-1024-h 和 MoE-4096-h。对于分层 MoE 层，第一层的分支因子是 16，对应于我们集群中的 GPU 数量。我们使用 Noisy-Top-K Gating（见第 2.1 节），普通 MoE 层使用 $k=4$ ，分层 MoE 层的每一层使用 $k=2$ 。因此，每个样本由正好 4 个专家处理，总共 4M ops/timestep。两个 LSTM 层各贡献 2M ops/timestep，所需总计为 8M。

Computationally-Matched Baselines:
计算匹配基线：

The MoE-4 model does not employ sparsity, since all 4 experts are always used. In addition, we trained four more computationally-matched baseline models with no sparsity:
MoE-4 模型没有采用稀疏性，因为所有 4 个专家总是被使用。此外，我们还训练了四个没有稀疏性的计算匹配基线模型：

•

MoE-1-Wide: The MoE layer consists of a single "expert" containing one ReLU-activated hidden layer of size 4096.

• MoE-1-Wide: MoE 层由一个包含一个大小为 4096 的 ReLU 激活隐藏层的“专家”组成。
•

MoE-1-Deep: The MoE layer consists of a single "expert" containing four ReLU-activated hidden layers, each with size $1024$ .

• MoE-1-Deep: MoE 层由一个包含四个 ReLU 激活的隐藏层的“专家”组成，每个隐藏层的大小为 $1024$ 。
•

4xLSTM-512: We replace the MoE layer with two additional 512-unit LSTM layers.

• 4xLSTM-512：我们用两个额外的 512 单元 LSTM 层替换了 MoE 层。
•

LSTM-2048-512: The model contains one 2048-unit LSTM layer (and no MoE). The output of the LSTM is projected down to 512 dimensions (Sak et al., 2014). The next timestep of the LSTM receives the projected output. This is identical to one of the models published in (Jozefowicz et al., 2016). We re-ran it to account for differences in training regimen, and obtained results very similar to the published ones.

• LSTM-2048-512：该模型包含一个 2048 单元的 LSTM 层（没有 MoE）。LSTM 的输出被投影到 512 维（Sak 等，2014）。LSTM 的下一个时间步接收投影输出。这与（Jozefowicz 等，2016）中发布的一个模型相同。我们重新运行了它，以考虑训练方案的差异，并获得了与发布结果非常相似的结果。

Training: 训练：

The models were trained on a cluster of 16 K40 GPUs using the synchronous method described in Section 3. Each batch consisted of a set of sentences totaling roughly 300,000 words. In the interest of time, we limited training to 10 epochs, (27,000 steps). Training took 12-16 hours for all models, except for MoE-4, which took 18 hours (since all the expert computation was performed on only 4 of 16 GPUs). We used the Adam optimizer (Kingma & Ba, 2015). The base learning rate was increased linearly for the first 1000 training steps, and decreased after that so as to be proportional to the inverse square root of the step number. The Softmax output layer was trained efficiently using importance sampling similarly to the models in (Jozefowicz et al., 2016). For each model, we performed a hyper-parmeter search to find the best dropout probability, in increments of 0.1.
这些模型在一个由 16 个 K40 GPU 组成的集群上使用第 3 节中描述的同步方法进行训练。每个批次由一组句子组成，总计约 300,000 个单词。为了节省时间，我们将训练限制在 10 个周期（27,000 步）。除了 MoE-4 模型外，所有模型的训练时间为 12-16 小时（因为所有专家计算仅在 16 个 GPU 中的 4 个上进行）。我们使用了 Adam 优化器（Kingma & Ba, 2015）。基础学习率在前 1000 个训练步骤中线性增加，之后按步数的平方根的倒数成比例减少。Softmax 输出层使用重要性采样高效地进行训练，类似于（Jozefowicz 等，2016）中的模型。对于每个模型，我们进行了超参数搜索，以 0.1 的增量找到最佳的 dropout 概率。

To ensure balanced expert utilization we set $w_{importance}=0.1$ and $w_{load}=0.1$ , as described in Section 4 and Appendix A.
为了确保专家利用的平衡，我们设置了 $w_{importance}=0.1$ 和 $w_{load}=0.1$ ，如第 4 节和附录 A 所述。

Results: 结果：

We evaluate our model using perplexity on the holdout dataset, used by (Chelba et al., 2013; Jozefowicz et al., 2016). We follow the standard procedure and sum over all the words including the end of sentence symbol. Results are reported in Table 7. For each model, we report the test perplexity, the computational budget, the parameter counts, the value of $DropProb$ , and the computational efficiency.
我们使用保留数据集上的困惑度来评估我们的模型，该数据集由（Chelba et al., 2013; Jozefowicz et al., 2016）使用。我们遵循标准程序，对所有单词（包括句子结束符号）进行求和。结果如表 7 所示。对于每个模型，我们报告测试困惑度、计算预算、参数数量、 $DropProb$ 的值和计算效率。

Table 7: Model comparison on 1 Billion Word Language Modeling Benchmark. Models marked with * are from (Jozefowicz et al., 2016).
表 7：10 亿词语言建模基准上的模型比较。标有*的模型来自（Jozefowicz 等，2016）。

Model	Test 测试	Test 测试	ops/timestep 操作/时间步	#Params excluding #参数排除	Total 总计	$Drop$ -	TFLOPS
	Perplexity 困惑度	Perplexity 困惑度	(millions) （百万）	embed. & softmax 嵌入和软最大值	#Params	$Prob$	per GPU 每个 GPU
	10 epochs 10 个周期	(final) 最终		(millions) （百万）	(billions) （十亿）		(observed) 观察到
Kneser-Ney 5-gram* Kneser-Ney 5-gram 模型		67.6	0.00001		1.8
LSTM-512-512*		54.1	2.4	2.4	0.8	0.1
LSTM-1024-512*		48.2	4.7	4.7	0.8	0.1
LSTM-2048-512*	45.0	43.7	9.4	9.4	0.8	0.1	0.61
LSTM-2048-512	44.7		9.4	9.4	0.8	0.1	1.21
4xLSTM-512	46.0		8.4	8.4	0.8	0.1	1.07
MoE-1-Wide	46.1		8.4	8.4	0.8	0.1	1.29
MoE-1-Deep	45.7		8.4	8.4	0.8	0.1	1.29
MoE-4 教育部-4	45.0		8.4	8.4	0.8	0.1	0.52
MoE-32 教育部-32	39.7		8.4	37.8	0.9	0.1	0.87
MoE-256	35.7		8.6	272.9	1.1	0.1	0.81
MoE-256-h 教育部-256-h	36.0		8.4	272.9	1.1	0.1	0.89
MoE-1024-h	34.6		8.5	1079.0	1.9	0.2	0.90
MoE-4096-h	34.1		8.9	4303.4	5.1	0.2	0.74
2xLSTM-8192-1024*	34.7	30.6	151.0	151.0	1.8	0.25	1.09
MoE-34M	31.3		33.8	4313.9	6.0	0.3	1.22
MoE-143M	28.0		142.7	4371.1	6.0	0.4	1.56

C.2 More Expensive Models C.2 更昂贵的模型

We ran two additional models (MoE-34M and MoE-143M) to investigate the effects of adding more computation in the presence of a large MoE layer. These models have computation budgets of 34M and 143M ops/timestep. Similar to the models above, these models use a MoE layer between two LSTM layers. The dimensionality of the embedding layer, and the input and output dimensionality of the MoE layer are set to 1024 instead of 512. For MoE-34M, the LSTM layers have 1024 units. For MoE-143M, the LSTM layers have 4096 units and an output projection of size 1024 (Sak et al., 2014). MoE-34M uses a hierarchical MoE layer with 1024 experts, each with a hidden layer of size 2048. MoE-143M uses a hierarchical MoE layer with 256 experts, each with a hidden layer of size 8192. Both models have 4B parameters in the MoE layers. We searched for the best $DropProb$ for each model, and trained each model for 10 epochs.
我们运行了两个额外的模型（MoE-34M 和 MoE-143M），以研究在存在大型 MoE 层的情况下增加更多计算的效果。这些模型的计算预算分别为 34M 和 143M ops/timestep。与上述模型类似，这些模型在两个 LSTM 层之间使用 MoE 层。嵌入层的维度以及 MoE 层的输入和输出维度设置为 1024 而不是 512。对于 MoE-34M，LSTM 层有 1024 个单元。对于 MoE-143M，LSTM 层有 4096 个单元，并且输出投影大小为 1024（Sak 等，2014）。MoE-34M 使用具有 1024 个专家的分层 MoE 层，每个专家有一个大小为 2048 的隐藏层。MoE-143M 使用具有 256 个专家的分层 MoE 层，每个专家有一个大小为 8192 的隐藏层。两个模型在 MoE 层中都有 4B 参数。我们为每个模型搜索了最佳 $DropProb$ ，并训练了每个模型 10 个周期。

The two models achieved test perplexity of $31.3$ and $28.0$ respectively, showing that even in the presence of a large MoE, more computation is still useful. Results are reported at the bottom of Table 7. The larger of the two models has a similar computational budget to the best published model from the literature, and training times are similar. Comparing after 10 epochs, our model has a lower test perplexity by $18\%$ .
这两个模型分别达到了 $31.3$ 和 $28.0$ 的测试困惑度，表明即使在存在大型 MoE 的情况下，更多的计算仍然是有用的。结果报告在表 7 的底部。两个模型中较大的一个在计算预算上与文献中已发表的最佳模型相似，训练时间也相似。经过 10 个周期的比较，我们的模型测试困惑度更低，为 $18\%$ 。

D 100 Billion Word Google News Corpus - Experimental Details
D1000 亿词谷歌新闻语料库 - 实验细节

Model Architecture: 模型架构：

The models are similar in structure to the 8-million-operations-per-timestep models described in the previous section. We vary the number of experts between models, using an ordinary MoE layer with 32 experts and hierarchical MoE layers with 256, 1024, 4096, 16384, 65536 and 131072 experts. For the hierarchical MoE layers, the first level branching factors are 32, 32, 64, 128, 256 and 256, respectively.
这些模型在结构上类似于前一节中描述的每时间步 800 万操作的模型。我们在模型之间改变专家的数量，使用一个普通的包含 32 个专家的 MoE 层和包含 256、1024、4096、16384、65536 和 131072 个专家的分层 MoE 层。对于分层 MoE 层，第一层的分支因子分别是 32、32、64、128、256 和 256。

Training: 训练：

Models are trained on a cluster of 32 Tesla K40 GPUs, except for the last two models, which are trained on clusters of 64 and 128 GPUs so as to have enough memory for all the parameters. For all models, training batch sizes are approximately 2.5 million words. Models are trained once-through over about 100 billion words.
模型在由 32 个 Tesla K40 GPU 组成的集群上进行训练，除了最后两个模型，它们分别在 64 和 128 个 GPU 的集群上进行训练，以便有足够的内存容纳所有参数。对于所有模型，训练批次大小约为 250 万词。模型在大约 1000 亿词上进行一次性训练。

We implement several memory optimizations in order to fit up to 1 billion parameters per GPU. First, we do not store the activations of the hidden layers of the experts, but instead recompute them on the backwards pass. Secondly, we modify the optimizer on the expert parameters to require less auxiliary storage:
为了在每个 GPU 上容纳多达 10 亿个参数，我们实施了几种内存优化。首先，我们不存储专家隐藏层的激活值，而是在反向传递时重新计算它们。其次，我们修改了专家参数的优化器，以减少辅助存储的需求：

The Adam optimizer (Kingma & Ba, 2015) keeps first and second moment estimates of the per-parameter gradients. This triples the required memory. To avoid keeping a first-moment estimator, we set $\beta_{1}=0$ . To reduce the size of the second moment estimator, we replace it with a factored approximation. For a matrix of parameters, instead of maintaining a full matrix of second-moment estimators, we maintain vectors of row-wise and column-wise averages of that matrix. At each step, the matrix of estimators is taken to be the outer product of those two vectors divided by the mean of either one. This technique could similarly be applied to Adagrad (Duchi et al., 2010).
Adam 优化器（Kingma & Ba，2015）保留了每个参数梯度的一阶和二阶矩估计。这使得所需的内存增加了三倍。为了避免保留一阶矩估计，我们设置 $\beta_{1}=0$ 。为了减少二阶矩估计的大小，我们用一个分解近似来替代它。对于一个参数矩阵，我们不再维护一个完整的二阶矩估计矩阵，而是维护该矩阵的行平均和列平均的向量。在每一步中，估计矩阵被认为是这两个向量的外积除以其中一个向量的平均值。类似地，这种技术也可以应用于 Adagrad（Duchi 等，2010）。

Table 8: Model comparison on 100 Billion Word Google News Dataset
表 8：在 1000 亿词谷歌新闻数据集上的模型比较

Model	Test 测试	Test 测试	ops/timestep 操作/时间步	#Params excluding #参数排除	Total 总计	TFLOPS
	Perplexity 困惑度	Perplexity 困惑度	(millions) （百万）	embed. & softmax 嵌入和软最大值	#Params	per GPU 每个 GPU
	.1 epochs .1 时代	1 epoch 1 轮		(millions) （百万）	(billions) （十亿）	(observed) 观察到
Kneser-Ney 5-gram Kneser-Ney 五元模型	67.1	45.3	0.00001		76.0
4xLSTM-512	54.5	47.0	8.4	8.4	0.1	1.23
MoE-32 教育部-32	48.5	40.4	8.4	37.8	0.1	0.83
MoE-256-h 教育部-256-h	42.8	35.3	8.4	272.9	0.4	1.11
MoE-1024-h	40.3	32.7	8.5	1079.0	1.2	1.14
MoE-4096-h	38.9	30.9	8.6	4303.4	4.4	1.07
MoE-16384-h	38.2	29.7	8.8	17201.0	17.3	0.96
MoE-65536-h	38.2	28.9	9.2	68791.0	68.9	0.72
MoE-131072-h	39.8	29.2	9.7	137577.6	137.7	0.30

Results: 结果：

We evaluate our model using perplexity on a holdout dataset. Results are reported in Table 8. Perplexity after 100 billion training words is 39% lower for the 68-billion-parameter MoE model than for the baseline model. It is notable that the measured computational efficiency of the largest model (0.30 TFLOPS/GPU) is very low compared to the other models. This is likely a result of the fact that, for purposes of comparison to the other models, we did not increase the training batch size proportionally to the number of GPUs. For comparison, we include results for a computationally matched baseline model consisting of 4 LSTMs, and for an unpruned 5-gram model with Kneser-Ney smoothing (Kneser & Ney, 1995).⁴⁴4While the original size of the corpus was 130 billion words, the neural models were trained for a maximum of 100 billion words. The reported Kneser-Ney 5-gram models were trained over 13 billion and 130 billion words respectively, giving them a slight advantage over the other reported results.
虽然语料库的原始规模为 1300 亿词，但神经模型的训练最大为 1000 亿词。报告的 Kneser-Ney 5-gram 模型分别在 130 亿和 1300 亿词上进行了训练，这使它们比其他报告的结果略有优势。
我们使用困惑度在保留数据集上评估我们的模型。结果如表 8 所示。经过 1000 亿个训练词后，68 亿参数的 MoE 模型的困惑度比基线模型低 39%。值得注意的是，最大模型的计算效率（0.30 TFLOPS/GPU）与其他模型相比非常低。这可能是因为，为了与其他模型进行比较，我们没有按 GPU 数量成比例地增加训练批量大小。为了比较，我们包括了一个由 4 个 LSTM 组成的计算匹配基线模型的结果，以及一个未修剪的使用 Kneser-Ney 平滑（Kneser & Ney, 1995）的 5-gram 模型的结果。

E Machine Translation - Experimental Details
机器翻译 - 实验细节

Model Architecture for Single Language Pair MoE Models:
单语对 MoE 模型的模型架构：

Our model is a modified version of the GNMT model described in (Wu et al., 2016). To reduce computation, we decrease the number of LSTM layers in the encoder and decoder from 9 and 8 to 3 and 2 respectively. We insert MoE layers in both the encoder (between layers 2 and 3) and the decoder (between layers 1 and 2). We use an attention mechanism between the encoder and decoder, with the first decoder LSTM receiving output from and providing input for the attention ⁵⁵5For performance reasons, we use a slightly different attention function from the one described in (Wu et al., 2016) - See Appendix G
出于性能原因，我们使用了与(Wu et al., 2016)中描述的稍有不同的注意力函数 - 见附录 G
我们的模型是对 (Wu et al., 2016) 中描述的 GNMT 模型的修改版本。为了减少计算量，我们将编码器和解码器中的 LSTM 层数分别从 9 和 8 减少到 3 和 2。我们在编码器（第 2 层和第 3 层之间）和解码器（第 1 层和第 2 层之间）中插入了 MoE 层。我们在编码器和解码器之间使用注意力机制，第一个解码器 LSTM 接收来自注意力机制的输出并为其提供输入。. All of the layers in our model have input and output dimensionality of 512. Our LSTM layers have 2048 hidden units, with a 512-dimensional output projection. We add residual connections around all LSTM and MoE layers to encourage gradient flow (He et al., 2015). Similar to GNMT, to effectively deal with rare words, we used sub-word units (also known as “wordpieces") (Schuster & Nakajima, 2012) for inputs and outputs in our system.
我们模型中的所有层的输入和输出维度均为 512。我们的 LSTM 层有 2048 个隐藏单元，输出投影为 512 维。我们在所有 LSTM 和 MoE 层周围添加了残差连接，以促进梯度流动（He 等，2015）。与 GNMT 类似，为了有效处理稀有词汇，我们在系统的输入和输出中使用了子词单元（也称为“词片”）（Schuster & Nakajima，2012）。

We use a shared source and target vocabulary of 32K wordpieces. We also used the same beam search technique as proposed in (Wu et al., 2016).
我们使用了一个包含 32K 词片的共享源和目标词汇表。我们还使用了(Wu et al., 2016)中提出的相同的束搜索技术。

We train models with different numbers of experts in the MoE layers. In addition to a baseline model with no MoE layers, we train models with flat MoE layers containing 32 experts, and models with hierarchical MoE layers containing 512 and 2048 experts. The flat MoE layers use $k=4$ and the hierarchical MoE models use $k=2$ at each level of the gating network. Thus, each input is processed by exactly 4 experts in each MoE layer. Each expert in the MoE layer is a feed forward network with one hidden layer of size 2048 and ReLU activation. Thus, each expert contains $[512*2048]+[2048*512]=2M$ parameters. The output of the MoE layer is passed through a sigmoid function. We use the strictly-balanced gating function described in Appendix F.
我们在 MoE 层中训练了具有不同数量专家的模型。除了没有 MoE 层的基线模型外，我们还训练了包含 32 个专家的平坦 MoE 层模型，以及包含 512 和 2048 个专家的分层 MoE 层模型。平坦 MoE 层使用 $k=4$ ，分层 MoE 模型在每个门控网络级别使用 $k=2$ 。因此，每个输入在每个 MoE 层中正好由 4 个专家处理。MoE 层中的每个专家都是一个具有 2048 大小隐藏层和 ReLU 激活的前馈网络。因此，每个专家包含 $[512*2048]+[2048*512]=2M$ 个参数。MoE 层的输出通过一个 sigmoid 函数。我们使用了附录 F 中描述的严格平衡门控函数。

Model Architecture for Multilingual MoE Model:
多语言 MoE 模型的模型架构：

We used the same model architecture as for the single-language-pair models, with the following exceptions: We used noisy-top-k gating as described in Section 2.1, not the scheme from Appendix F. The MoE layers in the encoder and decoder are non-hierarchical MoEs with $n=512$ experts, and $k=2$ . Each expert has a larger hidden layer of size $8192$ . This doubles the amount of computation in the MoE layers, raising the computational budget of the entire model from 85M to 102M ops/timestep.
我们使用了与单语言对模型相同的模型架构，但有以下例外：我们使用了第 2.1 节中描述的 noisy-top-k 门控，而不是附录 F 中的方案。编码器和解码器中的 MoE 层是非层次化的 MoE，具有 $n=512$ 个专家和 $k=2$ 。每个专家都有一个更大的隐藏层，大小为 $8192$ 。这使得 MoE 层的计算量增加了一倍，将整个模型的计算预算从 85M 提升到 102M ops/timestep。

Training: 训练：

We trained our networks using the Adam optimizer (Kingma & Ba, 2015). The base learning rate was increased linearly for the first 2000 training steps, held constant for an additional 8000 steps, and decreased after that so as to be proportional to the inverse square root of the step number. For the single-language-pair models, similarly to (Wu et al., 2016), we applied dropout (Zaremba et al., 2014) to the output of all embedding, LSTM and MoE layers, using $DropProb=0.4$ . Training was done synchronously on a cluster of up to 64 GPUs as described in section 3. Each training batch consisted of a set of sentence pairs containing roughly 16000 words per GPU.
我们使用 Adam 优化器（Kingma & Ba, 2015）训练了我们的网络。基础学习率在前 2000 个训练步骤中线性增加，在接下来的 8000 个步骤中保持不变，然后按步数的平方根的倒数成比例地减少。对于单语言对模型，类似于（Wu et al., 2016），我们对所有嵌入层、LSTM 层和 MoE 层的输出应用了 dropout（Zaremba et al., 2014），使用 $DropProb=0.4$ 。训练在一个最多包含 64 个 GPU 的集群上同步进行，如第 3 节所述。每个训练批次由每个 GPU 大约包含 16000 个单词的一组句子对组成。

To ensure balanced expert utilization we set $w_{importance}=0.01$ and $w_{load}=0.01$ , as described in Section 4 and Appendix A.
为了确保专家利用的平衡，我们设置了 $w_{importance}=0.01$ 和 $w_{load}=0.01$ ，如第 4 节和附录 A 所述。

Metrics: 度量：

We evaluated our models using the perplexity and the standard BLEU score metric. We reported tokenized BLEU score as computed by the multi-bleu.pl script, downloaded from the public implementation of Moses (on Github), which was also used in (Luong et al., 2015a).
我们使用困惑度和标准 BLEU 评分指标评估了我们的模型。我们报告了由 multi-bleu.pl 脚本计算的分词 BLEU 分数，该脚本从 Moses 的公共实现（在 Github 上）下载，也在(Luong et al., 2015a)中使用。

Results: 结果：

Tables 2, 3 and 4 in Section 5.3 show comparisons of our results to other published methods. Figure 4 shows test perplexity as a function of number of words in the (training data’s) source sentences processed for models with different numbers of experts. As can be seen from the Figure, as we increased the number of experts to approach 2048, the test perplexity of our model continued to improve.
第 5.3 节中的表 2、表 3 和表 4 展示了我们的结果与其他已发表方法的比较。图 4 显示了不同专家数量的模型在处理（训练数据的）源句子中的单词数量时的测试困惑度。正如图中所示，当我们将专家数量增加到接近 2048 时，我们模型的测试困惑度持续改善。

We found that the experts indeed become highly specialized by syntax and/or semantics, as can be seen in Table 9. For example, one expert is used when the indefinite article “a" introduces the direct object in a verb phrase indicating importance or leadership.
我们发现，专家确实在句法和/或语义方面变得高度专业化，如表 9 所示。例如，当不定冠词“a”在动词短语中引入直接宾语表示重要性或领导地位时，会使用一位专家。

Table 9: Contexts corresponding to a few of the 2048 experts in the MoE layer in the encoder portion of the WMT’14 En

\rightarrow

Fr translation model. For each expert

i

, we sort the inputs in a training batch in decreasing order of

G(x)_{i}

, and show the words surrounding the corresponding positions in the input sentences.
表 9：WMT'14 英法翻译模型编码器部分 MoE 层中 2048 个专家中的几个对应的上下文。对于每个专家，我们按训练批次中输入的降序排列，并显示输入句子中相应位置周围的单词。

Expert 381 专家 381	Expert 752 专家 752	Expert 2004 专家 2004
… with researchers , … … 与研究人员， …	… plays a core … … 起着核心 …	… with rapidly growing … … 随着迅速增长的 …
… to innovation . ……对创新。	… plays a critical … … 起着关键作用 …	… under static conditions … … 在静态条件下 …
… tics researchers . … 语言学研究者。	… provides a legislative … ……提供了立法……	… to swift ly … …迅速地…
… the generation of … … 的生成 …	… play a leading … … 发挥主导作用 …	… to dras tically … … 大幅 …
… technology innovations is … … 技术创新是 …	… assume a leadership … … 承担领导角色 …	… the rapid and … …快速和…
… technological innovations , … … 技术创新，…	… plays a central … … 起着核心作用 …	… the fast est … …最快的…
… support innovation throughout … … 支持整个过程中的创新 …	… taken a leading … … 处于领先地位 …	… the Quick Method … … 快速方法 …
… role innovation will … ……角色创新将……	… established a reconciliation … ……建立了和解……	… rec urrent ) … … 复发 ) …
… research scienti st … … 研究科学家 …	… played a vital … ……起到了至关重要的……	… provides quick access … … 提供快速访问 …
… promoting innovation where … … 促进创新的地方 …	… have a central … … 有一个中心 …	… of volatile organic … … 挥发性有机物 …
…	…	…

F Strictly Balanced Gating
严格平衡门控

Due to some peculiarities in our infrastructure which have since been fixed, at the time we ran some of the machine translation experiments, our models ran faster if every expert received exactly the same batch size. To accommodate this, we used a different gating function which we describe below.
由于我们的基础设施中存在一些特殊情况（这些问题后来已被修复），在我们进行一些机器翻译实验时，如果每个专家接收到完全相同的批量大小，我们的模型运行速度会更快。为了适应这一点，我们使用了不同的门控函数，具体描述如下。

Recall that we define the softmax gating function to be:
回想一下，我们将 softmax 门控函数定义为：

G_{\sigma}(x)=Softmax(x\cdot W_{g})

(15)

Sparse Gating (alternate formulation):
稀疏门控（替代形式）：

To obtain a sparse gating vector, we multiply $G_{\sigma}(x)$ component-wise with a sparse mask $M(G_{\sigma}(x))$ and normalize the output. The mask itself is a function of $G_{\sigma}(x)$ and specifies which experts are assigned to each input example:
为了获得稀疏门控向量，我们将 $G_{\sigma}(x)$ 逐元素乘以稀疏掩码 $M(G_{\sigma}(x))$ 并对输出进行归一化。掩码本身是 $G_{\sigma}(x)$ 的函数，并指定哪些专家分配给每个输入示例：

G(x)_{i}=\frac{G_{\sigma}(x)_{i}M(G_{\sigma}(x))_{i}}{\sum_{j=1}^{n}G_{\sigma}(x)_{j}M(G_{\sigma}(x))_{j}}

(16)

Top-K Mask: Top-K 掩码:

To implement top-k gating in this formulation, we would let $M(v)=TopK(v,k)$ , where:
为了在这个公式中实现 top-k 门控，我们将令 $M(v)=TopK(v,k)$ ，其中：

TopK(v,k)_{i}=\begin{cases}1&\text{if $v_{i}$ is in the top $k$ elements of $v$.}\\ 0&\text{otherwise.}\end{cases}

(17)

Batchwise Mask: 批量掩码:

To force each expert to receive the exact same number of examples, we introduce an alternative mask function, $M_{batchwise}(X,m)$ , which operates over batches of input vectors. Instead of keeping the top $k$ values per example, we keep the top $m$ values per expert across the training batch, where $m=\frac{k|X|}{n}$ , so that each example is sent to an average of $k$ experts.
为了强制每个专家接收完全相同数量的样本，我们引入了一种替代的掩码函数 $M_{batchwise}(X,m)$ ，该函数在输入向量的批次上运行。我们不再保留每个样本的前 $k$ 个值，而是保留整个训练批次中每个专家的前 $m$ 个值，其中 $m=\frac{k|X|}{n}$ ，以便每个样本平均发送给 $k$ 个专家。

M_{batchwise}(X,m)_{j,i}=\begin{cases}1&\text{if $X_{j,i}$ is in the top $m$ values for to expert $i$}\\ 0&\text{otherwise}\end{cases}

(18)

As our experiments suggest and also observed in (Ioffe & Szegedy, 2015), using a batchwise function during training (such as $M_{batchwise}$ ) requires modifications to the inference when we may not have a large batch of examples. Our solution to this is to train a vector $T$ of per-expert threshold values to approximate the effects of the batchwise mask. We use the following mask at inference time:
正如我们的实验所表明的那样，并且在 (Ioffe & Szegedy, 2015) 中也观察到，在训练期间使用批量函数（例如 $M_{batchwise}$ ）在推理时可能没有大量示例时需要进行修改。我们的解决方案是训练一个向量 $T$ ，其中包含每个专家的阈值，以近似批量掩码的效果。我们在推理时使用以下掩码：

M_{threshold}(x,T)_{i}=\begin{cases}1&\text{if $x_{i}>T_{i}$}\\ 0&\text{otherwise}\end{cases}

(19)

To learn the threshold values, we apply an additional loss at training time which is minimized when the batchwise mask and the threshold mask are identical.
为了学习阈值，我们在训练时应用一个额外的损失，当批量掩码和阈值掩码相同时，该损失最小化。

L_{batchwise}(X,T,m)=\sum_{j=1}^{|X|}\sum_{i=1}^{n}(M_{threshold}(x,T)_{i}-M_{batchwise}(X,m)_{j,i})(X_{j,i}-T_{i})

(20)

G Attention Function 注意力函数

The attention mechanism described in GNMT (Wu et al., 2016) involves a learned “Attention Function" $A(x_{i},y_{j})$ which takes a “source vector" $x_{i}$ and a “target vector" $y_{j}$ , and must be computed for every source time step $i$ and target time step $j$ . In GNMT, the attention function is implemented as a feed forward neural network with a hidden layer of size $n$ . It can be expressed as:
GNMT（Wu 等，2016）中描述的注意力机制涉及一个学习到的“注意力函数” $A(x_{i},y_{j})$ ，该函数接受一个“源向量” $x_{i}$ 和一个“目标向量” $y_{j}$ ，并且必须为每个源时间步 $i$ 和目标时间步 $j$ 计算。在 GNMT 中，注意力函数被实现为一个具有大小为 $n$ 的隐藏层的前馈神经网络。它可以表示为：

A_{GNMT}(x_{i},y_{j})=\sum_{d=1}^{n}V_{d}tanh((x_{i}U)_{d}+(y_{j}W)_{d})

(21)

Where $U$ and $W$ are trainable weight matrices and $V$ is a trainable weight vector.
其中 $U$ 和 $W$ 是可训练的权重矩阵， $V$ 是可训练的权重向量。

For performance reasons, in our models, we used a slightly different attention function:
出于性能原因，在我们的模型中，我们使用了稍微不同的注意力函数：

A(x_{i},y_{j})=\sum_{d=1}^{n}V_{d}tanh((x_{i}U)_{d})tanh((y_{j}W)_{d})

(22)

With our attention function, we can simultaneously compute the attention function on multiple source time steps and multiple target time steps using optimized matrix multiplications. We found little difference in quality between the two functions.
通过我们的注意力函数，我们可以使用优化的矩阵乘法同时计算多个源时间步和多个目标时间步的注意力函数。我们发现这两种函数在质量上几乎没有差别。

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer极其庞大的神经网络：稀疏门控专家混合层

Abstract 摘要

1 Introduction and Related Work1 引言与相关工作

1.1 Conditional Computation1.1 条件计算

1.2 Our Approach: The Sparsely-Gated Mixture-of-Experts Layer1.2 我们的方法：稀疏门控专家混合层

1.3 Related work on Mixtures of Experts1.3 专家混合相关工作

2 The Structure of the Mixture-of-Experts layer2 专家混合层的结构

2.1 Gating Network 2.1 门控网络

Softmax Gating: Softmax 门控：

Noisy Top-K Gating: 嘈杂的 Top-K 门控：

Training the Gating Network训练门控网络

3 Addressing Performance Challenges解决性能挑战

3.1 The Shrinking Batch Problem3.1 批次缩减问题

Mixing Data Parallelism and Model Parallelism:混合数据并行和模型并行：

Taking Advantage of Convolutionality:利用卷积性：

Increasing Batch Size for a Recurrent MoE:增加递归专家模型的批量大小：

3.2 Network Bandwidth 3.2 网络带宽

4 Balancing Expert Utilization4. 平衡专家利用

5 Experiments 5 实验

5.1 1 Billion Word Language Modeling Benchmark51.1 亿词语言建模基准

Dataset: 数据集：

Previous State-of-the-Art:以前的最新技术：

MoE Models: 教育部模型：

Low Computation, Varied Capacity:低计算，多样容量：

Varied Computation, High Capacity:多样化计算，高容量：

Computational Efficiency:计算效率：

5.2 100 Billion Word Google News Corpus52.1 亿词谷歌新闻语料库

Results: 结果：

5.3 Machine Translation (Single Language Pair)5.3 机器翻译（单语言对）

Model Architecture: 模型架构：

Datasets: 数据集：

Results: 结果：

5.4 Multilingual Machine Translation5.4 多语言机器翻译

Dataset: 数据集：

Results: 结果：

6 Conclusion 6 结论

Acknowledgments 致谢

References

Appendices 附录

A Load-Balancing Loss 负载均衡损失

Initial Load Imbalance: 初始负载不平衡：

Experiments: 实验：

Results: 结果：

B Hierachical Mixture of Experts层次专家混合模型

C 1 Billion Word Language Modeling Benchmark - Experimental DetailsC1 十亿词语言建模基准 - 实验细节

C.1 8-Million-Operations-per-Timestep Models每时间步 1800 万次操作模型

Model Architecture: 模型架构：

MoE Layer Architecture: MoE 层架构：

Computationally-Matched Baselines:计算匹配基线：

Training: 训练：

Results: 结果：

C.2 More Expensive Models C.2 更昂贵的模型

D 100 Billion Word Google News Corpus - Experimental DetailsD1000 亿词谷歌新闻语料库 - 实验细节

Model Architecture: 模型架构：

Training: 训练：

Results: 结果：

E Machine Translation - Experimental Details机器翻译 - 实验细节

Model Architecture for Single Language Pair MoE Models:单语对 MoE 模型的模型架构：

Model Architecture for Multilingual MoE Model:多语言 MoE 模型的模型架构：

Training: 训练：

Metrics: 度量：

Results: 结果：

F Strictly Balanced Gating严格平衡门控

Sparse Gating (alternate formulation):稀疏门控（替代形式）：

Top-K Mask: Top-K 掩码:

Batchwise Mask: 批量掩码:

G Attention Function 注意力函数

Outrageously Large Neural Networks:
The Sparsely-Gated Mixture-of-Experts Layer
极其庞大的神经网络：稀疏门控专家混合层

1 Introduction and Related Work
1 引言与相关工作

1.1 Conditional Computation
1.1 条件计算

1.2 Our Approach: The Sparsely-Gated Mixture-of-Experts Layer
1.2 我们的方法：稀疏门控专家混合层

1.3 Related work on Mixtures of Experts
1.3 专家混合相关工作

2 The Structure of the Mixture-of-Experts layer
2 专家混合层的结构

Training the Gating Network
训练门控网络

3 Addressing Performance Challenges
解决性能挑战

3.1 The Shrinking Batch Problem
3.1 批次缩减问题

Mixing Data Parallelism and Model Parallelism:
混合数据并行和模型并行：

Taking Advantage of Convolutionality:
利用卷积性：

Increasing Batch Size for a Recurrent MoE:
增加递归专家模型的批量大小：

4 Balancing Expert Utilization
4. 平衡专家利用

5.1 1 Billion Word Language Modeling Benchmark
51.1 亿词语言建模基准

Previous State-of-the-Art:
以前的最新技术：

Low Computation, Varied Capacity:
低计算，多样容量：

Varied Computation, High Capacity:
多样化计算，高容量：

Computational Efficiency:
计算效率：

5.2 100 Billion Word Google News Corpus
52.1 亿词谷歌新闻语料库

5.3 Machine Translation (Single Language Pair)
5.3 机器翻译（单语言对）

5.4 Multilingual Machine Translation
5.4 多语言机器翻译

B Hierachical Mixture of Experts
层次专家混合模型

C 1 Billion Word Language Modeling Benchmark - Experimental Details
C1 十亿词语言建模基准 - 实验细节

C.1 8-Million-Operations-per-Timestep Models
每时间步 1800 万次操作模型

Computationally-Matched Baselines:
计算匹配基线：

D 100 Billion Word Google News Corpus - Experimental Details
D1000 亿词谷歌新闻语料库 - 实验细节

E Machine Translation - Experimental Details
机器翻译 - 实验细节

Model Architecture for Single Language Pair MoE Models:
单语对 MoE 模型的模型架构：

Model Architecture for Multilingual MoE Model:
多语言 MoE 模型的模型架构：

F Strictly Balanced Gating
严格平衡门控

Sparse Gating (alternate formulation):
稀疏门控（替代形式）：