这是用户在 2024-3-26 15:16 为 https://ar5iv.labs.arxiv.org/html/2304.03946?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement
FlexMoE:通过动态设备配置实现大规模稀疏预训练模型的扩展训练

Xiaonan Nie  聂晓南 xiaonan.nie@pku.edu.cn School of CS & Key Lab of High Confidence Software Technologies (MOE)Peking UniversityBeijingChina
北京大学计算机学院 & 高可信软件技术教育部重点实验室 中国 北京
Xupeng Miao  缪旭鹏 xupeng@cmu.edu Carnegie Mellon UniversityUSA
卡内基梅隆大学 美国
Zilong Wang  王子龙 zilongwang@microsoft.com MicrosoftChina 微软中国 Zichao Yang  杨子超 yangtze2301@gmail.com Carnegie Mellon UniversityUSA
卡内基梅隆大学 美国
Jilong Xue  薛继龙 jxue@microsoft.com Microsoft ResearchChina 微软研究院中国 Lingxiao Ma  马灵潇 lingxiao.ma@microsoft.com Microsoft ResearchChina 微软研究院中国 Gang Cao  曹刚 caogang@baai.ac.cn Beijing Academy of Artificial IntelligenceChina
中国北京人工智能研究院
 and  Bin Cui  崔斌 bin.cui@pku.edu.cn School of CS & Key Lab of High Confidence Software Technologies (MOE), Institute of Computational Social Science, Peking University (Qingdao)Peking UniversityBeijingChina
计算机学院 & 高可信软件技术重点实验室(教育部),计算社会科学研究所,北京大学(青岛)北京大学北京中国
(July 2022; October 2022; November 2022)
(2022 年 7 月;2022 年 10 月;2022 年 11 月)
Abstract. 摘要

With the increasing data volume, there is a trend of using large-scale pre-trained models to store the knowledge into an enormous number of model parameters. The training of these models is composed of lots of dense algebras, requiring a huge amount of hardware resources. Recently, sparsely-gated Mixture-of-Experts (MoEs) are becoming more popular and have demonstrated impressive pretraining scalability in various downstream tasks. However, such a sparse conditional computation may not be effective as expected in practical systems due to the routing imbalance and fluctuation problems. Generally, MoEs are becoming a new data analytics paradigm in the data life cycle and suffering from unique challenges at scales, complexities, and granularities never before possible.
随着数据量的增加,使用大规模预训练模型将知识存储在大量模型参数中的趋势日益明显。这些模型的训练包含大量的密集代数运算,需要大量的硬件资源。最近,稀疏门控的专家混合模型(MoEs)变得越来越受欢迎,并在各种下游任务中展示了令人印象深刻的预训练可扩展性。然而,由于路由不平衡和波动问题,这种稀疏条件计算在实际系统中可能并不像预期的那样有效。通常,MoEs 正成为数据生命周期中的一种新的数据分析范式,并面临着前所未有的规模、复杂性和粒度方面的独特挑战。

In this paper, we propose a novel DNN training framework, FlexMoE, which systematically and transparently address the inefficiency caused by dynamic dataflow. We first present an empirical analysis on the problems and opportunities of training MoE models, which motivates us to overcome the routing imbalance and fluctuation problems by a dynamic expert management and device placement mechanism. Then we introduce a novel scheduling module over the existing DNN runtime to monitor the data flow, make the scheduling plans, and dynamically adjust the model-to-hardware mapping guided by the real-time data traffic. A simple but efficient heuristic algorithm is exploited to dynamically optimize the device placement during training. We have conducted experiments on both NLP models (e.g., BERT and GPT) and vision models (e.g., Swin). And results show FlexMoE can achieve superior performance compared with existing systems on real-world workloads — FlexMoE outperforms DeepSpeed by 1.70×1.70\times on average and up to 2.10×2.10\times, and outperforms FasterMoE by 1.30×1.30\times on average and up to 1.45×1.45\times.
在本文中,我们提出了一种新颖的 DNN 训练框架,FlexMoE,它系统性和透明地解决了由动态数据流引起的低效问题。我们首先对训练 MoE 模型的问题和机遇进行了实证分析,这激励我们通过动态专家管理和设备放置机制来克服路由不平衡和波动问题。然后,我们在现有的 DNN 运行时之上引入了一个新颖的调度模块,以监控数据流,制定调度计划,并根据实时数据流量指导动态调整模型到硬件的映射。我们利用了一个简单但高效的启发式算法,在训练过程中动态优化设备放置。我们在 NLP 模型(例如,BERT 和 GPT)和视觉模型(例如,Swin)上进行了实验。结果显示,与现有系统相比,FlexMoE 在真实工作负载上可以实现更优的性能——FlexMoE 的平均性能超过 DeepSpeed 1.70×1.70\times ,最高可达 2.10×2.10\times ,平均性能超过 FasterMoE 1.30×1.30\times ,最高可达 1.45×1.45\times

Deep Learning System, Distributed Computing, Sparse Model
深度学习系统,分布式计算,稀疏模型
copyright: acmlicensed
版权:acmlicensed
journal: PACMMOD
期刊:PACMMOD
journalyear: 2023
期刊年份:2023
journalvolume: 1
期刊卷号:1
journalnumber: 1
期刊期号:1
article: 110 文章编号:110publicationmonth: 5
发表月份:5
price: 15.00 价格:15.00doi: 10.1145/3588964
DOI:10.1145/3588964
ccs: Computing methodologies Self-organization
CCS:计算方法 自组织
ccs: Computing methodologies Massively parallel algorithms
CCS:计算方法 大规模并行算法

1. Introduction 1. 引言

Large-scale pre-trained models (PTMs), such as Transformer models, have promoted the deep learning (DL) development on various complicated tasks, including natural language processing, e.g., BERT (Devlin et al., 2019), GPT (Brown et al., 2020), T5 (Raffel et al., 2020), computer vision, e.g., ViT (Dosovitskiy et al., 2021), Swin (Liu et al., 2021), advertising recommendation, e.g., M6 (Lin et al., 2021), and so on. These models are also known as foundation models since they are trained on hundreds of gigabytes of data and can be adapted, e.g., task-specific fine-tuning, to a wide range of downstream tasks. However, recent studies (Kaplan et al., 2020; Brown et al., 2020) have demonstrated the model quality scales as the power law with data size, parameter size, and computation budgets. Current state-of-the-art foundation models have excessively grown to trillions of parameters and therefore become extremely expensive and time-consuming to be trained.
大规模预训练模型(PTMs),例如 Transformer 模型,已经推动了深度学习(DL)在各种复杂任务上的发展,包括自然语言处理,例如 BERT(Devlin 等,2019 年)、GPT(Brown 等,2020 年)、T5(Raffel 等,2020 年),计算机视觉,例如 ViT(Dosovitskiy 等,2021 年)、Swin(Liu 等,2021 年),广告推荐,例如 M6(Lin 等,2021 年)等。这些模型也被称为基础模型,因为它们是在数百 GB 的数据上训练的,并且可以适应广泛的下游任务,例如特定任务的微调。然而,最近的研究(Kaplan 等,2020 年;Brown 等,2020 年)已经证明,模型质量随着数据大小、参数大小和计算预算的增加而呈幂律规模增长。当前最先进的基础模型已经增长到了数万亿的参数,因此变得极其昂贵且耗时来训练。

To avoid hitting the train-ability wall, one well-known line of studies utilizes the sparsely-gated Mixture-of-Experts (MoE) (Shazeer et al., 2017) structure to make pre-trained models cost-effective. This sparse MoE architecture enlarges the model parameter size by expanding expert networks while keeping the computational budget stable. Specifically, each MoE layer consists of a gate network and a series of experts (e.g., tens of), where the gate network routes the input tokens (i.e., the inputs of the layer) to only a small number of experts rather than all experts to be computation-efficient. MoE models have been successfully used to boost the performance of various PTMs, such as language models (e.g., GLaM (Du et al., 2022) achieve the state-of-the-art performance in the SuperGLUE NLP benchmark (Wang et al., 2019)) and vision models (e.g., V-MoE (Riquelme et al., 2021) matches the most powerful vision transformers with only half computation time).
为了避免遭遇训练能力的瓶颈,一个众所周知的研究方向采用了稀疏门控的专家混合(MoE)(Shazeer 等人,2017)结构,使得预训练模型更具成本效益。这种稀疏 MoE 架构通过扩展专家网络来增大模型参数的规模,同时保持计算预算的稳定。具体来说,每个 MoE 层由一个门控网络和一系列专家(例如,数十个)组成,其中门控网络将输入令牌(即层的输入)仅路由到少数几个专家,而不是所有专家,以提高计算效率。MoE 模型已成功用于提升各种预训练模型(PTMs)的性能,如语言模型(例如,GLaM(Du 等人,2022)在 SuperGLUE NLP 基准测试(Wang 等人,2019)中达到了最先进的性能)和视觉模型(例如,V-MoE(Riquelme 等人,2021)仅用一半的计算时间就匹敌了最强大的视觉变换器)。

Due to the superior behaviors, large-scale MoE models have attracted wide interests in industrial companies recently when building real-world big data applications, including Google (Du et al., 2022), Meta (Roller et al., 2021), Microsoft (Rajbhandari et al., 2022) and Alibaba (Lin et al., 2021). Such kind of data-intensive workloads is becoming a new data analytics paradigm for data science research communities. With the rapid expansion of MoE models, unprecedented opportunities are brought into the data life cycle (especially in model development, deployment, and management procedures), as well as challenges. For example, Google created the ST-MoE (Zoph et al., 2022) model to help improve the question answering quality of their search engines, which achieves state-of-the-art result in the AIRC leadboard (qa_, 2021), e.g., increasing from 81.40%percent81.4081.40\% to 86.52%percent86.5286.52\% by using MoE. However, such a large MoE model not only needs a lot of data to train on, e.g., 1.5T tokens for ST-MoE, but also consumes large amounts of expensive computing resources for training. Therefore, it is necessary to address data management problems hidden in existing systems and conduct optimizations to overcome these challenges.
由于其卓越的性能,大规模的 MoE 模型近期在工业公司构建现实世界的大数据应用时引起了广泛兴趣,包括谷歌(Du 等,2022 年)、Meta(Roller 等,2021 年)、微软(Rajbhandari 等,2022 年)和阿里巴巴(Lin 等,2021 年)。这类数据密集型的工作负载正在成为数据科学研究社区的新数据分析范式。随着 MoE 模型的快速扩展,它为数据生命周期(特别是在模型开发、部署和管理过程中)带来了前所未有的机遇,同时也带来了挑战。例如,谷歌创建了 ST-MoE(Zoph 等,2022 年)模型来帮助提高其搜索引擎的问答质量,该模型在 AIRC 排行榜上取得了最先进的结果,例如,通过使用 MoE 将结果从 81.40%percent81.4081.40\% 提高到 86.52%percent86.5286.52\% 。然而,这样一个大型的 MoE 模型不仅需要大量的数据进行训练,例如,ST-MoE 需要 1.5T 的令牌,而且还消耗大量昂贵的计算资源进行训练。因此,有必要解决现有系统中隐藏的数据管理问题,并进行优化以克服这些挑战。

Since a single MoE layer can easily exceed the limited GPU memory, previous distributed training techniques (e.g, data parallel (Li et al., 2020) and model parallel (Narayanan et al., 2021, 2019; Miao et al., 2022b; Nie et al., 2022a)) are becoming unsatisfactory. To train large MoE models, Lepikhin et al. (2021) proposed the expert parallelism technique, which distributes experts into multiple GPUs to fit within the limited GPU memory. For each training token, after the gate network has determined the routing, it will be sent to the GPUs where its target experts reside. Since it adapts to the special characteristic of MoE structures, expert parallelism has become one of the most widely-used techniques to train MoE-based models in several well-known frameworks, including DeepSpeed-MoE (Rajbhandari et al., 2022), Tutel (Hwang et al., 2022) and HetuMoE (Nie et al., 2022b).
由于单个 MoE 层容易超出有限的 GPU 内存,以往的分布式训练技术(例如,数据并行(Li 等,2020 年)和模型并行(Narayanan 等,2021 年,2019 年;Miao 等,2022b 年;Nie 等,2022a 年))变得不再令人满意。为了训练大型 MoE 模型,Lepikhin 等人(2021 年)提出了专家并行技术,该技术将专家分布到多个 GPU 中,以适应有限的 GPU 内存。对于每个训练令牌,在门控网络确定了路由后,它将被发送到其目标专家所在的 GPU。由于它适应了 MoE 结构的特殊特性,专家并行已成为在几个知名框架中训练基于 MoE 模型的最广泛使用的技术之一,包括 DeepSpeed-MoE(Rajbhandari 等,2022 年),Tutel(Hwang 等,2022 年)和 HetuMoE(Nie 等,2022b 年)。

The Challenge of Workload Imbalance. Despite the widespread adoption of expert parallelism (Lepikhin et al., 2021), it suffers from the severe workload imbalance brought by the sparse and conditional computing nature of MoE layers. Specifically, the input tokens are organized at a fine-grained level and routed to the individual experts by the gating network. We found that the input tokens show highly uneven preferences on the experts and the routing results keep varying over training (Nie et al., 2021). As reported by  (Lewis et al., 2021; He et al., 2022; Ma et al., 2022) and verified by our empirical studies, such workload imbalance could lead to significantly efficiency slowdown and severe resource under-utilization.
工作负载不平衡的挑战。尽管专家并行(Lepikhin 等,2021 年)得到了广泛采用,但它受到了 MoE 层稀疏和条件计算特性带来的严重工作负载不平衡的困扰。具体来说,输入令牌在细粒度级别上被组织,并由门控网络路由到各个专家。我们发现,输入令牌对专家显示出高度不均匀的偏好,且路由结果在训练过程中不断变化(Nie 等,2021 年)。正如(Lewis 等,2021 年;He 等,2022 年;Ma 等,2022 年)所报告,并通过我们的实证研究验证,这种工作负载不平衡可能导致效率显著降低和资源严重闲置。

The Current Landscape. To tackle the workload imbalance problem, most existing works leverage an auxiliary balance loss on each MoE layer to enforce balanced routing. To control the trade-off between system efficiency (workloads balance) and statistical efficiency (model quality), the balance loss is further associated with a tunable penalty weight. In addition,  Fedus et al. (2022) introduces the expert capacity to avoid excessive input token assignments to the experts. In other words, tokens beyond the expert capacity will be dropped by the experts. Nevertheless, these methods can only mitigate the workload imbalance problem rather than fundamentally guarantee balance. Moreover, they make users stuck in a dilemma: whether and to what degree shall we sacrifice model quality for system efficiency? For instance, applying a large penalty weight to the balance loss (or applying a low capacity to experts) could help load balancing among experts but harm the model quality since a substantial amount of tokens would be routed to some less-desired experts (or dropped), and vice versa.
当前形势。为了解决工作负载不平衡问题,大多数现有工作在每个 MoE 层上利用辅助平衡损失来强制平衡路由。为了控制系统效率(工作负载平衡)和统计效率(模型质量)之间的权衡,进一步将平衡损失与可调节的惩罚权重相关联。此外,Fedus 等人(2022 年)引入了专家容量,以避免对专家的输入令牌分配过多。换句话说,超出专家容量的令牌将被专家丢弃。尽管如此,这些方法只能缓解工作负载不平衡问题,而不能从根本上保证平衡。此外,它们使用户陷入两难境地:我们是否以及在何种程度上应该为了系统效率牺牲模型质量?例如,对平衡损失施加大的惩罚权重(或对专家施加低容量)可以帮助在专家之间实现负载平衡,但由于大量令牌被路由到一些不太理想的专家(或被丢弃),从而损害模型质量,反之亦然。

Intuitively speaking, the aforementioned methods try to adjust the data-flow of tokens to be as even as possible, i.e., enforcing all experts to process almost the same amount of tokens. They are friendly adapted to existing expert parallelism systems, because users can easily change the definitions of their model (e.g., balance loss and capacity) and without any modifications on the implementation of underlying DL systems. In other words, such approaches are friendly to people who are not familiar with DL systems so they have to sacrifice the model quality for training efficiency. Unlike previous approaches making model modifications, which may cause model quality degradation, we try to optimize the workload imbalance from a system perspective:
直观地说,上述方法试图调整令牌的数据流尽可能均匀,即强制所有专家处理几乎相同数量的令牌。它们对现有的专家并行系统友好适应,因为用户可以轻松更改模型的定义(例如,平衡损失和容量),而无需对底层 DL 系统的实现进行任何修改。换句话说,这些方法对不熟悉 DL 系统的人友好,因此他们不得不为了训练效率牺牲模型质量。与之前导致模型质量下降的模型修改方法不同,我们尝试从系统角度优化工作负载不平衡:

How to design a distributed MoE training system that could achieve the maximum system efficiency without affecting the model quality in the meanwhile?
如何设计一个分布式 MoE 训练系统,既能实现最大的系统效率,同时又不影响模型质量?

Summary of Our Approach. In this work, we propose FlexMoE, a novel distributed training system for large-scale sparse MoE models. We overcome the routing imbalance problem from the brand new perspective of the expert placement — instead of placing each expert on a single GPU device as in traditional expert parallelism nor duplicating all experts to all GPU devices as in traditional data parallelism, we introduce a fine-grained replicated expert parallelism that selects specific heavy experts, duplicates them over multiple devices, and spreads the input tokens across the replicas. It is non-trivial in sparse MoE models because the device placement of these experts changes the original input data assignment (i.e., all-to-all communication across different experts) and involves additional synchronization overheads (i.e., all-reduce communication among replicated experts). By precisely modeling the actual execution procedure of the MoE layer, FlexMoE estimates the costs that determine the concrete expert replication scheme.
我们方法的概述。在这项工作中,我们提出了 FlexMoE,一种针对大规模稀疏 MoE 模型的新型分布式训练系统。我们从专家放置的全新视角克服了路由不平衡问题——我们既不是像传统的专家并行那样将每个专家放置在单个 GPU 设备上,也不是像传统的数据并行那样将所有专家复制到所有 GPU 设备上,而是引入了一种细粒度的复制专家并行机制,它选择特定的重负载专家,将它们复制到多个设备上,并将输入令牌分散到副本中。在稀疏 MoE 模型中,这是非平凡的,因为这些专家的设备放置改变了原始输入数据分配(即,不同专家之间的全到全通信)并涉及额外的同步开销(即,复制专家之间的全归约通信)。通过精确建模 MoE 层的实际执行过程,FlexMoE 估算了决定具体专家复制方案的成本。

We further investigate the uneven distribution of the sparsely-activated experts in various MoE training workloads. Our key finding is that MoE models exhibit routing fluctuation during the iterative training process. We observe that the imbalanced expert preference distribution changes continuously and smoothly during the training process (will be described in-depth in Section 2.4). The training of gate networks is highly unstable (Dai et al., 2022) because gradient-based optimization algorithms will reinforce the previous routing determination until it reaches a certain point and escapes from the wrong learning direction. To handle this obstacle, we design a dynamic expert management mechanism in FlexMoE to adaptively adjust the expert-to-device mapping and the assignment of tokens during training. Specifically, we adopt a data-driven approach to adaptively change the expert placement during training by monitoring the traffic trends. For instance, FlexMoE gradually expands resources for experts with increasing workloads for faster calculations and shrinks those with decreasing workloads to reduce replica synchronization overheads.
我们进一步研究了各种 MoE 训练工作负载中稀疏激活专家的不均匀分布。我们的关键发现是 MoE 模型在迭代训练过程中展现出路由波动。我们观察到,在训练过程中,不平衡的专家偏好分布持续且平滑地变化(将在第 2.4 节中深入描述)。由于基于梯度的优化算法会加强之前的路由决定,直到达到某个点并从错误的学习方向中逃脱,因此门控网络的训练极其不稳定(Dai 等,2022)。为了应对这一障碍,我们在 FlexMoE 中设计了一种动态专家管理机制,以适应性地调整训练期间的专家到设备的映射和令牌的分配。具体来说,我们采用数据驱动的方法,通过监控流量趋势,在训练期间适应性地改变专家放置。例如,FlexMoE 逐渐为工作负载增加的专家扩展资源以加快计算,并缩减工作负载减少的专家的资源以减少副本同步开销。

FlexMoE is designed as a novel scheduling module on top of existing DNN frameworks, which monitors data traffic, makes scheduling plans, and dynamically adjusts the mapping between the data-flow graph and distributed devices. Particularly, three placement adjustment primitives (i.e., expand, shrink, migrate) are provided to flexibly govern the expert placement and a gate flow-control mechanism is introduced to enable autonomous global traffic optimization. Based on these building blocks, a simple yet effective algorithm is designed to estimate the benefits or overheads of each adjustment to determine the optimal scheduling plan.
FlexMoE 被设计为一种新颖的调度模块,建立在现有的 DNN 框架之上,它监控数据流量,制定调度计划,并动态调整数据流图与分布式设备之间的映射关系。特别地,提供了三种位置调整原语(即扩展、缩小、迁移),以灵活地管理专家位置,并引入了一个门控流量控制机制,以实现自主的全局流量优化。基于这些构建块,设计了一个简单而有效的算法,以估算每次调整的收益或开销,从而确定最优的调度计划。

Last but not least, we implement FlexMoE on top of PyTorch (Paszke et al., 2019) and verify the superiority of our work through comprehensive experiments. When training several popular MoE-based PTMs (e.g., BERT, GPT, Swin) of various tasks on a 64-GPU A100 cluster, FlexMoE achieves up to 2.1×\times of speedup compared with existing state-of-the-art MoE training systems. Furthermore, FlexMoE consistently outperforms them in terms of model quality since we do not accommodate the expert placement with compromised token-routing.
最后但同样重要的是,我们在 PyTorch(Paszke 等人,2019)之上实现了 FlexMoE,并通过全面的实验验证了我们工作的优越性。在 64-GPU A100 集群上训练多种任务的几种流行的基于 MoE 的 PTMs(例如,BERT、GPT、Swin)时,与现有最先进的 MoE 训练系统相比,FlexMoE 实现了高达 2.1 倍的加速。此外,由于我们没有妥协地适应专家位置与代币路由,FlexMoE 在模型质量方面也始终优于它们。

Paper Organization. In section 2, we discuss the inefficiency of training sparse MoE models with expert parallelism and explore opportunities for optimizing such problems through system-level optimizations. In section 3, we introduce the system design of FlexMoE, which leverages the vExpert abstraction to facilitate dynamic expert management and device placement. We provide details on the system implementation in Section 4 and evaluate the effectiveness of FlexMoE in Section 5.
论文组织。在第 2 节中,我们讨论了使用专家并行性训练稀疏 MoE 模型的低效性,并探索了通过系统级优化来优化此类问题的机会。在第 3 节中,我们介绍了 FlexMoE 的系统设计,该设计利用 vExpert 抽象来促进动态专家管理和设备布置。我们在第 4 节提供了系统实现的详细信息,并在第 5 节评估了 FlexMoE 的有效性。

2. Background and Motivations
2.背景与动机

We first present the notations used in our paper.
首先,我们介绍本文中使用的符号。

  • *

    \mathcal{E}: A series of experts {e1,,eN}subscript𝑒1subscript𝑒𝑁\{e_{1},...,e_{N}\}.


    * \mathcal{E} :一系列专家 {e1,,eN}subscript𝑒1subscript𝑒𝑁\{e_{1},...,e_{N}\}
  • *

    𝒢𝒢\mathcal{G}: A set of GPUs, where g𝒢𝑔𝒢g\in\mathcal{G}.


    * 𝒢𝒢\mathcal{G} :一组 GPU,其中 g𝒢𝑔𝒢g\in\mathcal{G}
  • *

    \mathcal{I}: The assignment of tokens, where Ie,gsubscript𝐼𝑒𝑔I_{e,g}\subset\mathcal{I}.
    * \mathcal{I} :分配令牌,其中 Ie,gsubscript𝐼𝑒𝑔I_{e,g}\subset\mathcal{I}

    Ie,gsubscript𝐼𝑒𝑔I_{e,g} represents the number of tokens for expert e𝑒e to GPU g𝑔g.
    Ie,gsubscript𝐼𝑒𝑔I_{e,g} 代表将令牌数量从专家 e𝑒e 分配给 GPU g𝑔g

  • *

    𝒫𝒫\mathcal{P}: The expert-to-device mapping, where (e,g)𝒫𝑒𝑔𝒫(e,g)\in\mathcal{P}.
    * 𝒫𝒫\mathcal{P} :专家到设备的映射,其中 (e,g)𝒫𝑒𝑔𝒫(e,g)\in\mathcal{P}

    (e,g)𝒫𝑒𝑔𝒫(e,g)\in\mathcal{P} represents GPU g𝑔g exists expert e𝑒e.
    (e,g)𝒫𝑒𝑔𝒫(e,g)\in\mathcal{P} 代表 GPU g𝑔g 存在专家 e𝑒e

  • *

    TPS𝑇𝑃𝑆TPS: Tokens-per-second for an expert.


    * TPS𝑇𝑃𝑆TPS :每秒处理的令牌数,针对一个专家。
  • *

    Bwg,g𝐵subscript𝑤𝑔superscript𝑔Bw_{g,g^{\prime}}: Bandwidth between GPU g𝑔g and gsuperscript𝑔g^{\prime}.


    * Bwg,g𝐵subscript𝑤𝑔superscript𝑔Bw_{g,g^{\prime}} :GPU g𝑔ggsuperscript𝑔g^{\prime} 之间的带宽。
  • *

    BPS(𝒢)𝐵𝑃𝑆superscript𝒢BPS(\mathcal{G}^{\prime}): Bytes-per-second for AllReduce on a set of GPUs 𝒢superscript𝒢\mathcal{G}^{\prime}.


    * BPS(𝒢)𝐵𝑃𝑆superscript𝒢BPS(\mathcal{G}^{\prime}) :在一组 GPU 上进行 AllReduce 操作的每秒字节数 𝒢superscript𝒢\mathcal{G}^{\prime}

2.1. Transformer with Mixture-of-Expert
2.1. 带有专家混合的变压器

The Transformer architecture (Vaswani et al., 2017) has demonstrated its superior performance in various tasks (Devlin et al., 2019; Brown et al., 2020; Raffel et al., 2020; Dosovitskiy et al., 2021; Liu et al., 2021; Wei et al., 2022; Maruta and Kato, 2022), which mainly consists of attention networks and feed-forward networks. Each attention network first linearly transforms the input tokens into corresponding queries (Q), keys (K) and values (V) and then performs the scaled dot-product on them as Equation 1, where d𝑑d is the dimension of queries and keys. Meanwhile, this attention network benefits from capturing the dependencies between tokens within the sequence, and thus is effective in sequence transduction tasks.
变压器架构(Vaswani 等人,2017 年)在各种任务中展示了其卓越的性能(Devlin 等人,2019 年;Brown 等人,2020 年;Raffel 等人,2020 年;Dosovitskiy 等人,2021 年;Liu 等人,2021 年;Wei 等人,2022 年;Maruta 和 Kato,2022 年),主要由注意力网络和前馈网络组成。每个注意力网络首先将输入的令牌线性变换为相应的查询(Q)、键(K)和值(V),然后对它们执行缩放的点积运算,如方程 1 所示,其中 d𝑑d 是查询和键的维度。同时,这个注意力网络通过捕捉序列内令牌之间的依赖关系而受益,因此在序列转换任务中效果显著。

(1) Attention(Q,K,V)=softmax(QKTd)VAttention𝑄𝐾𝑉softmax𝑄superscript𝐾𝑇𝑑𝑉\texttt{Attention}(Q,K,V)=\texttt{softmax}(\frac{QK^{T}}{\sqrt{d}})V

Each feed-forward network (FFN) is composed by two fully connected layers and an activation function, i.e., ReLU, formulated as Equation 2. Different from the attention network, FFNs model the relation of different feature dimensions within a single token by scaling it into larger dimension space. For example, the output dimension of W1subscript𝑊1W_{1} is always set as 4×4\times of its input dimension. Existing famous pre-trained models are usually stacked by a series of transformer layers to improve its model capacity as well as model quality.
每个前馈网络(FFN)由两个全连接层和一个激活函数(例如 ReLU)组成,如方程 2 所公式化。与注意力网络不同,FFN 通过将单个令牌扩展到更大的维度空间来模拟不同特征维度之间的关系。例如, W1subscript𝑊1W_{1} 的输出维度总是设置为其输入维度的 4×4\times 倍。现有的著名预训练模型通常由一系列变压器层堆叠而成,以提高其模型容量以及模型质量。

(2) FFN(x)=W2ReLU(W1x+b1)+b2FFN𝑥subscript𝑊2ReLUsubscript𝑊1𝑥subscript𝑏1subscript𝑏2\texttt{FFN}(x)=W_{2}\cdot\texttt{ReLU}(W_{1}\cdot x+b_{1})+b_{2}

Recently, scaling with more data and more parameters has driven significant model quality improvement (Dosovitskiy et al., 2021; Brown et al., 2020) while requires large amounts of computing resources for training. To scale models efficiently, researchers have adopted the sparse-gated Mixture-of-Experts (MoE) paradigm (Shazeer et al., 2017; Fedus et al., 2022) by replacing the FFN network with an MoE layer, where each input token activates only a subset of model parameters, and thus introducing model sparsity.
近期,通过增加数据量和模型参数,显著提升了模型的质量(Dosovitskiy 等人,2021 年;Brown 等人,2020 年),但这同时也需要大量的计算资源来进行训练。为了高效地扩展模型,研究人员采用了稀疏门控的专家混合(MoE)范式(Shazeer 等人,2017 年;Fedus 等人,2022 年),通过用 MoE 层替换 FFN 网络,其中每个输入令牌只激活模型参数的一个子集,从而引入了模型的稀疏性。

The key components of a MoE layer include a data-dependent sparse gate network g(x)𝑔𝑥g(x) and a series of experts \mathcal{E}, as shown in Figure 1(a). For each input token x𝑥x, the gate network first produces the probability of x𝑥x with respective to all experts and then routes x𝑥x to its corresponding experts. The Top-K gate (Shazeer et al., 2017) is formulated in Equation 3, which keeps only the top k values before the softmax function.
MoE 层的关键组成部分包括一个数据依赖的稀疏门控网络和一系列专家,如图 1(a) 所示。对于每个输入令牌,门控网络首先产生相对于所有专家的概率,然后将令牌路由到其对应的专家。Top-K 门控(Shazeer 等人,2017 年)在方程 3 中被定义,它在 softmax 函数之前只保留前 k 个值。

(3) g(x)=softmax(TopK(xWg))𝑔𝑥softmaxTopK𝑥subscript𝑊𝑔g(x)=\texttt{softmax}(\texttt{TopK}(x\cdot W_{g}))

As soon as each expert eisubscript𝑒𝑖e_{i} receives its input token x𝑥x (eisubscript𝑒𝑖e_{i}\in\mathcal{E}), it produces its corresponding output ei(x)subscript𝑒𝑖𝑥e_{i}(x) and then the final output y𝑦y is obtained by the linear combination of ei(x)subscript𝑒𝑖𝑥e_{i}(x) weighted by the output of the gate network g(x)i𝑔subscript𝑥𝑖g(x)_{i} as follows:
一旦每个专家接收到其输入令牌,它就会产生相应的输出,然后通过门控网络的输出加权线性组合这些输出,得到最终的输出

(4) y=i=1Ng(x)iei(x).𝑦superscriptsubscript𝑖1𝑁𝑔subscript𝑥𝑖subscript𝑒𝑖𝑥y=\sum_{i=1}^{N}g(x)_{i}\cdot e_{i}(x).
Refer to caption
Refer to caption
(a) The workflow of an MoE layer
(a) MoE 层的工作流程
Refer to caption
(b) The distributed training of an MoE layer with Top-1 gate
(b) 使用 Top-1 门控的 MoE 层的分布式训练
Figure 1. Illustrations of Mixture-of-Experts (MoE). Figure 1(a) represents the workflow of a typical MoE layer (detailed in Section 2.1). Figure 1(b) presents the distributed training of an MoE layer as expert parallelism, where experts are partitioned while non-MoE layers (e.g., self-attention, gate) are duplicated across devices. For example, according to the result of Gate, GPU 1 routes 6 input tokens to 1st1𝑠𝑡1st Expert and the other 3 tokens to Nth𝑁𝑡Nth Expert (detailed in Section 2.2).
图 1. 专家混合模型(MoE)的示意图。图 1(a)展示了典型 MoE 层的工作流程(详见第 2.1 节)。图 1(b)展示了作为专家并行性的 MoE 层的分布式训练,其中专家被分割,而非 MoE 层(例如,自注意力、门控)在设备间被复制。例如,根据门控的结果,GPU 1 将 6 个输入令牌路由至 1st1𝑠𝑡1st 专家,其余 3 个令牌路由至 Nth𝑁𝑡Nth 专家(详见第 2.2 节)。

When the gate network’s sparsity is static, e.g., Top-1 (Fedus et al., 2022) or Top-2 (Lepikhin et al., 2021), the computation and communication costs of given inputs nearly remain constant as the number of experts increases, allowing models to be scaled effectively with MoE.
当门控网络的稀疏性是静态的,例如,Top-1(Fedus 等,2022)或 Top-2(Lepikhin 等,2021),随着专家数量的增加,给定输入的计算和通信成本几乎保持不变,允许模型通过 MoE 有效地扩展。

To help readers better understand the routine of MoE layers, we also provide an illustration in Figure 1(b), where each GPU is associated with one expert and 9 tokens are fed to the gate network and routed to their target experts. For example, GPU 1 routes 6 and 3 tokens to the 1st1𝑠𝑡1st and Nth𝑁𝑡Nth experts, while GPU N routes 4 and 5 tokens to these two experts. This assignment would cause the workload imbalance problem under current expert placement (i.e., 10 tokens for GPU 1 and 8 tokens for GPU N). Since there is a reverse routing step after the expert computation, all GPUs have to wait for each other before executing the following layers. Such an imbalanced workload inevitably results in GPU under-utilization and low training efficiency.
为了帮助读者更好地理解 MoE 层的常规流程,我们还在图 1(b)中提供了一个示意图,其中每个 GPU 与一个专家相对应,9 个令牌被送入门控网络并被路由到它们的目标专家。例如,GPU 1 将 6 个和 3 个令牌路由到 1st1𝑠𝑡1stNth𝑁𝑡Nth 专家,而 GPU N 将 4 个和 5 个令牌路由到这两个专家。这种分配会在当前的专家布局下引起工作负载不平衡问题(即 GPU 1 有 10 个令牌,而 GPU N 有 8 个令牌)。由于在专家计算后有一个反向路由步骤,所有 GPU 必须等待彼此完成后才能执行后续层。这种不平衡的工作负载不可避免地导致 GPU 利用率低下和训练效率低。

2.2. Distributed Training 2.2.分布式训练

As models are sparsely scaled with MoE, multiple GPUs would be involved for model management and training acceleration (Narayanan et al., 2021; Rajbhandari et al., 2022; Nie et al., 2023). We will analyze the existing popular parallelism strategies in the following:
随着 MoE 的稀疏扩展模型,多个 GPU 将被用于模型管理和训练加速(Narayanan 等,2021 年;Rajbhandari 等,2022 年;Nie 等,2023 年)。我们将在下文中分析现有的流行并行策略:

Data Parallelism. Data parallelism (DP) is usually used to scale training when the model can fit in the device’s available GPU memory, as the communication primitives (e.g., AllReduce) of DP achieve good scalability performance. In DP, training samples are partitioned while model parameters are duplicated for each device. Given a batch of training samples, each worker executes the forward and backward computation, synchronizes gradients globally (i.e., averaged), and updates local parameters based on the synchronized gradients. However, each device has to maintain a full replica of the model and thus DP can’t be used to scale up to large models.
数据并行。数据并行(DP)通常用于在模型可以适应设备可用 GPU 内存的情况下扩展训练,因为 DP 的通信原语(例如,AllReduce)实现了良好的可扩展性性能。在 DP 中,训练样本被分割,而模型参数为每个设备复制一份。给定一批训练样本,每个工作器执行前向和后向计算,全局同步梯度(即,平均化),并根据同步的梯度更新本地参数。然而,每个设备必须维护模型的完整副本,因此 DP 不能用于扩展到大型模型。

Model Parallelism. If the memory requirement of the given model exceeds the GPU memory, it should be partitioned across multiple devices. In tensor parallelism (TP), every single tensor can be partitioned over multiple devices to reduce the memory footprint of each GPU. For example, Megatron-LM (Narayanan et al., 2021) proposed to partition the attention network by exploiting its inherent parallelism in the multi-head attention operation where the queries (Q𝑄Q), keys (K𝐾K) and values (V𝑉V) matrices can be partitioned in a column-parallel fashion. Pipeline parallelism (PP) is an alternative partitioning method, where different groups of layers are placed on different devices. For example, GPipe (Huang et al., 2019) partitions the input batch into a number of micro batches and pipelines each device’s computation across micro batches to improve the resource utilization of devices.
模型并行性。如果给定模型的内存需求超过了 GPU 内存,那么它应该被分割到多个设备上。在张量并行性(TP)中,每一个单独的张量都可以被分割到多个设备上,以减少每个 GPU 的内存占用。例如,Megatron-LM(Narayanan 等人,2021 年)提出通过利用多头注意力操作中固有的并行性来分割注意力网络,其中查询( Q𝑄Q )、键( K𝐾K )和值( V𝑉V )矩阵可以以列并行的方式被分割。流水线并行性(PP)是另一种分割方法,不同组的层被放置在不同的设备上。例如,GPipe(Huang 等人,2019 年)将输入批次分割成多个微批次,并在微批次之间对每个设备的计算进行流水线处理,以提高设备的资源利用率。

Expert Parallelism. GShard (Lepikhin et al., 2021) first proposed expert parallelism (EP) as a specific model parallelism method for MoE models, where experts within an MoE layer are placed on different devices and non-MoE layers are replicated on devices as DP. The workflow of distributed training a Top-1 gate MoE layer is illustrated in Figure 1(b). After getting the target expert for each token, an All-to-All communication is involved to send tokens to target experts for processing. And another All-to-All communication is needed to send data back for the execution of data-parallel non-MoE layers. As MoE models often have numerous experts, e.g., 1024 Experts in Switch Transformer (Fedus et al., 2022), EP can scale up with model size better than model parallelism.
专家并行性。GShard(Lepikhin 等人,2021 年)首次提出专家并行性(EP)作为 MoE 模型的一种特定模型并行方法,其中 MoE 层内的专家被放置在不同的设备上,而非 MoE 层则作为 DP 在设备上被复制。分布式训练 Top-1 门 MoE 层的工作流程如图 1(b)所示。在为每个令牌获取目标专家后,需要进行一次全到全通信,以将令牌发送给目标专家进行处理。然后,需要另一次全到全通信将数据发送回来,以执行数据并行的非 MoE 层。由于 MoE 模型通常拥有大量的专家,例如 Switch Transformer(Fedus 等人,2022 年)中的 1024 个专家,EP 可以随着模型大小的增加而比模型并行性更好地扩展。

2.3. Problem Formulation 2.3.问题表述

To model the training cost, we consider a single MoE layer, which consists of a gate network and a set of experts \mathcal{E} (e𝑒e\in\mathcal{E}). Given current expert-to-device mapping 𝒫𝒫\mathcal{P} ((e,g)𝒫𝑒𝑔𝒫(e,g)\in\mathcal{P} represents GPU g𝑔g has expert e𝑒e), the assignment of tokens \mathcal{I} (Ie,gsubscript𝐼𝑒𝑔I_{e,g} represents the number of tokens for expert e𝑒e to GPU g𝑔g) at the current step, the training cost can be formulated as T(,𝒫)𝑇𝒫T(\mathcal{I},\mathcal{P}). Our objective is to minimize this training cost, which can be expressed as follows:
为了建模训练成本,我们考虑一个单独的 MoE 层,它由一个门控网络和一组专家 \mathcal{E}e𝑒e\in\mathcal{E} )组成。给定当前的专家到设备的映射 𝒫𝒫\mathcal{P}(e,g)𝒫𝑒𝑔𝒫(e,g)\in\mathcal{P} 表示 GPU g𝑔g 拥有专家 e𝑒e ),在当前步骤下令牌的分配 \mathcal{I}Ie,gsubscript𝐼𝑒𝑔I_{e,g} 表示专家 e𝑒e 到 GPU g𝑔g 的令牌数量),训练成本可以被表述为 T(,𝒫)𝑇𝒫T(\mathcal{I},\mathcal{P}) 。我们的目标是最小化这个训练成本,可以如下表达:

(5) minT(,𝒫)𝑇𝒫\displaystyle\mathop{\min}\ T(\mathcal{I},\mathcal{P}) =min(maxg𝒢e(e,g)𝒫{TC(Ie,g)+TA2A(Ie,g)+TSync(𝒫,e)})absentsubscript𝑔𝒢superscriptsubscript𝑒𝑒𝑔𝒫subscript𝑇𝐶subscript𝐼𝑒𝑔subscript𝑇𝐴2𝐴subscript𝐼𝑒𝑔subscript𝑇𝑆𝑦𝑛𝑐𝒫𝑒\displaystyle=\mathop{\min}\ ({\mathop{\max}_{g\in\mathcal{G}}{\sum_{e}^{(e,g)\in\mathcal{P}}\{{{T_{C}(I_{e,g})+T_{A2A}(I_{e,g})}}}+T_{Sync}(\mathcal{P},e)}\})

The first two terms represent the cost of expert computation and All-to-All communication respectively, which are determined by the assignment of tokens, i.e., Ie,gsubscript𝐼𝑒𝑔I_{e,g}. And the third term represents the cost of synchronization (TSyncsubscript𝑇𝑆𝑦𝑛𝑐T_{Sync}) because one expert may exist on multiple GPUs as data parallelism and needs to synchronize their gradients to maintain consistent. Each GPU sums up all of its local experts with regard to these three terms to obtain its execution time and the global training time T(,𝒫)𝑇𝒫T(\mathcal{I},\mathcal{P}) is the maximum execution time among these GPUs. In order to minimize the training cost while not modify the definitions of models, we design a dynamic expert management mechanism to adaptively adjust the expert-to-device mapping 𝒫𝒫\mathcal{P} and the assignment of tokens \mathcal{I} during training.
前两项分别代表专家计算和全到全通信的成本,这些由令牌的分配决定,即 Ie,gsubscript𝐼𝑒𝑔I_{e,g} 。第三项代表同步成本( TSyncsubscript𝑇𝑆𝑦𝑛𝑐T_{Sync} ),因为一个专家可能存在于多个 GPU 上作为数据并行,并需要同步它们的梯度以保持一致性。每个 GPU 根据这三项将其所有本地专家加总,以获得其执行时间,而全局训练时间 T(,𝒫)𝑇𝒫T(\mathcal{I},\mathcal{P}) 是这些 GPU 中最大的执行时间。为了在不修改模型定义的情况下最小化训练成本,我们设计了一种动态专家管理机制,在训练期间适应性地调整专家到设备的映射 𝒫𝒫\mathcal{P} 和令牌的分配 \mathcal{I}

Refer to caption
Figure 2. The top-5 accuracy of Swin-MoE under different balance loss coefficients, where we do not restrict the capacity of each expert to ensure no token is absent.
图 2:在不同平衡损失系数下,Swin-MoE 的前 5 准确率,其中我们没有限制每个专家的容量,以确保没有令牌缺失。

Existing System-Friendly Optimizations. As discussed in Section 1, existing works focus on addressing the workload imbalance problem by modifying the definitions of models (e.g., balance loss and capacity), which doesn’t need users to make modifications on the underlying DL systems and therefore is system-friendly.
现有的系统友好优化。如第 1 节所讨论的,现有工作通过修改模型的定义(例如,平衡损失和容量)来解决工作负载不平衡问题,这不需要用户对底层深度学习系统进行修改,因此是系统友好的。

  • The balance loss depicts the imbalance level of \mathcal{I} and is widely used in MoE training (Fedus et al., 2022; Lepikhin et al., 2021). Once the gate network produces an imbalanced token assignment, it would be penalized by a large balance loss. In practice, the training process aims to minimize the weighted sum of balance loss and training loss. Thus, there exists a trade-off — emphasizing the balance loss would drive the assignment \mathcal{I} to be even but harms the model quality since the training loss would be higher, and vice versa.


    • 平衡损失描述了 \mathcal{I} 的不平衡水平,并在 MoE 训练中广泛使用(Fedus 等,2022;Lepikhin 等,2021)。一旦门控网络产生了不平衡的令牌分配,它将会因较大的平衡损失而受到惩罚。实际上,训练过程旨在最小化平衡损失和训练损失的加权和。因此,存在一个权衡——强调平衡损失会使分配 \mathcal{I} 变得更加均匀,但会损害模型质量,因为训练损失会更高,反之亦然。
  • The capacity is a threshold that limits the number of input tokens for each expert (i.e., Ie,gsubscript𝐼𝑒𝑔{I}_{e,g}). Tokens that exceeds the capacity will be skipped by the expert and directly forwarded to the next layer by the residual connection. In short, the capacity upper bounds the training cost of the heaviest expert to improve the overall efficiency, however, at the price of degraded model quality since a certain amount of tokens cannot be fully trained.


    • 容量是限制每个专家(即 Ie,gsubscript𝐼𝑒𝑔{I}_{e,g} )的输入令牌数量的阈值。超出容量的令牌将被专家跳过,并通过残差连接直接转发到下一层。简而言之,容量上限限制了最重专家的训练成本,以提高整体效率,然而,代价是模型质量下降,因为一定数量的令牌无法得到充分训练。

2.4. Observations and Opportunities
2.4.观察与机遇

Observation 1: Limitations of the existing techniques. We first studied the effect of the existing techniques empirically and took Swin-MoE (Liu et al., 2021) as an example, where we varied the balance loss coefficient and did not restrict the capacity of each expert to ensure no token was absent. The results are summarized in Figure 2. By increasing the balance loss coefficient from 0 to 0.05, the GPU utilization is improved from 18.77%percent18.7718.77\% to 63.30%percent63.3063.30\% while the top-5 accuracy is decreased from 94.588%percent94.58894.588\% to 93.981%percent93.98193.981\%. It demonstrates that enforcing workload balance by adjusting the assignment of tokens \mathcal{I} would inevitably lead to the trade-off between system efficiency and model quality.
观察 1:现有技术的局限性。我们首先从经验上研究了现有技术的效果,并以 Swin-MoE(Liu 等,2021)为例,我们调整了平衡损失系数,并且没有限制每个专家的容量以确保没有令牌缺失。结果总结在图 2 中。通过将平衡损失系数从 0 增加到 0.05,GPU 利用率从 18.77%percent18.7718.77\% 提高到 63.30%percent63.3063.30\% ,而 top-5 准确率则从 94.588%percent94.58894.588\% 降低到 93.981%percent93.98193.981\% 。这表明,通过调整令牌的分配来强制工作负载平衡 \mathcal{I} ,不可避免地会导致系统效率与模型质量之间的权衡。

Observation 2: Dynamic imbalanced workloads. We also recorded the trace of training GPT-MoE models (64 expert per MoE layer) and studied the loads of experts during training. Results are summarized in Figure 3 and we have identified two key characteristics:
观察 2:动态不平衡的工作负载。我们还记录了训练 GPT-MoE 模型(每个 MoE 层 64 个专家)的轨迹,并研究了训练期间专家的负载。结果总结在图 3 中,我们确定了两个关键特征:

  • Skewness. As shown in Figure 3(a), we visualize the computational loads of experts as the cumulative distribution function (CDF), where we observe that the Top-10 experts (10 out of 64) receive almost 75%percent7575\% tokens, leading to routing imbalance in MoE training. If the experts are evenly distributed among GPUs as in expert parallelism (Lepikhin et al., 2021), such imbalanced workloads would result in severe resource under-utilization, as most experts need to wait for the slowest.


    • 偏斜性。如图 3(a)所示,我们将专家的计算负载可视化为累积分布函数(CDF),其中我们观察到前 10 名专家(64 个中的 10 个)接收到几乎 75%percent7575\% 令牌,导致 MoE 训练中的路由不平衡。如果专家像在专家并行中那样均匀分布在 GPU 上(Lepikhin 等,2021),这种不平衡的工作负载将导致严重的资源未充分利用,因为大多数专家需要等待最慢的专家。
  • Smoothness and continuousness. In Figure 3(b), we present evolution of expert loads throughout the entire training process, where different intervals represent different experts. We observe that the load of each expert is continuously changing during training, for example, from less to more, from more to less, and from more to less and then more again, etc, which poses routing fluctuation when training MoE models. Fortunately, the load of each expert does not change dramatically in a short period of time, which means a smooth and continuous change.


    • 平滑性与连续性。在图 3(b)中,我们展示了整个训练过程中专家负载的演变,其中不同的区间代表不同的专家。我们观察到,每个专家的负载在训练过程中持续变化,例如,从少到多,从多到少,再从多到少然后再多等等,这在训练 MoE 模型时引起了路由波动。幸运的是,每个专家的负载在短时间内不会发生剧烈变化,这意味着变化是平滑且连续的。
Refer to caption
(a) CDF of expert loads
(a) 专家负载的累积分布函数(CDF)
Refer to caption
(b) The evolution of expert loads throughout the entire training process
(b) 整个训练过程中专家负载的演变
Figure 3. Illustrations of expert loads. For a single step, we sort experts within each MoE layer based on their computational loads and visualize the corresponding cumulative distribution function (CDF) in Figure 3(a), where different colors indicate different MoE layers. For the entire training process, Figure 3(b) illustrates the dynamic changes in the load of each expert, where the different color areas represent the different experts.
图 3. 专家负载的示意图。对于单一步骤,我们根据每个 MoE 层中专家的计算负载对专家进行排序,并在图 3(a)中可视化相应的累积分布函数(CDF),其中不同的颜色表示不同的 MoE 层。对于整个训练过程,图 3(b)展示了每个专家负载的动态变化,不同颜色的区域代表不同的专家。

Challenges. Motivated by these observations, our work explores how to develop a system that achieves workload balance by adjusting the expert-to-device mapping (i.e., 𝒫𝒫\mathcal{P}) for better system efficiency while maintaining the model quality. Moreover, the system should dynamically adapt to the varying routing distribution during the training process. However, since both token routing and expert loads are decided by the data-dependent gating mechanism, we can not determine the optimal mapping ahead of its execution. The main challenge lies in designing and modifying the expert-to-device mapping efficiently under the fast GPU computation and rapid change in workloads. Another challenge is how to efficiently implement these irregular MoE operations.
挑战。受到这些观察的启发,我们的工作探索了如何开发一个系统,通过调整专家到设备的映射(即 𝒫𝒫\mathcal{P} ),在保持模型质量的同时,实现工作负载平衡,提高系统效率。此外,系统应能够动态适应训练过程中路由分布的变化。然而,由于令牌路由和专家负载都由数据依赖的门控机制决定,我们无法在执行之前确定最优映射。主要挑战在于,在快速的 GPU 计算和工作负载迅速变化的情况下,高效地设计和修改专家到设备的映射。另一个挑战是如何高效实现这些不规则的 MoE 操作。

Opportunities. Fortunately, as previously mentioned, the distribution changes smoothly and continuously. This means that the optimal expert-to-device mapping would not shift significantly in a short period of time. Therefore, it is feasible to refine the mapping based on the ad-hoc routing determination, without the need to predict the optimum for the next few training steps. Furthermore, the cost of computation and communication can be estimated before the actual execution. Based on the cost models, we can predict the benefits and overheads of different mapping candidates to find the best one.
机遇。幸运的是,如前所述,分布变化平滑且连续。这意味着最优的专家到设备映射在短时间内不会发生重大变化。因此,基于特定情况下的路由决定来细化映射是可行的,无需预测接下来几个训练步骤的最优解。此外,计算和通信的成本可以在实际执行之前估算。基于成本模型,我们可以预测不同映射候选方案的利益和开销,以找到最佳方案。

Refer to caption
Figure 4. FlexMoE System Architecture.
图 4. FlexMoE 系统架构。

3. FlexMoE Design 3. FlexMoE 设计

3.1. Overview 3.1.概览

The workflow of our FlexMoE is illustrated in Figure 4, where (1) input tokens are first processed by the gate network to determine their corresponding target experts and then (2) the router leverages a greedy algorithm to distribute tokens to different replicas of each expert. Finally, (3) each replica performs computations on its assigned tokens and delivers the output results back.
我们的 FlexMoE 工作流程如图 4 所示,其中(1)输入令牌首先由门控网络处理,以确定它们对应的目标专家,然后(2)路由器利用贪心算法将令牌分配给每个专家的不同副本。最后,(3)每个副本对其分配的令牌进行计算,并将输出结果返回。

To handle the dynamic workload imbalance, we have designed a dynamic expert management mechanism that adaptively adjusts the expert-to-device mapping 𝒫𝒫\mathcal{P} and the assignment of tokens \mathcal{I} during training. In addition to the regular workflow, we also introduce two new components: Scheduler and Policy Maker. (4) Scheduler monitors the real-time loads of experts and sends them to Policy Maker if the current imbalance metric (i.e., balance ratio in Equation 6) exceeds the predetermined threshold. Based on the received loads and the current placement of experts, Policy Maker produces modification instructions and sends them back to Scheduler. Scheduler then interacts with the executor of current DL frameworks and triggers modifications of expert placement at runtime.
为了处理动态工作负载不平衡,我们设计了一种动态专家管理机制,该机制在训练过程中自适应调整专家到设备的映射 𝒫𝒫\mathcal{P} 和令牌的分配 \mathcal{I} 。除了常规工作流程外,我们还引入了两个新组件:调度器和策略制定者。(4)调度器监控专家的实时负载,并在当前不平衡指标(即,方程 6 中的平衡比率)超过预定阈值时,将其发送给策略制定者。基于接收到的负载和专家当前的位置,策略制定者产生修改指令并将其发送回调度器。调度器随后与当前深度学习框架的执行器交互,并在运行时触发专家位置的修改。

To cope with the large decision space of flexible expert-to-device mapping and its dynamic modifications, FlexMoE proposes the abstraction of vExpert to represent the minimum unit for scheduling, as detailed in Section 3.2. Moreover, FlexMoE decouples the expert placement modifications using three placement modification primitives, including Expand, Shrink and Migrate, as explained in Section 3.3. Finally, we design a greedy heuristic algorithm to generate a sequential combination of these modification primitives to produce the adjustment plan of expert-to-device mapping, as described in Section 3.4.
为了应对灵活的专家到设备映射的大决策空间及其动态修改,FlexMoE 提出了 vExpert 的抽象概念,以表示调度的最小单元,详见 3.2 节。此外,FlexMoE 通过使用三种位置修改原语(包括扩展、收缩和迁移)来解耦专家位置的修改,如 3.3 节所述。最后,我们设计了一个贪心启发式算法,以生成这些修改原语的顺序组合,以产生专家到设备映射的调整计划,如 3.4 节所述。

3.2. Dynamic Expert Management and vExpert
3.2 动态专家管理与虚拟专家

To tackle the optimization problem related to expert-to-device mapping and its corresponding token routing, we introduce a novel abstraction called vExpert, which defines the minimum unit for scheduling GPU computations for an expert and enables dynamic expert management. The vExpert abstraction helps in determining how to duplicate experts, when to increase or decrease replicas, and how to partition tokens between replicas.
为了解决与专家到设备映射及其相应令牌路由相关的优化问题,我们引入了一个名为虚拟专家(vExpert)的新概念抽象,它定义了为专家安排 GPU 计算的最小单元,并实现了动态专家管理。虚拟专家抽象有助于确定如何复制专家,何时增减副本,以及如何在副本之间分配令牌。

vExpert has the following characteristics:
虚拟专家具有以下特点:

  • Each vExpert can only be assigned as the replica of exactly one expert and process part of tokens for the master expert.


    • 每个虚拟专家只能被指定为一个专家的副本,并处理该主专家的部分令牌。
  • Each vExpert shares weights with other vExperts for the same expert on the same GPU, which means packing the same vExperts of the same GPU.


    • 每个虚拟专家(vExpert)与同一 GPU 上的同一专家的其他虚拟专家共享权重,这意味着将同一 GPU 上的相同虚拟专家进行打包。
  • Evenly workloads (i.e., tokens) partitioning among all vExperts for the same expert.


    • 在所有虚拟专家(vExperts)之间均匀分配工作负载(即,令牌)。

With this abstraction, we can significantly reduce the large search space of the optimization problem by making decisions at the vExpert level. Besides, we have designed three placement modification primitives to manage the dynamic expert management at the vExpert level, including Expand, Shrink and Migrate (see Section 3.3 for details). These primitives allow for arbitrary modifications of expert placement by composing them in different ways. Moreover, because the amount of data in each iteration is fixed (Bi=Bsubscript𝐵𝑖𝐵\sum B_{i}=B), we can calculate the ideal capacity of each vExpert (B/(GE)𝐵𝐺𝐸B/(G*E)) as a reference for decision-making, where B𝐵B is the total number of input tokens, G𝐺G is the number of GPUs and E𝐸E is the number of vExperts on each device.
通过这种抽象,我们可以通过在虚拟专家(vExpert)层面上做决策,显著减少优化问题的大型搜索空间。此外,我们设计了三种位置修改原语,以在虚拟专家层面管理动态专家管理,包括扩展、收缩和迁移(详见第 3.3 节)。这些原语允许通过以不同方式组合它们来对专家位置进行任意修改。而且,因为每次迭代中的数据量是固定的( Bi=Bsubscript𝐵𝑖𝐵\sum B_{i}=B ),我们可以计算每个虚拟专家的理想容量( B/(GE)𝐵𝐺𝐸B/(G*E) )作为决策参考,其中 B𝐵B 是输入令牌的总数, G𝐺G 是 GPU 的数量, E𝐸E 是每个设备上的虚拟专家数量。

Input: \mathcal{I}: the assignment of tokens;
输入: \mathcal{I} :令牌的分配;
   𝒫𝒫\mathcal{P}: the expert-to-device mapping;
𝒫𝒫\mathcal{P} :专家到设备的映射;
1 while is training do
1 当训练时执行
2      balance_ratio𝑏𝑎𝑙𝑎𝑛𝑐𝑒_𝑟𝑎𝑡𝑖𝑜balance\_ratio \leftarrow balance(\mathcal{I}, 𝒫𝒫\mathcal{P})  // Equation 6 ;
2 balance_ratio𝑏𝑎𝑙𝑎𝑛𝑐𝑒_𝑟𝑎𝑡𝑖𝑜balance\_ratio \leftarrow 平衡 \mathcal{I}𝒫𝒫\mathcal{P} )// 方程 6;
3      while balance_ratio𝑏𝑎𝑙𝑎𝑛𝑐𝑒_𝑟𝑎𝑡𝑖𝑜balance\_ratio >> threshold do
3 当 balance_ratio𝑏𝑎𝑙𝑎𝑛𝑐𝑒_𝑟𝑎𝑡𝑖𝑜balance\_ratio >> 阈值时执行
4           plan𝑝𝑙𝑎𝑛plan \leftarrow MakeSchedulingPlan(\mathcal{I}, 𝒫𝒫\mathcal{P})  // Algorithm 2 ;
4 plan𝑝𝑙𝑎𝑛plan \leftarrow 制定调度计划 \mathcal{I}𝒫𝒫\mathcal{P} )// 算法 2;
5           if plan𝑝𝑙𝑎𝑛plan is empty then
5 如果 plan𝑝𝑙𝑎𝑛plan 为空则
6                break ; 6 中断;
7               
8          𝒫𝒫\mathcal{P} \leftarrow Modify Expert Placement 𝒫𝒫\mathcal{P} with plan𝑝𝑙𝑎𝑛plan ;
8 𝒫𝒫\mathcal{P} \leftarrow 使用 plan𝑝𝑙𝑎𝑛plan 修改专家位置 𝒫𝒫\mathcal{P}
9           balance_ratio𝑏𝑎𝑙𝑎𝑛𝑐𝑒_𝑟𝑎𝑡𝑖𝑜balance\_ratio \leftarrow balance(\mathcal{I}, 𝒫𝒫\mathcal{P}) ;
9 balance_ratio𝑏𝑎𝑙𝑎𝑛𝑐𝑒_𝑟𝑎𝑡𝑖𝑜balance\_ratio \leftarrow 平衡 \mathcal{I}𝒫𝒫\mathcal{P} );
10          
11     𝒫𝒫\mathcal{P} \leftarrow Modify Expert Placement 𝒫𝒫\mathcal{P} with Migrate ;
11 𝒫𝒫\mathcal{P} \leftarrow 通过迁移修改专家布局 𝒫𝒫\mathcal{P}
12     
Algorithm 1 Scheduler: Monitor Real-Time Workloads
算法 1 调度器:监控实时工作负载

3.3. Scheduler 3.3.调度器

Figure 4 illustrates the role of the Scheduler in connecting the Runtime Executor and the Policy Maker in FlexMoE. Our design includes a metric, the balance ratio, which measures real-time workloads and determines when to trigger the Policy Maker for adjustment decisions. Additionally, we propose three atomic primitives for the Runtime Executor to use in scheduling experts. These primitives will be further explained in the following section.
图 4 展示了调度器在 FlexMoE 中连接运行时执行器和策略制定者的角色。我们的设计包括一个度量标准,平衡比率,用于测量实时工作负载并确定何时触发策略制定者进行调整决策。此外,我们提出了三个原子原语供运行时执行器在调度专家时使用。这些原语将在下一节中进一步解释。

Balance ratio. Owing to the synchronous execution mode in MoE layer, the slowest GPU will dominate the finish time of all-to-all communication as well as the global training. Thus, we design the balance ratio as Equation 6, which adds up the loads of all vExperts on a single GPU and finds the max value among all GPUs to get the final results. While there may be other metrics that can be used to measure loads, we compare the balance ratio in Equation 6 with variance as a metric and provide a detailed analysis of our findings in Section 5.3.
平衡比率。由于 MoE 层中的同步执行模式,最慢的 GPU 将决定全体通信以及全局训练的完成时间。因此,我们将平衡比率设计为方程 6,它将单个 GPU 上所有 vExpert 的负载相加,并找出所有 GPU 中的最大值以得到最终结果。虽然可能有其他度量标准可以用来测量负载,我们将方程 6 中的平衡比率与方差作为度量进行比较,并在第 5.3 节提供我们发现的详细分析。

(6) balance(,𝒫)=1(e,g)𝒫Ie,g/|𝒢|maxg𝒢(e,g)𝒫Ie,gbalance𝒫1subscript𝑒𝑔𝒫subscript𝐼𝑒𝑔𝒢subscript𝑔𝒢subscript𝑒𝑔𝒫subscript𝐼𝑒𝑔\displaystyle\texttt{balance}(\mathcal{I},\mathcal{P})=\frac{1}{\sum_{(e,g)\in\mathcal{P}}I_{e,g}/|\mathcal{G}|}\mathop{\max}_{g\in\mathcal{G}}\sum_{(e,g)\in\mathcal{P}}I_{e,g}

Expand. When an expert receives increasing workloads, previously allocated vExpert resources may handle them slower than others and thus produce stragglers (over-utilized). In this scenario, FlexMoE will call the Expand primitive to allocate an extra new vExpert resource for this expert at a time. The Expand primitive copies the expert parameter as well as the corresponding optimizer states from the source vExpert to the target vExpert via parameter sharing for intra-GPU communication and point-to-point communication of inter-GPU communication.
扩展。当一个专家接收到增加的工作负载时,之前分配的 vExpert 资源可能比其他资源处理得更慢,从而产生拖后腿者(过度利用)。在这种情况下,FlexMoE 将调用扩展原语,一次为该专家分配一个额外的新 vExpert 资源。扩展原语通过参数共享以及 GPU 间通信的点对点通信,将专家参数及相应的优化器状态从源 vExpert 复制到目标 vExpert。

Shrink. In the opposite, when previously allocated vExpert resources are not fully utilized due to decreasing workloads (under-utilized), FlexMoE will call the Shrink primitive to release an vExpert resource at a time. The Shrink primitive is executed by a tag without any communication, and thus introduces no overheads.
缩减。相反,当之前分配的 vExpert 资源由于工作负载减少(未充分利用)而没有被充分利用时,FlexMoE 将调用缩减原语,一次释放一个 vExpert 资源。缩减原语通过一个标签执行,不涉及任何通信,因此不引入任何开销。

Migrate. If an expert retains multiple replicas on different GPUs, the training is slowed down due to the need for extra communications (i.e., gradients all-reduce) for synchronizations. The efficiency of this communication depends on the number of GPUs involved and their locality. To reduce the number of GPUs holding expert replicas and lower the synchronization cost, FlexMoE will call the Migrate primitive exchanges the model states between two vExperts.
迁移。如果一个专家在不同的 GPU 上保留了多个副本,由于需要额外的通信(即,梯度全归约)来进行同步,训练速度会因此变慢。这种通信的效率取决于参与的 GPU 数量及其位置关系。为了减少持有专家副本的 GPU 数量并降低同步成本,FlexMoE 将调用迁移原语在两个虚拟专家之间交换模型状态。

We present the workflow of Scheduler in Algorithm 1, which monitors the real-time workloads, i.e., the assignment of tokens \mathcal{I}, and measures the balance ratio under current placements 𝒫𝒫\mathcal{P} (Line 1-2). If current balance ratio exceeds the pre-defined threshold𝑡𝑟𝑒𝑠𝑜𝑙𝑑threshold, Scheduler will repeatedly ask the Policy Maker for modification primitives until no beneficial modifications exist (Line 3-8). Then, Scheduler turns to the Migrate operation to reduce the synchronization cost and continuously optimizes it at backend (Line 9). Intuitively, the vExpert-based Expand and Shrink primitives tackle the complicated expert-device mapping problem under the dynamic changing workloads, and the Migrate primitive continuously optimizes the placement of replicas for each expert.
我们在算法 1 中展示了调度器的工作流程,它监控实时工作负载,即令牌的分配 \mathcal{I} ,并测量当前放置下的平衡比率 𝒫𝒫\mathcal{P} (第 1-2 行)。如果当前平衡比率超过预定义的 threshold𝑡𝑟𝑒𝑠𝑜𝑙𝑑threshold ,调度器将反复请求策略制定者进行修改原语,直到不存在有益的修改为止(第 3-8 行)。然后,调度器转向迁移操作以减少同步成本,并在后端持续优化它(第 9 行)。直观地说,基于虚拟专家的扩展和收缩原语解决了在动态变化的工作负载下复杂的专家-设备映射问题,而迁移原语则持续优化每个专家副本的放置。

Input: \mathcal{I}: the assignment of tokens;
输入: \mathcal{I} :令牌的分配;
   𝒫𝒫\mathcal{P}: the expert-to-device mapping;
𝒫𝒫\mathcal{P} :专家到设备的映射;
1 Function MakeSchedulingPlan(\mathcal{I}, 𝒫𝒫\mathcal{P}):
1 函数 制定调度计划( \mathcal{I} , 𝒫𝒫\mathcal{P} ):
2      Estimate time cost t0subscript𝑡0t_{0} with ,𝒫𝒫\mathcal{I,P} by Equation 5
2 使用方程式 5 估算 t0subscript𝑡0t_{0},𝒫𝒫\mathcal{I,P} 的时间成本
;
3      for e𝑒e\in\mathcal{E} do 3 对于 e𝑒e\in\mathcal{E} 执行
4           nesubscript𝑛𝑒absentn_{e}\leftarrow number of vExpert allocated for e𝑒e ;
4 为 e𝑒e 分配的 vExpert 数量 nesubscript𝑛𝑒absentn_{e}\leftarrow
5           capee/ne𝑐𝑎subscript𝑝𝑒subscript𝑒subscript𝑛𝑒cap_{e}\leftarrow\mathcal{I}_{e}/n_{e} // capacity of vExpert for e𝑒e ;
5 capee/ne𝑐𝑎subscript𝑝𝑒subscript𝑒subscript𝑛𝑒cap_{e}\leftarrow\mathcal{I}_{e}/n_{e} // e𝑒e 的 vExpert 容量;
6          
7     e0,e1argmaxe{cape},argmine{cape}formulae-sequencesubscript𝑒0subscript𝑒1subscriptargmax𝑒𝑐𝑎subscript𝑝𝑒subscriptargmin𝑒𝑐𝑎subscript𝑝𝑒e_{0},\ e_{1}\leftarrow\operatorname*{arg\,max}_{e\in\mathcal{E}}\{cap_{e}\},\ \operatorname*{arg\,min}_{e\in\mathcal{E}}\{cap_{e}\} ; 7 e0,e1argmaxe{cape},argmine{cape}formulae-sequencesubscript𝑒0subscript𝑒1subscriptargmax𝑒𝑐𝑎subscript𝑝𝑒subscriptargmin𝑒𝑐𝑎subscript𝑝𝑒e_{0},\ e_{1}\leftarrow\operatorname*{arg\,max}_{e\in\mathcal{E}}\{cap_{e}\},\ \operatorname*{arg\,min}_{e\in\mathcal{E}}\{cap_{e}\}
8      Estimate time cost t1subscript𝑡1t_{1} after expanding e0subscript𝑒0e_{0} and shrinking e1subscript𝑒1e_{1}
8 在扩大 e0subscript𝑒0e_{0} 和缩小 e1subscript𝑒1e_{1} 后估计时间成本 t1subscript𝑡1t_{1}
;
9      if t1<t0subscript𝑡1subscript𝑡0t_{1}<t_{0} then 9 如果 t1<t0subscript𝑡1subscript𝑡0t_{1}<t_{0} 那么
10           return {(Expand, e0subscript𝑒0e_{0}), (Shrink, e1subscript𝑒1e_{1})} ;
返回 {(扩展, e0subscript𝑒0e_{0} ), (收缩, e1subscript𝑒1e_{1} )};
11          
12     return { } ; 返回 {};
13     
Algorithm 2 Policy Maker: vExpert-Based Scheduling
算法 2 策略制定者:基于专家的调度

3.4. Policy Maker 3.4 策略制定者

Using the modification primitives described above, FlexMoE employs an efficient vExpert-based scheduling algorithm for generating sequential modification operations, as shown in Algorithm 2. Specifically, we leverage a cost-model driven search planning approach, which makes decisions based on feed-backs from simulating the training time.
在上文描述的修改原语基础上,FlexMoE 采用了一种高效的 vExpert 为基础的调度算法来生成序列化的修改操作,如算法 2 所示。具体来说,我们采用了一种基于成本模型驱动的搜索规划方法,该方法根据模拟训练时间的反馈来做出决策。

To model the training cost of an MoE layer, we decompose it into three parts as Equation 5, including computation cost TCsubscript𝑇𝐶T_{C}, All-To-All communication cost TA2Asubscript𝑇𝐴2𝐴T_{A2A} and expert synchronization cost TSyncsubscript𝑇𝑆𝑦𝑛𝑐T_{Sync}. For each part, we build cost models that take into account input variables (e.g., \mathcal{I}: the assignment of tokens and 𝒫𝒫\mathcal{P}: the expert-to-device mapping) as well as environmental variables (e.g., TPS: tokens-per-second for an expert). By leveraging a profiling-based approach, we first profile the function’s running time under different input sizes and then estimate the corresponding environmental variables. Besides, we also consider the cost of expert adjustments, which could be executed concurrently with model training.
为了建模 MoE 层的训练成本,我们将其分解为三部分,如方程 5 所示,包括计算成本 TCsubscript𝑇𝐶T_{C} 、全对全通信成本 TA2Asubscript𝑇𝐴2𝐴T_{A2A} 和专家同步成本 TSyncsubscript𝑇𝑆𝑦𝑛𝑐T_{Sync} 。对于每个部分,我们构建了考虑输入变量(例如, \mathcal{I} :令牌的分配和 𝒫𝒫\mathcal{P} :专家到设备的映射)以及环境变量(例如,TPS:每秒令牌数)的成本模型。通过采用基于分析的方法,我们首先在不同输入大小下对函数的运行时间进行分析,然后估计相应的环境变量。此外,我们还考虑了专家调整的成本,这可以与模型训练同时执行。

Computation Cost. The computation cost refers to the time taken for an expert to perform the forward and backward computation of experts during training, which can be formulated as:
计算成本。计算成本指的是专家在训练期间执行前向和后向计算所需的时间,可以表述为:

(7) TC(Ie,g)subscript𝑇𝐶subscript𝐼𝑒𝑔\displaystyle T_{C}(I_{e,g}) =Ie,gTPSabsentsubscript𝐼𝑒𝑔𝑇𝑃𝑆\displaystyle=\frac{I_{e,g}}{\text{$TPS$}}

where Ie,gsubscript𝐼𝑒𝑔I_{e,g} represents the number of input tokens received by expert e𝑒e on GPU g𝑔g, and TPS𝑇𝑃𝑆TPS represents the throughput (tokens per second) of given GPUs to calculate an expert, which is obtained from profiling. The computation cost is proportional to the number of input tokens received by an expert, and inversely proportional to the throughput of the GPU. In other words, the more input tokens an expert receives, the longer it will take to compute the forward and backward passes, and the slower the GPU, the longer it will take to process each token.
其中 Ie,gsubscript𝐼𝑒𝑔I_{e,g} 代表专家 e𝑒e 在 GPU g𝑔g 上接收到的输入令牌数量, TPS𝑇𝑃𝑆TPS 代表给定 GPU 计算一个专家的吞吐量(每秒令牌数),这是通过分析获得的。计算成本与专家接收到的输入令牌数量成正比,与 GPU 的吞吐量成反比。换句话说,专家接收到的输入令牌越多,进行前向和后向计算所需的时间就越长,GPU 的速度越慢,处理每个令牌所需的时间就越长。

All-To-All Cost. The All-To-All operation is involved to send tokens to target experts and send back their results after processing, which will be called for totally 4 times in each training step. We predict the All-To-All cost by a topology-aware model as Equation 8, where the expert e𝑒e on GPU g𝑔g has received Ie,g.count(g)formulae-sequencesubscript𝐼𝑒𝑔𝑐𝑜𝑢𝑛𝑡superscript𝑔I_{e,g}.count(g^{\prime}) tokens from GPU gsuperscript𝑔g^{\prime} and Bwg,g𝐵subscript𝑤𝑔superscript𝑔Bw_{g,g^{\prime}} is the profiled bandwidth between GPU g𝑔g and GPU gsuperscript𝑔g^{\prime}. Our cost model considers the intra-node bandwidth (e.g., PCIe, NvLink) and inter-node bandwidth (e.g., IB, NIC) separately.
全对全成本。全对全操作涉及向目标专家发送令牌并在处理后发送回其结果,每个训练步骤中将调用总共 4 次。我们通过一个拓扑感知模型预测全对全成本,如方程 8 所示,其中 GPU g𝑔g 上的专家 e𝑒e 已从 GPU gsuperscript𝑔g^{\prime} 接收到 Ie,g.count(g)formulae-sequencesubscript𝐼𝑒𝑔𝑐𝑜𝑢𝑛𝑡superscript𝑔I_{e,g}.count(g^{\prime}) 个令牌,而 Bwg,g𝐵subscript𝑤𝑔superscript𝑔Bw_{g,g^{\prime}} 是 GPU g𝑔g 与 GPU gsuperscript𝑔g^{\prime} 之间的已测量带宽。我们的成本模型分别考虑了节点内带宽(例如,PCIe、NvLink)和节点间带宽(例如,IB、NIC)。

(8) TA2A(Ie,g)=4g𝒢Ie,g.count(g)Bwg,gsubscript𝑇𝐴2𝐴subscript𝐼𝑒𝑔4subscriptsuperscript𝑔𝒢formulae-sequencesubscript𝐼𝑒𝑔𝑐𝑜𝑢𝑛𝑡superscript𝑔𝐵subscript𝑤𝑔superscript𝑔\displaystyle T_{A2A}(I_{e,g})=4*\sum_{g^{\prime}\in\mathcal{G}}{\frac{I_{e,g}.count(g^{\prime})}{Bw_{g,g^{\prime}}}}

Synchronization Cost. When expert e𝑒e holds multiple replicas on different GPUs, their gradients must be synchronized via AllReduce communication, which is determined by the message size, the number of involved devices and their connected network. We enumerate different device groups and profile them before training to get their BPS (bytes-per-second). We predict the synchronization cost for the expert e𝑒e by finding the device group that includes e𝑒e (i.e., 𝒫.index(e)formulae-sequence𝒫𝑖𝑛𝑑𝑒𝑥𝑒\mathcal{P}.index(e)) and then obtaining its corresponding BPS𝐵𝑃𝑆BPS from profiling data. size(e)𝑠𝑖𝑧𝑒𝑒{size}(e) represents the size of gradients for an expert.
同步成本。当专家 e𝑒e 在不同 GPU 上持有多个副本时,必须通过 AllReduce 通信同步它们的梯度,这取决于消息大小、涉及设备的数量及其连接的网络。我们在训练前枚举不同的设备组并对它们进行评估,以获得它们的 BPS(每秒字节数)。我们通过找到包含 e𝑒e (即, 𝒫.index(e)formulae-sequence𝒫𝑖𝑛𝑑𝑒𝑥𝑒\mathcal{P}.index(e) )的设备组,然后从评估数据中获取其相应的 BPS𝐵𝑃𝑆BPS 来预测专家 e𝑒e 的同步成本。 size(e)𝑠𝑖𝑧𝑒𝑒{size}(e) 代表一个专家的梯度大小。

(9) TSync(𝒫,e)subscript𝑇𝑆𝑦𝑛𝑐𝒫𝑒\displaystyle T_{Sync}(\mathcal{P},e) =size(e.gradients)BPS(𝒫.index(e))\displaystyle=\frac{{size}(e.gradients)}{BPS(\mathcal{P}.index(e))}

Expert Adjustment Cost. FlexMoE proposes three modification primitives, including Expand, Shrink and Migrate. The Expand and Migrate primitives involve transferring model states from the source GPU to the target GPU via NCCL Point-to-Point communication. On the other hand, the Shrink primitive is executed with no overheads by marking a tag. We predict the cost of transferring model states as size(e.model_states)Bwg,g\frac{{size(e.model\_states)}}{Bw_{g,g^{\prime}}}.
专家调整成本。FlexMoE 提出了三种修改原语,包括扩展、收缩和迁移。扩展和迁移原语涉及通过 NCCL 点对点通信将模型状态从源 GPU 传输到目标 GPU。另一方面,收缩原语通过标记一个标签而无需开销地执行。我们预测传输模型状态的成本为 size(e.model_states)Bwg,g\frac{{size(e.model\_states)}}{Bw_{g,g^{\prime}}}

The scheduling policy of Policy Maker is summarized in Algorithm 2. Firstly, the algorithm gets the time cost t0subscript𝑡0t_{0} of current placement as the baseline (Line 2), and then finds the expert id with the maximum workload and the expert id with the minimum workload (Line 3-6). After that, our Policy Maker estimates the time cost t1subscript𝑡1t_{1} after applying the Expand and Shrink primitives (Line 7) and decides whether to return the modification by comparing t0subscript𝑡0t_{0} and t1subscript𝑡1t_{1} (Line 8-10).
策略制定者的调度策略在算法 2 中总结。首先,算法获取当前布局的时间成本 t0subscript𝑡0t_{0} 作为基线(第 2 行),然后找到工作量最大的专家 id 和工作量最小的专家 id(第 3-6 行)。之后,我们的策略制定者估计在应用扩展和收缩原语后的时间成本 t1subscript𝑡1t_{1} (第 7 行),并通过比较 t0subscript𝑡0t_{0}t1subscript𝑡1t_{1} 来决定是否返回修改(第 8-10 行)。

Input: \mathcal{I}: Expert Workloads; 𝒫𝒫\mathcal{P}: Expert Placement; g𝑔g: Current GPU;
输入: \mathcal{I} :专家工作量; 𝒫𝒫\mathcal{P} :专家配置; g𝑔g :当前 GPU;
Ouptut: r𝑟r: Routing plans
输出: r𝑟r :路由计划
1 /*iterate over all experts*/ ;
1 /*遍历所有专家*/;
2 for e \in \mathcal{E} do
2 对于 e \in \mathcal{E} 执行
3      nesubscript𝑛𝑒absentn_{e}\leftarrow number of vExpert allocated for e𝑒e ;
3 nesubscript𝑛𝑒absentn_{e}\leftarrowe𝑒e 分配的 vExpert 数量;
4      capee/ne𝑐𝑎subscript𝑝𝑒subscript𝑒subscript𝑛𝑒cap_{e}\leftarrow\mathcal{I}_{e}/n_{e} // capacity of vExpert for e𝑒e ;
4 capee/ne𝑐𝑎subscript𝑝𝑒subscript𝑒subscript𝑛𝑒cap_{e}\leftarrow\mathcal{I}_{e}/n_{e} // e𝑒e 的 vExpert 容量;
5      re,gmin(cape×ne,g,e,g)subscript𝑟𝑒𝑔𝑐𝑎subscript𝑝𝑒subscript𝑛𝑒𝑔subscript𝑒𝑔r_{e,g}\leftarrow\min(cap_{e}\times n_{e,g},\ \mathcal{I}_{e,g}) // locality first ;
5 re,gmin(cape×ne,g,e,g)subscript𝑟𝑒𝑔𝑐𝑎subscript𝑝𝑒subscript𝑛𝑒𝑔subscript𝑒𝑔r_{e,g}\leftarrow\min(cap_{e}\times n_{e,g},\ \mathcal{I}_{e,g}) // 优先考虑本地性;
6      see,gre,gsubscript𝑠𝑒subscript𝑒𝑔subscript𝑟𝑒𝑔s_{e}\leftarrow\mathcal{I}_{e,g}-r_{e,g} // tokens for other GPUs ;
6 see,gre,gsubscript𝑠𝑒subscript𝑒𝑔subscript𝑟𝑒𝑔s_{e}\leftarrow\mathcal{I}_{e,g}-r_{e,g} // 其他 GPU 的令牌;
7      ce,gre,ge,gsubscript𝑐𝑒𝑔subscript𝑟𝑒𝑔subscript𝑒𝑔c_{e,g}\leftarrow r_{e,g}-\mathcal{I}_{e,g} // local available capacity ;
7 ce,gre,ge,gsubscript𝑐𝑒𝑔subscript𝑟𝑒𝑔subscript𝑒𝑔c_{e,g}\leftarrow r_{e,g}-\mathcal{I}_{e,g} // 本地可用容量;
8      for gsuperscript𝑔absentg^{\prime}\in 𝒢{g}𝒢𝑔\mathcal{G}-\{g\} do
8 对于 gsuperscript𝑔absentg^{\prime}\in 𝒢{g}𝒢𝑔\mathcal{G}-\{g\} 执行
9           ce,gmin(cape×ne,g,e,g)e,gsubscript𝑐𝑒superscript𝑔𝑐𝑎subscript𝑝𝑒subscript𝑛𝑒superscript𝑔subscript𝑒superscript𝑔subscript𝑒superscript𝑔c_{e,g^{\prime}}\leftarrow\min(cap_{e}\times n_{e,g^{\prime}},\ \mathcal{I}_{e,g^{\prime}})-\mathcal{I}_{e,g^{\prime}} ; 9 ce,gmin(cape×ne,g,e,g)e,gsubscript𝑐𝑒superscript𝑔𝑐𝑎subscript𝑝𝑒subscript𝑛𝑒superscript𝑔subscript𝑒superscript𝑔subscript𝑒superscript𝑔c_{e,g^{\prime}}\leftarrow\min(cap_{e}\times n_{e,g^{\prime}},\ \mathcal{I}_{e,g^{\prime}})-\mathcal{I}_{e,g^{\prime}}
10           re,gse×ce,g/cesubscript𝑟𝑒superscript𝑔subscript𝑠𝑒subscript𝑐𝑒superscript𝑔subscript𝑐𝑒r_{e,g^{\prime}}\leftarrow s_{e}\times c_{e,g^{\prime}}/\sum c_{e} // proportional to availability ;
10 re,gse×ce,g/cesubscript𝑟𝑒superscript𝑔subscript𝑠𝑒subscript𝑐𝑒superscript𝑔subscript𝑐𝑒r_{e,g^{\prime}}\leftarrow s_{e}\times c_{e,g^{\prime}}/\sum c_{e} // 与可用性成比例;
11          
12return r{rer\leftarrow\{r_{e}e}e\in\mathcal{E}\} ;
12 返回 r{rer\leftarrow\{r_{e}e}e\in\mathcal{E}\} ;
Algorithm 3 Flexible Token Routing
算法 3 灵活的令牌路由

4. Implementation 4.实现

We have implemented the proposed mechanisms and algorithms on the top of PyTorch (Paszke et al., 2019) by adding new customized operators and CUDA kernels. FlexMoE is also a part of a novel distributed DL system Hetu (Miao et al., 2023, 2021, 2022a). To schedule dynamic dataflow more efficiently, FlexMoE incorporates the following system-level optimizations:
我们在 PyTorch(Paszke 等人,2019 年)的基础上实现了所提出的机制和算法,通过添加新的自定义操作符和 CUDA 核心。FlexMoE 也是一个新型分布式深度学习系统 Hetu(Miao 等人,2023 年,2021 年,2022a 年)的一部分。为了更有效地调度动态数据流,FlexMoE 整合了以下系统级优化:

Flexible Token Routing. Router should efficiently transfer input tokens from multiple devices (e.g., GPUs) to their target experts according to the complicated expert-to-device mapping 𝒫𝒫\mathcal{P}. To accomplish this, FlexMoE has designed a greedy policy as Algorithm 3, which prefers to route tokens to the local GPU and then scatters the remaining tokens to other GPUs in proportion to their available capacity. FlexMoE also implements a efficient expert-wise layout transformation to arrange the inputs in a continuous space for each expert.
灵活的令牌路由。路由器应高效地将输入令牌从多个设备(例如,GPU)传输到根据复杂的专家到设备映射 𝒫𝒫\mathcal{P} 的目标专家。为了实现这一点,FlexMoE 设计了一种贪婪策略作为算法 3,该策略优先将令牌路由到本地 GPU,然后按比例将剩余的令牌分散到其他 GPU,取决于它们的可用容量。FlexMoE 还实现了一种高效的专家级布局转换,以便为每个专家在连续空间中排列输入。

Paralleled Operation Modification. FlexMoE employs a queue to sequentially insert modification primitives triggered by the scheduler, such as Expand, Shrink and Migrate. To reduce the time cost of adjustments and kernel launch, we merge several consecutive and parallelizable operations to run them concurrently. For example, if two operations share the same source and destination, they can be merged to increase the message size and improve the bandwidth utilization. Meanwhile, if they share neither source or destination, they can be executed concurrently to improve the utilization of clusters.
并行操作修改。FlexMoE 采用队列按顺序插入调度程序触发的修改原语,如扩展、收缩和迁移。为了减少调整和内核启动的时间成本,我们合并了几个连续的、可并行化的操作以同时运行它们。例如,如果两个操作共享相同的源和目的地,它们可以合并以增加消息大小并提高带宽利用率。同时,如果它们既不共享源也不共享目的地,它们可以同时执行以提高集群的利用率。

AllReduce Coordination. When the vExperts of a single GPU are assigned to different experts, it should call the synchronizations separately for each expert, and may cause deadlock since the order of calls is inconsistent for different GPUs (jea, 2017). To avoid this deadlock problem, we assign a logical id to each expert and the logical id of each replica (vExpert) is same as its main expert. Then, each GPU invokes synchronizations in ascending order of experts’ logical id.
AllReduce 协调。当单个 GPU 的 vExperts 被分配给不同的专家时,它应该为每个专家单独调用同步,这可能会导致死锁,因为不同 GPU 的调用顺序不一致(jea, 2017)。为了避免这个死锁问题,我们为每个专家分配一个逻辑 ID,每个副本(vExpert)的逻辑 ID 与其主要专家相同。然后,每个 GPU 按照专家逻辑 ID 的升序调用同步。

Best-Effort Adjustment. Since the placement modifications may block the training process, it is not clear whether executing the current modification will be beneficial due to the dynamic workloads. To address this problem, FlexMoE leverages a separate CUDA stream to conduct adjustments concurrently with the available network bandwidth and adopts the best-effort adjustment to avoid hindering the training process. the Scheduler interacts with the DL executor to determine whether to call the first operation in the candidate queue.
尽力而为的调整。由于放置修改可能会阻塞训练过程,由于动态工作负载,目前执行的修改是否有益尚不清楚。为了解决这个问题,FlexMoE 利用单独的 CUDA 流来并行进行调整,同时利用可用的网络带宽,并采用尽力而为的调整来避免阻碍训练过程。调度程序与 DL 执行器交互,以确定是否调用候选队列中的第一个操作。

NCCL Group Management. FlexMoE adopts NCCL (jea, 2017) library to perform collective communication among GPUs and multiple NCCL groups are required to execute the complicated synchronization of experts. However, there is a maximum number of live NCCL groups that can remain, and it is inefficient to eliminate the groups once they have been utilized. FlexMoE employs a Least Recently Used (LRU) cache to maintain nccl groups and therefore reduces the costs of group creations and eliminations.
NCCL 组管理。FlexMoE 采用 NCCL(jea,2017)库来执行 GPU 之间的集体通信,并且需要多个 NCCL 组来执行专家的复杂同步。然而,可以保持活动状态的 NCCL 组的数量有一个最大限制,一旦使用过这些组,就高效地消除它们是低效的。FlexMoE 采用最近最少使用(LRU)缓存来维护 nccl 组,因此减少了组创建和消除的成本。

Table 1. Models for Evaluation.
表 1. 评估模型。
Model 模型 Params. 参数。 #Layer 𝐝𝐌𝐨𝐝𝐞𝐥subscript𝐝𝐌𝐨𝐝𝐞𝐥\mathbf{d_{Model}} 𝐝𝐅𝐅𝐍subscript𝐝𝐅𝐅𝐍\mathbf{d_{FFN}} #Expert
BERT-MoE-S 0.988B 9.88 亿 12 768 3072 32
BERT-MoE-L 6.69B 66.9 亿 24 1024 4096 64
GPT-MoE-S 0.988B 9.88 亿 12 768 3072 32
GPT-MoE-L 39B 24 2048 8192 64
Swin-MoE-S 946M 24 - - 32
Swin-MoE-L 1.83B 18.3 亿 24 - - 64

5. Evaluation 5.评估

In this section, we present the detailed evaluation results to demonstrate the effectiveness and scalability of FlexMoE.
在本节中,我们展示了详细的评估结果,以证明 FlexMoE 的有效性和可扩展性。

5.1. Experimental Setup 5.1 实验设置

Machine environment. We conduct experiments on Azure VMs (azu, 2014), each equipped with 192-core AMD CPUs and 8 NVIDIA Ampere A100 GPUs. The GPUs are connected via NVLink 3.0 intra-node and the servers are connected via 8 InfiniBand NICs (8*200 Gbps totally). RDMA is used by default and the PyTorch version is 1.11.
机器环境。我们在 Azure VMs(azu, 2014)上进行实验,每台虚拟机配备了 192 核的 AMD CPU 和 8 块 NVIDIA Ampere A100 GPU。GPU 通过 NVLink 3.0 内节点连接,服务器之间通过 8 个 InfiniBand NICs(总共 8*200 Gbps)连接。默认使用 RDMA,PyTorch 版本为 1.11。

Baselines. We compare FlexMoE with other popular frameworks, including DeepSpeed (Rajbhandari et al., 2022) and FasterMoE (He et al., 2022). Deepspeed used expert parallelism, which was first proposed by GShard (Lepikhin et al., 2021). FasterMoE proposed the shadowing strategy to replicate the popular expert among all GPUs.
基准。我们将 FlexMoE 与其他流行框架进行比较,包括 DeepSpeed(Rajbhandari 等,2022)和 FasterMoE(He 等,2022)。DeepSpeed 使用了 GShard(Lepikhin 等,2021)首次提出的专家并行机制。FasterMoE 提出了影子策略,以在所有 GPU 中复制流行的专家。

Benchmarks and datasets. Our evaluations are conducted by scaling representative transformer models in different application domains with the MoE architecture, including BERT (Devlin et al., 2019) and GPT (Radford et al., 2019) in NLP and Swin (Liu et al., 2021) in CV, as shown in Table 1. We adopt Top-2 Gate for each model, which is adopted by widely used sparse MoE models (GShard (Lepikhin et al., 2021), GLaM (Du et al., 2022), V-MoE (Riquelme et al., 2021)), and set the capacity factor as 1.01.01.0 for each expert and balance loss as 0.0010.0010.001. We pretrain BERT-MoE with masked language modeling (MLM) and next-sentence-prediction (NSP) (Devlin et al., 2019) tasks, GPT-MoE with language modeling task (LM) (Radford et al., 2019), Swin-MoE with image classification task (Liu et al., 2021; Hwang et al., 2022). NLP models are trained on wikipedia (wik, 2014) and vision models are trained on the ImageNet-1K (ima, 2014). All the hyper-parameters (e.g., learning rate) are fixed as for the same model.
基准和数据集。我们通过扩展不同应用领域中具有代表性的变压器模型与 MoE 架构的结合进行评估,包括 NLP 领域的 BERT(Devlin 等,2019 年)和 GPT(Radford 等,2019 年),以及 CV 领域的 Swin(Liu 等,2021 年),如表 1 所示。我们为每个模型采用 Top-2 门控,这一做法被广泛应用于稀疏 MoE 模型中(GShard(Lepikhin 等,2021 年),GLaM(Du 等,2022 年),V-MoE(Riquelme 等,2021 年)),并将每个专家的容量因子设置为 1.01.01.0 ,平衡损失设置为 0.0010.0010.001 。我们使用遮蔽语言建模(MLM)和下一句预测(NSP)(Devlin 等,2019 年)任务对 BERT-MoE 进行预训练,使用语言建模任务(LM)(Radford 等,2019 年)对 GPT-MoE 进行预训练,使用图像分类任务(Liu 等,2021 年;Hwang 等,2022 年)对 Swin-MoE 进行预训练。NLP 模型在维基百科(wik,2014 年)上训练,视觉模型在 ImageNet-1K(ima,2014 年)上训练。所有超参数(例如,学习率)对相同模型保持不变。

Table 2. Comparison on model quality
表 2.模型质量比较
Metric 指标 Masked LM 遮蔽语言模型 Language Modeling 语言建模 Metric 指标 Image Classification 图像分类
BERT- BERT- GPT- GPT- Swin- 斯温- Swin- 斯温-
MoE-S MoE-L MoE-S MoE-L MoE-S MoE-L
DeepSpeed PPL\ \downarrow 人均 3.53 3.31 12.2 10.71 acc@@@1\ \uparrow
每账户 1
77.316 77.022
acc@@@5\ \uparrow
每账户 5
93.838 93.642
FlexMoE PPL\ \downarrow 人均 3.14 3.07 11.72 10.47 acc@@@1\ \uparrow
每账户 1
77.754 77.109
acc@@@5\ \uparrow
每账户 5
94.042 93.663
Refer to caption
(a) 32 GPUs (a) 32 个图形处理器
Refer to caption
(b) 64 GPUs (b) 64 个 GPU
Figure 5. Comparison on system efficiency
图 5. 系统效率比较

5.2. Overall Performance 5.2. 总体性能

To evaluate the end-to-end efficiency of FlexMoE, we compare it with DeepSpeed and FasterMoE. For DeepSpeed, we use the traditional expert parallelism approach and set the number of experts per GPU as 1 in every MoE layer. We evaluate two different width of models, X-MoE-S with 32 experts and X-MoE-L with 64 experts, on 32 GPUs and 64 GPUs, respectively.
为了评估 FlexMoE 的端到端效率,我们将其与 DeepSpeed 和 FasterMoE 进行了比较。对于 DeepSpeed,我们使用传统的专家并行方法,并在每个 MoE 层中将每个 GPU 的专家数量设置为 1。我们在 32 个 GPU 和 64 个 GPU 上分别评估了两种不同宽度的模型,X-MoE-S 拥有 32 个专家,而 X-MoE-L 拥有 64 个专家。

Model Quality. To demonstrate the importance of not dropping tokens, we compare the model quality of FlexMoE with DeepSpeed on various models and various tasks. We use the validation perplexity for language models (e.g., BERT-MoE and GPT-MoE), which is the lower the better, and top-1/top-5 accuracy of Imagenet-1K for vision models (e.g., Swin-MoE). As shown in Table 2, FlexMoE outperforms DeepSpeed in almost all tasks, indicating that the limited capacity in existing MoE systems (e.g., DeepSpeed) can lead to model quality degradation. Considering the training cost, the Swin-MoE models (scaled based on the Swin-B model) are trained for 100 epochs (less than 300 epochs in the standard configuration) and thus the accuracy is slightly worse than the benchmark, where 94.04%percent94.0494.04\% of FlexMoE v.s. 96.5%percent96.596.5\% of the benchmark as for top-5 accuracy. And we believe it’s enough to show the benefits of no dropping tokens. What’s more, Swin-MoE-L performs slightly worse than Swin-MoE-S as the size of training data is small and thus it may lead to overfitting.
模型质量。为了展示不丢弃令牌的重要性,我们比较了 FlexMoE 与 DeepSpeed 在各种模型和任务上的模型质量。我们使用语言模型(例如,BERT-MoE 和 GPT-MoE)的验证困惑度,其值越低越好,以及视觉模型(例如,Swin-MoE)的 Imagenet-1K 的 top-1/top-5 准确率。如表 2 所示,FlexMoE 在几乎所有任务中都优于 DeepSpeed,这表明现有 MoE 系统(例如,DeepSpeed)的有限容量可能导致模型质量下降。考虑到训练成本,Swin-MoE 模型(基于 Swin-B 模型的缩放)训练了 100 个周期(标准配置下少于 300 个周期),因此准确率略低于基准,其中 FlexMoE 的 top-5 准确率与基准的 94.04%percent94.0494.04\% 相比为 96.5%percent96.596.5\% 。我们认为这足以展示不丢弃令牌的好处。更重要的是,由于训练数据量小,Swin-MoE-L 的表现略逊于 Swin-MoE-S,因此可能导致过拟合。

System Efficiency. We also evaluate FlexMoE against other SOTA systems on efficiency. To measure efficiency, we record the required training time to achieve the target model quality, and the results are illustrated in Figure 5. Our experiments show that FlexMoE achieves the best training efficiency, outperforming DeepSpeed by 1.70×1.70\times on average and up to 2.10×2.10\times, and FasterMoE by 1.30×1.30\times on average and up to 1.45×1.45\times. As mentioned above, although DeepSpeed obtains the smallest iteration time thanks to its limited capacity, it drops tokens to skip the expert network and thus requires more iterations to converge. FasterMoE proposes the dynamic shadowing strategy for loading balance, which replicates the popular expert on all GPUs. However, due to its coarse-grained expert management (i.e., on 1 GPU or on all GPUs), it falls back to a sub-optimal solution. As the number of GPUs increases, FasterMoE suffers from the global synchronization of expert replicas.
系统效率。我们还评估了 FlexMoE 与其他 SOTA 系统在效率上的表现。为了衡量效率,我们记录了达到目标模型质量所需的训练时间,结果如图 5 所示。我们的实验表明,FlexMoE 在训练效率上表现最佳,平均优于 DeepSpeed 1.70×1.70\times ,最高可达 2.10×2.10\times ,平均优于 FasterMoE 1.30×1.30\times ,最高可达 1.45×1.45\times 。如上所述,尽管 DeepSpeed 由于其有限的容量获得了最小的迭代时间,但它通过跳过专家网络丢弃令牌,因此需要更多的迭代才能收敛。FasterMoE 提出了动态阴影策略来实现负载平衡,该策略在所有 GPU 上复制热门专家。然而,由于其粗粒度的专家管理(即,在 1 个 GPU 上或在所有 GPU 上),它回落到了次优解。随着 GPU 数量的增加,FasterMoE 受到专家副本全局同步的影响。

In addition to the expert networks, the time of models’ training also consists of other parts, such as non-experts’ computation, optimizers’ update and communication. As FlexMoE only optimizes the execution of the expert networks, we will mainly focus on analyzing the MoE layer in the following sections.
除了专家网络外,模型训练的时间还包括其他部分,如非专家的计算、优化器的更新和通信。由于 FlexMoE 仅优化了专家网络的执行,我们将在以下部分主要关注分析 MoE 层。

5.3. Ablation Study 5.3.消融研究

To verify the effectiveness of FlexMoE, we conduct several ablation studies, as demonstrated below.
为了验证 FlexMoE 的有效性,我们进行了几项消融研究,如下所示。

Refer to caption
(a) Different metrics  (a) 不同的评价指标
Refer to caption
(b) Different Scheduling Polices
(b) 不同的调度策略
Refer to caption
(c) Cost Models (c) 成本模型
Figure 6. Study of different metrics, different scheduling policies and cost models
图 6. 不同指标、调度策略及成本模型的研究

Different Metrics. FlexMoE utilizes the balance ratio as a key metric to trigger adjustments. To investigate the impact of alternative metrics, , we also study the use of Variance as a ratio, formulated as g𝒢(IgI¯)2/|𝒢|subscript𝑔𝒢superscriptsubscript𝐼𝑔¯𝐼2𝒢\sum_{g\in\mathcal{G}}(I_{g}-\overline{I})^{2}/{|\mathcal{G}|}, in addition to the Max(ours) ratio given by Equation 6. Results are presented in Figure 6(a) and show that Max(ours) outperforms Variance by 1.03×1.03\times on average and up to 1.13×1.13\times on Swin-MoE-L. Meanwhile, Variance also performs well on BERT-MoE-S and GPT-MoE-S. Because the training time is dominated by the slowest expert, Max(ours) is a simple but effective metric. Variance takes the global workload distribution into consideration, which triggers adjustment more frequently but often gets empty operations from Policy Maker as it is not always relevant to the actual training.
不同指标。FlexMoE 利用平衡比率作为触发调整的关键指标。为了探究替代指标的影响,我们还研究了将方差作为比率的使用,即除了方程 6 给出的 Max(ours)比率外,还形式化为 g𝒢(IgI¯)2/|𝒢|subscript𝑔𝒢superscriptsubscript𝐼𝑔¯𝐼2𝒢\sum_{g\in\mathcal{G}}(I_{g}-\overline{I})^{2}/{|\mathcal{G}|} 。结果展示在图 6(a)中,并显示 Max(ours)的平均表现优于方差 1.03×1.03\times ,在 Swin-MoE-L 上最高可达 1.13×1.13\times 。同时,方差在 BERT-MoE-S 和 GPT-MoE-S 上也表现良好。由于训练时间由最慢的专家主导,Max(ours)是一个简单但有效的指标。方差考虑了全局工作负载分布,这会触发更频繁的调整,但由于并非总是与实际训练相关,因此经常从策略制定者那里获得空操作。

Different Scheduling Policies. FlexMoE dynamically triggers adjustment according to the pre-defined balance ratio. We also conduct experiments on static scheduling policies, which triggers the adjustments in the fixed interval and executes them completely before training. As shown in Figure 6(b), our dynamic scheduling outperforms both large interval and small interval because of the dynamic changing workloads. Specifically, small interval will trigger adjustments frequently and thus introduce more adjustment costs, while large interval can not tackle the dynamic workloads well because it can not make adjustments in time. Our dynamic policy dynamically decides the adjustments based on current workloads, which is suitable for the dynamic workloads.
不同调度策略。FlexMoE 根据预定义的平衡比率动态触发调整。我们还对静态调度策略进行了实验,这些策略在固定间隔触发调整,并在训练前完全执行它们。如图 6(b)所示,我们的动态调度因为动态变化的工作负载而胜过大间隔和小间隔。具体来说,小间隔会频繁触发调整,因此引入更多的调整成本,而大间隔因为不能及时进行调整而无法很好地处理动态工作负载。我们的动态策略根据当前工作负载动态决定调整,适用于动态工作负载。

Evaluation on Cost Models Figure 6(c) demonstrates the effectiveness of our cost models, where we compare the estimation cost to real cost on different input sizes for computation/alltoall/allreduce respectively. It can be observed that our estimation results are very close to the real execution costs for all experimental models, where the average prediction error is less than 3%percent33\%.
成本模型评估 图 6(c)展示了我们成本模型的有效性,我们在不同输入大小的计算/全通信/全归约上比较了估算成本和实际成本。可以观察到,我们的估算结果与所有实验模型的实际执行成本非常接近,其中平均预测误差小于 3%percent33\%

5.4. Token Efficiency and Expert Efficiency
5.4.令牌效率与专家效率

In this section, we analyze both token efficiency and expert efficiency for different MoE training methods during the whole training process. Token efficiency refers to the fraction of input tokens that are processed by the expert network and expert efficiency refers to the meaningful computation of GPUs. 100%percent100100\% of both metrics is the ideal status of an MoE layer, shown as the red flag in Figure 7(a). The traditional expert-parallel method (e.g., DeepSpeed) obtains low token efficiency and expert efficiency as it drops tokens beyond capacity for loading balance. SWIPE, proposed by BaGuaLu (Ma et al., 2022), improves expert efficiency by modifying the gating algorithm to re-assign inputs to other experts for strict load balance. However, this approach changes the relations between tokens and experts, thus leads to low token efficiency. FasterMoE (He et al., 2022) replicates hot experts on each GPU and guarantees no tokens dropping in the implementation. However, it doesn’t take load balance as its design goals and obtains low expert efficiency. Our system, FlexMoE, guarantees 100%percent100100\% token efficiency and optimizes the allocation of computation resources to balance the workloads among GPUs, and is the closest to the ideal. With the training progressing, the imbalanced workloads are getting better due to the punishment of balance loss and all methods are moving towards to better efficiency.
在本节中,我们分析了整个训练过程中不同 MoE 训练方法的令牌效率和专家效率。令牌效率指的是被专家网络处理的输入令牌的比例,而专家效率则指 GPU 的有意义计算。 100%percent100100\% 是 MoE 层的理想状态,如图 7(a)中的红旗所示。传统的专家并行方法(例如,DeepSpeed)由于为了负载平衡而丢弃超出容量的令牌,因此获得了低令牌效率和低专家效率。由 BaGuaLu(马等,2022)提出的 SWIPE 通过修改门控算法,将输入重新分配给其他专家以实现严格的负载平衡,从而提高了专家效率。然而,这种方法改变了令牌与专家之间的关系,因此导致了低令牌效率。FasterMoE(何等,2022)在每个 GPU 上复制热门专家,并保证实施中不丢弃任何令牌。然而,它没有将负载平衡作为其设计目标,因此获得了低专家效率。我们的系统,FlexMoE,保证了 100%percent100100\% 令牌效率,并优化了计算资源的分配,以平衡 GPU 之间的工作负载,并且是最接近理想的。随着训练的进展,由于平衡损失的惩罚,不平衡的工作负载正在得到改善,所有方法都在向更高效率迈进。

Refer to caption
(a) Token efficiency and Expert efficiency
(a) 令牌效率与专家效率
Refer to caption
(b) Scalability (b) 可扩展性
Figure 7. Figure 7(a) shows the trends of token efficiency and expert efficiency for different methods during training. And Figure 7(b) shows the scalability for different methods.
图 7。图 7(a)展示了训练过程中不同方法的令牌效率和专家效率的趋势。而图 7(b)则展示了不同方法的可扩展性。

5.5. Scalability 5.5 可扩展性

We also evaluate the scalability of FlexMoE on 8, 16, 32 and 64 GPUs, which are conducted on a single MoE layer with 64 experts. The results are presented in Figure 7(b), which are normalized to the throughput of DeepSpeed-8GPUs. Results show FlexMoE significantly outperforms DeepSpeed and FasterMoE. As experiments are conducted on a high-speed interconnected cluster (8*300Gbps intra-node and 8*200 Gbps inter-node), balanced computation among GPUs plays a key role and FlexMoE is targeting as it.
我们还评估了 FlexMoE 在 8、16、32 和 64 个 GPU 上的可扩展性,这些测试是在具有 64 个专家的单个 MoE 层上进行的。结果呈现在图 7(b)中,这些结果是相对于 DeepSpeed-8GPUs 的吞吐量进行标准化的。结果显示 FlexMoE 显著优于 DeepSpeed 和 FasterMoE。由于实验是在一个高速互连的集群上进行的(节点内 8*300Gbps 和节点间 8*200Gbps),GPU 间的平衡计算起着关键作用,而 FlexMoE 正是针对这一点。

6. Conclusion 6.结论

In this paper, we presented FlexMoE, a novel solution to address the dynamic imbalanced challenges encountered during the training of large-scale MoE models. By integrating a scheduling module on top of existing DNN frameworks, FlexMoE monitors data traffic, creates scheduling plans, and dynamically adjusts the expert-to-device mapping during training. Our empirical results on six popular MoE models demonstrate that FlexMoE outperforms DeepSpeed by an average of 1.70×1.70\times and up to 2.10×2.10\times, while also surpassing FasterMoE by an average of 1.30×1.30\times and up to 1.45×1.45\times.
在本文中,我们提出了 FlexMoE,这是一种解决大规模 MoE 模型训练过程中遇到的动态不平衡挑战的新型解决方案。通过在现有 DNN 框架上集成一个调度模块,FlexMoE 监控数据流量,创建调度计划,并在训练过程中动态调整专家到设备的映射。我们在六个流行的 MoE 模型上的实证结果表明,FlexMoE 的性能平均超过 DeepSpeed 1.70×1.70\times ,最高可达 2.10×2.10\times ,同时也平均超过 FasterMoE 1.30×1.30\times ,最高可达 1.45×1.45\times

7. Acknowledgments 致谢

This work is supported by the National Key Research and Development Program of China (No. 2020AAA0105200), the National Natural Science Foundation of China (No. 61832001 and U22B2037) and PKU-Tencent joint research Lab. Bin Cui is the corresponding author.
本项工作得到了中国国家重点研发计划(编号 2020AAA0105200)、国家自然科学基金(编号 61832001 和 U22B2037)以及北京大学-腾讯联合研究实验室的支持。崔斌教授是本文的通讯作者。

References 参考文献

  • (1)
  • azu (2014) 2014. Azure A100 VMs. https://docs.microsoft.com/en-us/azure/virtual-machines/nda100-v4-series.
    2014 年。Azure A100 虚拟机。https://docs.microsoft.com/zh-cn/azure/virtual-machines/nda100-v4-series。
  • ima (2014) 2014. ImageNet-1K Dataset. https://huggingface.co/datasets/imagenet-1k.
    2014 年。ImageNet-1K 数据集。https://huggingface.co/datasets/imagenet-1k。
  • wik (2014) wik(2014) 2014. Wikipedia Dataset. https://huggingface.co/datasets/wikipedia.
    2014 年。维基百科数据集。https://huggingface.co/datasets/wikipedia。
  • jea (2017) jea(2017) 2017. NCCL 2.0. https://github.com/NVIDIA/nccl.
    2017 年。NCCL 2.0。https://github.com/NVIDIA/nccl。
  • qa_(2021) 2021 年,AI2 推理挑战(ARC)。https://leaderboard.allenai.org/arc/submissions/public 2021. The AI2 Reasoning Challenge (ARC). https://leaderboard.allenai.org/arc/submissions/public.
    布朗等人(2020 年)
  • Brown et al. (2020) 汤姆·布朗、本杰明·曼恩、尼克·赖德、梅兰妮·苏比亚、贾里德·D·卡普兰、普拉富拉·达里瓦尔、阿尔温德·尼拉坎坦、普拉纳夫·夏姆、吉里什·萨斯特里、阿曼达·阿斯克尔等。2020 年。语言模型是少数次学习者。在 2020 年神经信息处理系统大会(卷 33)上的进展,1877-1901。 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  • Dai et al. (2022) 戴等人(2022) Damai Dai, Li Dong, Shuming Ma, Bo Zheng, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. StableMoE: Stable Routing Strategy for Mixture of Experts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7085–7095.
    戴大麦,董立,马顺明,郑波,隋志芳,常宝宝,魏福如。2022。StableMoE:专家混合的稳定路由策略。在第 60 届计算语言学协会年会论文集中(第 1 卷:长篇论文)。7085-7095。
  • Davoudian et al. (2021) 达沃迪安等人(2021) Ali Davoudian, Liu Chen, Hongwei Tu, and Mengchi Liu. 2021. A workload-adaptive streaming partitioner for distributed graph stores. Data Science and Engineering 6 (2021), 163–179.
    阿里·达沃迪安,陈柳,涂宏伟,刘梦驰。2021。一种适应工作负载的分布式图存储流式分区器。数据科学与工程 6(2021),163-179。
  • Devlin et al. (2019) Devlin 等人(2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和 Kristina Toutanova. 2019. BERT:深度双向变换器的预训练用于语言理解。在 2019 年北美计算语言学会人类语言技术会议论文集中,第 1 卷(长篇和短篇论文)。4171-4186。
  • Dosovitskiy et al. (2021)
    Dosovitskiy 等人(2021)
    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, 和 Neil Houlsby. 2021. 一张图片值得 16x16 个词:在大规模图像识别中的变换器。在国际学习表示会议中。
  • Du et al. (2022) 杜等人(2022) Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. 2022. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning. PMLR, 5547–5569.
    杜楠、黄艳萍、戴安德鲁·M、童思敏、列皮欣·德米特里、徐远中、克里昆·马克西姆、周延琦、余伟亚当斯、费拉特·奥尔汗等。2022 年。Glam: 使用专家混合模型高效扩展语言模型。在国际机器学习会议上。PMLR,5547-5569。
  • Fedus et al. (2022) 费杜斯等人(2022) William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Journal of Machine Learning Research 23, 120 (2022), 1–39.
    威廉·费杜斯、巴雷特·佐夫和诺姆·沙泽尔。2022 年。Switch Transformers:通过简单高效的稀疏性扩展至万亿参数模型。机器学习研究杂志第 23 卷,第 120 期(2022 年),1-39。
  • Fu et al. (2020) 傅方程等人(2020) Fangcheng Fu, Yuzheng Hu, Yihan He, Jiawei Jiang, Yingxia Shao, Ce Zhang, and Bin Cui. 2020. Don’t waste your bits! squeeze activations and gradients for deep neural networks via tinyscript. In International Conference on Machine Learning. PMLR, 3304–3314.
    傅方程、胡宇正、何一涵、蒋佳伟、邵颖夏、张策、崔斌。2020。别浪费你的比特!通过 TinyScript 压缩深度神经网络的激活值和梯度。在国际机器学习会议上。PMLR,3304-3314。
  • Fu et al. (2019) 傅方程等人(2019) Fangcheng Fu, Jiawei Jiang, Yingxia Shao, and Bin Cui. 2019. An Experimental Evaluation of Large Scale GBDT Systems. Proc. VLDB Endow. 12, 11 (2019), 1357–1370.
    傅方程、蒋佳伟、邵颖夏、崔斌。2019。大规模 GBDT 系统的实验评估。Proc. VLDB Endow. 12, 11(2019),1357-1370。
  • Fu et al. (2022) 傅等人(2022) Fangcheng Fu, Xupeng Miao, Jiawei Jiang, Huanran Xue, and Bin Cui. 2022. Towards Communication-efficient Vertical Federated Learning Training via Cache-enabled Local Update. Proc. VLDB Endow. 15, 10 (2022), 2111–2120.
    傅方程、苗旭鹏、蒋佳伟、薛焕然、崔斌。2022。通过缓存支持的本地更新实现通信高效的垂直联邦学习训练。《VLDB 终端处理进程》第 15 卷,第 10 期(2022 年),2111-2120 页。
  • Ge et al. (2021) 葛等人(2021) Jia-Ke Ge, Yan-Feng Chai, and Yun-Peng Chai. 2021. WATuning: a workload-aware tuning system with attention-based deep reinforcement learning. Journal of Computer Science and Technology 36, 4 (2021), 741–761.
    葛家珂、柴彦峰、柴云鹏。2021。WATuning:一种基于注意力的深度强化学习的工作负载感知调优系统。《计算机科学与技术杂志》第 36 卷,第 4 期(2021 年),741-761 页。
  • He et al. (2022) 何佳澳, 翟继东, Tiago Antunes, 王浩杰, 罗富文, 施尚峰, 李钦. 2022. FasterMoE: 大规模动态预训练模型的建模与优化训练. 在第 27 届 ACM SIGPLAN 并行编程原理与实践研讨会论文集中. 120-134. Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. 2022. FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 120–134.
    黄延平, 程有龙, Ankur Bapna, Orhan Firat, 陈德豪, 陈妙, 李厚勇, Ngiam Jiquan, Quoc V Le, 吴永辉, 等. 2019. Gpipe: 使用管道并行性高效训练巨型神经网络. 神经信息处理系统进展 32 (2019), 103-112.
  • Huang et al. (2019) Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019), 103–112.
  • Hwang et al. (2022) 黄等人(2022) Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, Joe Chau, Peng Cheng, Fan Yang, Mao Yang, and Yongqiang Xiong. 2022. Tutel: Adaptive Mixture-of-Experts at Scale. CoRR abs/2206.03382 (2022).
    黄昌浩,崔巍,熊一帆,杨子越,刘泽,胡涵,王子龙,拉斐尔·萨拉斯,吉辛·乔斯,普拉巴特·拉姆,乔·周,程鹏,杨帆,杨茂,熊永强。2022。Tutel:大规模自适应专家混合模型。CoRR abs/2206.03382(2022)。
  • Kaplan et al. (2020) 卡普兰等人(2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. CoRR abs/2001.08361 (2020).
    贾里德·卡普兰,萨姆·麦坎德利什,汤姆·海尼根,汤姆·B·布朗,本杰明·切斯,雷文·查尔德,斯科特·格雷,亚历克·拉德福德,吴杰弗里,达里奥·阿莫迪。2020。神经语言模型的规模化定律。CoRR abs/2001.08361(2020)。
  • Lepikhin et al. (2021) Lepikhin 等人(2021) Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In International Conference on Learning Representations.
    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, 和 Zhifeng Chen. 2021. GShard:通过条件计算和自动分片扩展巨型模型。在国际学习表示会议上。
  • Lewis et al. (2021) Lewis 等人(2021) Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. 2021. Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning. PMLR, 6265–6274.
    Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, 和 Luke Zettlemoyer. 2021. 基础层:简化大型稀疏模型的训练。在国际机器学习会议上。PMLR, 6265–6274。
  • Li et al. (2020) 李等(2020) Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. 2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Proceedings of the VLDB Endowment 13, 12 (2020), 3005–3018.
    李慎、赵艳丽、罗汉·瓦尔玛、奥姆卡尔·萨尔佩卡、皮特·诺德豪斯、李腾、亚当·帕斯克、杰夫·史密斯、布莱恩·沃恩、普里坦姆·达马尼亚等。2020。PyTorch Distributed:加速数据并行训练的经验。《VLDB 终端出版物》第 13 卷,第 12 期(2020),3005-3018。
  • Lin et al. (2021) 林等(2021) Junyang Lin, Rui Men, An Yang, Chang Zhou, Ming Ding, Yichang Zhang, Peng Wang, Ang Wang, Le Jiang, Xianyan Jia, et al. 2021. M6: A chinese multimodal pretrainer. CoRR abs/2103.00823 (2021).
    林俊阳、门睿、杨安、周畅、丁明、张义昌、王鹏、王昂、蒋乐、贾先岩等。2021。M6:一种中文多模态预训练器。CoRR abs/2103.00823(2021)。
  • Liu et al. (2021) 刘等人(2021) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012–10022.
    刘泽、林雨桐、曹越、胡涵、魏一轩、张正、林斯蒂芬、郭宾宁。2021。Swin Transformer:使用移位窗口的分层视觉 Transformer。在 IEEE/CVF 国际计算机视觉会议论文集中。10012-10022。
  • Ma et al. (2022) 马等人(2022) Zixuan Ma, Jiaao He, Jiezhong Qiu, Huanqi Cao, Yuanwei Wang, Zhenbo Sun, Liyan Zheng, Haojie Wang, Shizhi Tang, Tianyu Zheng, et al. 2022. BaGuaLu: targeting brain scale pretrained models with over 37 million cores. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 192–204.
    马子轩、何佳澳、邱杰忠、曹焕琪、王远威、孙振波、郑丽燕、王浩杰、唐士志、郑天宇等。2022。BaGuaLu:以超过 3700 万核心为目标的大脑规模预训练模型。在第 27 届 ACM SIGPLAN 并行编程原理与实践研讨会论文集中。192-204。
  • Maruta and Kato (2022)
    丸田敦树与加藤诚. 2022. 意图感知的数据可视化推荐. 数据科学与工程 7, 4 (2022), 301–315
    Atsuki Maruta and Makoto P Kato. 2022. Intent-Aware Data Visualization Recommendation. Data Science and Engineering 7, 4 (2022), 301–315.
    苗旭鹏, 聂晓南, 张海林, 赵彤, 崔斌. 2023. 河图:一个高效的自动并行分布式深度学习系统. 中国科学: 信息科学 66, 1 (2023), 1–2.
  • Miao et al. (2023) Xupeng Miao, Xiaonan Nie, Hailin Zhang, Tong Zhao, and Bin Cui. 2023. Hetu: A highly efficient automatic parallel distributed deep learning system. Science China Information Sciences 66, 1 (2023), 1–2.
  • Miao et al. (2022a) 缪等(2022a) Xupeng Miao, Yining Shi, Hailin Zhang, Xin Zhang, Xiaonan Nie, Zhi Yang, and Bin Cui. 2022a. HET-GMP: A Graph-based System Approach to Scaling Large Embedding Model Training. In Proceedings of the 2022 International Conference on Management of Data. 470–480.
    缪旭鹏、施一宁、张海林、张鑫、聂晓楠、杨志、崔斌。2022a。HET-GMP:一种基于图的系统方法,用于扩展大型嵌入模型训练。在 2022 年数据管理国际会议论文集中。470-480。
  • Miao et al. (2022b) 缪等(2022b) Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. 2022b. Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism. Proceedings of the VLDB Endowment 16, 3 (2022), 470–479.
    缪旭鹏、王玉洁、江友鹤、史楚楠、聂晓楠、张海林、崔斌。2022b。Galvatron:使用自动并行性在多个 GPU 上高效训练 Transformer。VLDB 终端出版物第 16 卷,第 3 期(2022),470-479。
  • Miao et al. (2021) 缪翔鹏、张海林、史一宁、聂晓楠、杨志、陶阳宇、崔斌. 2021. HET: 通过缓存启用的分布式框架实现巨型嵌入模型训练的扩展. 数据库学报 15, 2 (2021), 312–320. Xupeng Miao, Hailin Zhang, Yining Shi, Xiaonan Nie, Zhi Yang, Yangyu Tao, and Bin Cui. 2021. HET: scaling out huge embedding model training via cache-enabled distributed framework. Proceedings of the VLDB Endowment 15, 2 (2021), 312–320.
    纳拉亚南等人 (2019)
  • Narayanan et al. (2019) Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, 和 Matei Zaharia. 2019. PipeDream: 用于 DNN 训练的通用管道并行性. 在第 27 届 ACM 操作系统原理研讨会论文集中. 1–15. Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1–15.
  • Narayanan et al. (2021) 纳拉亚南等人(2021) Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15.
    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro 等。2021 年。在 GPU 集群上使用 Megatron-LM 进行高效的大规模语言模型训练。在高性能计算、网络、存储和分析国际会议论文集中。1-15 页。
  • Nie et al. (2021) 聂等人(2021) Xiaonan Nie, Shijie Cao, Xupeng Miao, Lingxiao Ma, Jilong Xue, Youshan Miao, Zichao Yang, Zhi Yang, and Bin Cui. 2021. Dense-to-sparse gate for mixture-of-experts. CoRR abs/2112.14397 (2021).
    聂晓南,曹世杰,苗旭鹏,马凌霄,薛继龙,苗友山,杨子超,杨志,崔斌。2021 年。专家混合的密集到稀疏门控。CoRR abs/2112.14397(2021)。
  • Nie et al. (2023) 聂晓楠、刘毅、付方程、薛金宝、焦点、苗旭鹏、陶阳宇、崔斌。2023。Angel-PTM:腾讯中可扩展且经济的大规模预训练系统。arXiv 预印本 arXiv:2303.02868(2023)。 Xiaonan Nie, Yi Liu, Fangcheng Fu, Jinbao Xue, Dian Jiao, Xupeng Miao, Yangyu Tao, and Bin Cui. 2023. Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent. arXiv preprint arXiv:2303.02868 (2023).
    聂晓楠、苗旭鹏、杨志、崔斌。2022a。TSPLIT:通过张量分割实现高效 DNN 训练的细粒度 GPU 内存管理。在国际数据工程会议上。IEEE,2615-2628。
  • Nie et al. (2022a) Xiaonan Nie, Xupeng Miao, Zhi Yang, and Bin Cui. 2022a. TSPLIT: Fine-grained GPU Memory Management for Efficient DNN Training via Tensor Splitting. In International Conference on Data Engineering. IEEE, 2615–2628.
  • Nie et al. (2022b) 聂晓南,赵品学,苗旭鹏,赵彤,崔斌. 2022b. HetuMoE: 一个高效的万亿规模专家混合分布式训练系统. CoRR abs/2203.14685 (2022). Xiaonan Nie, Pinxue Zhao, Xupeng Miao, Tong Zhao, and Bin Cui. 2022b. HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System. CoRR abs/2203.14685 (2022).
    帕兹克等人(2019)
  • Paszke et al. (2019) 亚当·帕兹克,萨姆·格罗斯,弗朗西斯科·马萨,亚当·莱勒,詹姆斯·布拉德伯里,格雷戈里·查南,特雷弗·基林,林泽明,娜塔莉娅·吉梅尔谢因,卢卡·安蒂加等. 2019. Pytorch: 一种命令式风格的高性能深度学习库. 神经信息处理系统进展 32 (2019), 8026–8037. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019), 8026–8037.
  • Peng et al. (2021) 彭等人(2021) Yun Peng, Byron Choi, and Jianliang Xu. 2021. Graph learning for combinatorial optimization: a survey of state-of-the-art. Data Science and Engineering 6, 2 (2021), 119–141.
    彭云、蔡伯伦、徐建良。2021。图学习在组合优化中的应用:最新进展综述。数据科学与工程 6, 2 (2021), 119–141。
  • Radford et al. (2019) 拉德福德等人(2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. OpenAI blog 1, 8 (2019), 9.
    亚历克·拉德福德、吴杰弗里、雷文·查尔德、卢安·大卫、达里奥·阿莫迪、伊利亚·苏茨克沃。2019。语言模型是无监督的多任务学习者。OpenAI 博客 1, 8 (2019), 9。
  • Raffel et al. (2020) Raffel 等人(2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21 (2020), 1–67.
    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, 和 Peter J Liu. 2020. 探索统一文本到文本转换器在迁移学习极限的研究. 机器学习研究杂志 21 (2020), 1–67.
  • Rajbhandari et al. (2022)
    Rajbhandari 等人(2022)
    Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022. DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. In International Conference on Machine Learning. PMLR, 18332–18346.
    Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, 和 Yuxiong He. 2022. DeepSpeed-MoE:推进专家混合模型的推理和训练,以支持下一代 AI 规模. 在国际机器学习会议上. PMLR, 18332–18346.
  • Riquelme et al. (2021) Riquelme 等人(2021) Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby. 2021. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems 34 (2021), 8583–8595.
    卡洛斯·里克尔梅、乔安·普伊格塞弗、巴西尔·穆斯塔法、马克西姆·诺伊曼、罗道夫·杰纳顿、安德烈·苏萨诺·平托、丹尼尔·凯瑟斯以及尼尔·豪尔斯比。2021 年。通过稀疏专家混合模型扩展视觉领域。《神经信息处理系统进展》第 34 卷(2021 年),8583-8595 页。
  • Roller et al. (2021) Roller 等人(2021) Stephen Roller, Sainbayar Sukhbaatar, Jason Weston, et al. 2021. Hash layers for large sparse models. Advances in Neural Information Processing Systems 34 (2021), 17555–17566.
    斯蒂芬·罗勒、赛恩巴亚尔·苏赫巴特尔、杰森·韦斯顿等人。2021 年。用于大型稀疏模型的哈希层。《神经信息处理系统进展》第 34 卷(2021 年),17555-17566 页。
  • Shazeer et al. (2017) Shazeer 等人(2017) Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In International Conference on Learning Representations.
    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, 和 Jeff Dean. 2017. 超大型神经网络:稀疏门控的专家混合层。在国际学习表征会议上。
  • Vaswani et al. (2017) Vaswani 等人(2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017), 6000–6010.
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, 和 Illia Polosukhin. 2017. 你所需要的只是注意力。神经信息处理系统进展 30(2017),6000-6010。
  • Wang et al. (2019) 王等(2019) Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems 32 (2019), 3266–3280.
    王亚力、普雅达·普鲁克萨查昆、尼基塔·南加、阿曼普里特·辛格、朱利安·迈克尔、菲利克斯·希尔、奥默·列维和塞缪尔·鲍曼。2019。Superglue:通用语言理解系统的更加粘性基准测试。神经信息处理系统进展 32 卷(2019 年),3266-3280 页。
  • Wang et al. (2023) 王等(2023) Zhuping Wang, Jingcheng Liu, Chao Huang, Hao Zhang, and Huaicheng Yan. 2023. Global output feedback adaptive stabilization for systems with long uncertain input delay. Science China Information Sciences 66, 1 (2023), 119201.
    王竹平、刘京城、黄超、张浩、严怀成。2023。具有长不确定输入延迟系统的全局输出反馈自适应稳定化。中国科学信息科学版 66 卷,1 期(2023 年),119201。
  • Wei et al. (2022) 魏华鹏、邓颖颖、唐帆、潘兴佳、董伟明. 2022. 基于 CNN 与 Transformer 的视觉风格迁移比较研究. 计算机科学与技术杂志 37, 3 (2022), 601–614. Hua-Peng Wei, Ying-Ying Deng, Fan Tang, Xing-Jia Pan, and Wei-Ming Dong. 2022. A Comparative Study of CNN-and Transformer-Based Visual Style Transfer. Journal of Computer Science and Technology 37, 3 (2022), 601–614.
    Zoph 等人. 2022.
  • Zoph et al. (2022) Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, William Fedus. 2022. ST-MoE: 设计稳定且可迁移的稀疏专家模型. CoRR abs/2202.08906 (2022). Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. 2022. ST-MoE: Designing Stable and Transferable Sparse Expert Models. CoRR abs/2202.08906 (2022).