MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
MoE-Lightning：在内存受限 GPU 上实现高吞吐量 MoE 推理

Shiyi Cao shicao@berkeley.edu UC BerkeleyBerkeleyCAUSA , Shu Liu lshu@berkeley.edu UC BerkeleyBerkeleyCAUSA , Tyler Griggs tgriggs@berkeley.edu UC BerkeleyBerkeleyCAUSA , Peter Schafhalter pschafhalter@berkeley.edu UC BerkeleyBerkeleyCAUSA , Xiaoxuan Liu xiaoxuan˙liu@berkeley.edu UC BerkeleyBerkeleyCAUSA , Ying Sheng ying1123@stanford.edu StanfordPalo AltoCAUSA , Joseph E. Gonzalez jegonzal@berkeley.edu UC BerkeleyBerkeleyCAUSA , Matei Zaharia matei@berkeley.edu UC BerkeleyBerkeleyCAUSA and Ion Stoica istoica@berkeley.edu UC BerkeleyBerkeleyCAUSA

Abstract. 摘要。

Efficient deployment of large language models, particularly Mixture of Experts (MoE) models, on resource-constrained platforms presents significant challenges in terms of computational efficiency and memory utilization. The MoE architecture, renowned for its ability to increase model capacity without a proportional increase in inference cost, greatly reduces the token generation latency compared with dense models. However, the large model size makes MoE models inaccessible to individuals without high-end GPUs. In this paper, we propose a high-throughput MoE batch inference system, MoE-Lightning, that significantly outperforms past work. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization, and a performance model, HRM, based on a Hierarchical Roofline Model we introduce to help find policies with higher throughput than existing systems. MoE-Lightning can achieve up to $10.3\times$ higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB). When the theoretical system throughput is bounded by the GPU memory, MoE-Lightning can reach the throughput upper bound with 2–3 $\times$ less CPU memory, significantly increasing resource utilization. MoE-Lightning also supports efficient batch inference for much larger MoEs (e.g., Mixtral 8x22B and DBRX) on multiple low-cost GPUs (e.g., 2–4 T4s).
在资源有限的平台上高效部署大型语言模型，特别是专家混合模型（MoE），在计算效率和内存利用方面存在显著挑战。MoE 架构因其能够在不成比例增加推理成本的情况下提高模型容量而闻名，与稠密模型相比，它大大减少了生成令牌的延迟。然而，模型的庞大规模使得没有高端 GPU 的个人无法使用 MoE 模型。本文中，我们提出了一种高吞吐量 MoE 批量推理系统 MoE-Lightning，其性能显著优于过去的工作。MoE-Lightning 引入了一种新颖的 CPU-GPU-I/O 流水线调度方法，CGOPipe，配以分页权重，实现高资源利用率，并引入了一种基于分层屋顶线模型的性能模型 HRM，帮助找到比现有系统更高吞吐量的策略。MoE-Lightning 在单个 T4 GPU（16GB）上对 Mixtral 8x7B 的吞吐量可达 $10.3\times$ ，高于最先进的支持卸载的 LLM 推理系统。当理论系统吞吐量受限于 GPU 内存时，MoE-Lightning 可以达到吞吐量的上限，且仅需 2-3 $\times$ 的 CPU 内存，从而显著提高资源利用率。MoE-Lightning 还支持在多个低成本 GPU（如 2-4 个 T4）上进行更大 MoE（如 Mixtral 8x22B 和 DBRX）的高效批量推理。

1. Introduction 1.引言

Mixture of Experts (MoE) (shazeer2017outrageously, ; dbrx, ; deepseek, ; mixtral, ) is a paradigm shift in the architecture of Large Language Models (LLMs) that leverages sparsely-activated expert sub-networks to enhance model performance without significantly increasing the number of operations required for inference. Unlike dense models (llama, ; opt, ; scao2022bloom, ), where all model parameters are activated for each input, MoE models activate only a subset of experts, thereby improving computational efficiency.
专家混合模型（MoE）（shazeer2017outrageously; dbrx; deepseek; mixtral,）是大语言模型（LLMs）架构的范式转变，通过利用稀疏激活专家子网络来提升模型性能，而不会显著增加推理所需的操作数量。与稠密模型（llama; opt; scao2022bloom,）不同，后者为每个输入都激活所有模型参数，而 MoE 模型仅激活部分专家，从而提高计算效率。

While the MoE models achieve strong performance in many tasks (deepseek, ; mixtral, ), unfortunately, their deployment is challenging due to the significantly increased memory demand for the same number of active parameters. For example, the Mixtral 8x22B model (mixtral22b, ) requires over 256 GB of memory for the parameters of the expert feed-forward network (FFN), which is $4-5\times$ higher than the memory requirements of dense models that require similar FLOPs for inference.
虽然 MoE 模型在许多任务中表现强劲（deepseek; mixtral,），但由于在相同数量的活动参数下内存需求显著增加，其部署变得具有挑战性。例如，Mixtral 8x22B 模型（mixtral22b,）的专家前馈网络（FFN）参数需要超过 256 GB 的内存，这比在推理中需要类似浮点运算量的稠密模型的内存需求要高出 $4-5\times$ 。

In this paper, we study how to achieve high-throughput MoE inference with limited GPU memory. We are focusing on off-line, batch-processing workloads such as model evaluation (liang2022holistic, ), synthetic data generation (dubey2024llama, ), data wrangling (narayan2022can, ), form processing (chen2021spreadsheetcoder, ), and LLM for relational analytics (liu2024optimizing, ) where higher inference throughput translates into lower total completion time.
本文研究如何在有限的 GPU 内存下实现高吞吐量的 MoE 推理。我们专注于离线的批量处理工作负载，如模型评估（liang2022holistic,）、合成数据生成（dubey2024llama,）、数据整理（narayan2022can,）、表单处理（chen2021spreadsheetcoder,）以及关系分析的 LLM（liu2024optimizing,），在这些场景中，更高的推理吞吐量将转化为更低的总完成时间。

The common approach for memory-constrained batch inference is to offload model weights (aminabadi2022deepspeed, ; huggingfaceAccelerate, ) and key-value tensors of earlier tokens (KV cache) (sheng2023flexgen, ) — which are needed for generating the next token – to CPU memory or disk. Then, they are loaded layer-by-layer to the GPU for computation.
针对内存受限的批量推理的常见方法是将模型权重（aminabadi2022deepspeed; huggingfaceAccelerate,）及早期令牌的键值张量（KV 缓存）（sheng2023flexgen,）—这些是生成下一个令牌所需的—卸载到 CPU 内存或磁盘。然后，再将它们逐层加载到 GPU 进行计算。

Refer to caption — Figure 1. MoE-Lightning achieves higher throughput with far less CPU memory, enabled by CGOPipe and HRM.
图 1. 通过 CGOPipe 和 HRM，MoE-Lightning 在用更少的 CPU 内存的情况下实现更高吞吐量。

Unfortunately, existing solutions fall short of effectively overlapping computations with data transfers between CPU and GPU. For instance, the GPU may remain idle as it awaits a small yet crucial piece of data such as intermediate results for the upcoming batch. At the same time, transferring the weights for subsequent layers may take a long time and potentially block both the GPU and CPU from processing further tasks, leading to under-utilization of all the resources.
不幸的是，现有的解决方案未能有效地将 CPU 和 GPU 之间的数据传输与计算重叠。例如，GPU 可能因为等待一个小但关键的数据，如下一批的中间结果，而处于空闲状态。同时，为后续层传输权重可能耗时较长，并可能阻碍 GPU 和 CPU 进行进一步任务处理，导致所有资源的利用率不足。

As a result, efficient MoE inference for throughput-oriented workloads using limited GPU memory remains challenging. We find that increasing I/O utilization and other resource utilization is critical in achieving high throughput. For example, Fig. 1 illustrates the relationship between CPU memory and achievable token generation throughput for different systems with fixed GPU memory (less than the model size) and CPU-to-GPU memory bandwidth. When a layer’s weights are loaded onto the GPU, a common strategy to increase throughput is to process as many requests as possible to amortize the I/O overhead of weights’ transfer (sheng2023flexgen, ). However, this increases CPU memory usage as additional space is required to store the KV cache for all requests. Consequently, lower I/O utilization means higher I/O overhead of weights’ transfer, requiring greater CPU memory to reach peak generation performance; otherwise, the GPU will be under-utilized as suggested by the blue line in Fig. 1.
因此，使用有限的 GPU 内存对面向吞吐量的工作负载进行高效的 MoE 推理仍然具有挑战性。我们发现，提高 I/O 利用率和其他资源使用率对于实现高吞吐量至关重要。例如，图 1 显示了在固定 GPU 内存（小于模型大小）和 CPU 到 GPU 内存带宽的系统中，CPU 内存与可实现的 token 生成吞吐量之间的关系。当层的权重被加载到 GPU 上时，一种增加吞吐量的常见策略是尽可能多地处理请求，以摊销传输权重的 I/O 开销（sheng2023flexgen,）。然而，这会增加 CPU 内存使用，因为需要额外的空间来存储所有请求的 KV 缓存。结果，较低的 I/O 利用率意味着权重传输的 I/O 开销更高，需要更大的 CPU 内存才能达到峰值生成性能；否则，正如图 1 中的蓝线所示，GPU 将被未充分利用。

While improving resource utilization is crucial for achieving high-throughput inference with limited GPU memory, achieving this raises several challenges. First, we need to effectively schedule the computation tasks running on CPU and GPU, together with the transfers of various inputs (e.g., experts weights, hidden states, and KV cache), such that to avoid computation tasks waiting for transfers or the other way around. Second, as indicated by the orange line in Fig. 1, the existing solutions (sheng2023flexgen, ) tend to generate sub-optimal policies with smaller GPU batch sizes which lead to resource under-utilization. Fundamentally, these solutions fail to take into account that changes in the workload can lead to changes in the bottleneck resource.
虽然提高资源利用率对于使用有限 GPU 内存实现高吞吐量推理至关重要，但实现这一目标会带来若干挑战。首先，我们需要有效地调度 CPU 和 GPU 上的计算任务，以及传输各种输入（如专家权重、隐藏状态和 KV 缓存），以避免计算任务等待传输或反之亦然。其次，正如图 1 中的橙色线所示，现有的解决方案（sheng2023flexgen,）往往生成子优化的策略，即具有较小的 GPU 批处理大小，导致资源未能充分利用。根本上，这些解决方案未能考虑到工作负载的变化会导致瓶颈资源的变化。

To address these two challenges, we developed a new inference system, MoE-Lightning, which consists of two new components. The first component is CGOPipe, a pipeline scheduling strategy that overlaps GPU computation, CPU computation and various I/O events efficiently so that computation is not blocked by I/O events and different I/O events won’t block each other. This way, CGOPipe can significantly improve the system utilization. The second component is Hierarchical Roofline Model (HRM) which accurately models how different components in an inference system interact and affect application performance under various operational conditions.
为了解决这两个挑战，我们开发了一个新的推理系统——MoE-Lightning，它由两个新组件组成。第一个组件是 CGOPipe，一种管道调度策略，可以高效地重叠 GPU 计算、CPU 计算和各种 I/O 事件，从而确保计算不被 I/O 事件阻塞，不同的 I/O 事件也不会互相阻塞。通过这种方式，CGOPipe 可以显著提高系统利用率。第二个组件是分层 Roofline 模型（HRM），它能够准确建模推理系统中不同组件的交互，并在各种操作条件下影响应用性能。

In summary, this paper makes the following contributions:
总结而言，本论文做出了以下贡献：

•

CGOPipe, a pipeline scheduling strategy that efficiently schedules various I/O events and overlaps CPU and GPU computation with I/O events. By deploying weights paging, CGOPipe reduces pipeline bubbles, significantly enhancing throughput and I/O efficiency compared with existing systems (Section 4.1).

• CGOPipe，一种有效调度各种 I/O 事件并将 CPU 和 GPU 计算与 I/O 事件重叠的管道调度策略。通过实施权重分页，CGOPipe 减少了管道气泡，与现有系统相比显著提高了吞吐量和 I/O 效率（第 4.1 节）。
•

HRM, a general performance model for LLM inference which extends the Roofline Model (WilliamsWP09, ). HRM can easily support different models, hardware, and workloads, and has near-zero overhead in real deployments, without the need for extensive data fitting (might take hours or days) as needed in FlexGen (Section 4.2).

• HRM，一种扩展的 Roofline 模型（WilliamsWP09,）的通用性能模型，用于 LLM 推理。HRM 可以轻松支持不同的模型、硬件和工作负载，并且在实际部署中几乎零开销，无需如 FlexGen 中所需的广泛数据拟合（可能需要数小时或数天）（第 4.2 节）。
•

An in-depth performance analysis for MoE models based on our extended HRM which identifies various performance regions where specific resource becomes the bottleneck (Section 3).

• 基于我们扩展的 HRM 对 MoE 模型进行的深入性能分析，该分析识别出特定资源成为瓶颈的各种性能区域（第 3 节）。

We evaluate MoE-Lightning on various recent popular MoE models (e.g., Mixtral 8x7b, Mixtral 8x22B, and DBRX) on different hardware settings (e.g., L4, T4, 2xT4, and 4xT4 GPUs) using three real workloads. When compared to the best of the existing systems, MoE-Lightning can improve the generation throughput by up to $10.3\times$ (without request padding) and $3.5\times$ (with request padding) on a single GPU. When Tensor-parallelism is enabled, MoE-Lightning demonstrates super-linear scaling in generation throughput (Section 5).
我们在不同硬件设置（如 L4、T4、2xT4 和 4xT4 GPU）上使用三种实际工作负载对 MoE-Lightning 进行了评估，并测试了多种近期流行的 MoE 模型（如 Mixtral 8x7b、Mixtral 8x22B 和 DBRX）。与现有系统的最佳相比，MoE-Lightning 在单个 GPU 上的生成吞吐量提升可达 $10.3\times$ （无请求填充）和 $3.5\times$ （有请求填充）。当启用 Tensor 并行时，MoE-Lightning 在生成吞吐量上表现出超线性扩展（第 5 节）。

2. Background 2.背景

2.1. Mixture of Experts 2.1.专家混合模型

Large Language Models (LLMs) have significantly improved in performance due to the advancements in architecture and scalable training methods. In particular, Mixture of Experts (MoE) models have shown remarkable improvements in model capacity, training time, and model quality (shazeer2017outrageously, ; glam, ; deepseek, ; dbrx, ; mixtral, ; lepikhin2020gshard, ), revitalizing an idea that dates back to the early 1990s (JacobsJNH91, ; jordan1994hierarchical, ) where ensembles of specialized models are used in conjunction with a gating mechanism to dynamically select the appropriate “expert” for a given task.
由于架构的进步和可扩展的训练方法，大型语言模型（LLMs）在性能上显著提升。特别是，专家混合（MoE）模型在模型容量、训练时间和模型质量方面表现出显著的改进（shazeer2017outrageously,；glam,；deepseek,；dbrx,；mixtral,；lepikhin2020gshard,），复兴了一种可以追溯到 20 世纪 90 年代初的想法（JacobsJNH91,；jordan1994hierarchical,），即使用专门模型的集合并结合门控机制来动态选择适合特定任务的“专家”。

The key idea behind MoE is a gating function that routes inputs to specific experts within a larger neural network. Each expert is specialized in handling particular types of inputs. The gating function selects only a subset of experts to process an input, which allows LLMs to scale the number of parameters without increasing inference operations.
MoE 的关键理念是使用一个门控功能来将输入路由到大规模神经网络中的特定专家。每个专家专门处理特定类型的输入。门控功能仅选择一部分专家来处理输入，这使得 LLM 可以在不增加推理操作数的情况下扩展参数数量。

MoE models adopt a conventional LLM architecture, which uses learned embeddings for tokens and stacked transformer layers. MoE LLMs typically modify the Feed-Forward Network (FFN) within a transformer layer by adding a gating network that selects expert FFNs, usually implemented as multi-layer perceptrons, to process the input token (glam, ; zhou2022mixture, ; chen2023lifelong, ). These designs can surpass traditional dense models (chatbot-arena, ; deepseek, ; mixtral, ) in effectiveness while being more parameter-efficient and cost-effective during training and inference.
MoE 模型采用传统的 LLM 架构，使用学习得到的嵌入来表示标记及堆叠的 Transformer 层。MoE LLM 通常通过在 Transformer 层内的前馈网络（FFN）中添加门控网络来选择专家 FFN（通常实现为多层感知器）来处理输入标记（glam,；zhou2022mixture,；chen2023lifelong,）。这些设计能够在有效性上超越传统的密集模型（chatbot-arena,；deepseek,；mixtral,），同时在训练和推理过程中更具参数效率和成本效益。

Despite their advantages, the widespread use of MoE models faces challenges due to the difficulties in managing and deploying models with extremely high parameter counts that demand substantial memory. Thus, our work aims to make MoE models more accessible to those lacking extensive high-end GPU resources.
尽管有这些优势，MoE 模型的广泛使用面临挑战，因为管理和部署具有极高参数数量的模型需要大量内存。因此，我们的工作旨在使那些缺乏高端 GPU 资源的人也能更容易地使用 MoE 模型。

2.2. LLM Inference 2.2.LLM 推理

LLMs are trained to predict the conditional probability distribution for the next token, $P(x_{n+1}\mid x_{1},\ldots,x_{n})$ , given a list of input tokens $(x_{1},\ldots,x_{n})$ . When deployed as a service, the LLM takes in a list of tokens from a user request and generates an output sequence $(x_{n+1},\ldots,x_{n+T})$ . The generation process involves sequentially evaluating the probability and sampling the token at each position for $T$ iterations. The stage where the model generates the first token $x_{n+1}$ given the initial list of tokens $(x_{1},\ldots,x_{n})$ , is defined as the prefill stage. In the prefill stage, at each layer, the input hidden states to the attention block will be projected into the query, key, and value vectors. The key and value vectors will be stored in the KV cache. Following the prefill stage is the decode stage, where the model generates the remaining tokens $(x_{n+2},\ldots,x_{n+T})$ sequentially. When generating token $x_{n+2}$ , all the KV cache of the previous tokens $(x_{1},\ldots,x_{n+1})$ will be needed, and the token $x_{n+2}$ ’s key and value at each layer will be appended to the KV cache.
LLM 被训练用于预测在给定一系列输入标记 $(x_{1},\ldots,x_{n})$ 的条件下下一个标记的条件概率分布 $P(x_{n+1}\mid x_{1},\ldots,x_{n})$ 。当作为服务部署时，LLM 接受用户请求中的一系列标记并生成输出序列 $(x_{n+1},\ldots,x_{n+T})$ 。生成过程包括依序评估概率并在每个位置采样标记，进行 $T$ 次迭代。模型在给定初始标记列表 $(x_{1},\ldots,x_{n})$ 时生成第一个标记 $x_{n+1}$ 的阶段定义为预填充阶段。在预填充阶段，每层中输入到注意力模块的隐藏状态将被投射为查询、键和值向量。键和值向量将存储在 KV 缓存中。预填充阶段之后是解码阶段，在此阶段，模型依次生成其余标记 $(x_{n+2},\ldots,x_{n+T})$ 。在生成标记 $x_{n+2}$ 时，所有先前标记的 KV 缓存 $(x_{1},\ldots,x_{n+1})$ 都是必要的，并且每层的标记 $x_{n+2}$ 的键和值将附加到 KV 缓存中。

The auto-regressive nature of LLM generation, where tokens are generated sequentially, can lead to sub-optimal device utilization and decreased serving throughput (pope2023efficiently, ). Batching is a critical strategy for improving GPU utilization: (yu2022orca, ) proposed continuous batching which increases the serving throughput by orders of magnitude. Numerous studies have developed methods to tackle associated challenges such as memory fragmentation (kwon2023efficient, ) and the heavy memory pressure imposed by the KV cache (he2024fastdecode, ; sheng2023flexgen, ; juravsky2024hydragen, ). The scenario of limited GPU memory introduces further challenges, especially for large MoE models, as it requires transferring large amounts of data between the GPU and CPU for various computational tasks with distinct characteristics. Naive scheduling of the computation task and data transfer can result in poor resource utilization. This paper explores how each resource in a heterogeneous system affects LLM inference performance and proposes efficient scheduling strategies and system optimizations to enhance resource utilization.
大型语言模型生成的自回归特性，按顺序生成标记，可能导致设备利用率不足和服务吞吐量降低（pope2023efficiently）。批处理是提高 GPU 利用率的关键策略：（yu2022orca）提出的连续批处理可将服务吞吐量提高若干数量级。许多研究已开发出方法，解决如内存碎片化（kwon2023efficient）和 KV 缓存带来的高内存压力（he2024fastdecode；sheng2023flexgen；juravsky2024hydragen）等相关挑战。有限的 GPU 内存情境带来了进一步的挑战，特别是对于大型 MoE 模型，因为这需要在 GPU 和 CPU 之间传输大量数据以进行不同特性计算任务。计算任务和数据传输的简单调度可能导致资源利用不足。本文探讨了异构系统中每种资源如何影响 LLM 推理性能，并提出高效的调度策略和系统优化，以提升资源利用。

3. Performance Analysis 3.性能分析

In this section, we introduce a Hierarchical Roofline Model (HRM) (Section 3.2) extended from the classical Roofline Model (WilliamsWP09, ), which we use to conduct a theoretical performance analysis for MoE inference (Section 3.3). It also serves as the basics of our performance model used in scheduling policy search, which will be discussed in Section 4.2. The Hierarchical Roofline Model extends the original Roofline Model for multicore architectures (WilliamsWP09, ) to provide a stronger model of heterogeneous computing devices and memory bandwidth. We further identify two additional turning points that define settings where the computation is best done on CPU instead of GPU and where the application is GPU memory-bound or CPU memory-bound, providing explicit explanations for how LLM inference performance will be affected by different resource limits in the system.
在本节中，我们介绍了一个从经典 Roofline 模型（WilliamsWP09）扩展而来的分层 Roofline 模型（HRM）（第 3.2 节），我们用它来对 MoE 推理进行理论性能分析（第 3.3 节）。它还作为我们的调度策略搜索中使用的性能模型的基础，将在第 4.2 节讨论。分层 Roofline 模型扩展了原始的多核架构 Roofline 模型（WilliamsWP09），提供了异构计算设备和内存带宽的更强有力的模型。我们进一步识别了两个附加转折点，定义了在何种设置下计算最好在 CPU 而不是 GPU 上执行，以及应用程序是 GPU 内存受限还是 CPU 内存受限，为 LLM 推理性能如何受系统中不同资源限制影响提供了明确的解释。

3.1. Roofline Model 3.1.Roofline 模型

We will start with the original Roofline Model (WilliamsWP09, ), which provides a visual performance model to estimate the performance of a given application by showing inherent hardware limitations and potential opportunities for optimizations. It correlates a system’s peak performance and memory bandwidth with the operational intensity of a given computation, where Operational Intensity ( $I$ ) denotes the ratio of the number of operations in FLOPs performed to the number of bytes accessed from memory, expressed in FLOPs/Bytes.
我们将以原始 Roofline 模型（WilliamsWP09）开始，该模型提供一个视觉性能模型，通过展示固有的硬件限制和潜在的优化机会来估计给定应用程序的性能。它关联了系统的峰值性能和内存带宽与给定计算的操作强度，其中操作强度（ $I$ ）表示以 FLOPs 执行的操作数量与从内存访问的字节数之比，用 FLOPs/字节表示。

The fundamental representation in the Roofline Model is a performance graph, where the x-axis represents operational intensity $I$ in FLOPs/byte and the y-axis represents performance $P$ in FLOPs/sec. The model is graphically depicted by two main components:
Roofline 模型中基础的表示是性能图，其中 x 轴表示操作强度 $I$ （以 FLOPs/字节为单位），y 轴表示性能 $P$ （以 FLOPs/秒为单位）。该模型通过两个主要组成部分图形显示：

Memory Roof: It serves as the upper-performance limit indicated by memory bandwidth. It is determined by the product of the peak memory bandwidth ( $B_{\text{peak}}$ in Bytes/sec) and the operational intensity ( $I$ ). Intuitively, if the data needed for the computation is supplied slower than the computation itself, the processor will idly wait for data, making memory bandwidth the primary bottleneck. The memory-bound region (in blue) of the roofline is then represented by:
内存顶：它作为内存带宽指示的最高性能限制。由峰值内存带宽（ $B_{\text{peak}}$ 以字节/秒为单位）与操作强度（ $I$ ）的乘积决定。直观地说，如果计算所需的数据供应速度慢于计算本身，处理器将在数据上无所事事地等待，使内存带宽成为主要瓶颈。那么，Roofline 的内存受限区域（蓝色）被表示为：

(1)

P\leq B_{\text{peak}}\times I

where $P$ is the achievable performance.
其中 $P$ 是可实现的性能。

Compute Roof: This represents the maximum performance achievable limited by the machine’s peak computational capability ( $P_{\text{peak}}$ ). It is a horizontal line on the graph (top edge of the yellow region), independent of the operational intensity, indicating that when data transfer is not the bottleneck, the maximum achievable performance is determined by the processor’s computation capability. The compute-bound part (yellow region) is then defined by:
计算顶：这表示受机器的峰值计算能力（ $P_{\text{peak}}$ ）限制的最大可达性能。它在图表上表示为一条水平线（黄色区域的顶边），独立于操作强度，表明当数据传输不是瓶颈时，最大可达性能由处理器的计算能力决定。计算受限部分（黄色区域）定义为：

(2)

P\leq P_{\text{peak}}

The turning point is the intersection of the compute and memory roofs, given by the equation:
转折点是计算顶与内存顶的交点，由下式给出：

(3)

\bar{I}=\frac{P_{\text{peak}}}{B_{\text{peak}}}

defines the critical operational intensity $\bar{I}$ . Applications with $I\geq\bar{I}$ are typically compute-bound, while those with $I<\bar{I}$ are memory-bound.
定义了关键的操作强度 $\bar{I}$ 。具有 $I\geq\bar{I}$ 的应用程序通常是计算密集型的，而具有 $I<\bar{I}$ 的应用程序则是内存密集型的。

In practice, analyzing an application’s placement on the roofline model helps identify the critical bottleneck for performance improvements. Recent works (yuan2024llm, ) analyze different computations (e.g., softmax and linear projection) in LLM using the Roofline Model.
实际上，分析应用程序在屋顶线模型上的位置有助于识别性能改进的关键瓶颈。最近的研究（yuan2024llm）分析了在 LLM 中使用屋顶线模型的不同计算（例如 softmax 和线性投影）。

3.2. Hierarchical Roofline Model
3.2.分层屋顶线模型

While the original Roofline Model demonstrates great power for application performance analysis, it is not enough for analyzing applications such as LLM inference that utilize diverse computing resources (e.g., CPU and GPU) and move data across multiple memory hierarchies (e.g., GPU HBM, CPU DRAM, and Disk storage).
尽管原始的屋顶线模型在应用性能分析中显示出了很大的威力，但它不足以分析利用多种计算资源（例如 CPU 和 GPU）并在多个内存层次（例如 GPU HBM、CPU DRAM 和磁盘存储）之间移动数据的应用程序，如 LLM 推理。

Consider a system with $n$ levels of memory hierarchies. Each level $i$ in this hierarchy is coupled with a computing processor. The peak bandwidth at which the processor at level $i$ can access the memory at the same level is denoted by $B_{\text{peak}}^{i}$ . Additionally, the peak performance of the processor is denoted by $P_{\text{peak}}^{i}$ ¹¹1In this paper we assume when $i<j$ , $P_{\text{peak}}^{i}\geq P_{\text{peak}}^{j}$ and $B_{\text{peak}}^{i}\geq B_{\text{peak}}^{j}$ ..
考虑一个具有 $n$ 个内存层次的系统。这个层次中的每一个层级 $i$ 都与一个计算处理器配对。层级 $i$ 上的处理器访问其所在层级内存的峰值带宽被标记为 $B_{\text{peak}}^{i}$ 。此外，处理器的峰值性能被标记为 $P_{\text{peak}}^{i}$ ¹ 。

Definition 3.1 (General Operational Intensity).
定义 3.1（一般操作强度）。

To consider different memory hierarchies, we define the general operational intensity $I_{x}^{i}$ of the computation task $x$ as the ratio of the number of operations in FLOPs performed by $x$ to the number of bytes accessed from memory at level $i$ .
为了考虑不同的内存层次，我们定义了计算任务 $x$ 的一般操作强度 $I_{x}^{i}$ ，作为 $x$ 执行的浮点运算数量与从内存层次 $i$ 访问的字节数之比。

For computation $x$ executed at level $i$ in the HRM, we can define its compute and memory roofs similarly as in the original Roofline Model:
对于在 HRM 中第 $i$ 层执行的计算 $x$ ，我们可以类似于原屋顶线模型来定义其计算和内存屋顶：

•

Compute Roof at level $i$ :

(4)

P_{x}^{i}\leq P_{\text{peak}}^{i}

This represents the maximum computational capability at level $i$ , independent of the operational intensity.
这表示第 $i$ 层的最大计算能力，与操作强度无关。

• 第

i

层的计算屋顶：

•

Memory Roof at level $i$ :

(5)

P_{x}^{i}\leq B_{\text{peak}}^{i}\times I_{x}^{i}

• 第

i

层的内存屋顶：

More importantly, in HRM, there is also the memory bandwidth from level $j$ to level $i$ , denoted as $B_{\text{peak}}^{j,i}$ , which will define another memory roof for computation $x$ that is executed on level $i$ and transfers data from level $j$ :
更重要的是，在 HRM 中，还有从第 $j$ 层到第 $i$ 层的内存带宽，标记为 $B_{\text{peak}}^{j,i}$ ，这将定义另一个内存屋顶用于执行在第 $i$ 层并从第 $j$ 层传输数据的计算 $x$ ：

•

Memory Roof from level $j$ to $i$ :

(6)

P_{x}^{i}\leq B_{\text{peak}}^{j,i}\times I_{x}^{j}

• 从第

j

层到第

i

层的内存屋顶：

Therefore, if computation $x$ is executed on level $i$ , data from level $j$ needs to be fetched, and the peak performance will be bounded by the three roofs listed above (Eqs. 4, 5 and 6):
因此，如果计算 $x$ 在第 $i$ 层执行，需要从第 $j$ 层提取数据，则峰值性能将被上述三个屋顶（方程式 4、5 和 6）所限：

(7)

\displaystyle P_{x}^{i}

\displaystyle=\min(P_{\text{peak}}^{i},B_{\text{peak}}^{i}\times I_{x}^{i},B_{% \text{peak}}^{j,i}\times I_{x}^{j})

If operator $x$ is executed on level $i$ without fetching data from other levels, it reduces to the traditional roofline model and can achieve:
如果运算符 $x$ 在第 $i$ 层执行而不从其他层提取数据，它将简化为传统的屋顶线模型，并且可以实现：

(8)

\displaystyle P_{x}^{i}

\displaystyle=\min(P_{\text{peak}}^{i},B_{\text{peak}}^{i}\times I_{x}^{i})

Turning Points 转折点

Intuitively, our HRM introduces more memory roofs that consider cross-level memory bandwidth and compute roofs for diverse processors. This results in more “turning points” than in the original Roofline Model, which define various performance regions where different resources are the bottleneck. Analyzing these turning points is crucial for understanding the performance upper bound of an application under different hardware setups and computational characteristics.
直观地说，我们的 HRM 引入了更多的内存屋顶，考虑了跨层的内存带宽以及多样化处理器的计算屋顶。这导致了比原屋顶线模型更多的“转折点”，这些转折点定义了不同资源作为瓶颈的各种性能区域。分析这些转折点对理解应用程序在不同硬件设置和计算特性下的性能上限至关重要。

For example, consider a computation task $x$ that has data stored on level $j$ , according to Eq. 6 and Eq. 8, when $P_{x}^{j}=\min(P_{\text{peak}}^{j},B_{\text{peak}}^{j}\times I_{x}^{j})\geq B_% {\text{peak}}^{j,i}\times I_{x}^{j}$ , we have $P_{x}^{i}\leq B_{\text{peak}}^{j,i}\times I_{x}^{j}\leq P_{x}^{j}$ . Therefore, the first turning point $P_{1}$ is at:
例如，考虑一个计算任务 $x$ ，其数据存储在第 $j$ 层，根据方程式 6 和方程式 8，当 $P_{x}^{j}=\min(P_{\text{peak}}^{j},B_{\text{peak}}^{j}\times I_{x}^{j})\geq B_% {\text{peak}}^{j,i}\times I_{x}^{j}$ 时，我们有 $P_{x}^{i}\leq B_{\text{peak}}^{j,i}\times I_{x}^{j}\leq P_{x}^{j}$ 。因此，第一个转折点 $P_{1}$ 在：

(9)

\bar{I}_{x}^{j}=\frac{\min(P_{\text{peak}}^{j},B_{\text{peak}}^{j}\times I_{x}% ^{j})}{B_{\text{peak}}^{j,i}}

This gives the critical operational intensity $\bar{I}_{x}^{j}$ , indicating the threshold below which it is not beneficial to transfer data from level $j$ to $i$ for computation for $x$ .
这给出了关键的操作强度 $\bar{I}_{x}^{j}$ ，表明不以此数据强度从第 $j$ 层到第 $i$ 层进行数据传输到计算 $x$ 是没有益处的。

Now if we continue increasing $I_{x}^{j}$ such that $P_{x}^{j}<B_{\text{peak}}^{j,i}\times I_{x}^{j}\leq\min(P_{\text{peak}}^{i},B_% {\text{peak}}^{i}\times I_{x}^{i})$ , then we obtain another turning point $P_{2}$ :
现在，如果我们继续增加 $I_{x}^{j}$ ，以至于 $P_{x}^{j}<B_{\text{peak}}^{j,i}\times I_{x}^{j}\leq\min(P_{\text{peak}}^{i},B_% {\text{peak}}^{i}\times I_{x}^{i})$ ，那么我们获得了另一个转折点 $P_{2}$ ：

(10)

\bar{I}_{x}^{j}=\frac{\min(P_{\text{peak}}^{i},B_{\text{peak}}^{i}\times I_{x}% ^{i})}{B_{\text{peak}}^{j,i}}

which denotes the critical operational intensity $\bar{I}_{x}^{j}$ below which computation $x$ is bounded by the memory bandwidth from memory at level $j$ to memory at level $i$ .
表示关键操作强度 $\bar{I}_{x}^{j}$ ，低于该强度时，计算 $x$ 受限于从第 $j$ 层到第 $i$ 层内存的带宽。

Balance Point 平衡点

Further, if $B_{\text{peak}}^{i}\times I_{x}^{i}<B_{\text{peak}}^{j,i}\times I_{x}^{j}<P_{% \text{peak}}^{i}$ , indicating that the computation $x$ on level $i$ is memory-bound (refer to Eq. 3). In this situation, further increasing $I_{x}^{j}$ cannot improve the system’s performance. Instead, we need to increase $I_{x}^{i}$ , and a balance point will be reached if:
此外，如果 $B_{\text{peak}}^{i}\times I_{x}^{i}<B_{\text{peak}}^{j,i}\times I_{x}^{j}<P_{% \text{peak}}^{i}$ ，表明在第 $i$ 层的计算 $x$ 是内存瓶颈（参见方程式 3）。在这种情况下，进一步增加 $I_{x}^{j}$ 无法提高系统性能。相反，我们需要增加 $I_{x}^{i}$ ，并且当达到以下条件时将产生一个平衡点：

(11)

B_{\text{peak}}^{i}\times I_{x}^{i}=B_{\text{peak}}^{j,i}\times I_{x}^{j}

Our performance model and policy optimizer (see Section 4.2) are designed to find the maximum balance point under the device memory constraints.
我们的性能模型和政策优化器（参见第 4.2 节）旨在在设备内存限制下找到最大平衡点。

3.3. Case Study 3.3.案例研究

To visualize the turning points and balance points discussed in the preceding sections, we conduct a case study with real HRM plots for computations²²2We only discuss the attention and feed-forward blocks since they account for the majority of computation time and represent quite different computation characteristics. in a single layer of the Mixtral 8x7B model on a Google Cloud Platform L4 instance. The hardware setting is as detailed in Fig. 3. Specifically, we let levels $i$ and $j$ represent GPU and CPU, respectively. Then, we define the following:
为了可视化前面部分讨论的转折点和平衡点，我们在 Google Cloud Platform L4 实例上的单层 Mixtral 8x7B 模型上进行了一个实际 HRM 图的案例研究来计算 ² 。硬件设置如图 3 所示。具体来说，我们将级别 $i$ 和 $j$ 分别代表 GPU 和 CPU。然后，我们定义如下：

Definition 3.2 (Batch Size $N$ ).
定义 3.2（批大小 $N$ ）。

Batch size is the total number of tokens processed by one pass of the whole model.
批大小是一次通过整个模型所处理的令牌总数。

Definition 3.3 (Micro-Batch Size $\mu$ ).
定义 3.3（微批大小 $\mu$ ）。

Since GPU memory is limited, a batch of size $N$ often needs to be split into several micro-batches of size $\mu$ to be processed by a single kernel execution on GPU.
由于 GPU 内存有限，通常需要将批大小 $N$ 分成若干微批大小 $\mu$ ，以便单个 GPU 内核执行处理。

Attention Block 注意力块

Fig. 4 demonstrates the HRM plot for Mixtral 8x7B’s attention computation³³3Not including QKVO projection. assuming all the KV cache are stored on CPU⁴⁴4For analysis purposes, we use the calculated theoretical operational intensity instead of numbers from real profiling. On the plot, we have horizontal lines as the compute roofs defined by CPU and GPU peak performance. There are also the memory roofs defined by CPU memory bandwidth, GPU memory bandwidth, and CPU to GPU memory bandwidth, respectively. We then draw vertical lines representing different operational intensities for the attention computation with different KV cache data types. Theoretically, attention’s operational intensity is independent of the batch size since its flops and bytes are proportional to batch size. To increase the attention computation’s operational intensity, we need methods such as quantization (lin2024qserve, ; kv_int4, ), Grouped Query Attention (GQA) (ainslie2023gqa, ), or sparse attention (child2019generating, ). All these methods try to reduce the memory access needed by performing the attention computation, and GQA is used by most of the existing MoE models; however, as denoted in the plot, for both float16 and int4⁵⁵5The computation is still done in float32. the operational intensity is quite low and is smaller than $P_{1}$ ’s corresponding operational intensity, which suggests it may be better to perform attention on CPU.
图 4 展示了 Mixtral 8x7B 注意力计算 ³ 的 HRM 图，假设所有的 KV 缓存存储在 CPU ⁴ 上。在图中，水平线表示由 CPU 和 GPU 峰值性能定义的计算屋顶。还有分别由 CPU 内存带宽、GPU 内存带宽以及 CPU 到 GPU 内存带宽定义的内存屋顶。然后我们画出垂直线，表示具有不同 KV 缓存数据类型的注意力计算的不同操作强度。理论上，由于其 flops 和字节与批大小成比例，注意力的操作强度独立于批大小。为了提高注意力计算的操作强度，我们需要使用量化（lin2024qserve,; kv_int4,）、分组查询注意力（GQA）（ainslie2023gqa,）或稀疏注意力（child2019generating,）等方法。所有这些方法都尝试通过执行注意力计算来减少所需的内存访问，目前大多数现有的 MoE 模型都使用 GQA；然而，图中显示对于 float16 和 int4 ⁵ 的操作强度相当低，并且小于 $P_{1}$ 的相应操作强度，这表明可能更适合在 CPU 上执行注意力计算。

MoE Feed-Forward Network (FFN)
MoE 前馈网络（FFN）

Fig. 5 is an HRM plot for Mixtral 8x7B’s MoE Feed-Forward module on the L4 instance. The orange line represents the MoE FFN kernel performance achieved at a micro-batch size of 128. Vertical lines intersecting with CPU roofs and CPU-GPU memory roofs represent different batch sizes. FFN’s operational intensity will increase as batch size or micro-batch size increases since, intuitively, a larger batch size means more computation per weight access. As shown in the plot, suppose the computation kernel for the MoE FFN can run at a maximum $\mu=128$ , we can identify the turning point in Eq. 10 to be $P_{2}$ and the turning point in Eq. 9 to be $P_{1}$ .
图 5 是 L4 实例上的 Mixtral 8x7B MoE 前馈模块的 HRM 图。橙色线表示在微批大小为 128 时 MoE FFN 内核性能。与 CPU 屋顶和 CPU-GPU 内存屋顶相交的垂直线代表不同的批大小。FFN 的操作强度将随着批大小或微批大小的增加而增加，因为直观上，较大的批大小意味着每次权重访问的计算更多。如图所示，假设 MoE FFN 的计算内核可达到最大 $\mu=128$ ，我们可以识别出在方程 10 中的转折点为 $P_{2}$ ，方程 9 中的转折点为 $P_{1}$ 。

When $I$ is less than $P_{1}$ ’s corresponding $I$ , there is no benefit in swapping the data to GPU for computation since it will be bounded by the memory roof from CPU to GPU. This is normally the case for many latency-oriented applications where users may only have one or two prompts to be processed. In such scenarios, it is more beneficial to have a static weights placement strategy (e.g., putting $m$ out of $n$ layers on GPU) and perform the computation where the data is located instead of swapping the weights back and forth.
当 $I$ 小于 $P_{1}$ 的相应 $I$ 时，将数据交换到 GPU 进行计算无益，因为它将受到 CPU 到 GPU 的内存屋顶限制。这通常是许多面向延迟的应用的情况，在这些应用中，用户可能只有一个或两个提示需要处理。在这种情况下，更有利的是采用静态权重放置策略（例如，将 $n$ 层中的 $m$ 层放在 GPU 上）并在数据所在的位置执行计算，而不是来回交换权重。

Next, we show the peak performance will be finally reached at a balance point (Eq. 11). When $I$ is less than $P_{2}$ ’s corresponding $I$ , the computation is bounded by the CPU to GPU memory bandwidth, and it cannot achieve the performance at $P_{2}$ . Depending on whether there is enough CPU memory to hold a larger batch, we can either increase the batch size or put some of the weights on the GPU statically since both strategies can increase the operational intensity for the MoE FFN computation regarding the data on the CPU.
接下来，我们展示峰值性能最终将在平衡点（方程 11）处达到。当 $I$ 小于 $P_{2}$ 的相应 $I$ 时，计算受限于 CPU 到 GPU 的内存带宽，无法在 $P_{2}$ 处实现性能。根据是否有足够的 CPU 内存来容纳更大的批次，我们可以选择增加批大小或静态地将部分权重放在 GPU 上，因为这两种策略都可以增加关于 CPU 数据的 MoE FFN 计算的操作强度。

If the batch size can be continually increased, then when $I$ equals $P_{2}$ ’s corresponding $I$ , the maximum performance that can be achieved is bounded by the operator’s operational intensity on GPU, which is dependent on the $\mu$ for the MoE FFN kernels. Then, there is no need to increase $N$ anymore, and the maximum performance reached at a balance point equals $P_{2}$ . On the other hand, if we put more weights onto GPU, $\mu$ will decrease since larger $\mu$ will result in higher peak memory consumption. The maximum performance will be achieved at a balance point smaller than $P_{2}$ .
如果批量大小可以持续增加，那么当 $I$ 等于 $P_{2}$ 相应的 $I$ 时，所能达到的最大性能将受限于 GPU 上的操作员运算强度，这依赖于 MoE FFN 内核的 $\mu$ 。于是，不再需要增加 $N$ ，在平衡点处实现的最大性能等于 $P_{2}$ 。另一方面，如果我们将更多的权重放在 GPU 上， $\mu$ 将下降，因为较大的 $\mu$ 会导致更高的峰值内存消耗。最大性能将在比 $P_{2}$ 更小的平衡点处实现。

In conclusion, to achieve high throughput for batched MoE inference, we hope to place computations on proper computing devices and find the best combination of $N$ and $\mu$ so that we can fully utilize all the system’s components.
总之，为了实现批处理 MoE 推理的高吞吐量，我们希望将计算放在合适的计算设备上，并找到 $N$ 和 $\mu$ 的最佳组合，以便充分利用系统的所有组成部分。

4. Method 4.方法

In general, we adopt the zigzag computation order proposed in FlexGen (sheng2023flexgen, ): loading the weights from CPU⁶⁶6We do not consider disk offloading in this work. and performing the computation layer by layer. For the prefill stage, we perform all the computation on GPU and offload KV cache to CPU for all the micro-batches⁷⁷7Since the prefill stage is normally compute-bound, and the computation can be easily overlapped with I/O, we do not perform further optimization for prefill stage.. For the decode stage, within each layer, we propose a fine-grained GPU-CPU-I/O pipeline schedule (Section 4.1) to increase the utilization of GPU, CPU, and I/O in decode stage. We also build a performance model (Section 4.2) based on the HRM we extended from the Roofline Model to help search for the best hyper-parameters for the pipeline schedule, including the assignment of devices to perform different computations, the batch size, the micro-batch size and the ratio of weights to be placed on GPU statically. Note that for the memory-constrained scenarios we target in this paper, CPU attention is consistently better than GPU attention, according to our performance model. We also conduct an ablation study in Section 6.3 to show how best policy changes under different hardware configurations.
总的来说，我们采用 FlexGen 中提出的之字形计算顺序（sheng2023flexgen,）：从 CPU ⁶ 加载权重，并逐层进行计算。在预填充阶段，我们在 GPU 上执行所有计算，并将 KV 缓存卸载到 CPU，以处理所有微批次 ⁷ 。在解码阶段，每层内，我们提出一种细粒度的 GPU-CPU-I/O 流水线调度（第 4.1 节）来提高解码阶段中 GPU、CPU 和 I/O 的利用率。我们还基于从屋顶模型扩展的 HRM 建立了性能模型（第 4.2 节），以帮助搜索流水线调度的最佳超参数，包括执行不同计算的设备分配、批量大小、微批次大小以及静态放置在 GPU 上的权重比例。请注意，根据我们的性能模型，对于本文中我们针对的受限于内存的场景，CPU 注意力始终优于 GPU 注意力。我们还在第 6.3 节中进行了消融研究，以展示在不同硬件配置下最佳策略的变化。

4.1. GPU-CPU-I/O Pipeline Schedule
4.1. GPU-CPU-I/O 流水线调度

Algorithm 1 CGOPipe 算法 1 CGOPipe

1:for

d=1,2,\ldots gen\_len

do 1：对于

d=1,2,\ldots gen\_len

做

2: // Prologue 2：// 序曲

3: for

j=1,2

do
3：对于

j=1,2

做

4: PreAttn(

1,j

4：PreAttn

1,j

)

5: OffloadQKV(

1,j

5：OffloadQKV

1,j

)

6: CPUAttn(

1,j

6：CPUAttn

1,j

)

7: W_CtoPin(

2,j

)
7：W_CtoPin

2,j

)

8: for

i=1,2,\ldots num\_layers

do
8：对于

i=1,2,\ldots num\_layers

做

9: for

j=1,2,\ldots num\_ubs

do
9：对于

j=1,2,\ldots num\_ubs

做

10: LoadH(

i,j

10：LoadH

i,j

)

11: W_PintoG(

i+1,j

11：W_PintoG

i+1,j

)

12: PostAttn(

i,j

12：PostAttn

i,j

)

13: // Launch CPUAttn two batches ahead
13：// 启动两批前的 CPUAttn

14: PreAttn(

i,j+2

14：PreAttn

i,j+2

)

15: OffloadQKV(

i,j+2

15：OffloadQKV

i,j+2

)

16: CPUAttn(

i,j+2

16：CPUAttn

i,j+2

)

17: W_CtoPin(

i+1,j+2

)
17：W_CtoPin

i+1,j+2

)

Pipeline scheduling is a common approach to maximize compute and I/O resource utilization. Yet, the pipeline concerning GPU, CPU, and I/O is not trivial. In traditional pipeline parallelism for deep learning training (huang2019gpipe, ; narayanan2019pipedream, ; fan2021dapple, ), models are divided into stages which are assigned to different devices. Therefore, only output activations are transferred between stages, resulting in a single type of data transfer in each direction at a time. In our scenario, both weights and intermediate results need to be transferred between GPU and CPU. Intermediate results are required immediately after computation to avoid blocking subsequent operations, whereas weights for the next layer are needed only after all micro-batches for the current layer are processed. Additionally, weight transfers typically take significantly longer than intermediate results. Consequently, naive scheduling of I/O events can lead to low I/O utilization, which also hinders computation.
流水线调度是一种常见的方法，用于最大化计算和 I/O 资源的利用。然而，涉及 GPU、CPU 和 I/O 的流水线并不是简单的。在深度学习训练的传统流水线并行中（huang2019gpipe,; narayanan2019pipedream,; fan2021dapple,），模型分为多个阶段并分配到不同的设备上。因此，仅有输出激活在阶段之间传输，导致每次仅有一种方向的数据传输。在我们的场景中，权重和中间结果都需要在 GPU 和 CPU 之间传输。中间结果在计算后立即需要，以避免阻塞后续操作，而下一层的权重仅在处理完当前层的所有微批次后才需要。此外，权重传输通常比中间结果消耗更长的时间。因此，I/O 事件的简单调度可能导致 I/O 利用率低下，这也会妨碍计算。

CGOPipe. Fig. 6 demonstrates our proposed CGOPipe and the other three scheduling strategies adopted in existing systems. CGOPipe employs CPU attention as analyzed in Section 3.3, alongside a weight paging scheme that interleaves the transfer of intermediate results for upcoming micro-batches with paged weight transfers to optimize computation and communication overlap. The GPU sequentially processes the post-attention tasks (primarily O projection and MoE FFN) for the current micro-batch, followed by the pre-attention tasks (mainly layer norm and QKV projection) for the next micro-batch. Concurrently, the CPU handles attention (specifically the softmax part) for the next batch, and a page of weights for the subsequent layer are transferred to the GPU.
CGOPipe。图 6 展示了我们提出的 CGOPipe 以及现有系统中采用的其他三种调度策略。CGOPipe 采用第 3.3 节分析的 CPU 注意力，并且采用权重分页方案，将中间结果的传输与即将到来的微批次进行交错，优化计算和通信的重叠。GPU 按顺序处理当前微批次的后注意力任务（主要是 O 投影和 MoE FFN），然后处理下一个微批次的前注意力任务（主要是层规范化和 QKV 投影）。同时，CPU 负责下一个批次的注意力（主要是 softmax 部分），并将下一层的权重页传输到 GPU。

FlexGen (sheng2023flexgen, ) primarily employs the fourth schedule ( $\mathcal{S}_{4}$ ), where attention is performed on GPU and the KV cache for the next micro-batch is prefetched during the current computation. This approach results in higher KV cache transfer latency than performing attention directly on the CPU (Section 3.3) and consumes I/O bandwidth that could otherwise be used for weight transfers, reducing resource utilization compared to CGOPipe. FlexGen also supports CPU attention and adopts the third schedule ( $\mathcal{S}_{3}$ ), which is the least optimized and may even perform worse than $\mathcal{S}_{4}$ if KV cache transfer latency is less than the sum of pre-attention, post-attention, and CPU attention latencies, as later shown by our evaluation results (Section 5). FastDecode (he2024fastdecode, ) suggests overlapping CPU attention with GPU computation, similar to the second schedule ( $\mathcal{S}_{2}$ ). However, it does not target memory-constrained settings, so weight transfer scheduling is not considered.
FlexGen（sheng2023flexgen）主要采用第四种计划（ $\mathcal{S}_{4}$ ），在该计划中注意力在 GPU 上执行，并且在当前计算期间预取下一个微批次的 KV 缓存。这种方法导致 KV 缓存传输延迟高于直接在 CPU 上执行注意力（第 3.3 节），并消耗了本可以用于权重传输的 I/O 带宽，与 CGOPipe 相比降低了资源利用率。FlexGen 也支持 CPU 注意力，并采用第三种计划（ $\mathcal{S}_{3}$ ），这是一种最不优化的计划，如果 KV 缓存传输延迟小于前注意力、后注意力和 CPU 注意力延迟之和，性能可能甚至比 $\mathcal{S}_{4}$ 差，如我们的评估结果所示（第 5 节）。FastDecode（he2024fastdecode）建议将 CPU 注意力与 GPU 计算重叠，类似于第二种计划（ $\mathcal{S}_{2}$ ），然而，这并不针对内存受限的设置，因此权重传输调度未被考虑。

Weights Paging and Data Transfer Scheduling. To fully utilize the I/O, we propose a weights paging scheme to interleave the data transfer for different tasks, reducing bubbles in the I/O. There are mainly four kinds of data transfer:
权重分页和数据传输调度。为充分利用 I/O，我们提出了一种权重分页方案，以交错传输不同任务的数据，减少 I/O 中的空窗期。主要有四类数据传输：

•

$\mathcal{D}_{1}$ (QKV DtoH): the intermediate results to be transferred from GPU to CPU after QKV projection.

• $\mathcal{D}_{1}$ (QKV DtoH)：中间结果在 QKV 投影后从 GPU 传输到 CPU。
•

$\mathcal{D}_{2}$ (Hidden HtoD): the hidden states to be transferred from CPU to GPU after the CPU attention.

• $\mathcal{D}_{2}$ (Hidden HtoD)：隐藏状态在 CPU 注意力后从 CPU 传输到 GPU。
•

$\mathcal{D}_{3}$ (Weights Transfer): the weights for the next layer to be transferred from CPU to GPU.

• $\mathcal{D}_{3}$ (Weights Transfer)：下一层的权重从 CPU 传输到 GPU。
•

$\mathcal{D}_{4}$ (KV cache Transfer): the KV cache for the next micro-batch to be transferred from CPU to GPU.

• $\mathcal{D}_{4}$ (KV cache Transfer)：下一个微批次的 KV 缓存从 CPU 传输到 GPU。

Due to independent data paths, data transfers in opposite directions can happen simultaneously. Data transfer will be performed sequentially in the same direction. The challenge then mainly lies in the scheduling of $\mathcal{D}_{2}$ , $\mathcal{D}_{3}$ and $\mathcal{D}_{4}$ , which are all from CPU to GPU. For the case without CPU attention ( $\mathcal{S}_{4}$ ), while $\mathcal{D}_{4}$ usually takes a similar or longer time compared with a layer’s computation, the I/O bandwidth is almost fully utilized, leaving little room for more efficient scheduling for data transfer. As we can see from the diagram of $\mathcal{S}_{2}$ and $\mathcal{S}_{3}$ , conducting the weights transfer as a whole will block the next layer’s first $\mathcal{D}_{2}$ for a long time, resulting in poor overall system efficiency. Instead, we can chunk the weights to be transferred into $n$ pages where $n$ equals the number of micro-batches in the pipeline, and the performance model and optimizer (Section 4.2) select the proper micro-batch size, batch size and the proportion of weights to be transferred from CPU to GPU.
由于独立的数据路径，相反方向的数据传输可以同时进行。同一方向的数据传输将顺序执行。挑战主要在于调度 $\mathcal{D}_{2}$ 、 $\mathcal{D}_{3}$ 和 $\mathcal{D}_{4}$ ，这些都从 CPU 传输到 GPU。在没有 CPU 注意力的情况下（ $\mathcal{S}_{4}$ ），尽管 $\mathcal{D}_{4}$ 通常需要与一个层计算相似或更长的时间，但 I/O 带宽几乎被完全利用，几乎没有剩余空间来为数据传输进行更高效的调度。如我们从 $\mathcal{S}_{2}$ 和 $\mathcal{S}_{3}$ 的图中看到，整体进行权重的传输将长时间阻碍下一层的第一次 $\mathcal{D}_{2}$ ，导致整体系统效率低下。相反，我们可以将要传输的权重分块成 $n$ 页，其中 $n$ 等于流水线中微批次的数量，性能模型和优化器（第 4.2 节）选择合适的微批次大小、批次大小和从 CPU 传输到 GPU 的权重量比例。

Algorithm 1 provides the order in which the main CPU task launcher thread launches the tasks to enable CGOPipe. All the tasks are executed asynchronously, and necessary synchronization primitives are added to each task to enforce the correct data dependency.
算法 1 提供了主 CPU 任务启动线程启动任务的顺序，以启用 CGOPipe。所有任务均异步执行，并在每个任务中添加了必要的同步原语以确保正确的数据依赖性。

4.2. Search Space and Performance Model
4.2. 搜索空间和性能模型

Table 1. Notations for the Performance Model Configuration
表 1. 性能模型配置的符号

Hardware Configurations, $\mathcal{H}$ 硬件配置， $\mathcal{H}$
Notation 符号	Description 描述
$m_{g},m_{c}$	GPU, CPU memory GPU, CPU 内存
$b_{g},b_{c},b_{cg}$	GPU, CPU, CPU-GPU bandwidth GPU, CPU, CPU-GPU 带宽
$p_{g},p_{c}$	GPU, CPU FLOPS
Model Configurations, $\mathcal{M}$ 模型配置， $\mathcal{M}$
$l$	Number of layers 层数
$h_{1},h_{2}$	Model, Intermediate hidden dimensions 模型，中间隐藏维度
$n_{q},n_{kv}$	Query, Key/Value heads in attention 查询、键/值在注意力中的头数
$n_{e},k$	Number of experts, Top-k routing 专家数量，Top-k 路由配置
$dt$	Data type (e.g., float32) 数据类型（例如，float32）
Workload Configurations, $\mathcal{W}$ 工作负载配置， $\mathcal{W}$
$s$	Average Prompt Length 平均提示长度
$n$	Generation Length 生成长度
Policy, $\mathcal{P}$ 策略， $\mathcal{P}$
$N,\mu$	Batch, Micro Batch Size 批处理，微批大小
$F_{g},A_{g}$	GPU Attention/MoE FFN Indicator GPU 注意力/MoE FFN 指示器
$r_{w},r_{c}$	Ratio of Weights/KV Cache Stored on GPU 存储在 GPU 上的权重/KV 缓存比例

Given a hardware configuration $\mathcal{H}$ , a model configuration $\mathcal{M}$ , and a workload configuration $\mathcal{W}$ , we search for the optimal policy $\mathcal{P}$ that minimizes per-layer latency $T(\mathcal{M},\mathcal{H},\mathcal{W},\mathcal{P})$ for the pipeline schedule in Section 4.1, without violating the CPU and GPU memory constraints, in order to reach the optimal balance point (Eq. 11). Compared with FlexGen, we exclude disk-related variables from the search space and add two binaries to indicate whether to perform attention or MoE FFN on GPU.
给定硬件配置 $\mathcal{H}$ 、模型配置 $\mathcal{M}$ 和工作负载配置 $\mathcal{W}$ ，我们寻找最佳策略 $\mathcal{P}$ ，以最小化第 4.1 节中的流水线计划的每层延迟 $T(\mathcal{M},\mathcal{H},\mathcal{W},\mathcal{P})$ ，同时不违反 CPU 和 GPU 内存限制，以达到最佳平衡点（方程 11）。与 FlexGen 相比，我们从搜索空间中排除了磁盘相关变量，增加了两个二进制指标以指示是否在 GPU 上执行注意力或 MoE FFN。

The search space (Tab. 1) covers 2 integer values: the micro-batch size ( $\mu$ ) and batch size ( $N$ ), 2 binary indicators $A_{g}$ to indicate whether to perform the attention on GPU and $F_{g}$ to indicate whether to perform the MoE FFN on GPU. When $F_{g}=1$ , we also need to decide the percent of weights $r_{w}$ that can be statically stored on GPU and the percent of weights $1-r_{w}$ that need to be transferred to GPU. Similarly, for $A_{g}=1$ , we need to decide $r_{c}$ . The generated policy will be a 6-tuple $(N,\mu,A_{g},F_{g},r_{w},r_{c})$ . For our major setting, we always get $A_{g}=0$ and $F_{g}=1$ . However, we discuss in Section 6.3 different policies for various hardware settings. Notably, CGOPipe is primarily designed for $A_{g}=0$ and when $A_{g}=1$ , MoE-Lightning adopt $\mathcal{S}_{4}$ .
搜索空间（表 1）涵盖了 2 个整数：微批大小（ $\mu$ ）和批大小（ $N$ ），两个二进制指标 $A_{g}$ 指示是否在 GPU 上执行注意力， $F_{g}$ 指示是否在 GPU 上执行 MoE FFN。当 $F_{g}=1$ 时，我们还需要决定可以静态存储在 GPU 上的权重百分比 $r_{w}$ 和需要传输到 GPU 的权重百分比 $1-r_{w}$ 。类似地，对于 $A_{g}=1$ ，我们需要决定 $r_{c}$ 。生成的策略将是一个 6 元组 $(N,\mu,A_{g},F_{g},r_{w},r_{c})$ 。对于我们的主要设置，我们总是得到 $A_{g}=0$ 和 $F_{g}=1$ 。然而，我们在第 6.3 节中讨论了针对各种硬件设置的不同策略。值得注意的是，CGOPipe 主要为 $A_{g}=0$ 而设计，当 $A_{g}=1$ 时，MoE-Lightning 采用 $\mathcal{S}_{4}$ 。

We then build the performance model based on Eq. 7 and Eq. 8 in HRM to estimate per-layer decode latency $T$ by:
然后，我们根据 HRM 中的方程 7 和方程 8 构建性能模型，估计每层解码延迟 $T$ ：

(12)

\displaystyle T(\mathcal{M},\mathcal{H},\mathcal{W},\mathcal{P})=\max(comm^{% cpu\_to\_gpu},T_{cpu},T_{gpu})

where $comm^{cpu\_to\_gpu}$ can be computed as the number of bytes needed to be transferred from CPU to GPU for a layer’s computation divided by the CPU to GPU memory bandwidth $b_{cg}$ . Here, for simplicity, we only consider the attention computation and the MoE FFN computation in a transformer block, and therefore we have:
其中 $comm^{cpu\_to\_gpu}$ 可计算为从 CPU 转移到 GPU 进行一层计算所需字节数除以 CPU 到 GPU 的内存带宽 $b_{cg}$ 。在此，为简单起见，我们仅考虑 Transformer 块中的注意力计算和 MoE FFN 计算，因此我们有：

(13)

\displaystyle T_{gpu}

\displaystyle=T_{attn}^{g}+T_{ffn}^{g}\ \text{,}\ T_{cpu}=T_{attn}^{c}+T_{ffn}% ^{c}

To estimate the time to perform a computation $x$ on GPU or CPU, we can use $T_{x}=\max(comm_{x},comp_{x})$ according to Eq. 8 in HRM, resulting in:
为估计在 GPU 或 CPU 上执行计算 $x$ 所需的时间，我们可以根据 HRM 中的方程 8 使用 $T_{x}=\max(comm_{x},comp_{x})$ ，所得结果为：

(14)

\displaystyle T_{ffn}^{g}=\max(comm_{ffn}^{g},comp_{ffn}^{g})

and similarly for $T_{attn}^{g}$ , $T_{attn}^{c}$ and $T_{ffn}^{c}$ .
同样地，对于 $T_{attn}^{g}$ ， $T_{attn}^{c}$ 和 $T_{ffn}^{c}$ 。

For a given computation $x$ , we can calculate their theoretical FLOPS and data transfer based on $\mathcal{M}$ and then we have $comm_{x}^{g}=bytes_{x}/b_{g}$ and $comp_{x}^{g}=flops_{x}/p_{g}$ (same for CPU). While there are discrepancies between the theoretical performance estimation and the kernel’s real performance, such modeling can provide a reasonable estimation of the relative effectiveness of any two policies. In this paper, all the evaluation results of MoE-Lightning follow policies generated by a performance model with theoretically calculated computation flops and bytes with profiled peak performance and memory bandwidth for the hardware.
对于给定的计算 $x$ ，我们可以根据 $\mathcal{M}$ 计算其理论 FLOPS 和数据传输，然后得到 $comm_{x}^{g}=bytes_{x}/b_{g}$ 和 $comp_{x}^{g}=flops_{x}/p_{g}$ （同样适用于 CPU）。虽然理论性能估计与内核的实际性能之间存在差异，但这种建模可以为任何两个策略的相对效果提供合理的估计。在本文中，MoE-Lightning 的所有评估结果均遵循由性能模型生成的策略，这些模型使用理论计算的计算 FLOPS 和字节以及为硬件配置的峰值性能和内存带宽。

4.3. Tensor Parallelism 4.3.张量并行

In existing works (sheng2023flexgen, ), pipeline parallelism is used for scaling beyond a single GPU, which requires the number of devices to scale with the model depth instead of the layer size. However, according to our analysis for MoE models in Section 3.3, Total GPU memory capacity can decide the upper bound throughput the system can achieve. Therefore, MoE-Lightning implements tensor parallelism (narayanan2021efficient, ) within a single node to get a higher throughput upper bound. In this case, we have $tp\_size$ times more GPU memory capacity and GPU memory bandwidth, we can then search for the policy similarly as for single GPU.
在现有作品中（sheng2023flexgen），管道并行用于超越单个 GPU 的扩展，这需要设备数量随着模型深度而不是层大小扩展。然而，根据我们在第 3.3 节中对 MoE 模型的分析，总 GPU 内存容量可以决定系统可以达到的上限吞吐量。因此，MoE-Lightning 在单节点内实现张量并行（narayanan2021efficient），以获得更高的吞吐量上限。在这种情况下，我们有 $tp\_size$ 倍更多的 GPU 内存容量和 GPU 内存带宽，然后我们可以像单个 GPU 一样搜索策略。

5. Evaluation 5.评估

5.1. Setup 5.1.设置

Table 2. Model and Hardware Configurations.
表 2.模型和硬件配置。

Setting 设置	Model	GPU	CPU (Intel Xeon) CPU（Intel Xeon）
S1	Mixtral 8x7B	1xT4 (16G) 1xT4（16G）	2.30GHz, 24-core, 192GB 2.30GHz，24 核，192GB
S2	Mixtral 8x7B	1xL4 (24G) 1xL4（24G）	2.20GHz, 24-core, 192GB 2.20GHz，24 核，192GB
S6	Mixtral 8x22B	2xT4 (32G) 2xT4（32G）	2.30GHz, 32-core, 416GB 2.30GHz，32 核，416GB
S7	Mixtral 8x22B	4xT4 (64G) 4xT4（64G）	2.30GHz, 32-core, 416GB 2.30GHz，32 核，416GB
S8	DBRX	2xT4 (32G) 2xT4（32G）	2.30GHz, 32-core, 416GB 2.30GHz，32 核，416GB
S9	DBRX	4xT4 (64G) 4xT4（64G）	2.30GHz, 32-core, 416GB 2.30GHz，32 核，416GB

Table 3. Workload Configurations.
表 3.工作负载配置。

Dataset 数据集	$s_{\text{avg}}$	$s_{\text{max}}$	$l$
MTBench (zheng2024judging, )	77	418	32, 64, 128, 256
Synthetic Reasoning (liang2022holistic, ) 合成推理 liang2022holistic	242	256	50
Summarization (liang2022holistic, ) 摘要 liang2022holistic	1693	1984	64

Table 4. Performance for HELM tasks under S1 & S2
表 4. S1 和 S2 下 HELM 任务的性能

Settings 设置	S1			S2			S1			S2
	Synthetic Reasoning 合成推理						Summarization 摘要
	Throughput 吞吐量	$\mu$	$N/\mu$	Throughput 吞吐量	$\mu$	$N/\mu$	Throughput 吞吐量	$\mu$	$N/\mu$	Throughput 吞吐量	$\mu$	$N/\mu$
FlexGen(c)	16.903	32	61	20.015	64	33	2.614	3	92	4.307	8	36
FlexGen	22.691	32	61	50.138	64	33	3.868	3	92	7.14	8	36
DeepSpeed	11.832	102	1	18.589	156	1	0.965	8	1	1.447	12	1
MoE-Lightning(p)	26.349	36	26	105.29	100	15	4.52	4	19	12.393	8	36

Implementation. We build MoE-Lightning on top of PyTorch (paszke2019pytorch, ), vLLM (kwon2023efficient, ) and SGLang (zheng2024sglang, ), written in Python and C++. We implement customized CPU Grouped Query Attention (GQA) kernels based on Intel’s MKL library (mkl, ).
实现。我们在 PyTorch（paszke2019pytorch）、vLLM（kwon2023efficient）和 SGLang（zheng2024sglang）的基础上构建 MoE-Lightning，使用 Python 和 C++编写。我们基于英特尔的 MKL 库（mkl）实现定制的 CPU 分组查询注意力（GQA）内核。

Models. We evaluate three popular MoE models: Mixtral 8x7B (mixtral, ), Mixtral 8x22B (mixtral22b, ), and DBRX (132B, 16 Experts) (dbrx, ). Although not evaluated, MoE-Lightning also supports other models compatible with vLLM (kwon2023efficient, )’s model classes.
模型。我们评估了三个流行的 MoE 模型：Mixtral 8x7B（mixtral）、Mixtral 8x22B（mixtral22b）和 DBRX（132B，16 专家）（dbrx）。虽然没有进行评估，MoE-Lightning 也支持与 vLLM（kwon2023efficient）模型类兼容的其他模型。

Hardware. We conduct tests on various hardware settings, including a single NVIDIA T4 GPU (16GB), a single NVIDIA L4 GPU (24GB) and multiple T4 GPUs. We evaluate 6 different model and hardware settings as shown in Tab. 2.
硬件。我们在各种硬件设置上进行测试，包括单个 NVIDIA T4 GPU（16GB）、单个 NVIDIA L4 GPU（24GB）和多台 T4 GPU。我们评估了如表 2 所示的 6 种不同的模型和硬件设置。

Workloads. We use popular LLM benchmarks with different prompt length distributions to evaluate our system, as shown in Tab. 3. MTBench (zheng2024judging, ) includes 80 high-quality multi-turn questions across various categories like writing and reasoning. We replicate it into thousands of questions for our batch inference use case. We test various output token lengths for MTBench, from 32, 64, 128, to 256 tokens. We also pick two tasks (i.e., synthetic reasoning and summarization), from the HELM benchmarks (liang2022holistic, ) to test our system with longer prompt lengths.
工作负负载。我们使用流行的 LLM 基准测试，具有不同的提示长度分布来评估我们的系统，如表 3 所示。MTBench（zheng2024judging）包括 80 个高质量的多轮问题，涉及写作和推理等各个类别。我们将其复制成数千个问题用于我们的批量推理用例。我们测试了 MTBench 的不同输出标记长度，从 32、64、128 到 256 标记。我们还从 HELM 基准（liang2022holistic）中挑选了两个任务（即合成推理和摘要）来测试我们的系统对更长提示长度的处理。

Baselines. We evaluate MoE-Lightning and MoE-Lightning’s variant, comparing them against two baseline systems that support running LLMs without enough GPU memory: FlexGen (sheng2023flexgen, ) and DeepSpeed Zero-Inference (aminabadi2022deepspeed, ).
基线。我们评估 MoE-Lightning 及其变体，并将其与两种支持在 GPU 内存不足情况下运行 LLM 的基线系统进行比较：FlexGen（sheng2023flexgen）和 DeepSpeed Zero-Inference（aminabadi2022deepspeed）。

•

FlexGen (sheng2023flexgen, ) is the state-of-the-art offloading system that targets high-throughput batch inference for OPT (opt, ) models. It does not support variable prompt length in a batch and needs to pad all the requests to the maximum prompt length in the batch.

• FlexGen（sheng2023flexgen）是面向 OPT（opt）模型的最先进的卸载系统，专注于高吞吐量批量推理。它不支持批量中不同的提示长度，需要将所有请求填充到批量中的最大提示长度。
•

FlexGen(c) is FlexGen enabling CPU attention.

• FlexGen(c) 是启用了 CPU 注意力的 FlexGen。
•

DeepSpeed Zero-Inference (aminabadi2022deepspeed, ) is an offloading system that pins model weights to CPU memory and streams them layer-by-layer to GPU for computation. We use version 0.14.3 in the evaluation.

• DeepSpeed Zero-Inference（aminabadi2022deepspeed）是一个卸载系统，将模型权重固定到 CPU 内存，然后逐层流到 GPU 进行计算。在评估中，我们使用版本 0.14.3。
•

MoE-Lightning represents our system with all the optimizations enabled.

• MoE-Lightning 表示我们的系统在启用所有优化情况下的表现。
•

MoE-Lightning (p) represents our system running with requests padded to the maximum prompt length in the batch to compare with FlexGen.

• MoE-Lightning (p) 表示我们系统在将请求填充到批量中的最大提示长度情况下运行，以便与 FlexGen 进行比较。

Metrics. We measure the generation throughput for each workload, which is calculated as the number of tokens generated divided by total generation time (i.e., prefill time + decode time).
指标。我们测量每个工作负载的生成吞吐量，该吞吐量计算为生成的标记数除以总生成时间（即预填充时间 + 解码时间）。

5.2. End-to-end Results on Real Workloads
5.2. 真实工作负载的端到端结果

We evaluate the maximum generation throughput for all baseline systems on three workloads under S1, S2, S6, and S7 settings. As shown in Footnote 9 and Tab. 4, MoE-Lightning (p) outperforms all baselines in all settings, and MoE-Lightning achieves up to $10.3\times$ better throughput compared with the best of the baselines for MTBench and HELM benchmark. In the following sections, we analyze how MoE-Lightning (p) outperforms our baselines by integrating the key methods from Section 4.2.
我们评估了 S1、S2、S6 和 S7 设置下所有基线系统在三个工作负载上的最大生成吞吐量。如脚注 9 和表 4 所示，MoE-Lightning (p) 在所有设置中均优于所有基线，而 MoE-Lightning 相比于 MTBench 和 HELM 基准中最好的基线，其吞吐量提高了 $10.3\times$ 。在接下来的章节中，我们分析了 MoE-Lightning (p) 如何通过整合第 4.2 节中的关键方法来超越我们的基线。

Generation Length. While longer lengths allow for better amortization of the prefill time which increases throughput, they also lead to higher CPU memory usage and additional attention computation or KV cache transfer overheads. This increased memory demand can limit the maximum batch size, reducing throughput. Moreover, the increase in computation or KV cache transfers can make attention the main bottleneck. Typically, throughput first increases with longer generation length and then decreases.
生成长度。虽然较长的生成长度可以更好地分摊预填充时间，提高吞吐量，但也会导致更高的 CPU 内存使用和额外的注意力计算或 KV 缓存传输开销。这种增加的内存需求可能会限制最大批量大小，从而降低吞吐量。此外，计算或 KV 缓存传输的增加可能会使注意力成为主要瓶颈。通常，吞吐量在生成长度增加时先上升后下降。

We observe this pattern for FlexGen and FlexGen(c) in all settings. However, MoE-Lightning (p) avoids a decrease in throughput under S1 and S6, which feature similar ratios of GPU to CPU memory. We attribute this performance improvement to CGOPipe, which significantly improves the resource utilization and renders the system GPU memory capacity bound in these settings.
我们在所有设置中观察到了 FlexGen 和 FlexGen(c)的这一模式。然而，MoE-Lightning (p)在 S1 和 S6 下避免了吞吐量的下降，这些设置具有相似的 GPU 到 CPU 内存比例。我们将这种性能改善归因于 CGOPipe，它显著提高了资源利用率，使系统在这些设置下受到 GPU 内存容量的限制。

On a single GPU (S1 and S2), MoE-Lightning (p) achieves up to 3.5 $\times$ , 5 $\times$ , and 6.7 $\times$ improvement over FlexGen, FlexGen(c), and DeepSpeed, respectively.
在单个 GPU 上（S1 和 S2），MoE-Lightning (p)相对于 FlexGen、FlexGen(c)和 DeepSpeed 的性能提升分别高达 3.5 $\times$ 、5 $\times$ 和 6.7 $\times$ 。

Prompt Length. In the HELM tasks, we examine the impact of varying prompt lengths on generation throughput. Increasing the prompt length not only raises CPU memory consumption and attention overhead, but also leads to greater GPU peak memory usage during the prefill stage. Consequently, systems handling the summarization task with a 2k prompt length are bottlenecked by either GPU memory capacity or attention processes (see the ablation study in Section 6.3 for a detailed discussion on bottlenecks). Under S1, MoE-Lightning (p) achieves a 1.16 $\times$ and 1.73 $\times$ higher throughput than FlexGen and FlexGen(c), respectively, despite using a batch size that is 3.63 $\times$ smaller, enabled by CGOPipe. DeepSpeed, utilizing a larger micro-batch size but the smallest batch size, is primarily constrained by the overhead of weight transfers. Under S2, with increased GPU memory, MoE-Lightning (p) adjusts to use a larger $\mu$ and $N$ , reaching a new balance point (Eq. 11), while FlexGen and FlexGen(c) are unable to increase $N$ from their S1 settings due to CPU memory limitations. As a result, MoE-Lightning (p) now achieves an even higher throughput improvement: 1.74 $\times$ and 2.88 $\times$ higher than FlexGen and FlexGen(c), respectively. This superior performance is attributed to MoE-Lightning (p)’s efficient resource utilization.
提示长度。在 HELM 任务中，我们检查了提示长度变化对生成吞吐量的影响。增加提示长度不仅提高了 CPU 内存消耗和注意力开销，还导致在预填阶段 GPU 峰值内存使用更高。因此，使用 2k 提示长度处理摘要任务的系统会由于 GPU 内存容量或注意力过程而受限（见第 6.3 节的消融研究讨论瓶颈问题）。在 S1 下，MoE-Lightning (p)通过 CGOPipe 实现了比 FlexGen 和 FlexGen(c)分别高出 1.16 $\times$ 和 1.73 $\times$ 的吞吐量，尽管使用的批量大小是 3.63 $\times$ 更小。DeepSpeed 由于使用较大的微批量但较小的批量，主要受限于权重转移的开销。在 S2 下，随着 GPU 内存的增加，MoE-Lightning (p)调整使用更大的 $\mu$ 和 $N$ ，达到一个新的平衡点（公式 11），而 FlexGen 和 FlexGen(c)由于 CPU 内存限制，无法从 S1 设置中增加 $N$ 。因此，MoE-Lightning (p)现在实现了更高的吞吐量提升：分别比 FlexGen 和 FlexGen(c)高出 1.74 $\times$ 和 2.88 $\times$ 。这种卓越的性能归因于 MoE-Lightning (p)的高效资源利用。

The synthetic reasoning task enables all systems to have a larger micro-batch size due to the shorter prompt length. Under S1, MoE-Lightning (p) achieves a 1.16 $\times$ , 1.56 $\times$ , 2.22 $\times$ higher throughput than FlexGen, FlexGen(c) and DeepSpeed respectively. Under S2, MoE-Lightning (p) finds a better balance point and uses less batch size than FlexGen, achieving 2.1 $\times$ and 5.26 $\times$ higher throughput compared to FlexGen and FlexGen(c), demonstrating the efficiency of CGOPipe and HRM.
合成推理任务由于较短的提示长度使所有系统都能使用更大的微批量大小。在 S1 下，MoE-Lightning (p)相比 FlexGen、FlexGen(c)和 DeepSpeed 的吞吐量分别提升 1.16 $\times$ 、1.56 $\times$ 、2.22 $\times$ 。在 S2 下，MoE-Lightning (p)找到一个更好的平衡点，并使用较小的批量大小，相比 FlexGen 分别实现了 2.1 $\times$ 和 5.26 $\times$ 的更高吞吐量，展示了 CGOPipe 和 HRM 的效率。

5.3. Tensor Parallelism 5.3. 张量并行

This section evaluates MoE-Lightning’s ability to run on multiple GPUs with tensor parallelism. As shown in S1 and S2, due to our efficient resource utilization, MoE-Lightning’s throughput is predominantly bounded by GPU memory capacity. This shows that increasing GPU memory can raise the system’s throughput upper bound.
本节评估了 MoE-Lightning 在多 GPU 上使用张量并行运行的能力。如在 S1 和 S2 所示，由于我们高效的资源利用，MoE-Lightning 的吞吐量主要受限于 GPU 内存容量。这表明增加 GPU 内存可以提高系统的吞吐量上限。

S6 and S7 in Footnote 9 show the end-to-end throughput results on Mixtral 8x22B of MoE-Lightning (p), FlexGen, and DeepSpeed on MTBench for multiple T4 GPUs. Notably, MoE-Lightning (p) achieves 2.77-3.38 $\times$ higher throughput with 4xT4 GPUs than with 2xT4 GPUs, demonstrating super-linear scaling performance. DeepSpeed demonstrates a linear-scaling performance but uses a small batch size of 32, resulting in low throughput. FlexGen fails to scale under settings S6 and S7, largely due to the pipeline parallelism approach it employs. In this method, when using 4 GPUs, during the saturated phase, four layers are simultaneously active across four GPUs, increasing CPU peak memory consumption. As a result, FlexGen is bottlenecked by the CPU to GPU memory bandwidth and fails to take advantage of the added GPUs. Note that pipeline parallelism is more effective across multiple GPU nodes. In such configurations, doubling the number of GPUs also doubles the CPU to GPU bandwidth, the CPU memory capacity, and the CPU memory bandwidth¹⁰¹⁰10In this paper, we focus on the cases within one node..
页脚 9 中的 S6 和 S7 展示了 MoE-Lightning（p）、FlexGen 和 DeepSpeed 在 MTBench 上针对多个 T4 GPU 的 Mixtral 8x22B 的端到端吞吐量结果。值得注意的是，相较于 2xT4 GPU，MoE-Lightning（p）在 4xT4 GPU 下实现了 2.77-3.38 倍更高的吞吐量，表明具备超线性扩展性能。DeepSpeed 表现为线性扩展性能，但使用了较小的批量大小 32，导致吞吐量低。FlexGen 在设置 S6 和 S7 下无法扩展，主要因为其采用的流水线并行方法。在这种方法中，使用 4 个 GPU 时，在饱和阶段，四个层次在四个 GPU 上同时激活，增加了 CPU 峰值内存消耗。结果，FlexGen 受制于 CPU 至 GPU 的内存带宽，无法利用新增的 GPU。需要注意的是，流水线并行在多个 GPU 节点上更为有效。在此类配置中，GPU 数量翻倍同时意味着 CPU 至 GPU 带宽、CPU 内存容量和 CPU 内存带宽也会翻倍。

Fig. 8 demonstrates MoE-Lightning’s generation throughput results on DBRX to showcase the performance when all optimizations are enabled (CGOPipe, HRM and variable length prompts). For the DBRX model and without request padding (i.e., shorter prompt length), the system becomes less GPU memory capacity bound. We can see 2.1-2.8 $\times$ improvement when scaling from 2 GPUs to 4 GPUs.
图 8 展示了 MoE-Lightning 在 DBRX 上的生成吞吐量结果，以展示开启所有优化（CGOPipe、HRM 和变量长度提示）后的性能。对于 DBRX 模型，且不进行请求填充（即，较短的提示长度），系统对 GPU 内存容量的限制较小。从 2 个 GPU 扩展到 4 个 GPU 时，我们可以看到 2.1-2.8 倍的改进。

6. Ablation Study 6.消融研究

6.1. Optimizer Policy 6.1.优化器策略

In this section, we compare MoE-Lightning (p), FlexGen with its policy and FlexGen with our policy. For this experiment, we do not turn on the CPU attention for FlexGen as it is consistently worse than FlexGen w/o CPU attention. We use the workload from MTBench on the S1 setting with a generation length of 128. The results are displayed in Tab. 5. By deploying our policy, we can see a $1.77\times$ improvement in FlexGen. We also increase the batch size to better amortize the weights transfer overhead and it gives a $2.17\times$ speedup. However, it still cannot match MoE-Lightning’s throughput under the same policy, as KV cache swapping becomes the bottleneck for FlexGen in this case.
在本节中，我们比较了 MoE-Lightning（p）、使用其策略的 FlexGen 和使用我们策略的 FlexGen。对于此实验，我们没有打开 FlexGen 的 CPU 注意力功能，因为它一直表现不如不使用 CPU 注意力的 FlexGen。我们使用 MTBench 在 S1 设置下生成长度为 128 的工作负载。结果显示在表 5 中。通过部署我们的策略，我们可以看到 FlexGen 的性能有所提高。我们还增加了批量大小以更好地摊销权重转移开销，并达到了加速效果。然而，在相同策略下，它仍无法匹敌 MoE-Lightning 的吞吐量，因为 KV 缓存交换在这种情况下成为 FlexGen 的瓶颈。

Table 5. Generation throughput for MoE-Lightning and different variants of FlexGen. (MTBench@S1, Generation length=128)
表 5。MoE-Lightning 和不同 FlexGen 变体的生成吞吐量。（MTBench@S1，生成长度=128）

	$\mu$	$N$	Throughput (token/s) 吞吐量（token/s）
FlexGen w/ their policy 使用其策略的 FlexGen	8	1112	9.5
FlexGen w/ our policy 使用我们策略的 FlexGen	36	504	16.816 ( $1.77\times$ )
FlexGen w/ our policy + larger $N$ 使用我们策略和更大 $N$ 的 FlexGen	36	1116	20.654 ( $2.17\times$ )
MoE-Lightning (p) MoE-Lightning（p）	36	504	30.12 ( $3.17\times$ )

6.2. CPU Attention vs. Experts FFN vs. KV Transfer
6.2.CPU 注意力与专家 FFN 与 KV 传输

In this section, we study when CPU attention will become the bottleneck in the decode stage. For different batch sizes (from 32 to 256), we test the latency of the MoE FFN kernel on L4 GPU and compare it with the latency of the CPU attention kernel on a 24-core Intel(R) Xeon(R) CPU @ 2.20GHz with various context lengths (from 128 to 2048). Additionally, we also measure the latency for swapping the KV cache needed for the attention from CPU pinned memory to GPU to validate the efficiency of our CPU GQA kernel.
在本节中，我们研究在解码阶段何时 CPU 注意力会成为瓶颈。对于不同的批量大小（从 32 到 256），我们在 L4 GPU 上测试 MoE FFN 内核的延迟，并将其与在不同上下文长度（从 128 到 2048）下在 24 核 Intel(R) Xeon(R) CPU @ 2.20GHz 上的 CPU 注意力内核延迟进行比较。此外，我们还测量了从 CPU 锁页内存到 GPU 为注意力所需的 KV 缓存交换的延迟，以验证我们 CPU GQA 内核的效率。

As shown in Fig. 9, our CPU attention kernel is $3-4\times$ faster than KV cache transfer, which is close to the ratio of CPU memory bandwidth and the CPU to GPU memory bandwidth. The MoE FFN’s latency doesn’t change so much across different micro batch sizes, which is as expected since the kernel is memory-bound for the decode stage. As the micro-batch size and context length increase, the CPU attention will eventually become the bottleneck, which calls for higher CPU memory bandwidth.
如图 9 所示，我们的 CPU 注意力内核比 KV 缓存传输要快，这接近于 CPU 内存带宽与 CPU 到 GPU 内存带宽的比率。MoE FFN 的延迟在不同的微批量大小下变化不大，这是意料之中的，因为该内核对解码阶段来说是受内存限制的。随着微批量大小和上下文长度的增加，CPU 注意力最终会成为瓶颈，这需要更高的 CPU 内存带宽。

6.3. Case Study on Different Hardware Settings
6.3.在不同硬件设置下的案例研究

In this section, we study how the best policy changes under different hardware settings. As we have shown in the previous ablation study, CPU attention can actually become the bottleneck for large batch size and context length, which means if we have more powerful GPUs, at some point, CPU attention may not be worth it. Moreover, if we have higher CPU to GPU memory bandwidth, the trade-offs will also change. Then the question becomes: when we have enough GPU memory (e.g., 2xA100-80G) to hold the model weights (e.g., Mixtral 8x7B), is it still beneficial to perform CPU computation or to offload weights/KV cache to the CPU? To conduct the analysis, we use 2xA100-80G for the GPU specification and vary the CPU to GPU memory bandwidth from 100 to 500 GB/s alongside different CPU capabilities. We set base CPU specifications at $m_{c}=200$ GB/s, $b_{c}=100$ GB, and $p_{c}=1.6$ TFLOPS/s, scaling these values by multiplying with the CPU scaling ratio for various configurations¹¹¹¹11Note that this setup doesn’t reflect real-world hardware scaling; rather, it simplifies the number of variables to offer a rough idea of how these hardware configurations might impact performance..
在本节中，我们研究在不同硬件设置下最佳策略如何变化。正如我们在先前的消融研究中所展示的，CPU 注意力实际上可能成为大批量大小和上下文长度的瓶颈，这意味着如果我们有更强大的 GPU，在某些情况下，CPU 注意力可能不值得使用。此外，如果我们有更高的 CPU 到 GPU 内存带宽，权衡也会发生变化。那么问题就变成：当我们有足够的 GPU 内存（例如，2 个 A100-80G）来存储模型权重（例如，Mixtral 8x7B）时，执行 CPU 计算或将权重/KV 缓存卸载到 CPU 上是否仍然有益呢？为了进行分析，我们使用 2 个 A100-80G 作为 GPU 规格，并将 CPU 到 GPU 内存带宽从 100 变到 500 GB/s，同时结合不同的 CPU 性能。我们将基本 CPU 规格设置为 $m_{c}=200$ GB/s、 $b_{c}=100$ GB 和 $p_{c}=1.6$ TFLOPS/s，通过与 CPU 缩放比例相乘来扩展这些值，以适应各种配置 ¹¹ 。

We can see that when running Mixtral 8x7B on two A100 GPUs, as CPU-to-GPU memory bandwidth increases, more weight will be offloaded to the CPU. KV cache offloading is highly related to the CPU scaling ratio in this setup: when the CPU scaling ratio is low (i.e., low CPU memory bandwidth), even with the highest CPU to GPU memory bandwidth tested here, it is not beneficial to offload KV cache.
我们可以看到，当在两个 A100 GPU 上运行 Mixtral 8x7B 时，随着 CPU 到 GPU 内存带宽的增加，更多的权重会被卸载到 CPU。在此设置中，KV 缓存卸载与 CPU 缩放比例高度相关：当 CPU 缩放比例较低时（即低 CPU 内存带宽），即使是在此测试中的最高 CPU 到 GPU 内存带宽下，卸载 KV 缓存也是无益的。

7. Related Work 7.相关工作

Memory-constriant LLM Inference LLM inference requires substantial memory to store model parameters and computation outputs, making it typically memory capacity-bound. There is a line of research dedicated to memory-constraint LLM inference. This is particularly crucial for inference hardware such as desktop computers, or low-end cloud instances with limited computational power and memory capacity. To facilitate inference on such constrained systems, some work leverages sparsity or neuro activation patterns to intelligent offloading between CPU and GPU (xue2024moeinfinity, ; song2023powerinfer, ; eliseev2023fast, ; sheng2023flexgen, ). Some approaches utilize not only DRAM but also flash memory to expand the available memory resources (alizadeh2024llm, ). Additionally, since the CPU often remains underutilized during inference, it can be harnessed to perform complementary computations (song2023powerinfer, ; xuanlei2024hetegen, ; kamahori2024fiddler, ).
内存限制的 LLM 推理 LLM 推理需要大量内存来存储模型参数和计算输出，因此通常受限于内存容量。有一系列研究专门针对内存受限的 LLM 推理。这对于推理硬件，如台式计算机或计算能力和内存容量有限的低端云实例，尤其重要。为了便于在此类受限系统上进行推理，一些工作利用稀疏性或神经激活模式在 CPU 和 GPU 之间进行智能卸载（xue2024moeinfinity,; song2023powerinfer,; eliseev2023fast,; sheng2023flexgen,）。一些方法不仅利用 DRAM，还使用闪存来扩展可用的内存资源（alizadeh2024llm,）。此外，由于 CPU 在推理过程中通常未被充分利用，可以将其用于执行补充计算（song2023powerinfer,; xuanlei2024hetegen,; kamahori2024fiddler,）。

LLM Inference Throughput Optimization To enhance inference throughput, some research focuses on maximizing the sharing of computations between sequences to minimize redundant processing of identical tokens (juravsky2024hydragen, ; zheng2024sglang, ). Another approach involves batching requests (yu2022orca, ) to optimize hardware utilization. Additionally, some studies develop paged memory methods for managing the key-value (KV) cache to reduce memory waste, thereby increasing the effective batch size and further improving throughput (kwon2023efficient, ). FastDecode (he2024fastdecode, ) proposes aggregating memory and computing power of CPUs across multiple nodes to process the attention part to boost GPU throughput. Compared with FastDecode, we are targeting the memory-constrained case where the model weights also need to be transferred between CPU and GPU, making the optimization and scheduling problem far more challenging.
LLM 推理吞吐量优化为了提高推理吞吐量，一些研究专注于最大化序列之间计算的共享，以最小化重复处理相同令牌（juravsky2024hydragen,; zheng2024sglang,）。另一种方法是将请求批处理（yu2022orca,）以优化硬件利用率。此外，一些研究开发分页内存方法来管理键值（KV）缓存，以减少内存浪费，从而增加有效批量大小并进一步提高吞吐量（kwon2023efficient,）。FastDecode（he2024fastdecode,）建议聚合多个节点中 CPU 的内存和计算能力以处理注意力部分，从而提高 GPU 吞吐量。与 FastDecode 相比，我们专注于内存受限的情况，其中模型权重也需要在 CPU 和 GPU 之间传输，使得优化和调度问题更具挑战性。

LLM Inference Latency Optimization To reduce LLM inference latency, some work addresses the inherent slowness caused by the autoregressive nature of LLM inference by developing fast decoding methods, such as speculative decoding (chen2023accelerating, ; stern2018blockwise, ; leviathan2023fast, ) and parallel decoding (Santilli_2023, ), which generate multiple tokens simultaneously. Another approach aims to decrease inference latency by implementing efficient computational kernels (dao2022flashattention, ; dao2023flashattention2, ; flashinfer2024, ) designed to minimize memory access and maximize GPU utilization.
LLM 推理延迟优化为了减少 LLM 推理延迟，一些工作通过开发快速解码方法来解决 LLM 推理的自回归性质所导致的固有慢速问题，如推测解码（chen2023accelerating；stern2018blockwise；leviathan2023fast）和并行解码（Santilli_2023)，这些方法可同时生成多个标记。另一种方法旨在通过实现高效的计算内核来减少推理延迟（dao2022flashattention；dao2023flashattention2；flashinfer2024），该内核设计用于最大化 GPU 利用率并最小化内存访问。

8. Conclusion 8.结论

We present MoE-Lightning, a high-throughput MoE inference system for GPU-constrained scenarios. MoE-Lightning can achieve up to 10.3 $\times$ (without request padding) and 3.5 $\times$ (with request padding) higher throughput over state-of-the-art systems on a single GPU and demonstrate super-linear scaling on multiple GPUs, enabled by CGOPipe and HRM. CGOPipe is a novel pipeline scheduling strategy to improve resource utilization, and HRM is a performance model based on a Hierarchical Roofline Model that we extend from the classical Roofline Model to find policies with higher throughput upper bound.
我们提出了 MoE-Lightning，这是一种用于 GPU 受限场景的高吞吐量 MoE 推理系统。MoE-Lightning 在单个 GPU 上可实现高达 10.3 倍（无请求填充）和 3.5 倍（有请求填充）的吞吐量提升，并在多个 GPU 上表现出超线性扩展，这得益于 CGOPipe 和 HRM。CGOPipe 是一种新颖的流水线调度策略，可提高资源利用率，而 HRM 是基于我们从经典屋顶线模型扩展而来的分层屋顶线模型的性能模型，用于寻找具有更高吞吐率上限的策略。

Appendix A System Implementation Details
附录 A 系统实现细节

In this section, we explain two system-level designs and their implementation details: 1. Section A.1 introduces how GPU and CPU memory are used and weights paging is implemented in MoE-Lightning, and 2. Section A.2 presents the batching algorithm employed in MoE-Lightning to support dynamic-length requests in a batch.
在本节中，我们解释了两个系统级设计及其实现细节：1. A.1 节介绍了 MoE-Lightning 中 GPU 和 CPU 内存的使用及权重分页的实现；2. A.2 节展示了 MoE-Lightning 中用于支持批内动态长度请求的批处理算法。

A.1. Memory Management A.1.内存管理

Since attention is performed on CPU, the KV cache for all micro-batches will be transferred to and stored on CPU after the corresponding computation completes. To enable CGOPipe, we allocate a weight buffer with a size of $2\times sizeof(W_{L})$ , where $W_{L}$ denotes the size of the portion of a layer’s weights stored in CPU memory. This buffer enables overlapping weight prefetching: as the current layer’s weights are being used, the next layer’s weights are simultaneously transferred to GPU memory.
由于注意力操作是在 CPU 上进行的，因此所有微批次的 KV 缓存将在相应计算完成后转移并存储到 CPU 上。为了启用 CGOPipe，我们分配了一个大小为 $2\times sizeof(W_{L})$ 的权重缓冲，其中 $W_{L}$ 表示存储在 CPU 内存中某层权重的部分大小。此缓冲允许权重预取重叠：当当前层的权重正在使用时，下一层的权重同时被传输到 GPU 内存。

Weights are transferred in a paged manner. For example in Fig. 11, each expert in the MoE FFN kernel requires two pages, and the kernel accesses the appropriate pages using a page table. To accelerate transfers from CPU to GPU, weights are first moved from CPU memory to pinned memory, and then from pinned memory to GPU. These transfers are overlapped to hide latency. As illustrated in Fig. 11, while transferring Weights 2 for Layer 2 from pinned memory to GPU, Weights 4 for the same layer can be transferred concurrently from CPU to pinned memory.
权重以分页方式传输。例如在图 11 中，MoE FFN 内核中的每个专家需要两个页面，内核通过页面表访问相应页面。为了加速从 CPU 到 GPU 的传输，首先将权重从 CPU 内存移到锁页内存，然后从锁页内存移到 GPU。这些传输是重叠的，以隐藏延迟。如图 11 所示，在将权重 2 从锁页内存传输到 GPU 时，相同层的权重 4 可以同时从 CPU 传输到锁页内存。

A.2. Request Batching A.2.请求批处理

For a given workload, the optimizer introduced in Section 4.2 takes the average prompt length to search for an optimal policy. However, maintaining a consistent micro-batch size becomes challenging due to varying input lengths across requests. To address this, we employ the strategy outlined in Algorithm 2 to achieve balanced token distribution. In essence, requests are sorted by input length in descending order and assigned to micro-batches by iteratively placing the longest request into the micro-batch with the fewest tokens. This approach ensures that all micro-batches have a size close to the $ubs$ specified by the generated policy.
对于给定的工作负载，第 4.2 节中介绍的优化器采用平均提示长度来寻找最佳策略。然而，由于请求间输入长度不同，保持一致的微批次大小变得具有挑战性。为了解决这个问题，我们采用算法 2 中概述的策略来实现平衡的标记分配。基本上，请求按输入长度降序排序，并通过反复将最长请求放入标记最少的微批次中来分配给微批次。此方法确保所有微批次的大小接近生成策略指定的 $ubs$ 。

Algorithm 2 Request Batching
算法 2 请求批处理

req\_queue

: Queue of requests
1:

req\_queue

: 请求队列

n\_ub

: Number of micro-batches
2:

n\_ub

: 微批次数量

ubs

: Maximum number of requests per micro-batches
3:

ubs

: 每个微批次的最大请求数

gen\_len

: Generation length per request
4:

gen\_len

: 每个请求的生成长度

cache\_size

: Maximum cache size per micro-batches
5:

cache\_size

: 每个微批次的最大缓存大小

micro\_batches

: List of micro-batches
6:

micro\_batches

: 微批次列表

aborted\_requests

: List of aborted requests to be added to the next batch
7:

aborted\_requests

: 需要添加到下一个批次的中止请求列表

8:for

i\leftarrow 1

n\_ub

do
8:对

i\leftarrow 1

到

n\_ub

进行

partitions[i]\leftarrow\emptyset

10:

partitions\_sums[i]\leftarrow 0

11:Sort(

req\_queue

key

\lambda x.\,x.input\_len

reverse=True

)
11:排序(

req\_queue

key

\lambda x.\,x.input\_len

reverse=True

)

12:for all

req\in req\_queue

do
12:对所有

req\in req\_queue

进行

13: if

partitions==\emptyset

then
13: 如果

partitions==\emptyset

那么

14:

aborted\_requests\mathrel{+}=req

15:

idx\leftarrow\arg\min(partitions\_sums)

16: if

(partitions\_sums[idx]+req.input\_len)+(1+|partitions[idx]|)\times gen\_len>% cache\_size

then
16: 如果

(partitions\_sums[idx]+req.input\_len)+(1+|partitions[idx]|)\times gen\_len>% cache\_size

那么

17:

aborted\_requests\mathrel{+}=req

18: else 18: 否则

19:

partitions[idx]\mathrel{+}=req

20:

partitions\_sums[idx]\mathrel{+}=req.input\_len

21: if

|partitions[idx]|==ubs

then
21: 如果

|partitions[idx]|==ubs

那么

22:

new\_batch\leftarrow\textsc{NewBatch}(partitions[idx])

23:

micro\_batches\mathrel{+}=new\_batch

24:

partitions.pop(idx)

25:

partitions\_sums.pop(idx)

26:return

micro\_batches

abort\_requests

26:返回

micro\_batches

abort\_requests

Appendix B Further Discussion
附录 B 进一步讨论

B.1. MoE v.s. Dense Models
B.1.MoE 与密集模型的比较

The performance model and system optimizations proposed in this work are fully applicable to dense models. As discussed in Section 3.3, MoE models present greater challenges with their higher memory-to-FLOPS ratio. This benefits them more from the system optimizations, which specifically aim to improve I/O efficiency and reduce pipeline bubbles. Dense models can benefit from these optimizations as well; however, they are more likely to be bottle-necked by CPU memory bandwidth during attention (depending on sequence length), where methods like sparse attention(child2019generating, ; zhang2024h2o, ; tang2024quest, ) and quantized KV cache may offer more gains.
本文所提出的性能模型和系统优化完全适用于密集型模型。如第 3.3 节所述，MoE 模型由于其更高的内存与 FLOPS 比率而呈现出更大的挑战。这使得它们更需要从专门旨在提高 I/O 效率和减少流水线空泡的系统优化中获益。密集型模型同样可以从这些优化中受益；然而，在注意力阶段（取决于序列长度），它们更可能因为 CPU 内存带宽而成为瓶颈，在此情况下，诸如稀疏注意力（child2019generating；zhang2024h2o；tang2024quest）和量化 KV 缓存等方法可能会带来更大的收益。

B.2. Optimizer Overhead B.2.优化器开销

In Section 4.2, we introduced the optimization target Eq. 12 and the search space. For a given workload, model and hardware specification, the optimal policy can be generated offline through mixed integer linear programming (MILP), which takes less than a minute.
在第 4.2 节中，我们介绍了优化目标公式 12 以及搜索空间。对于给定的工作负载、模型和硬件规格，最佳策略可以通过混合整数线性规划（MILP）离线生成，所需时间不到一分钟。

Appendix C Future Work 附录 CFuture Work

Advanced performance model
高级性能模型

HRM presented in this work is limited to hardware within a single node and does not account for GPU-GPU communication or multi-node communication, both of which are critical for more comprehensive distributed performance modeling. Additionally, with recent advances in leveraging KV cache sparsity for long-context inference (tang2024quest, ), it becomes essential to incorporate these optimizations into the performance model. For example, when CPU attention emerges as the bottleneck, the KV cache budget can be adjusted to better balance CPU and GPU computation, enhancing overall system efficiency.
本文中提出的 HRM 仅限于单节点内硬件，不考虑 GPU-GPU 通信或多节点通信，而这些对于更全面的分布式性能建模是至关重要的。此外，随着在长上下文推理中利用 KV 缓存稀疏性的最新进展（tang2024quest），将这些优化整合到性能模型中变得至关重要。例如，当 CPU 注意力成为瓶颈时，可以调节 KV 缓存预算，以更好地平衡 CPU 和 GPU 的计算，提升整体系统效率。

Disk and other hardware support
磁盘及其他硬件支持

MoE-Lightning currently focuses on scenarios where GPU memory is limited but sufficient CPU memory is available to hold the model, highlighting the effectiveness of both CGOPipe and HRM. However, when CPU memory is insufficient to hold the entire model, disk offloading becomes essential. Moreover, supporting hardware such as TPUs and other accelerators is essential for extending the versatility of MoE-Lightning to diverse computing environments.
MoE-Lightning 目前专注于 GPU 内存有限但足够的 CPU 内存可存储模型的场景，突出显示了 CGOPipe 和 HRM 的有效性。然而，当 CPU 内存不足以容纳整个模型时，磁盘卸载变得必不可少。此外，支持 TPU 和其他加速器等硬件对于扩展 MoE-Lightning 至多样化的计算环境至关重要。

References 参考文献

[1] Flashinfer AI. Flashinfer: Kernel library for llm serving. https://github.com/flashinfer-ai/flashinfer, 2024. Accessed: 2024-05-20.
Flashinfer AI。Flashinfer：用于 llm 服务的内核库。https://github.com/flashinfer-ai/flashinfer, 2024。访问日期：2024-05-20。
[2] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, 和 Sumit Sanghai。Gqa：从多头检查点训练广义多查询 Transformer 模型。arXiv 预印本 arXiv:2305.13245, 2023。
[3] Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. Llm in a flash: Efficient large language model inference with limited memory, 2024.
Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, 和 Mehrdad Farajtabar。大模型如闪电：使用有限内存进行高效大型语言模型推理，2024。
[4] Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE, 2022.
Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, 等。Deepspeed-inference：实现 Transformer 模型的高效推理，规模空前。在 SC22：高性能计算、网络、存储与分析国际会议，页数 1–15。IEEE, 2022。
[5] Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling, 2023.
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, 和 John Jumper。通过推测性采样加速大型语言模型解码，2023。
[6] Wuyang Chen, Yanqi Zhou, Nan Du, Yanping Huang, James Laudon, Zhifeng Chen, and Claire Cui. Lifelong language pretraining with distribution-specialized experts. In International Conference on Machine Learning, pages 5383–5395. PMLR, 2023.
Wuyang Chen, Yanqi Zhou, Nan Du, Yanping Huang, James Laudon, Zhifeng Chen, 和 Claire Cui。使用分布专用专家进行终身语言预训练。在国际机器学习会议，页数 5383–5395。PMLR, 2023。
[7] Xinyun Chen, Petros Maniatis, Rishabh Singh, Charles Sutton, Hanjun Dai, Max Lin, and Denny Zhou. Spreadsheetcoder: Formula prediction from semi-structured context. In International Conference on Machine Learning, pages 1661–1672. PMLR, 2021.
Xinyun Chen, Petros Maniatis, Rishabh Singh, Charles Sutton, Hanjun Dai, Max Lin, 和 Denny Zhou。Spreadsheetcoder：从半结构化上下文预测公式。在国际机器学习会议，页数 1661–1672。PMLR, 2021。
[8] Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024.
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, 等。Chatbot Arena：通过人类偏好评估 llm 的开放平台。arXiv 预印本 arXiv:2403.04132, 2024。
[9] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers, 2019.
Rewon Child, Scott Gray, Alec Radford 和 Ilya Sutskever。使用稀疏 Transformer 生成长序列，2019。
[10] Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. CoRR, abs/2401.06066, 2024.
Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui 和 Wenfeng Liang。Deepseekmoe：朝向集合专家语言模型的终极专家专用。CoRR, abs/2401.06066, 2024。
[11] Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024.
Tri Dao。FlashAttention-2：通过更好的并行性和工作划分实现更快的注意力。在国际学习表征会议（ICLR），2024。
[12] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra 和 Christopher Ré。FlashAttention：具备 I/O 感知的快速且内存高效的精确注意力。在神经信息处理系统进展会议（NeurIPS），2022。
[13] Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR, 2022.
Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, 等。Glam：通过集合专家高效扩展语言模型。在国际机器学习会议中，页面 5547–5569。PMLR, 2022。
[14] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, 等。The llama 3 模型群。arXiv 预印本 arXiv:2407.21783, 2024。
[15] Artyom Eliseev and Denis Mazur. Fast inference of mixture-of-experts language models with offloading, 2023.
Artyom Eliseev 和 Denis Mazur。使用转移实现集合专家语言模型的快速推理，2023。
[16] Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, et al. Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021.
Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, 等。Dapple：一种用于训练大型模型的流水线数据并行方法。在第 26 届 ACM SIGPLAN 并行程序设计原则与实践研讨会上的演讲，页面 431–445，2021。
[17] Jiaao He and Jidong Zhai. Fastdecode: High-throughput gpu-efficient llm serving using heterogeneous pipelines, 2024.
Jiaao He 和 Jidong Zhai。Fastdecode：使用异构流水线的高吞吐量 GPU 高效 LLM 服务，2024。
[18] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, 等。Gpipe：使用流水线并行训练大型神经网络的高效方法。神经信息处理系统进展，32，2019。
[19] HuggingFace. Hugging face accelerate. https://huggingface.co/docs/accelerate/index, 2022.
HuggingFace。Hugging Face 加速器。https://huggingface.co/docs/accelerate/index, 2022。
[20] Intel. Intel(r) oneapi math kernel library (onemkl). https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html, 2024.
Intel。Intel® OneAPI 数学核心库（oneMKL）。https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html, 2024。
[21] Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts. Neural Comput., 3(1):79–87, 1991.
Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan 和 Geoffrey E. Hinton。自适应的局部专家混合模型。神经计算，3(1):79–87, 1991。
[22] Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mixtral of experts. CoRR, abs/2401.04088, 2024.
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix 和 William El Sayed。Mixtral of Experts。CoRR, abs/2401.04088, 2024。
[23] Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994.
Michael I Jordan 和 Robert A Jacobs。专家的层次混合与 EM 算法。神经计算，6(2):181–214, 1994。
[24] Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, and Azalia Mirhoseini. Hydragen: High-throughput llm inference with shared prefixes, 2024.
Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré 和 Azalia Mirhoseini。Hydragen: 使用共享前缀的高通量 LLM 推理，2024。
[25] Keisuke Kamahori, Yile Gu, Kan Zhu, and Baris Kasikci. Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models, 2024.
Keisuke Kamahori, Yile Gu, Kan Zhu 和 Baris Kasikci。Fiddler：用于快速推理专家混合模型的 CPU-GPU 协调，2024。
[26] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang 和 Ion Stoica。使用 PagedAttention 的有效内存管理为大型语言模型服务。在第 29 届操作系统原则研讨会上的演讲，页面 611–626，2023。
[27] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer 和 Zhifeng Chen. Gshard：通过条件计算和自动分片扩大巨大模型的规模。arXiv 预印本 arXiv:2006.16668, 2020 年。
[28] Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding, 2023.
Yaniv Leviathan, Matan Kalman 和 Yossi Matias. 通过推测解码快速推断 Transformer，2023 年。
[29] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar 等。语言模型的整体评估。arXiv 预印本 arXiv:2211.09110, 2022 年。
[30] Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving, 2024.
Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan 和 Song Han. Qserve：高效 LLM 服务的 W4A8KV4 量化和系统联合设计, 2024 年。
[31] Shu Liu, Asim Biswal, Audrey Cheng, Xiangxi Mo, Shiyi Cao, Joseph E. Gonzalez, Ion Stoica, and Matei Zaharia. Optimizing llm queries in relational workloads, 2024.
Shu Liu, Asim Biswal, Audrey Cheng, Xiangxi Mo, Shiyi Cao, Joseph E. Gonzalez, Ion Stoica 和 Matei Zaharia. 在关系型工作负载中优化 LLM 查询, 2024 年。
[32] MistralAI. https://mistral.ai/news/mixtral-8x22b/, April 2024.
MistralAI. https://mistral.ai/news/mixtral-8x22b/, 2024 年 4 月。
[33] Avanika Narayan, Ines Chami, Laurel Orr, and Christopher Ré. Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911, 2022.
Avanika Narayan, Ines Chami, Laurel Orr 和 Christopher Ré. 基础模型能够处理你的数据吗？arXiv 预印本 arXiv:2205.09911, 2022 年。
[34] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM symposium on operating systems principles, pages 1–15, 2019.
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons 和 Matei Zaharia. Pipedream：用于 DNN 训练的泛化流水线并行。发表于第 27 届 ACM 操作系统原理研讨会论文集，第 1-15 页, 2019 年。
[35] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro 等。使用 Megatron-LM 的 GPU 集群上大规模语言模型训练的效率。在高性能计算、网络、存储和分析国际会议论文集中，第 1-15 页, 2021 年。
[36] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga 等。Pytorch：一种命令式风格的高性能深度学习库。神经信息处理系统进展, 32, 2019 年。
[37] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal 和 Jeff Dean. 高效扩展 Transformer 推理。机器学习和系统会议论文集, 5, 2023 年。
[38] Sarunya Pumma, Jongsoo Park, Jianyu Huang, Amy Yang, Jaewon Lee, Daniel Haziza, Grigory Sizov, Jeremy Reizenstein, Jeff Johnson, and Ying Zhang. Int4 decoding gqa cuda optimizations for llm inference. https://pytorch.org/blog/int4-decoding/, 2024.
Sarunya Pumma, Jongsoo Park, Jianyu Huang, Amy Yang, Jaewon Lee, Daniel Haziza, Grigory Sizov, Jeremy Reizenstein, Jeff Johnson 和 Ying Zhang. 用于 LLM 推理的 Int4 解码 GQA CUDA 优化。https://pytorch.org/blog/int4-decoding/, 2024 年。
[39] Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino Maiorca, Michele Mancusi, Riccardo Marin, and Emanuele Rodola. Accelerating transformer inference for translation via parallel decoding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2023.
Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino Maiorca, Michele Mancusi, Riccardo Marin 和 Emanuele Rodola. 通过并行解码加速翻译的 Transformer 推理。发表于第 61 届计算语言学协会年会论文集（第一卷：长篇论文）。计算语言学协会, 2023 年。
[40] Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé 等。Bloom：一个具有 1760 亿参数的开放访问多语言模型。arXiv 预印本 arXiv:2211.05100, 2022 年。
[41] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton 和 Jeff Dean. 异常大的神经网络：稀疏门控专家混合层。arXiv 预印本 arXiv:1701.06538, 2017 年。
[42] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023.
Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica 和 Ce Zhang. Flexgen: 使用单个 GPU 的大语言模型高吞吐量生成式推理。发表于国际机器学习会议论文集，第 31094–31116 页。PMLR, 2023 年。
[43] Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. Powerinfer: Fast large language model serving with a consumer-grade gpu, 2023.
Yixin Song, Zeyu Mi, Haotong Xie 和 Haibo Chen. Powerinfer: 使用消费级 GPU 快速提供大语言模型服务, 2023 年。
[44] Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models, 2018.
Mitchell Stern、Noam Shazeer 和 Jakob Uszkoreit. 用于深度自回归模型的分块并行解码，2018。
[45] Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context llm inference. arXiv preprint arXiv:2406.10774, 2024.
Jiaming Tang、Yilong Zhao、Kan Zhu、Guangxuan Xiao、Baris Kasikci 和 Song Han. Quest: 面向查询的稀疏性以提高长上下文大型语言模型推理效率。arXiv 预印本 arXiv:2406.10774, 2024。
[46] Mosaic Research Team. Introducing dbrx: A new state-of-the-art open llm, 2024. https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm, March 2024. Accessed 2024-06-20.
Mosaic Research Team. 介绍 dbrx: 一种新型的开创性开放大型语言模型，2024。https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm, 2024 年 3 月。访问日期 2024-06-20。
[47] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Hugo Touvron、Thibaut Lavril、Gautier Izacard、Xavier Martinet、Marie-Anne Lachaux、Timothée Lacroix、Baptiste Rozière、Naman Goyal、Eric Hambro、Faisal Azhar 等人。Llama: 开放且高效的基础语言模型。arXiv 预印本 arXiv:2302.13971, 2023。
[48] Samuel Williams, Andrew Waterman, and David A. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65–76, 2009.
Samuel Williams、Andrew Waterman 和 David A. Patterson. Roofline: 一种对多核架构有洞察力的可视化性能模型。Commun. ACM, 52(4):65–76, 2009。
[49] ZHAO XUANLEI, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, and Yang You. Hetegen: Efficient heterogeneous parallel inference for large language models on resource-constrained devices. Proceedings of Machine Learning and Systems, 6:162–172, 2024.
ZHAO XUANLEI、Bin Jia、Haotian Zhou、Ziming Liu、Shenggan Cheng 和 Yang You. Hetegen: 在资源受限设备上为大型语言模型的高效异构并行推理。机器学习与系统大会论文集, 6:162–172, 2024。
[50] Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. Moe-infinity: Activation-aware expert offloading for efficient moe serving, 2024.
Leyang Xue、Yao Fu、Zhan Lu、Luo Mai 和 Mahesh Marina. Moe-infinity: 面向高效 MoE 服务的激活感知专家卸载，2024。
[51] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for $\{$ Transformer-Based $\}$ generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022.
Gyeong-In Yu、Joo Seong Jeong、Geon-Woo Kim、Soojeong Kim 和 Byung-Gon Chun. Orca: 一种面向基于 Transformer 的生成模型的分布式服务系统。第 16 届 USENIX 操作系统设计与实现研讨会（OSDI 22），页码 521–538, 2022。
[52] Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, and Kurt Keutzer. Llm inference unveiled: Survey and roofline model insights, 2024.
Zhihang Yuan、Yuzhang Shang、Yang Zhou、Zhen Dong、Chenhao Xue、Bingzhe Wu、Zhikai Li、Qingyi Gu、Yong Jae Lee、Yan Yan、Beidi Chen、Guangyu Sun 和 Kurt Keutzer. Llm 推理揭秘: 调查与屋顶线模型洞见，2024。
[53] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
Susan Zhang、Stephen Roller、Naman Goyal、Mikel Artetxe、Moya Chen、Shuohui Chen、Christopher Dewan、Mona Diab、Xian Li、Xi Victoria Lin 等人。Opt: 开放预训练的变压器语言模型。arXiv 预印本 arXiv:2205.01068, 2022。
[54] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024.
Zhenyu Zhang、Ying Sheng、Tianyi Zhou、Tianlong Chen、Lianmin Zheng、Ruisi Cai、Zhao Song、Yuandong Tian、Christopher Ré、Clark Barrett 等人。H2o: 用于大型语言模型高效生成推理的重击者神谕。神经信息处理系统大会论文集, 36, 2024。
[55] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
Lianmin Zheng、Wei-Lin Chiang、Ying Sheng、Siyuan Zhuang、Zhanghao Wu、Yonghao Zhuang、Zi Lin、Zhuohan Li、Dacheng Li、Eric Xing 等人。审视作为审判者的 LLM 通过 MT-Bench 和 Chatbot Arena。神经信息处理系统大会论文集, 36, 2024。
[56] Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs, 2024.
Lianmin Zheng、Liangsheng Yin、Zhiqiang Xie、Chuyue Sun、Jeff Huang、Cody Hao Yu、Shiyi Cao、Christos Kozyrakis、Ion Stoica、Joseph E. Gonzalez、Clark Barrett 和 Ying Sheng. Sglang: 结构化语言模型程序的高效执行，2024。
[57] Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103–7114, 2022.
Yanqi Zhou、Tao Lei、Hanxiao Liu、Nan Du、Yanping Huang、Vincent Zhao、Andrew M Dai、Quoc V Le、James Laudon 等人。专家选择路由的专家混合模型。神经信息处理系统大会论文集, 35:7103–7114, 2022。

MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUsMoE-Lightning：在内存受限 GPU 上实现高吞吐量 MoE 推理