Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for
Large Language Model Training
Seq1F1B：高效的序列级流水线并行大型语言模型训练

Ao Sun^1,∗ Weilin Zhao^2,∗ Xu Han^2,† \ANDCheng Yang^1,† Xinrong Zhang ² Zhiyuan Liu² Chuan Shi¹ Maosong Sun²
¹ Beijing University of Posts and Telecommunications, Beijing, China.
² NLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China.
{maydomine, yangcheng}@bupt.edu.cn
{zwl23,zxr19}@mails.tsinghua.edu.cn
han-xu@tsinghua.edu.cn

Abstract 摘要

Training large language models (LLMs) heavily relies on distributed training strategies, among which pipeline parallelism (PP) plays a crucial role. As training sequences extend to 32K or even 128K tokens, current PP methods face severe bottlenecks, including substantial pipeline bubbles and high memory footprint, greatly hindering training throughput and model scalability. This paper introduces a sequence-level one-forward-one-backward (1F1B) PP method, named Seq1F1B, tailored for training LLMs on long sequences with high training throughput and memory efficiency. Unlike typical PP methods, which adopt batch-level pipeline schedule, Seq1F1B schedules the pipeline of training LLMs at the sequence level. It uses a computational strategy to partition sequences appropriately, significantly reducing pipeline bubbles and memory footprint. Compared to competitive PP baselines such as Megatron 1F1B PP, Seq1F1B achieves 1.14 $\times$ training throughput with half memory footprint compared to competitive baseline. Notably, Seq1F1B trains an LLM with 30B parameters on sequences up to 64K tokens using 64 $\times$ NVIDIA A100 GPUs without using recomputation strategies, a feat unachievable with existing methods. We will release our code to facilitate further research and development in LLM training on long sequences.
训练大型语言模型（LLMs）在很大程度上依赖于分布式训练策略，其中流水线并行（PP）发挥着关键作用。随着训练序列扩展到 32K 甚至 128K 个标记，当前的 PP 方法面临严重瓶颈，包括大量的流水线空泡和高内存占用，极大地阻碍了训练吞吐量和模型的可扩展性。本文介绍了一种序列级的一次前向一次后向（1F1B）PP 方法，名为 Seq1F1B，专为在长序列上训练 LLMs 而设计，具有高训练吞吐量和内存效率。与采用批量级流水线调度的典型 PP 方法不同，Seq1F1B 在序列级调度 LLMs 的训练流水线。它使用一种计算策略来适当地划分序列，显著减少流水线空泡和内存占用。与竞争性的 PP 基线如 Megatron 1F1B PP 相比，Seq1F1B 在训练吞吐量上达到 1.14 倍，并且内存占用减半。值得注意的是，Seq1F1B 在不使用重计算策略的情况下，使用 64 块 NVIDIA A100 GPU 在长达 64K 个标记的序列上训练一个具有 300 亿参数的 LLM，这是现有方法无法实现的壮举。我们将发布我们的代码，以促进在长序列上进行 LLM 训练的进一步研究和开发。

Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for
Seq1F1B：高效的序列级流水线并行
Large Language Model Training
大型语言模型训练

Ao Sun^1,∗ Weilin Zhao^2,∗ Xu Han^2,†

Cheng Yang^1,† Xinrong Zhang ² Zhiyuan Liu² Chuan Shi¹ Maosong Sun² ¹ Beijing University of Posts and Telecommunications, Beijing, China.
¹ 北京邮电大学，中国北京。 ² NLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China.
² 清华大学 NLP 组，DCST，IAI，BNRIST，中国北京。 {maydomine, yangcheng}@bupt.edu.cn {zwl23,zxr19}@mails.tsinghua.edu.cn han-xu@tsinghua.edu.cn

¹¹footnotetext: ^∗ Indicates equal contribution.²²footnotetext:

{\dagger}

Indicates corresponding author.

1 Introduction
1 介绍

Efficient distributed strategies Shoeybi et al. (2019); Li et al. (2020); Narayanan et al. (2021b) play a crucial role in training large language models (LLMs), and these LLMs have revolutionized various NLP tasks in recent years Touvron et al. (2023); Reid et al. (2024); Jiang et al. (2024); Anil et al. (2023). Among these strategies, pipeline parallelism (PP) Huang et al. (2019); Narayanan et al. (2021b) stands out due to its low communication bandwidth requirement and great computing resource scalability, and it can be easily integrated with other strategies such as data parallelism (DP) Li et al. (2020); Rasley et al. (2020) and tensor parallelism (TP) Shoeybi et al. (2019); Korthikanti et al. (2023).
高效的分布式策略Shoeybi 等人 (2019); Li 等人 (2020); Narayanan 等人 (2021b)在训练大型语言模型（LLMs）中发挥着关键作用，这些 LLMs 在近年来革新了各种 NLP 任务Touvron 等人 (2023); Reid 等人 (2024); Jiang 等人 (2024); Anil 等人 (2023)。在这些策略中，流水线并行（PP）Huang 等人 (2019); Narayanan 等人 (2021b)因其低通信带宽需求和出色的计算资源可扩展性而脱颖而出，并且可以轻松与其他策略如数据并行（DP）Li 等人 (2020); Rasley 等人 (2020)和张量并行（TP）Shoeybi 等人 (2019); Korthikanti 等人 (2023)结合。

PP involves partitioning a model into multiple stages, with each computing device processing a stage consisting of consecutive layers. This paradigm inherently leads to “bubbles”—the idle time caused by the execution topology between the sharded layers. Several ingenious pipeline schedule strategies have been proposed to address this bubble problem. GPipe (Huang et al., 2019) reduces bubbles by splitting each batch of training sequences into micro-batches, coming at the cost of increased memory usage, as each pipeline stage must store the intermediate states of all micro-batches generated during forward passes until backward passes are completed. To address the high memory demand of GPipe, one-forward-one-backward (1F1B) methods are proposed (Harlap et al., 2018; Fan et al., 2021; Narayanan et al., 2021b). 1F1B methods make backward passes have higher execution priority than forward passes and schedule backward passes in advance without affecting final results. Owing to this, the memory demand for storing intermediate states can be reduced without adding extra bubbles. Generally, optimizing PP relies on handling the trade-off between bubble ratio and memory footprint.
PP 涉及将模型划分为多个阶段，每个计算设备处理一个由连续层组成的阶段。这种范式本质上会导致“气泡”——由分片层之间的执行拓扑引起的空闲时间。为了应对这一气泡问题，已经提出了几种巧妙的流水线调度策略。GPipe(Huang et al., 2019)通过将每批训练序列拆分为微批次来减少气泡，但代价是增加了内存使用，因为每个流水线阶段必须存储在前向传递期间生成的所有微批次的中间状态，直到反向传递完成。为了应对 GPipe 的高内存需求，提出了一次前向一次反向（1F1B）方法(Harlap et al., 2018; Fan et al., 2021; Narayanan et al., 2021b)。1F1B 方法使反向传递的执行优先级高于前向传递，并提前调度反向传递而不影响最终结果。因此，可以在不增加额外气泡的情况下减少存储中间状态的内存需求。通常，优化 PP 依赖于处理气泡比例和内存占用之间的权衡。

Recently, some efforts Buckman and Gelada ; Reid et al. (2024) have noticed that long-sequence training benefits LLMs in many aspects. However, it takes work to train LLMs on long sequences due to the quadratic time and memory complexities of Transformer attention modules in terms of sequence Vaswani et al. (2017). In distributed training scenarios, long sequences also cause various parallelism methods to fail. For GPipe and existing 1F1B methods, whose minimal schedulable unit is micro-batch, inevitably face the memory overflow caused by just a single micro-batch, as training sequences extend to extremely long lengths. Long sequences make balancing bubble ratio and memory footprint more challenging for PP methods.
最近，一些研究Buckman and Gelada ; Reid et al. (2024)注意到长序列训练在许多方面对 LLMs 有益。然而，由于 Transformer 注意力模块在序列方面的时间和内存复杂度为二次方，训练 LLMs 以长序列为代价。Vaswani et al. (2017)在分布式训练场景中，长序列也导致各种并行方法失效。对于 GPipe 和现有的 1F1B 方法，其最小可调度单元是微批次，不可避免地面临由于训练序列延伸到极长长度而导致的单个微批次引起的内存溢出。长序列使得 PP 方法在平衡气泡比例和内存占用方面更加具有挑战性。

In this paper, we introduce a Sequence-Level 1F1B (Seq1F1B) PP method. This method capitalizes on the causal self-attention mechanism of LLMs to schedule pipeline stages at the sequence level. In contrast to existing 1F1B methods Narayanan et al. (2021b); Qi et al. (2024), Seq1F1B offers significant efficiency and memory benefits. Specifically, splitting sequences into sub-sequences allows for a significant reduction in memory footprint since only the intermediate states of sub-sequences rather than micro-batches need to be retained. Scheduling the pipeline at the sequence level yields more stages and thus reduces the bubble ratio. While, the causal nature of LLMs also causes a dependency between the forward and backward passes of different sub-sequences, i.e., the forward passes of later sub-sequences rely on earlier ones, and vice versa for the backward passes of early sub-sequences, bringing the challenge for the pipeline schedule. To this end, we introduce a partially ordered queue in Seq1F1B to replace the first-in-first-out (FIFO) queue used in existing 1F1B methods and reorganize the pipeline schedule, so that we can preserve the exact execution dependencies between forward and backward passes while providing synchronous parallelism. To further improve Seq1F1B, we propose a strategy for balancing the workload across sub-sequences rather than simply splitting sequences evenly along the sequence dimension.
在本文中，我们介绍了一种序列级 1F1B（Seq1F1B）PP 方法。该方法利用 LLMs 的因果自注意机制在序列级调度流水线阶段。与现有的 1F1B 方法Narayanan et al. (2021b); Qi et al. (2024)相比，Seq1F1B 提供了显著的效率和内存优势。具体来说，将序列拆分为子序列可以显著减少内存占用，因为只需保留子序列而非微批次的中间状态。在序列级调度流水线产生更多阶段，从而减少气泡比例。然而，LLMs 的因果性质也导致不同子序列的前向和反向传递之间存在依赖关系，即后续子序列的前向传递依赖于较早的子序列，反之亦然，这给流水线调度带来了挑战。为此，我们在 Seq1F1B 中引入了一个部分有序队列，以替代现有 1F1B 方法中使用的先进先出（FIFO）队列，并重新组织流水线调度，以便在提供同步并行性的同时保留前向和反向传递之间的确切执行依赖关系。为了进一步改进 Seq1F1B，我们提出了一种策略，用于平衡子序列之间的工作负载，而不是简单地沿序列维度均匀拆分序列。

Sufficient experiments demonstrate that Seq1F1B significantly outperforms recent popular 1F1B methods Narayanan et al. (2021b); Fan et al. (2021) in terms of memory efficiency and training throughput for training LLMs, with the sequence length ranging from 16K to 128K and the model size ranging from 2.7B to 32B. As the sequence length increases, the efficiency of Seq1F1B becomes more pronounced. Seq1F1B supports efficiently training an LLM with 30B parameters on sequences up to 64K tokens using 64 $\times$ NVIDIA A100 GPUs without using recomputation strategies, which is unachievable with existing PP methods.
充分的实验表明，Seq1F1B 在内存效率和训练吞吐量方面显著优于最近流行的 1F1B 方法Narayanan et al. (2021b); Fan et al. (2021)，适用于训练序列长度从 16K 到 128K，模型大小从 2.7B 到 32B 的 LLMs。随着序列长度的增加，Seq1F1B 的效率变得更加显著。Seq1F1B 支持在不使用重计算策略的情况下，使用 64 块 $\times$ NVIDIA A100 GPU 高效地训练具有 30B 参数的 LLM，处理长达 64K tokens 的序列，这是现有 PP 方法无法实现的。

2 Related Work
2 相关工作

Training LLMs requires using a mixture of parallelism strategies, the most important of which are DP, TP, and PP. PP focuses on partitioning a model into multiple stages, with each device processing a stage consisting of consecutive layers, which is the core strategy to scale more devices to train LLMs. For PP, pipeline schedules can be broadly categorized into two main types: synchronous schedules and asynchronous schedules. Asynchronous schedules such as asynchronous PipeDream (Harlap et al., 2018) and PipeMare (Yang et al., 2021) can achieve bubble-free results but suffer from the performance degradation of final trained models because they use outdated parameters to compute gradient updates. As for synchronous schedules, GPipe (Huang et al., 2019; Li et al., 2021) and 1F1B (Fan et al., 2021; Narayanan et al., 2021b, a) are the most commonly used pipeline schedules following synchronous settings. They achieve much fewer bubbles as the number of micro-batch increases and guarantee mathematical equivalent to the original training process. Given this, our work focuses on improving synchronous pipeline schedules as they ensure consistent semantics across different parallelism strategies.
训练 LLM 需要使用多种并行策略，其中最重要的是 DP、TP 和 PP。PP 专注于将模型划分为多个阶段，每个设备处理由连续层组成的一个阶段，这是扩展更多设备以训练 LLM 的核心策略。对于 PP，流水线调度大致可以分为两种主要类型：同步调度和异步调度。异步调度如异步 PipeDream (Harlap et al., 2018) 和 PipeMare (Yang et al., 2021) 可以实现无气泡结果，但由于使用过时的参数计算梯度更新，最终训练模型的性能会下降。对于同步调度，GPipe (Huang et al., 2019; Li et al., 2021) 和 1F1B (Fan et al., 2021; Narayanan et al., 2021b, a) 是最常用的遵循同步设置的流水线调度。随着微批次数量的增加，它们实现了更少的气泡，并保证与原始训练过程的数学等价性。鉴于此，我们的工作重点是改进同步流水线调度，因为它们确保了不同并行策略之间的一致语义。

The original GPipe (Huang et al., 2019) simply divides a batch into several micro-batches, and its scheduling process has only two phases: the forward phase and the backward phase. The backward passes are executed only after the forward passes of all micro-batches within a batch are completed. During the forward phase, the intermediate states of each micro-batch are enqueued into a FIFO queue $Q$ . During the backward phase, these intermediate states are dequeued for their corresponding backward passes. Since the backward phase happens after all intermediate states are queued, GPipe exhibits an $O(M)$ memory consumption, where $M$ represents the number of micro-batches. TeraPipe (Li et al., 2021) relies on the observation of causal language modeling, where the computation of a given input token only depends on its previous tokens, divides GPipe’s micro-batch into multiple token spans, and replaces the FIFO queue with a last-in-first-out (LIFO) queue to ensure the correct computation of gradients in backward. By using finer schedulable units (token spans), TeraPipe reduces the bubble ratio while being more memory-efficient than GPipe. Chimera (Li and Hoefler, 2021) adopts bidirectional schedule, where each device is responsible for processing multiple stages. While Chimera reduces the bubble ratio, each device has to store redundant parameters (as stages are not evenly distributed across devices), leading to increased memory usage.
原始的 GPipe (Huang et al., 2019) 仅将一个批次划分为多个微批次，其调度过程只有两个阶段：前向阶段和后向阶段。只有在一个批次内所有微批次的前向传递完成后，才执行后向传递。在前向阶段，每个微批次的中间状态被排入 FIFO 队列 $Q$ 。在后向阶段，这些中间状态被出队以进行相应的后向传递。由于后向阶段发生在所有中间状态入队之后，GPipe 表现出 $O(M)$ 的内存消耗，其中 $M$ 代表微批次的数量。 TeraPipe (Li et al., 2021) 依赖于因果语言建模的观察，其中给定输入 token 的计算仅依赖于其前面的 tokens，将 GPipe 的微批次划分为多个 token 跨度，并用后进先出（LIFO）队列替换 FIFO 队列，以确保后向传递中梯度的正确计算。通过使用更细的可调度单元（token 跨度），TeraPipe 在比 GPipe 更高效的内存使用下减少了气泡比率。Chimera (Li and Hoefler, 2021) 采用双向调度，每个设备负责处理多个阶段。虽然 Chimera 减少了气泡比率，但由于阶段在设备间分布不均，每个设备必须存储冗余参数，导致内存使用增加。

Different from GPipe, 1F1B (Narayanan et al., 2021b; Fan et al., 2021) alternates between forward and backward passes (adopting a 1F1B pattern) to keep the number of intermediate states in the FIFO queue $Q$ constant. Regardless of the number of micro-batches, 1F1B mitigates excessive memory usage. Based on 1F1B, 1F1B with interleaved stages (1F1B-I) (Narayanan et al., 2021b) enlarges the number of pipeline stages and assigns each device multiple stages. By interleaving stages among devices, 1F1B-I reduces the bubble ratio at the cost of adding more communication operators and slightly increasing memory consumption. Zero-bubble-pipeline (ZB1P) (Qi et al., 2024) divides the backward passes into obtaining weight and input gradients separately, which can achieve higher pipeline efficiency by delaying weight gradient computation and using dynamic programming to optimize the schedule. ZB1P nearly achieves zero-bubble pipeline efficiency but brings more memory footprint caused by delaying memory release. 1F1B methods are the most popular for training LLMs, yet still suffer from difficulties in balancing bubble ratio and memory footprint, which is the issue we want to solve.
与 GPipe 不同，1F1B (Narayanan 等, 2021b; Fan 等, 2021)在前向和后向传递之间交替（采用 1F1B 模式），以保持 FIFO 队列 $Q$ 中间状态的数量不变。无论微批次的数量如何，1F1B 都能减轻过多的内存使用。在 1F1B 的基础上，具有交错阶段的 1F1B（1F1B-I） (Narayanan 等, 2021b)增加了流水线阶段的数量，并为每个设备分配多个阶段。通过在设备之间交错阶段，1F1B-I 在增加更多通信操作符和略微增加内存消耗的代价下，降低了气泡比率。零气泡流水线（ZB1P） (Qi 等, 2024)将后向传递分为分别获取权重和输入梯度，这可以通过延迟权重梯度计算并使用动态规划优化调度来实现更高的流水线效率。ZB1P 几乎实现了零气泡流水线效率，但由于延迟内存释放带来了更多的内存占用。1F1B 方法是训练 LLM 最受欢迎的方法，但仍然难以平衡气泡比率和内存占用，这是我们想要解决的问题。

Refer to caption — Figure 1: Execution timeline for the 1F1B and Seq1F1B schedules. Blank spaces represent idle time, i.e. bubbles. The upper figure illustrates the original 1F1B schedule, where each micro-batch is labeled with an ID and bottom dashline represents the schedule phase of last device. The lower figure illustrates our Seq1F1B schedule, where the input is split into two sequences for better illustration. In Seq1F1B’s illustration, light-colored areas represent the first sequence, while dark-colored areas represent the second sequence. Notice that the forward pass for the dark-colored sequence follows the light-colored sequence, whereas, for the backward pass, the dark-colored sequence precedes the light-colored sequence.
图 1：1F1B 和 Seq1F1B 调度的执行时间线。空白区域表示空闲时间，即气泡。上图展示了原始的 1F1B 调度，其中每个微批次都有一个 ID，底部虚线表示最后一个设备的调度阶段。下图展示了我们的 Seq1F1B 调度，其中输入被分为两个序列以便更好地说明。在 Seq1F1B 的图示中，浅色区域表示第一个序列，而深色区域表示第二个序列。注意，深色序列的前向传递跟随浅色序列，而在后向传递中，深色序列先于浅色序列。

3 Methodology
3 方法论

In this section, we first give a preliminary overview to introduce the characteristics of the 1F1B schedule and language modeling. Then, we prove why it is feasible to schedule the pipeline of training LLMs at the sequence level for micro-batches in 1F1B. Finally, we explain how Seq1F1B works in detail and how it meets the exact semantics of original language modeling.
在本节中，我们首先提供一个初步概述，以介绍 1F1B 调度和语言建模的特征。然后，我们证明为什么在 1F1B 中在序列级别调度训练 LLM 的流水线是可行的。最后，我们详细解释 Seq1F1B 如何工作以及如何满足原始语言建模的确切语义。

3.1 Preliminary
3.1 初步

As shown in Figure 1, 1F1B includes three phases to train a batch of sequences: warm-up phase, steady phase, and cooling-down phase. Given $P$ devices (e.g., GPUs) to perform a 1F1B schedule to train $M$ micro-batches, with each device responsible for one pipeline stage, the size of PP is $P$ . After splitting the batch into $M$ micro-batches, during the warm-up phase, each device executes the forward passes of the first few micro-batches, and the number of forward passes $w_{i}$ executed by the $i$ -th device is
如图 1所示，1F1B包括三个阶段来训练一批序列：热身阶段、稳定阶段和冷却阶段。给定 $P$ 个设备（例如，GPU）执行 1F1B 调度来训练 $M$ 个微批次，每个设备负责一个流水线阶段，PP 的大小为 $P$ 。在将批次分成 $M$ 个微批次后，在热身阶段，每个设备执行前几个微批次的前向传递，第 $i$ 个设备执行的前向传递数量为 $w_{i}$

\centering\text{w}_{i}=\begin{cases}P-i&\text{if }M>P\\ M&\text{if }M\leq P\end{cases},\quad i\in[1,P],\@add@centering

(1)

When $M\leq P$ , 1F1B degrades to the behavior of GPipe and does not process the steady phase. Otherwise, during the warm-up phase, a device responsible for an earlier stage performs one more forward pass than the device responsible for its subsequent stage. Each forward pass results in intermediate states enqueued in a FIFO queue $Q$ to be used later for the gradient computation of backward passes.
当 $M\leq P$ 时，1F1B 退化为 GPipe 的行为，并且不处理稳定阶段。否则，在热身阶段，负责较早阶段的设备比负责其后续阶段的设备多执行一次前向传递。每次前向传递都会在 FIFO 队列 $Q$ 中排入中间状态，以便稍后用于后向传递的梯度计算。

During the steady phase, each device performs one forward pass and enqueues the resulting intermediate states into $Q$ . After a device executes a forward pass, the device dequeues specific intermediate states from $Q$ and immediately executes a backward pass for gradient computation, where the “1F1B” name comes from. Note that, the bubble ratio is minimal during the steady phase, and the number of 1F1B passes in the steady phase is given by $M-\text{w}_{i}$ . As $M$ increases, the proportion of the steady phase in the entire pipeline increases, which reduces the bubble ratio. After the steady phase, the 1F1B schedule enters the cooling-down phase, which is symmetric to the warm-up phase and involves executing the same number of backward passes as in the warm-up phase.
在稳定阶段，每个设备执行一次前向传递，并将生成的中间状态排入 $Q$ 。在设备执行前向传递后，该设备从 $Q$ 中出队特定的中间状态，并立即执行后向传递以进行梯度计算，这就是“1F1B”名称的由来。请注意，在稳定阶段，气泡比率最小，稳定阶段中的 1F1B 传递次数由 $M-\text{w}_{i}$ 给出。随着 $M$ 的增加，整个流水线中稳定阶段的比例增加，从而降低了气泡比率。在稳定阶段之后，1F1B 调度进入冷却阶段，该阶段与热身阶段对称，并涉及执行与热身阶段相同数量的反向传递。

The primary optimization of 1F1B is to ensure that the memory consumption of intermediate states is independent of $M$ . The peak memory consumption for intermediate states is determined by the number of items in the queue $Q$ at the end of the warm-up phase, where each device holds $w_{i}$ intermediate states. Assuming the total memory consumption of intermediate states is $A$ , the peak memory consumption of the $i$ -th device is $\frac{A\times w_{i}}{\sum_{j=1}^{P}w_{j}}$ . During the steady and cooling-down phases, this consumption does not increase since each backward pass frees the storage space for its associated intermediate states
1F1B 的主要优化是确保中间状态的内存消耗与 $M$ 无关。中间状态的峰值内存消耗由队列 $Q$ 末尾的项目数量决定在热身阶段，每个设备持有 $w_{i}$ 个中间状态。假设总的中间状态的内存消耗为 $A$ ，第 $i$ 个设备的峰值内存消耗为 $\frac{A\times w_{i}}{\sum_{j=1}^{P}w_{j}}$ 。在稳定和冷却阶段，这种消耗不会增加，因为每次反向传递都会释放其相关中间状态的存储空间。

Language modeling is the most common unsupervised objective in training LLMs. In language modeling, each token is predicted sequentially while conditioned on the preceding tokens, embodying the principles of sequential generation, as formulated as
语言建模是训练 LLM 中最常见的无监督目标。在语言建模中，每个标记在条件于前面的标记的情况下按顺序预测，体现了顺序生成的原则，具体表述为

P(\mathbf{x})=\prod_{t=1}^{T}P(x_{t}\mid x_{1},x_{2},\ldots,x_{t-1}),

(2)

where $T$ is the sequence length. In the context of language modeling using Transformers, the causal attention mechanism ensures that each token in a sequence can only see its predecessors to process hidden states, including itself. Given a sequence of input token states $\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{T}\}$ , the output of the attention mechanism for each token can be computed as follows. Each token states $\mathbf{x}_{i}$ is mapped into a query vector $\mathbf{q}_{i}$ , a key vector $\mathbf{k}_{i}$ , and a value vector $\mathbf{v}_{i}$ , and output for each token $\mathbf{o}_{i}$ is computed by attending over all previous tokens as follows,
其中 $T$ 是序列长度。在使用 Transformers 进行语言建模的背景下，因果注意机制确保序列中的每个标记只能看到其前面的标记以处理隐藏状态，包括它自己。给定一系列输入标记状态 $\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{T}\}$ ，每个标记的注意机制输出可以如下计算。每个标记状态 $\mathbf{x}_{i}$ 被映射为查询向量 $\mathbf{q}_{i}$ 、键向量 $\mathbf{k}_{i}$ 和值向量 $\mathbf{v}_{i}$ ，每个标记 $\mathbf{o}_{i}$ 的输出通过关注所有先前的标记来计算，如下所示，

O_{i}=\text{softmax}\left(\frac{\mathbf{q}_{i}^{\top}\cdot[\mathbf{k}_{1},% \ldots,\mathbf{k}_{i}]}{\sqrt{d_{k}}}\right)[\mathbf{v}_{0},\ldots,\mathbf{v}_% {i}],

(3)

where $d_{k}$ is the vector dimension. Based on these characteristics, it is clear that to partition the Transformer computation across the sequence dimension must retain the key and value vectors of all preceding tokens. The forward and backward passes also need to maintain a specific order. The forward pass of each token must follow the completion of its predecessor’s computation, while the backward pass requires the subsequent token’s gradients to complete its computation. This computational dependency needs to be fully considered in sequence-level pipeline schedule.
其中 $d_{k}$ 是向量维度。基于这些特性，很明显要在序列维度上划分 Transformer 计算必须保留所有前面标记的键和值向量。前向和反向传递也需要保持特定顺序。每个标记的前向传递必须在其前一个标记的计算完成后进行，而反向传递需要后续标记的梯度来完成其计算。这种计算依赖性需要在序列级流水线调度中充分考虑。

3.2 Framework of Seq1F1B
3.2 Seq1F1B 框架

From Figure 1, we observe that the original 1F1B schedule cannot accommodate the splitting of micro-batches along the sequence dimension because the last stage needs to immediately execute a backward pass after forwarding a micro-batch. A straightforward adaptation method is to divide each original 1F1B micro-batch into $k$ segments and then execute a $k$ F $b$ B pipeline (Li et al., 2021). Although this schedule can reduce some bubbles in 1F1B, it does not save memory usage.
从图 1可以看出，原始的 1F1B 调度无法适应沿序列维度拆分微批次，因为最后一个阶段需要在转发微批次后立即执行反向传递。一种简单的适应方法是将每个原始 1F1B 微批次分为 $k$ 个部分，然后执行 $k$ F $b$ B 流水线 (Li et al., 2021)。虽然这种调度可以减少 1F1B 中的一些气泡，但不能节省内存使用。

To achieve a more efficient sequence-level 1F1B pipeline schedule, we propose Seq1F1B. Similar to 1F1B, the schedule of Seq1F1B is also divided into three phases: warm-up phase, steady phase, and cooling-down phase. During the warm-up phase, the number of sub-sequences of the $i$ -th device is computed according to
为了实现更高效的序列级 1F1B 流水线调度，我们提出了 Seq1F1B。类似于 1F1B，Seq1F1B 的调度也分为三个阶段：热身阶段、稳定阶段和冷却阶段。在热身阶段， $i$ 设备的子序列数量根据以下公式计算

\centering\text{w}_{i}=\begin{cases}P-i-1+k&\text{if }M>P\\ M&\text{if }M\leq P\end{cases},\quad i\in[1,P],\@add@centering

(4)

where $P$ is the size of the PP and $k$ indicates the number of divisions of the sequence. This equation ensures that the last stage can perform a backward pass on the last sub-sequence of the first micro-batch when entering the steady phase, and the device responsible for each stage performs one more forward pass than the device responsible for the subsequent stage. Here, we construct a partially ordered queue $Q_{s}$ , where each pop returns the tail sequence from the earliest enqueued intermediate states. This satisfies the FIFO principle in the batch dimension and the first-in-last-out (FILO) principle in the sequence dimension. In each step of the warm-up phase, devices execute one forward pass and enqueue the corresponding intermediate states of sub-sequences into $Q_{s}$ . During the steady phase, after each device completes a forward pass, it dequeues intermediate states from $Q_{s}$ and performs a backward pass, following the standard 1F1B process, except that the units for forward and backward passes become a sub-sequence. During the cooling-down phase, devices dequeue the remaining intermediate states from $Q_{s}$ , perform backward passes for remaining subsequences and accumulate the gradient to ensure mathematical equivalent with other synchronous schedules.
其中 $P$ 是 PP 的大小， $k$ 表示序列的划分数量。这个方程确保在进入稳定阶段时，最后一个阶段可以对第一个微批次的最后一个子序列执行反向传播，并且负责每个阶段的设备比负责后续阶段的设备多执行一次前向传播。在这里，我们构建了一个部分有序队列 $Q_{s}$ ，每次弹出操作返回最早入队的中间状态的尾部序列。这在批次维度上满足 FIFO 原则，在序列维度上满足先进后出（FILO）原则。在预热阶段的每一步中，设备执行一次前向传播，并将子序列的相应中间状态入队到 $Q_{s}$ 。在稳定阶段，每个设备完成一次前向传播后，从 $Q_{s}$ 中出队中间状态并执行反向传播，遵循标准的 1F1B 过程，只是前向和反向传播的单位变成了一个子序列。在冷却阶段，设备从 $Q_{s}$ 中出队剩余的中间状态，对剩余的子序列执行反向传播并累积梯度，以确保与其他同步调度在数学上等效。

From the timeline shown in Figure 1, it is evident that the Seq1F1B schedule offers a shorter execution time and significantly fewer bubbles, compared to the original 1F1B schedule. Meanwhile, it can be seen that each device now has less memory consumption since the sub-sequence is smaller than the micro-batch. Another observation is that optimizations similar to ZB1P can also be applied to Seq1F1B by delaying the gradient computation associated with weights in the backward pass. For more details, we refer to our appendix.
从图1所示的时间线上可以看出，Seq1F1B 调度相比原始的 1F1B 调度提供了更短的执行时间和显著更少的气泡。同时可以看到，由于子序列比微批次小，每个设备现在的内存消耗更少。另一个观察是，类似于 ZB1P 的优化也可以通过延迟与反向传播中权重相关的梯度计算应用于 Seq1F1B。更多细节请参阅我们的附录。

Model	Number of 数量	Attention 注意力	Hidden 隐藏	Sequence 序列	PP	TP	Number of 数量
Size 大小	Layers 层数	Heads 头数	Size 大小	Length 长度	Size 大小	Size 大小	Micro-batches 微批次
2.7B	32	32	2560	16k / 24k / 32k	8	1	32 / 64
7B	32	32	4096	32k / 64k / 128k	4	8	16 / 32
13B	40	40	5120	32k / 64k / 128k	4	8	16 / 32
30B	64	64	6144	32k / 48k / 64k	8	8	32 / 64

Table 1: Settings used in experiments for training LLMs.
表 1：实验中用于训练 LLM 的设置。

3.3 Framework of Seq1F1B-I
3.3 Seq1F1B-I 的框架

As shown in Figure 2, 1F1B-I (Narayanan et al., 2021b) achieves better efficiency by modifying the 1F1B schedule to support interleaved stages among devices. In 1F1B-I, each device is assigned multiple stages. Suppose we have $P$ devices and $V$ stages $\{s_{1},s_{2},\ldots,s_{V}\}$ in our pipeline, where $V$ is a multiple of $P$ . The $i$ -th device will handle $n$ stages $\{s_{i},s_{i+P},s_{i+2P},\ldots,s_{i+(n-1)P}\}$ , where $n=\frac{V}{P}$ . The number of warm-up micro-batches of each device $i$ in 1F1B-I is as follows,
如图 2所示，1F1B-I (Narayanan et al., 2021b)通过修改 1F1B 调度以支持设备间的交错阶段来实现更高的效率。在 1F1B-I 中，每个设备被分配多个阶段。假设我们的流水线中有 $P$ 个设备和 $V$ 个阶段 $\{s_{1},s_{2},\ldots,s_{V}\}$ ，其中 $V$ 是 $P$ 的倍数。第 $i$ 个设备将处理 $n$ 个阶段 $\{s_{i},s_{i+P},s_{i+2P},\ldots,s_{i+(n-1)P}\}$ ，其中 $n=\frac{V}{P}$ 。1F1B-I 中每个设备的预热微批次数量 $i$ 如下，

w_{i}=(P-i)\times 2+(n-1)\times P,i\in[1,P],

(5)

After completing $P$ iterations of forward and backward passes, each device switches its context to the next stage that the device is responsible for. From Figure 2, the above part shows a 1F1B-I pipeline with $P=4$ and $V=8$ , in which each device handles 2 stages. The 1F1B-I schedule reduces the bubble ratio by interleaving stages among devices. However, this interleaving slightly increases memory consumption, as the number of warm-up micro-batches $w_{i}$ is greater than that of 1F1B.
完成 $P$ 次前向和反向传播迭代后，每个设备切换其上下文到设备负责的下一个阶段。从图 2可以看出，上述部分展示了一个具有 $P=4$ 和 $V=8$ 的 1F1B-I 流水线，其中每个设备处理 2 个阶段。1F1B-I 调度通过在设备间交错阶段来减少气泡比率。然而，这种交错略微增加了内存消耗，因为预热微批次数量 $w_{i}$ 大于 1F1B。

Similar to 1F1B-I, Seq1F1B-I further modifies 1F1B-I to achieve a sequence-level schedule. From Figure 2, Seq1F1B-I effectively reduces pipeline bubbles and the memory footprint of intermediate states compared to 1F1B-I. Seq1F1B-I defines the number of warm-up sub-sequences as
类似于 1F1B-I，Seq1F1B-I 进一步修改 1F1B-I 以实现序列级调度。从图2可以看出，Seq1F1B-I 有效减少了流水线气泡和中间状态的内存占用，相比于 1F1B-I。Seq1F1B-I 定义了预热子序列的数量为

w_{i}=(P-i)\times 2+(n-1)\times P+k-1,i\in[1,P],

(6)

where $P$ is the size of the PP and $k$ indicates the number of divisions of the sequence. Using the partially ordered queue, Seq1F1B-I maintains a strict order of forward and backward passes as well as ensures the consistent semantics of gradient updates. From the perspective of pipeline bubbles, Seq1F1B-I outperforms both Seq1F1B and 1F1B-I. Besides, Seq1F1B-I requires slightly more memory than Seq1F1B but significantly less than 1F1B-I.
其中 $P$ 是 PP 的大小， $k$ 表示序列的划分数量。使用部分有序队列，Seq1F1B-I 保持了前向和反向传播的严格顺序，并确保梯度更新的一致语义。从流水线气泡的角度来看，Seq1F1B-I 优于 Seq1F1B 和 1F1B-I。此外，Seq1F1B-I 需要的内存略多于 Seq1F1B，但显著少于 1F1B-I。

3.4 Workload Balance
3.4 工作负载平衡

In this section, we detail the strategy of sequence partition and workload balance consideration. Previous works, such as (Li et al., 2021), have discussed strategies for sequence partitioning. To achieve an efficient pipeline schedule, the processing cost for each sub-sequence must be approximately equal to avoid pipeline bubbles. To this end, we design a computation-wise partition strategy by estimating the FLOPs of sequences and constructing a theoretical solution aiming to make the FLOPs of all sub-sequences as closely as possible. For a input sequence $S=\{x_{1},x_{2},\cdots,x_{n}\}$ , we devide it into $k$ segments $S=\{S_{1},S_{2},\cdots,S_{k}\}$ . Each segment has a length of $n_{i}$ , where $\sum_{i=1}^{k}n_{i}=n$ . We expect the computational amount of each segment to be roughly the same, that is
在本节中，我们详细介绍了序列划分策略和工作负载平衡的考虑。先前的工作，如 (Li et al., 2021)，已经讨论了序列划分的策略。为了实现高效的流水线调度，每个子序列的处理成本必须大致相等，以避免流水线气泡。为此，我们通过估算序列的 FLOPs，设计了一种计算量划分策略，并构建了一个理论解决方案，旨在使所有子序列的 FLOPs 尽可能接近。对于输入序列 $S=\{x_{1},x_{2},\cdots,x_{n}\}$ ，我们将其划分为 $k$ 段 $S=\{S_{1},S_{2},\cdots,S_{k}\}$ 。每段的长度为 $n_{i}$ ，其中 $\sum_{i=1}^{k}n_{i}=n$ 。我们期望每段的计算量大致相同，即

$\displaystyle\text{FLOPs}(S_{1})$	$\displaystyle=\text{FLOPs}(S_{2})$	(7)
	$\displaystyle=\cdots=\text{FLOPs}(S_{k})$
	$\displaystyle={\text{FLOPs}(S)\over k}.\vspace{-2em}$

Specifically, we use the method proposed in (Hoffmann et al., 2022) to estimate the FLOPs for each subsequence, formulated as
具体来说，我们使用(Hoffmann et al., 2022)中提出的方法来估算每个子序列的 FLOPs，公式为

	$\displaystyle\text{FLOPs}(S_{i})$	$\displaystyle=2n_{i}P+2Ln_{i}\left(\sum_{j=0}^{i}n_{j}\right)d,\forall i\in[1,% k],$		(8)
	$\displaystyle\text{FLOPs}(S)$	$\displaystyle=2nP+2Ln^{2}d,$		(8)

in which, $L$ is the number of layers, $d$ is the dimension of the model, and $P$ is the total number of parameters in the model. We have $k$ variables in Eq. (8) and $k$ equations in Eq. (7), and thus we can set up the equation to get the optimal segmentation.
其中， $L$ 是层数， $d$ 是模型的维度， $P$ 是模型中的参数总数。我们在方程 (8)中有 $k$ 个变量，在方程 (7)中有 $k$ 个方程，因此我们可以建立方程以获得最佳分段。

Model Size 模型大小		2.7b
Sequence Length 序列长度		16384		24576		32768
Micro-batch 微批量		16	32	16	32	16	32
Throughput 吞吐量	1F1B	32.0±0.0	37.1±0.0	27.0±0.0	31.4±0.0	OOM	OOM
(Thousands （千	1F1B-I	36.4±0.0	39.7±0.0	OOM	OOM	OOM	OOM
Tokens/s) 词元/秒）	Seq1F1B	37.3±0.0	38.9±0.3	32.6±0.0	34.2±0.0	28.8±0.0	30.1±0.2
Tokens/s) 词元/秒）	Seq1F1B-I	38.0±0.0	38.9±0.0	33.3±0.0	34.3±0.0	29.5±0.0	30.3±0.0
TFLOPS	1F1B	96.9±0.0	112.3±0.0	95.5±0.1	111.1±0.1	OOM	OOM
per device 每个设备	1F1B-I	110.3±0.1	120.2±0.1	OOM	OOM	OOM	OOM
per device 每个设备	Seq1F1B	113.1±0.0	117.8±0.8	115.2±0.1	120.9±0.1	116.5±0.1	122.0±1.0
	Seq1F1B-I	115.2±0.0	118.0±0.0	118.0±0.1	121.3±0.1	119.4±0.0	122.7±0.0

Table 2: 2.7B GPT training experiments with PP size of 8 under 8

\times

A100 setting.
表 2: 2.7B GPT 训练实验，PP 大小为 8，在 8

\times

A100 设置下。

Model Size 模型大小		7b
Sequence Length 序列长度		32768		65536		131072
Micro-batch 微批量		8	16	8	16	8	16
Throughput 吞吐量	1F1B	48.2±0.1	55.3±0.2	37.3±0.0	43.1±0.0	OOM	OOM
(Thousands （千	1F1B-I	53.0±0.3	56.3±0.4	41.7±0.1	44.7±0.0	OOM	OOM
Tokens/s) 词元/秒）	Seq1F1B	53.5±0.3	55.8±0.1	43.3±0.0	45.0±0.1	30.4±0.0	31.6±0.0
Tokens/s) 词元/秒）	Seq1F1B-I	47.2±0.9	46.2±0.8	40.9±0.4	41.0±0.3	30.0±0.0	30.4±0.0
TFLOPS	1F1B	99.7±0.2	114.5±0.4	107.5±0.0	124.0±0.1	OOM	OOM
per device 每个设备	1F1B-I	109.5±0.7	116.5±0.8	120.0±0.2	128.7±0.1	OOM	OOM
per device 每个设备	Seq1F1B	110.6±0.5	115.3±0.2	124.6±0.1	129.7±0.5	136.7±0.1	142.1±0.0
	Seq1F1B-I	97.7±1.8	95.5±1.6	117.8±1.3	118.0±0.8	135.1±0.2	136.6±0.2

Table 3: 7B GPT training experiments with PP size of 4 and TP size of 8 under 32

\times

A100 setting.
表 3: 7B GPT 训练实验，PP 大小为 4，TP 大小为 8，在 32

\times

A100 设置下。

4 Experiments
4 实验

4.1 Experimental Settings
4.1 实验设置

In experiments, we measure Seq1F1B, Seq1F1B-I, 1F1B, and 1F1B-I under variable sequence lengths, different numbers of micro-batches, different numbers of GPUs, and different PP and TP sizes. Compared methods are as follows:
在实验中，我们测量了在不同序列长度、不同微批次数量、不同 GPU 数量以及不同 PP 和 TP 大小下的 Seq1F1B、Seq1F1B-I、1F1B 和 1F1B-I。比较的方法如下：

(1) Seq1F1B: Seq1F1B with computation-wise sequence partition strategy.
(1) Seq1F1B：采用计算序列分区策略的 Seq1F1B。

(2) Seq1F1B-I: Seq1F1B with interleaved stages and computation-wise sequence partition strategy.
(2) Seq1F1B-I：采用交错阶段和计算序列分区策略的 Seq1F1B。

(3) 1F1B/1F1B-I: 1F1B and 1F1B with interleaved stages in Megatron implementation.
(3) 1F1B/1F1B-I：Megatron 实现中的 1F1B 和采用交错阶段的 1F1B。

(4) Seq1F1B w/o cwp: Seq1F1B without computation-wise sequence partition strategy.
(4) Seq1F1B 无 cwp：不采用计算序列分区策略的 Seq1F1B。

(5) Seq1F1B-I w/o cwp: Seq1F1B-I without computation-wise sequence partition strategy.
(5) Seq1F1B-I 无 cwp：不采用计算序列分区策略的 Seq1F1B-I。

All assessments are based on the GPT model and model configurations are listed in Table 1. All experiments focus on long-sequence training since a lot of work has mentioned its importance. For hyperparameter configurations, we set the number of sequence splits to four and each device manages two stages in interleaved settings. Our implementation is based on the open-source Megatron-LM project (Narayanan et al., 2021b) and ensures reproducibility. We adopt Megatron-V3 (Korthikanti et al., 2023)’s tensor parallelism in all experiments since it is necessary for long sequence training.
所有评估均基于 GPT 模型，模型配置列在表格1中。所有实验都集中在长序列训练上，因为许多工作已提到其重要性。对于超参数配置，我们将序列分割数设置为四，每个设备在交错设置中管理两个阶段。我们的实现基于开源的 Megatron-LM 项目(Narayanan et al., 2021b)，确保可重复性。我们在所有实验中采用 Megatron-V3(Korthikanti et al., 2023)的张量并行性，因为它对于长序列训练是必要的。

Our experiments include three cluster settings: 1) 1 node with 8 NVIDIA A100 SXM 80G GPUs interconnected by NvLink. 2) 4 nodes interconnected by a RoCE RDMA network and each node has 8 NVIDIA A100 SXM 80G GPUs interconnected by NvLink. 3) 8 nodes interconnected by a RoCE RDMA network and each node has 8 NVIDIA A100 SXM 80G GPUs interconnected by NvLink. Each measurement in the experiment is repeated 100 times, and the standard deviation is recorded.
我们的实验包括三种集群设置：1）1 个节点，配备 8 个通过 NvLink 互连的 NVIDIA A100 SXM 80G GPU。2）4 个节点通过 RoCE RDMA 网络互连，每个节点有 8 个通过 NvLink 互连的 NVIDIA A100 SXM 80G GPU。3）8 个节点通过 RoCE RDMA 网络互连，每个节点有 8 个通过 NvLink 互连的 NVIDIA A100 SXM 80G GPU。实验中的每次测量重复 100 次，并记录标准差。

4.2 Main Results
4.2 主要结果

Model Size 模型大小		13b
Sequence Length 序列长度		32768		49152		65536
Micro-batch 微批量		8	16	8	16	8	16
Throughput 吞吐量	1F1B	28.9±0.1	33.4±0.1	25.3±0.1	29.3±0.1	22.6±0.1	30.0±0.0
(Thousands （千	1F1B-I	32.2±0.2	34.4±0.1	28.2±0.2	30.6±0.1	OOM	OOM
Tokens/s) 词元/秒）	Seq1F1B	32.9±0.1	34.3±0.1	29.5±0.1	30.8±0.0	26.7±0.0	27.8±0.0
Tokens/s) 词元/秒）	Seq1F1B-I	29.7±0.4	29.8±0.3	28.0±0.2	28.3±0.1	26.4±0.1	26.8±0.1
TFLOPS	1F1B	106.7±0.2	123.0±0.5	109.5±0.5	126.2±0.6	111.9±0.5	135.1±0.2
per device 每个设备	1F1B-I	118.6±0.6	126.9±0.4	121.9±0.7	132.2±0.4	OOM	OOM
per device 每个设备	Seq1F1B	121.2±0.2	126.6±0.3	127.3±0.4	133.1±0.2	132.5±0.0	137.9±0.0
	Seq1F1B-I	109.7±1.4	110.0±1.1	121.0±1.1	122.1±0.4	130.6±0.3	132.8±0.3

Table 4: 13B GPT training experiments with PP size of 4 and TP size of 8 under 32

\times

A100 setting.
表 4：在 32

\times

A100 设置下，具有 4 个 PP 大小和 8 个 TP 大小的 13B GPT 训练实验。

Model Size 模型大小		30b
Sequence Length 序列长度		32768		49152		65536
Micro-batch 微批量		8	16	8	16	8	16
Throughput 吞吐量	1F1B	26.4±0.1	31.2±0.2	OOM	OOM	OOM	OOM
(Thousands （千	1F1B-I	OOM	OOM	OOM	OOM	OOM	OOM
Tokens/s) 词元/秒）	Seq1F1B	31.3±0.1	33.1±0.2	28.2±0.1	29.6±0.1	25.5±0.0	26.8±0.0
Tokens/s) 词元/秒）	Seq1F1B-I	28.0±0.4	28.4±0.2	26.5±0.2	27.1±0.2	24.8±0.1	25.2±0.1
TFLOPS	1F1B	104.8±0.3	123.9±0.7	OOM	OOM	OOM	OOM
per device 每个设备	1F1B-I	OOM	OOM	OOM	OOM	OOM	OOM
per device 每个设备	Seq1F1B	124.5±0.2	131.5±0.6	129.4±0.3	135.6±0.3	132.6±0.0	139.2±0.0
	Seq1F1B-I	111.1±1.6	113.0±1.0	121.5±1.1	124.2±0.8	128.6±0.3	130.9±0.6

Table 5: 30B GPT training experiments with PP size of 8 and TP size of 8 under 64

\times

A100 setting.
表 5： 30B GPT 训练实验在 PP 大小为 8 和 TP 大小为 8 的 64

\times

A100 设置下进行。

In Figure 3, we compared the memory consumption of our method with that of 1F1B and 1F1B-I. As can be seen, our method consistently requires less memory across all settings. Notably, it can support training a 30B model on a 64 $\times$ A100 cluster, which is impossible for the traditional combination of PP and TP. Additionally, we recorded TFLOPS (teraFLOPS) per GPU in our experiments to measure the hardware utilization of different methods. From Table 2, 3, 4 and 5, our method Seq1F1B outperforms 1F1B and 1F1B-I under almost all settings in both training throughput and teraFLOPS.
在图 3中，我们将我们的方法与 1F1B 和 1F1B-I 的内存消耗进行了比较。可以看出，我们的方法在所有设置中始终需要更少的内存。值得注意的是，它可以支持在 64 $\times$ A100 集群上训练 30B 模型，而传统的 PP 和 TP 组合无法实现。此外，我们在实验中记录了每个 GPU 的 TFLOPS（teraFLOPS），以衡量不同方法的硬件利用率。从表 2、3、4 和 5中可以看出，我们的方法 Seq1F1B 在几乎所有设置下的训练吞吐量和 teraFLOPS 上都优于 1F1B 和 1F1B-I。

However, as observed in Table 3, 4, and 5, the Seq1F1B-I may have a performance degradation under multi-node settings. This could be due to the overly fine-grained interleaving of stage partitioning and input sequence partitioning, which also implies that more communication calls in TP (although the total communication volume remains unchanged), potentially leads to a decrease in performance. Another observation is that the efficiency of Seq1F1B becomes more pronounced as the sequence length increases. This is because the computation time for each micro-sequence extends with longer sequences, thereby enhancing the benefits derived from sequence partitioning.
然而，如表 3、4 和 5 所示，Seq1F1B-I 在多节点设置下可能会出现性能下降。这可能是由于阶段分区和输入序列分区的过于细粒度交错，这也意味着 TP 中更多的通信调用（尽管总通信量保持不变），可能导致性能下降。另一个观察是，随着序列长度的增加，Seq1F1B 的效率变得更加明显。这是因为较长序列的每个微序列的计算时间延长，从而增强了从序列分区中获得的好处。

4.3 Ablation Results
4.3 消融结果

We also conducted all experiments using Seq1F1B without computation-wise partitioning (Seq1F1B w/o cwp) and Seq1F1B-I without computation-wise partitioning (Seq1F1B-I w/o cwp) to evaluate the effectiveness of our computation-wise partition strategy. Under identical settings, employing the computation-wise partition strategy leads to performance enhancements ranging from approximately 10-30% for Seq1F1B compared to simply splitting the sequence.
我们还进行了所有实验，使用没有计算分区的 Seq1F1B（Seq1F1B w/o cwp）和没有计算分区的 Seq1F1B-I（Seq1F1B-I w/o cwp），以评估我们的计算分区策略的有效性。在相同设置下，采用计算分区策略相比于简单的序列分割，Seq1F1B 的性能提升约为 10-30%。

Across all experimental scales, Seq1F1B consistently surpassed Seq1F1B w/o cwp in performance. Table 6 highlights the ablation performance for a 2.7B model with a sequence length of 32k, demonstrating a performance boost of approximately 28% due to the computation-wise partitioning.
在所有实验规模中，Seq1F1B 的性能始终优于 Seq1F1B w/o cwp。表 6 突出了序列分区策略的消融性能，针对序列长度为 32k 的 2.7B 模型，计算分区带来了约 28% 的性能提升。

5 Conclusion
5 结论

Method 方法	TFLOPS/device TFLOPS/设备	SpeedUp 加速比
Seq1F1B w/o cwp	94.8±0.1	-
Seq1F1B	122.0±1.0	1.28 $\times$
Seq1F1B-I w/o cwp	103.5±0.1	-
Seq1F1B-I	122.7±0.0	1.18 $\times$

Table 6: The Ablation experiments are based on 2.7B GPT of sequence partitioning strategies, where “w/o cwp” indicates the absence of a computation-wise partitioning strategy.
表 6：消融实验基于 2.7B GPT 的序列分区策略，其中“w/o cwp”表示没有计算分区策略。

In this paper, we present Seq1F1B, an efficient 1F1B pipeline parallel schedule orienting to training Transformer-based LLMs on long sequences by decomposing the batch-level schedulable units used by typical 1F1B methods into more fine-grained sequence-level units. To achieve a better workload balance of the sequence-level pipeline, we design a computation-wise sequence partition strategy to partition the sequences well. Meanwhile, Seq1F1B can integrate with other pipeline parallel methods such as 1F1B with interleaved stage or zero-bubble-pipeline. Our evaluations demonstrate that Seq1F1B outperforms the 1F1B and 1F1B-I schedules regarding memory efficiency and training throughput under variable sequence lengths and model sizes. Moreover, Seq1F1B can support the efficient training of a 30B GPT model on sequences up to 64k in length using 64 $\times$ A100 GPUs, without recomputation strategies, which is unachievable with existing pipeline parallel methods. In the future, we will thoroughly combine our method with other distributed methods to achieve better LLM training acceleration. In addition, we will systematically release our code to support the community in training LLMs to process longer sequences more efficiently.
在本文中，我们提出了 Seq1F1B，这是一种高效的 1F1B 管道并行调度，旨在通过将典型 1F1B 方法使用的批级可调度单元分解为更细粒度的序列级单元，以训练基于 Transformer 的长序列大规模语言模型。为了实现序列级流水线的更好工作负载平衡，我们设计了一种计算导向的序列分区策略，以便更好地分割序列。同时，Seq1F1B 可以与其他流水线并行方法集成，如交错阶段的 1F1B 或零气泡流水线。我们的评估表明，Seq1F1B 在内存效率和训练吞吐量方面优于 1F1B 和 1F1B-I 调度，适用于可变序列长度和模型大小。此外，Seq1F1B 可以在不使用重计算策略的情况下，支持使用 64 个 $\times$ A100 GPU 对长度达 64k 的序列进行 30B GPT 模型的高效训练，这是现有流水线并行方法无法实现的。未来，我们将全面结合我们的方法与其他分布式方法，以实现更好的 LLM 训练加速。此外，我们将系统地发布我们的代码，以支持社区更高效地训练处理更长序列的 LLM。

Limitations 局限性

The current implementation of Seq1F1B is optimized for long-context training in LLMs, which may result in performance degradation when dealing with short context such as 4k/8k. We recommend using Seq1F1B in environments with limited communication bandwidth, as the PP incurs fewer communication costs compared to other parallel strategies.
当前的 Seq1F1B 实现针对 LLM 的长上下文训练进行了优化，这可能导致在处理短上下文（如 4k/8k）时性能下降。我们建议在通信带宽有限的环境中使用 Seq1F1B，因为与其他并行策略相比，PP 产生的通信成本较低。

References 参考文献

Anil et al. (2023) Anil 等人（2023） Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, 等人. 2023. Palm 2 技术报告。 arXiv 预印本 arXiv:2305.10403。
(2) Jacob Buckman and Carles Gelada. Compute-optimal Context Size.
Jacob Buckman 和 Carles Gelada。计算优化的上下文大小。
Fan et al. (2021) Fan 等人（2021） Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, et al. 2021. Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445.
Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, 等人. 2021. Dapple: 一种用于训练大型模型的流水线数据并行方法。在 第 26 届 ACM SIGPLAN 并行编程原则与实践研讨会论文集，第 431–445 页。
Harlap et al. (2018) Harlap 等人（2018） Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377.
Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, 和 Phil Gibbons. 2018. Pipedream: 快速高效的流水线并行 DNN 训练。 arXiv 预印本 arXiv:1806.03377。
Hoffmann et al. (2022) Hoffmann 等人（2022） Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, 等人. 2022. 训练计算优化的大型语言模型。 arXiv 预印本 arXiv:2203.15556。
Huang et al. (2019) Huang 等人（2019） Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32.
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, 等人. 2019. Gpipe: 使用流水线并行技术高效训练巨型神经网络。 神经信息处理系统进展, 32.
Jiang et al. (2024) Jiang 等人（2024） Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. 2024. Mixtral of experts. arXiv preprint arXiv:2401.04088.
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, 等人. 2024. Mixtral of experts。 arXiv 预印本 arXiv:2401.04088。
Korthikanti et al. (2023)
Korthikanti 等人（2023） Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2023. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5.
Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, 和 Bryan Catanzaro. 2023. 减少大型 Transformer 模型中的激活重计算。 机器学习与系统会议论文集, 5.
Li et al. (2020) Li 等人（2020） Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. 2020. Pytorch distributed: experiences on accelerating data parallel training. Proceedings of the VLDB Endowment, 13(12):3005–3018.
Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, 等人. 2020 年. Pytorch 分布式：加速数据并行训练的经验. VLDB Endowment 会议论文集, 13(12):3005–3018.
Li and Hoefler (2021) Li 和 Hoefler (2021) Shigang Li and Torsten Hoefler. 2021. Chimera: efficiently training large-scale neural networks with bidirectional pipelines. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14.
Shigang Li 和 Torsten Hoefler. 2021 年. Chimera：使用双向流水线高效训练大规模神经网络. 发表于高性能计算、网络、存储与分析国际会议论文集, 页码 1–14.
Li et al. (2021) Li 等人 (2021) Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, and Ion Stoica. 2021. Terapipe: Token-level pipeline parallelism for training large-scale language models. In International Conference on Machine Learning, pages 6543–6552. PMLR.
Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, 和 Ion Stoica. 2021 年. Terapipe：用于训练大规模语言模型的令牌级流水线并行. 发表于国际机器学习会议, 页码 6543–6552. PMLR.
Narayanan et al. (2021a) Narayanan 等人 (2021a) Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2021a. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937–7947. PMLR.
Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, 和 Matei Zaharia. 2021a 年. 内存高效的流水线并行 DNN 训练. 发表于国际机器学习会议, 页码 7937–7947. PMLR.
Narayanan et al. (2021b) Narayanan 等人 (2021b) Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021b. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15.
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, 等人. 2021b 年. 使用 Megatron-LM 在 GPU 集群上高效进行大规模语言模型训练. 发表于高性能计算、网络、存储与分析国际会议论文集, 页码 1–15.
Qi et al. (2024) Qi 等人 (2024) Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2024. Zero bubble (almost) pipeline parallelism. In The Twelfth International Conference on Learning Representations.
Penghui Qi, Xinyi Wan, Guangxing Huang, 和 Min Lin. 2024 年. 零气泡（几乎）流水线并行. 发表于第十二届学习表征国际会议.
Rasley et al. (2020) Rasley 等人 (2020) Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of KDD, pages 3505–3506.
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, 和 Yuxiong He. 2020 年. DeepSpeed：系统优化使得训练超过 1000 亿参数的深度学习模型成为可能. 发表于KDD 会议论文集, 页码 3505–3506.
Reid et al. (2024) Reid 等人 (2024) Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, 等人. 2024 年. Gemini 1.5：解锁跨越数百万上下文标记的多模态理解. arXiv 预印本 arXiv:2403.05530.
Shoeybi et al. (2019) Shoeybi 等人 (2019) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, 和 Bryan Catanzaro. 2019 年. Megatron-LM：使用模型并行训练多十亿参数的语言模型. arXiv 预印本 arXiv:1909.08053.
Touvron et al. (2023) Touvron 等人 (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, 等人. 2023 年. Llama 2：开放基础和微调的聊天模型. arXiv 预印本 arXiv:2307.09288.
Vaswani et al. (2017) Vaswani 等人 (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of NeurIPS.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, 和 Illia Polosukhin. 2017 年. Attention is all you need. 发表于NeurIPS 会议论文集.
Yang et al. (2021) Yang 等人 (2021) Bowen Yang, Jian Zhang, Jonathan Li, Christopher Ré, Christopher Aberger, and Christopher De Sa. 2021. Pipemare: Asynchronous pipeline parallel dnn training. Proceedings of Machine Learning and Systems, 3:269–296.
Bowen Yang, Jian Zhang, Jonathan Li, Christopher Ré, Christopher Aberger, 和 Christopher De Sa. 2021. Pipemare: 异步流水线并行深度神经网络训练。 机器学习与系统会议论文集, 3:269–296。

Appendix A Appendix
附录 A 附录

A.1 Integration with Zero-bubble-pipeline
A.1 与零气泡流水线的集成

From Figure 4, we can see that Seq1F1B can integrate with the ZB1P method and further reduce bubbles while reducing memory demands by splitting sequence. Such integration outperforms simple ZB1P in both memory demands and pipeline bubbles since sequence-level pipelines naturally have fewer bubbles. Furthermore, Seq1F1B can integrate with ZB2P and ZBV methods too. Theoretically, introducing a zero-bubble-pipeline to Seq1F1B should be more efficient. Even though, such a fine-grained handcraft schedule may have performance degradation under some settings. We hope our work inspires future work to solve this problem.
从图4中可以看到，Seq1F1B 可以与 ZB1P 方法集成，并通过拆分序列进一步减少气泡，同时降低内存需求。由于序列级流水线自然气泡较少，这种集成在内存需求和流水线气泡方面都优于简单的 ZB1P。此外，Seq1F1B 还可以与 ZB2P 和 ZBV 方法集成。理论上，将零气泡流水线引入 Seq1F1B 应该更有效。尽管如此，这种精细的手工调度在某些设置下可能会导致性能下降。我们希望我们的工作能激励未来的研究来解决这个问题。

Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model TrainingSeq1F1B：高效的序列级流水线并行 大型语言模型训练

Abstract 摘要

1 Introduction1 介绍

2 Related Work2 相关工作

3 Methodology3 方法论

3.1 Preliminary3.1 初步

3.2 Framework of Seq1F1B3.2 Seq1F1B 框架

3.3 Framework of Seq1F1B-I3.3 Seq1F1B-I 的框架

3.4 Workload Balance3.4 工作负载平衡

4 Experiments4 实验

4.1 Experimental Settings4.1 实验设置

4.2 Main Results4.2 主要结果

4.3 Ablation Results4.3 消融结果

5 Conclusion5 结论