This article presents detailed few-principles reasoning about large language model inference performance, with no experiments or difficult math. The amount of understanding that can be acquired this way is really impressive and practical! A very simple model of latency for inference turns out to be a good fit for emprical results. It's helped me make better predictions and form better explanations about transformer inference.

This post assumes some prior knowledge about transformers, say at having understood most of The Illustrated Transformer but not having internalised all of it. Familiarity with this parameter counting post which I developed along with this one may also be useful.

Table of Contents 目录

  • kv cache explains the performance improvement of caching self-attention vectors as a part of inferencing, as well as the possible tradeoffs and capacity costs
    kv 缓存解释了作为推理一部分缓存自注意力向量的性能改进,以及可能的权衡和容量成本
  • capacity takes the storage cost of kv cache and connects it to the storage cost of model weights and what capacity means for performance.
    容量将 kv 缓存的存储成本与模型权重的存储成本联系起来,并解释了容量对性能意味着什么。
  • model parallelism builds up an understanding specifically of tensor parallelism to clearly identify the cost of communication
  • latency calculations pulls understanding from other concepts to create equations that serve as floorlines for inference speed.
  • batch sizes discusses what impact batch size has on performance and what sizes may be optimal.
  • flops counting steps through the transformer blocks and identifies which operations meaningfully contribute to flops speed.
    通过变换器块的 flops 计数步骤,识别哪些操作有意义地贡献于 flops 速度。
  • intermediate memory costs covers how the activations take additional memory and what that memory bandwidth costs looks like from some real benchmarks.
  • comparing against real benchmarks compares what we can calculate to what Nvidia FasterTransformer benchmarks report and identifies the discrepancies.
    与真实基准比较,比较我们能计算的与 Nvidia FasterTransformer 基准报告的内容,并识别出差异。

kv cache kv 缓存

For sampling, transformer inference consists of processing a provided prompt/context (which can happen in parallel), and then sampling additional tokens one by one (this is where the autoregressiveness surfaces). In the sampling, the transformer performs self-attention, which requires the kv values for each item currently in the sequence (whether it was prompt/context or a generated token). These vectors are provided a matrix known as the kv cache, aka past cache (the open source GPT-2 implementation called it past). The past cache would be shaped like [batch, 2, num_heads, seq_len, features].
对于采样,变换器推理包括处理提供的提示/上下文(这可以并行进行),然后逐个采样额外的令牌(这是自回归性显现的地方)。在采样中,变换器执行自注意力,这需要序列中每个项目(无论是提示/上下文还是生成的令牌)的 kv 值。这些向量提供了一个矩阵,称为 kv 缓存,又称为过去缓存(开源的 GPT-2 实现称之为 past )。过去的缓存将呈现 [batch, 2, num_heads, seq_len, features] 的形状。

The purpose of this is to avoid recalculations of those vectors every time we sample a token. With the computed k,vk, v values, we can save quite a bit of computation at the cost of some storage. Per token, the number of bytes we store is
这样做的目的是为了避免每次采样令牌时都重新计算这些向量。通过计算的 k,vk, v 值,我们可以在一定程度上节省大量计算,代价是一些存储。每个令牌,我们存储的字节数是

22nlayersnheadsdhead2\cdot 2 \cdot n_\text{layers} \cdot n_\text{heads} \cdot d_\text{head}

The first factor of 2 is to account for the two vectors, kk and vv. We store that per each layer, and each of those values is a nheads×dhead n_\text{heads}\times d_\text{head} matrix. Then multiply by 2 again for the number of bytes (we'll assume 16-bit formats throughout the post).
2 的第一个因素是为了考虑两个向量, kkvv 。我们将其存储在每一层中,每个值都是一个 nheads×dhead n_\text{heads}\times d_\text{head} 矩阵。然后再次乘以 2,用于字节数(我们将在整篇文章中假设 16 位格式)。

The weights that we multiply by the token embeddings are Wk,WvRdmodel×dmodelW_\text{k}, W_\text{v} \in \mathbb{R}^{d_\text{model}\times d_\text{model}} and then each token embedding is teR1×dmodelt_\text{e}\in \mathbb{R}^{1\times d_\text{model}}. So then the flops to compute kk and vv for all our layers is
我们用来乘以令牌嵌入的权重是 Wk,WvRdmodel×dmodelW_\text{k}, W_\text{v} \in \mathbb{R}^{d_\text{model}\times d_\text{model}} ,然后每个令牌嵌入是 teR1×dmodelt_\text{e}\in \mathbb{R}^{1\times d_\text{model}} 。所以,计算 kkvv 对于我们所有层的 flops 是

22nlayersdmodel22 \cdot 2 \cdot n_\text{layers} \cdot {d_\text{model}}^2

We multiply tet_\text{e} by WkW_\text{k}, which takes 2dmodel22 \cdot {d_\text{model}}^2 flops. We have another factor of 2 as we do that twice, once each for kk and vv and then repeat for nlayersn_\text{layers}.
我们将 tet_\text{e} 乘以 WkW_\text{k} ,这需要 2dmodel22 \cdot {d_\text{model}}^2 浮点运算。我们还有另一个因子 2,因为我们做了两次,每次分别对 kkvv ,然后再重复 nlayersn_\text{layers}

How many flops in a matmul?

The computation for a matrix-vector multiplication is 2mn2mn for ARm×n,bRnA \in \mathbb{R}^{m\times n}, b \in \mathbb{R}^{n}. A matrix-matrix is 2mnp2mnp for ARm×n,BRn×pA \in \mathbb{R}^{m\times n}, B \in \mathbb{R}^{n \times p}. The mnmn factor makes a lot of sense, and the two comes from the fact that a matmuls are composed of multiply(1)-add(2) operations. More in these lecture notes.
矩阵-向量乘法的计算是 2mn2mn 对于 ARm×n,bRnA \in \mathbb{R}^{m\times n}, b \in \mathbb{R}^{n} 。矩阵-矩阵乘法是 2mnp2mnp 对于 ARm×n,BRn×pA \in \mathbb{R}^{m\times n}, B \in \mathbb{R}^{n \times p} 。因子 mnmn 非常有意义,而且两者来自于矩阵乘法是由乘法(1)-加法(2)操作组成的事实。更多内容在这些讲义中。

This means for a 52B parameter model (taking Anthropic's, where dmodel=8192d_\text{model} = 8192 and nlayers=64n_\text{layers} = 64). The flops are
这意味着对于一个 52B 参数模型(以 Anthropic 的为例,其中 dmodel=8192d_\text{model} = 8192nlayers=64n_\text{layers} = 64 )。浮点运算是

226481922=17,179,869,1842 \cdot 2 \cdot 64 \cdot 8192^2 = 17,179,869,184

Say we have an A100 GPU, which does 312e12312\text{e}12 flops per second and 1.5e121.5\text{e}12 bytes per second of memory bandwidth. The following are numbers for just the kv weights and computations.
假设我们有一个 A100 GPU,每秒可以执行 312e12312\text{e}12 次浮点运算并且具有 1.5e121.5\text{e}12 字节每秒的内存带宽。以下是仅针对 kv 权重和计算的数字。

memory=22nlayersdmodel2÷1.5e12compute=22nlayersdmodel2÷312e12\text{memory} = 2 \cdot 2 \cdot n_\text{layers} \cdot {d_\text{model}}^2 \div 1.5\text{e}12\\ \text{compute} = 2 \cdot 2 \cdot n_\text{layers} \cdot {d_\text{model}}^2 \div 312\text{e}12\\

Flops vs Memory Boundedness

Flops vs memory boundedness is something we deal with a lot for transformer inference, but also in deep learning optimisation in general. To do the computations we do, we need to load weights which costs memory bandwidth. We assume (correctly, this has been very well optimised) that we can start the computations while we load the weights. Flop bound would then mean that there is time when nothing is being passed through memory, and memory bound would mean that no floperations are occuring. Nvidia uses the term math bandwidth which I find really cute. Technically, this delineation exist per kernel but can be abstracted to exist for groups of operations.
对于变压器推理,我们经常处理浮点运算与内存限制的问题,但这也普遍存在于深度学习优化中。为了进行计算,我们需要加载权重,这会消耗内存带宽。我们假设(正确地,这已经得到了非常好的优化)我们可以在加载权重的同时开始计算。浮点运算限制意味着有时候没有数据通过内存传输,而内存限制意味着没有浮点运算发生。Nvidia 使用了“数学带宽”这个我觉得非常可爱的术语。技术上,这种划分是按内核存在的,但可以抽象为存在于一组操作中。

None of the model architecture matters anymore — we get a distinct ratio here of 208 given this hardware specification. This means that if we're going to compute kv for one token, it'll take the same amount of time to compute for up to 208 tokens! Anything below, we're memory bandwidth bound. Above, flops bound. If we used the rest of our weights to do a full forwards pass (run the rest of the transformer) on our context, it's also 208 (both the numerator and denominator get a factor of 6 added). This will be reasoned thoroughly in future sections. The intersection of the below diagram is at 208, though in reality the memory line does have a slight slope due to memory cost of intermediate calculations (discussed in the last section).
模型架构不再重要 — 在这个硬件规格下,我们得到了一个明确的比率 208。这意味着,如果我们要为一个令牌计算 kv,计算多达 208 个令牌所需的时间是相同的!任何低于此的,我们受到内存带宽的限制。高于此的,受到浮点运算的限制。如果我们使用剩余的权重来进行完整的前向传递(在我们的上下文中运行变压器的其余部分),它也是 208(分子和分母都增加了一个因子 6)。这将在未来的章节中彻底讨论。下图的交点是在 208,尽管实际上由于中间计算的内存成本(在最后一节中讨论),内存线确实有轻微的斜率。

For a 52B model full forwards pass, that's 122nlayersdmodel2/1.5e126912\cdot 2 \cdot n_\text{layers} \cdot {d_\text{model}}^2 / 1.5\text{e}12 \approx 69 milliseconds for up to 208 tokens (in practice, we'd use four GPUs in parallel so it would actually be ~17 milliseconds, more in following sections). If we had 416 (double) tokens in the context, then it would take twice as long, and 312 tokens would take 1.5 times as long.
对于一个 52B 模型的完整前向传递,这是 122nlayersdmodel2/1.5e126912\cdot 2 \cdot n_\text{layers} \cdot {d_\text{model}}^2 / 1.5\text{e}12 \approx 69 毫秒,最多可处理 208 个令牌(实际上,我们会并行使用四个 GPU,所以实际上大约是 17 毫秒,后续部分会有更多)。如果我们有 416 个(双倍)令牌在上下文中,那么它将花费两倍的时间,312 个令牌将花费 1.5 倍的时间。

Calculating for a kv cache token is exactly 1/6th of the compute of passing the token through the model. In general, these forwards passes (what we experience in getting logits, embeddings and training) are very cheap because of the parallelism that is possible as opposed to sampling where we're forced to read through all the weights for each token and do the autoregression.
计算 kv 缓存令牌的计算量正好是通过模型传递令牌计算量的 1/6。一般来说,这些前向传递(我们在获取 logits、嵌入和训练时的体验)非常便宜,因为与采样相比,可能的并行性。在采样中,我们被迫阅读每个令牌的所有权重并进行自回归。

This doesn't mean that 1/6th of the time is saved! Let's assume we are flops bound. Then at each sample step, we save 22ntokensnlayersdmodel2÷312e122 \cdot 2 \cdot n_\text{tokens} \cdot n_\text{layers} \cdot {d_\text{model}}^2 \div 312\text{e}12 flops while the decoding steps costs 212nlayersdmodel2÷312e122 \cdot 12 \cdot n_\text{layers} \cdot {d_\text{model}}^2 \div 312\text{e}12. Thus at each step we save 1/6 of the flops time multiplied by the number of tokens in our sequence (big!) — which increases as we sample tokens. It is the case that without a kv cache, sampling would be quadratic in time complexity as we increase the number of tokens.
这并不意味着节省了 1/6 的时间!假设我们受到 flops 的限制。那么在每个样本步骤中,我们节省了 22ntokensnlayersdmodel2÷312e122 \cdot 2 \cdot n_\text{tokens} \cdot n_\text{layers} \cdot {d_\text{model}}^2 \div 312\text{e}12 flops,而解码步骤的成本是 212nlayersdmodel2÷312e122 \cdot 12 \cdot n_\text{layers} \cdot {d_\text{model}}^2 \div 312\text{e}12 。因此,在每一步中,我们节省的 flops 时间乘以我们序列中的令牌数量(大!)——随着我们采样令牌,这个数量会增加。如果没有 kv 缓存,随着令牌数量的增加,采样将在时间复杂度上是二次方的。

This is not the whole story (given overheads and tradeoffs associated with storing this cache). If we're serving small batches we may be memory bandwidth bound rather than flops, in which case we won't even want to use the past cache and will instead happily do recomputations, spending the flops (we'll already be paying the memory cost to do our sampling).
这并不是全部的故事(考虑到与存储此缓存相关的开销和权衡)。如果我们服务的是小批量,我们可能受到内存带宽的限制而不是 flops,在这种情况下,我们甚至不想使用过去的缓存,而是愿意重新计算,花费 flops(我们已经在为我们的采样支付内存成本了)。

capacity 容量

We have a solid idea of the two things we store in our GPUs — kv cache and weights. GPU capacity does come into play for transformer inferencing performance and we have all the understanding we need to evaluate that now!
我们对我们在 GPU 中存储的两样东西——kv 缓存和权重有了清晰的了解。GPU 容量确实对变压器推理性能有影响,我们现在已经拥有了评估它所需的所有理解!

Nvidia A100 GPUs (which are generally speaking, the best GPUs we can get for inference) have a standard of 40GB of capacity. There are ones with 80GB and higher memory bandwidth (2e12 instead of 1.5e12) but they aren't available with any large cloud providers yet which means they aren't real to me!
Nvidia A100 GPU(一般来说,我们可以获得的最好的推理 GPU)的标准容量为 40GB。有些是 80GB 和更高的内存带宽(2e12 而不是 1.5e12),但它们还没有在任何大型云提供商那里可用,这意味着对我来说它们不是真实的!

Given the parameter count, we can multiply by two to get bytes. So to calculate the size of the weights for a 52B model.
鉴于参数计数,我们可以乘以二来获取字节。因此,要计算 52B 模型的权重大小。

52e122=104e12bytes104GB52\text{e}12 \cdot 2 = 104\text{e}12 \text{bytes} \approx 104\text{GB}\\

Oh no! This doesn't fit in one GPU! We'd need at least three GPUs just to have all the weights loaded in (will discuss how to do that sharding later). That leaves us 120104=16GB120-104 = 16GB left for our kv cache. Is that enough? Back to our equation for kv cache memory per token, again with a 52B model;
哦不!这个不能完全装进一个 GPU 里!我们至少需要三个 GPU 才能加载所有的权重(稍后将讨论如何进行这种分片)。这就给我们的 kv 缓存留下了 120104=16GB120-104 = 16GB 。这足够吗?回到我们的 kv 缓存每个令牌的内存方程,再次以一个 52B 模型为例;

2nlayersnheadsdhead24nlayersnheadsdhead=4648192=2,097,152bytes0.002GB2\cdot n_\text{layers} \cdot n_\text{heads} \cdot d_\text{head} \cdot 2\\ 4\cdot n_\text{layers} \cdot n_\text{heads} \cdot d_\text{head} \\ = 4\cdot 64 \cdot 8192\\ = 2,097,152 \text{bytes} \approx 0.002 GB
And then we'd do 16/0.002800016/0.002 \approx 8000 tokens can fit into our kv cache with this GPU set up, or that we could do up to a batch size 4 where each request has up to 2048 tokens (and higher sizes for less tokens).
然后我们会做出 16/0.002800016/0.002 \approx 8000 令牌可以适应我们的 kv 缓存,有了这个 GPU 设置,或者我们可以做到批量大小为 4,其中每个请求最多有 2048 个令牌(对于更少的令牌有更高的大小)。

This sucks because we would like to be able to do higher batch sizes, but are capacity limited! Higher batch sizes are more efficient in terms of how much GPU time it takes to process the same request. On the other hand, at batch sizes this low we're bound to be memory bound, and should forego the kv cache and just pay the flops cost instead.
这很糟糕,因为我们希望能够执行更高的批量大小,但是受到容量限制!更高的批量大小在处理相同请求时,就 GPU 时间而言更有效率。另一方面,在这么低的批量大小下,我们肯定会受到内存限制,应该放弃 kv 缓存,只支付 flops 成本。

For four GPUs, we'd get 56/0.0022300056/0.002 \approx 23000. We definitely want to go for the four GPUs since we'll want to be able to do higher batch sizes, and it's silly to to divide powers of two over three GPUs. But it's not just batch size! If we have high volume, then we'd have multiple instances of our models. We approximately want each instance to be able to do as large as a batch size as possible, as we pay the cost of storing the weights anyway.
对于四个 GPU,我们会得到 56/0.0022300056/0.002 \approx 23000 。我们绝对想要四个 GPU,因为我们希望能够执行更高的批量大小,而且将两个的幂分配到三个 GPU 上是愚蠢的。但这不仅仅是批量大小的问题!如果我们有高容量,那么我们会有多个模型实例。我们大约希望每个实例能够尽可能执行更大的批量大小,因为我们无论如何都要支付存储权重的成本。

There's some extra space used by intermediate calculation steps, but they're negligible.

model parallelism 模型并行性

I'm not going to build up full understanding of model parallelism and all the implementation details, because many have done so. But we will build out the parts of the understanding that are useful to figure to make performance decisions and calculate communication costs!

The outcome of model parallelism, is that the cost of passing all the weights through through memory and the flops are all divided over the degree (number of accelerators we use).
模型并行性的结果是,通过内存传递所有权重的成本和 flops 都分散在我们使用的加速器的数量(加速器的数量)上。

We will assume tensor parallel (model parallel) where we will split down the middle of the model. Each accelerator will execute as much as it can with its shards of the weights and will communicate whenever synchronisation is required. A more naive way is pipeline parallel, where each GPU will hold onto a fraction of the layers. This does successfully even out the weight loading cost, but has the obvious silly that all but one GPU will be idling! In training you could pipeline through it (as the first batch moves onto the next GPU, start on a new batch on the first GPU) but it doesn't work out for a single sample request (though you could still do it for multiple requests). Pipeline also doesn't exhaust the memory bandwidth, which is actually ok if you're flops bound anyway. The only place where pipeline parallel does better is communications. A pipeline parallel model would communicate dmodeld_\text{model} per accelerator, while a model parallel does NdmodelN\cdot d_\text{model} per layer where NN is the number of accelerators.
我们将采用张量并行(模型并行)方式,将模型从中间分开。每个加速器将尽可能多地执行其权重分片,并在需要同步时进行通信。一种更简单的方式是流水线并行,其中每个GPU将保留一部分层。这确实成功地平衡了权重加载成本,但有一个明显的缺点,那就是除了一个GPU外,所有GPU都将处于空闲状态!在训练中,你可以通过它进行流水线处理(当第一批数据移动到下一个GPU时,在第一个GPU上开始处理新的批次),但这对于单个样本请求来说行不通(尽管对于多个请求你仍然可以这样做)。流水线并行也没有耗尽内存带宽,如果你 anyway 受限于浮点运算次数,这实际上是可以接受的。流水线并行唯一表现更好的地方是通信。流水线并行模型会在每个加速器上进行 dmodeld_\text{model} 次通信,而模型并行则在每层进行 NdmodelN\cdot d_\text{model} 次通信,其中 NN 是加速器的数量。

Here we introduce the last constant for our A100 GPUs which is a communication bandwith of 300GB/s. The doc marks it as 600GB/s because Nvidia is adding up 300GB/s into each chip and 300GB/s out simultaneously rather than using a bidirectional number (which will be more intuitive for our calculations).
在这里,我们介绍了我们 A100 GPU 的最后一个常数,即通信带宽为 300GB/s。文档将其标记为 600GB/s,因为 Nvidia 同时将 300GB/s 加入每个芯片并同时输出 300GB/s,而不是使用双向数字(这对我们的计算来说会更直观)。

In this diagram, we start by following the yellow brick road where we insert our token embeddings into the bottom of the model. The purple boxes outline how our weights would be split across the accelerators, and we work with an extremely tiny model so we can draw everything to scale. A general idea is that if we have two matrices XX and YY we can shard both of them and multiply the shards. This doesn't actually complete the matmul of XYX\cdot Y, and an easy way to tell (other than our ability to multiply matrices) is that if we concatenated the result of multiplying the shards, we get too big of a matrix. Instead, we would want to communicate, compute a shard sum, communicate that sum back out and then concatenate for the output of XYX \cdot Y.
在这个图表中,我们首先沿着黄砖路开始,将我们的令牌嵌入插入模型的底部。紫色框架概述了我们的权重如何在加速器之间分割,我们使用一个极小的模型,这样我们就可以按比例绘制一切。一个基本的想法是,如果我们有两个矩阵 XXYY ,我们可以将它们都分片然后乘以分片。这实际上并不完成 XYX\cdot Y 的矩阵乘法,一个简单的判断方法(除了我们能够乘以矩阵的能力之外)是,如果我们连接乘以分片的结果,我们会得到一个太大的矩阵。相反,我们会想要通信,计算一个分片和,将该和通信回去,然后连接以得到 XYX \cdot Y 的输出。

For attention the parallelism is intuitive from the fact that we have multiple heads. We go through most of the attention layer without communication because our attention heads are concatenated to multiply by WoW_o. After we multiply by vv, we multiply the result by our shard of WoW_o to get a shard of osRdmodel×nheads/No_s \in \mathbb{R}^{d_\text{model}\times n_\text{heads}/N}. Then each accelerator will communicate its own shard to all the others, and all the others will communicate their shards back. This is (N1)dmodel/N(N-1)d_\text{model}/N of comms cost. Each accelerator will do an even share of the addition to get the output projection, then do the same communication they did last time and the individual hosts will do the concatenation (approximately instant).
对于注意力并行性来说,直观的是我们有多个头。我们在没有通信的情况下完成了大部分注意力层,因为我们的注意力头被连接起来乘以 WoW_o 。在我们乘以 vv 之后,我们将结果乘以我们的 WoW_o 分片以得到一个 osRdmodel×nheads/No_s \in \mathbb{R}^{d_\text{model}\times n_\text{heads}/N} 分片。然后每个加速器将其自己的分片通信给所有其他加速器,所有其他加速器将它们的分片通信回来。这是 (N1)dmodel/N(N-1)d_\text{model}/N 的通信成本。每个加速器将平均分担加法以得到输出投影,然后进行与上次相同的通信,个别主机将进行连接(大约瞬间完成)。

The MLP layer is by nature very similar! Just like we have WoW_o to project our multi-headed attention results back down to a vector of length dmodeld_\text{model}, we have W1R4×dmodelW_1\in \mathbb{R}^{4\times d_\text{model}} and W2Rdmodel×4W_2\in \mathbb{R}^{d_\text{model}\times 4} to make a dimension 4 times larger and then project it back down. The same two communications are done at the end of the MLP.
MLP 层的本质非常相似!就像我们有 WoW_o 将我们的多头注意力结果投影回一个长度为 dmodeld_\text{model} 的向量,我们有 W1R4×dmodelW_1\in \mathbb{R}^{4\times d_\text{model}}W2Rdmodel×4W_2\in \mathbb{R}^{d_\text{model}\times 4} 使维度变大四倍然后再投影回来。在 MLP 的末尾完成相同的两次通信。

Ultimately we do 4(N1)dmodel/N4 \cdot (N - 1)d_\text{model}/N bytes of communication. The kv cache is split across heads by GPU.
最终我们进行了 4(N1)dmodel/N4 \cdot (N - 1)d_\text{model}/N 字节的通信。kv 缓存由 GPU 在头之间分割。

latency calculations 延迟计算

We've discussed the capacity fairly thoroughly, mapped out comms in the model parallelism section and discussed general compute steps. Now we'll build it into equations that estimate latency!

Our latency calculations are mostly about the flops vs memory boundedness. If we have a small number of multiplies to do per parameter, then maybe we'll be throttled by memory bandwidth. Flops are increased by both batch size and number of parameters, while memory is only increased by number of parameters.
我们的延迟计算主要是关于 flops 与内存限制性的。如果我们每个参数要做的乘法数量很少,那么我们可能会受到内存带宽的限制。Flops 通过批量大小和参数数量增加,而内存仅通过参数数量增加。

For comms, it's not about boundedness, but rather about adding a latency term and a throughput term (the 300GB/s). Something tricky about the latency side of this figure is that it's not reported, so the best I can do is guess "approximately small", which is approximately 8 microseconds per message sent as found in this Citadel paper but it's for V100 NVLink.
对于通信来说,重点不在于有界性,而是要加上一个延迟项和一个吞吐量项(即 300GB/s)。这个图表中关于延迟方面的棘手之处在于它没有被报告,所以我能做的最好猜测是“大约很小”,这大约是每发送一条消息 8 微秒,如同在这篇 Citadel 论文中发现的,但它是针对 V100 NVLink 的。

Because of the compute factors, to calculate the latency of a single token decoding step we'd have two formulas - one for memory bandwidth bound (small batch) and another for flops bound (large batch). For large batch, we'll drop the latency factor for communications.
由于计算因素,要计算单个令牌解码步骤的延迟,我们需要两个公式 - 一个是内存带宽受限(小批量)的,另一个是 flops 受限(大批量)的。对于大批量,我们将忽略通信的延迟因素。

Equations for a small batch (say 1, so we can drop the batch factor) would be; (where NN is the number of accelerators and PP is the number of parameters and bb is "byte" as a unit)
小批量的方程式(比如说 1,所以我们可以忽略批量因素)将是;(其中 NN 是加速器的数量, PP 是参数的数量, bb 是单位为“字节”)

compute=2PbNAbms1comms=4nlayers8μs\text{compute} = \frac{2 \cdot P\cdot\text{b}}{N \cdot A_{\text{bm}} \text{b}\space\text{s}^{-1}}\\ \text{comms} = 4 \cdot n_\text{layers} \cdot 8\mu\text{s}
There is 2P2 \cdot P because we need to pass all the parameters through the memory, and each parameter is two bytes. AbmA_\text{bm} is the accelerator memory bandwidth, and this cost is split across accelerators. For comms, we have 4nlayers 4 \cdot n_\text{layers} communications per layer, and the latency per each request. Comms will usually come out to be relatively small so for the compute bound case we won't need to pay attention to it anyway. There's also a throughput cost in comms which also rounds away.
2P2 \cdot P 是因为我们需要通过内存传递所有参数,每个参数是两个字节。 AbmA_\text{bm} 是加速器的内存带宽,这个成本在加速器之间分摊。对于通信,我们有 4nlayers 4 \cdot n_\text{layers} 每层的通信次数,以及每个请求的延迟。通信通常会相对较小,所以对于计算受限情况,我们无需过多关注。通信中还有一个吞吐量成本,也会被忽略。

There's another sometimes-significant factor here which is the read time for the kv cache, which I'll leave out of the equation now since it depends on number of context tokens, which can even vary within a batch and total number of tokens we want to sample. This would be calculated as memory bandwidth time. Another missing memory bandwidth time is the read of the unembeddings to calculate logits at each sampling step, which is Rdmodel×nvocab \in \mathbb{R}^{d_\text{model}\times n_\text{vocab}}.
这里还有另一个有时候很重要的因素,即 kv 缓存的读取时间,现在我将其从方程中省略,因为它依赖于上下文令牌的数量,这甚至可以在一个批次内变化,以及我们想要抽样的令牌总数。这将被计算为内存带宽时间。另一个缺失的内存带宽时间是在每个抽样步骤计算 logits 时读取 unembeddings 的时间,这是 Rdmodel×nvocab \in \mathbb{R}^{d_\text{model}\times n_\text{vocab}}

As previously mentioned, the memory does not actually stay constant, rather some additional memory is used per batch for intermediate activations. The reason we don't factor this in is simply because it's hard to count as it varies a lot by the software stack, compiler optimisations, etc.

For large batches (say 512), where BB is the batch size;
对于大批量(比如 512),其中 BB 是批量大小;

compute=B2PFLOPNAfFLOP s1comms=B2nlayers4dmodelbAcs1\text{compute} = B \cdot \frac{2 \cdot P \cdot \text{FLOP}}{N \cdot A_f\text{FLOP}\space\text{s}^{-1}}\\ \text{comms} = B \cdot \frac{2\cdot n_\text{layers} \cdot 4 \cdot d_\text{model} \cdot \text{b}}{A_c\text{b}\space\text{s}^{-1}}

Where AfA_f is the flops of the accelerator and AcA_c is the comms bandwidth. We do 2P2\cdot P flops of operations, which can be intuited by the fact that we matmul through all the parameters, and as mentioned earlier, a matrix-vector multiplication is 2mn2mn given ARm×n,bRnA \in \mathbb{R}^{m\times n}, b \in \mathbb{R}^{n}.
其中 AfA_f 是加速器的浮点运算次数, AcA_c 是通信带宽。我们进行了 2P2\cdot P 浮点运算操作,这可以通过我们通过所有参数进行矩阵乘法的事实来直观理解,正如前面提到的,给定 ARm×n,bRnA \in \mathbb{R}^{m\times n}, b \in \mathbb{R}^{n} 时,矩阵-向量乘法是 2mn2mn

For comms, we see the four (I'll round that N1N-1 factor to NN) communications each of a dmodeld_{model} size vector per layer as explained in the model parallelism section. We swapped out the latency calculation for a throughput one. Then it's all divided by the comms bandwidth.
对于通信,我们看到四个(我会将那个 N1N-1 因子四舍五入到 NN )每层的通信,每个通信的向量大小为 dmodeld_{model} ,如模型并行性部分所解释的。我们用吞吐量计算代替了延迟计算。然后它全部被通信带宽除。

Let's play with a larger model, a Gopher sized 260B model on 16 GPUs. For a small batch, it's 22 ms per token generated. The throughput cost for the comms which we can calculate with the equation for large batch is approximately 35 microseconds, assuring us that it was safe to drop.
让我们使用更大的模型,一个在 16 个 GPU 上的 260B 大小的 Gopher 模型。对于一个小批量,每生成一个令牌需要 22 毫秒。我们可以用大批量的方程计算的通信吞吐成本大约为 35 微秒,确保我们放弃是安全的。

compute=2PNAbm=2260e9161.5e120.021722mscomms=24nlayers8μs=4808μs=2560μs3ms\text{compute} = \frac{2 \cdot P}{N \cdot A_\text{bm}} = \frac{2 \cdot 260\text{e}9}{16\cdot 1.5\text{e}12} \approx 0.0217 \approx 22\text{ms} \\ \text{comms} = 2 \cdot 4 \cdot n_\text{layers} \cdot 8\mu\text{s}= 4 \cdot 80\cdot 8\mu\text{s} = 2560\mu\text{s} \approx 3\text{ms}

For a large batch of 512, for a total of 53 ms per token generated (per batch, so in the 62ms 512 tokens are generated). The latency cost on comms here would've also been 3ms (latency is not multiplied by batch as the message can be prepared together) which is somewhat significant to drop but it's fine if we assuming parallel comms and compute.
对于一个大批量的 512,每生成一个令牌总共需要 53 毫秒(每批,所以在 62 毫秒内生成 512 个令牌)。这里通信的延迟成本也将是 3 毫秒(延迟不会随批量增加而乘以批量,因为消息可以一起准备),这在放弃时有些重要,但如果我们假设通信和计算是并行的,那么这是可以接受的。

compute=B2PNAf=5122260e916312e120.05353mscomms=B24nlayersdmodelAc=51288016384300e918ms\text{compute} = B \cdot \frac{2 \cdot P}{N \cdot A_f} = 512 \cdot \frac{2 \cdot 260\text{e}9}{16 \cdot 312\text{e}12} \approx 0.053 \approx 53\text{ms}\\ \text{comms} = B \cdot \frac{2\cdot 4 \cdot n_\text{layers} \cdot d_\text{model}}{A_c} = 512 \cdot \frac{ 8 \cdot 80 \cdot 16384}{300\text{e}9} \approx 18\text{ms}

The higher value between the comms and compute is taken as we're assuming that it's parallel. Thus, we would want to avoid having comms being greater than compute (this is the mechanism that prevents us from approaching latency zero as we insert more chips, eventually the comms will start taking more and more time). It's not guaranteed that all systems will do this in parallel, and certainly not perfectly in parallel.

These numbers are definitely much lower than what we can get with real sand, as it assumes optimal hardware usage, doesn't factor in softmaxes, assumes zero comms latency and ignores many other smaller factors. Nonetheless, all the reasoning behind this math is useful for thinking about where to go optimise performance what deltas incoming optimisations will cause.
这些数字绝对比我们能用真正的沙子得到的要低得多,因为它假设了最佳硬件使用,没有考虑到 softmaxes,假设了零通信延迟,并忽略了许多其他较小的因素。尽管如此,这背后的所有推理对于思考如何去优化性能以及即将到来的优化将引起什么变化都是有用的。

batch sizes 批量大小

Batch size is an important factor of our performance, especially towards understanding performance for specific usages.

In the previous section, we have two calculations for when something memory bandwidth bound versus flops bound. To figure out which is at play we can compare these numbers;

mem bandwidth time=2PNAbmflops time=B2PNAf\text{mem bandwidth time} = \frac{2 \cdot P}{N \cdot A_\text{bm}}\\ \text{flops time} = B \cdot \frac{2 \cdot P}{N \cdot A_f}

We're dealing with the same ratio we found in the kv cache section. The min batch size for memory bandwidth bound is Abw/Af=208A_\text{bw}/A_f = 208. This is a handy ratio! If we have the load to do it, we prefer flops bound as it's more compute efficient. Though it's also the case that if we're flops bound, making the batch size larger doesn't mean anything is getting faster.
我们处理的比例与我们在 kv 缓存部分发现的比例相同。对于内存带宽受限的最小批量大小是 Abw/Af=208A_\text{bw}/A_f = 208 。这是一个方便的比例!如果我们有负载去做,我们更倾向于浮点运算受限,因为它更计算高效。尽管如果我们是浮点运算受限,增大批量大小并不意味着任何东西变得更快。

To calculate when the capacity goes from mostly kv cache to mostly weights is trivial, and also isn't a binary in the same way (nothing special happens when your kv cache starts taking up more memory than your weights). Nothing special really happens with comms either. At some point in increasing the batch size, the throughput starts dwarfing the latency so we dropped that factor. As observed previously, the latency becomes insignificant much later (our 512 batch on 52B communication cost was still 11% latency).
计算容量从主要是 kv 缓存到主要是权重的转变是微不足道的,也不是以同样的方式是二元的(当你的 kv 缓存开始占用比你的权重更多的内存时,没有什么特别的事情发生)。与通信相关的也没有什么特别的事情发生。在增加批量大小的某个点,吞吐量开始使延迟显得微不足道,所以我们放弃了那个因素。如之前观察到的,延迟在很晚时才变得不重要(我们的 512 批量在 52B 通信成本上仍然有 11% 的延迟)。

Something oversimplified about comms is that it happens at four different steps, which means we don't just want our compute time to be longer than our comms time, we want it to be the case at each step (if we can parallelise the compute and comms). For that, we have a weirder ratio: flops per byte of comms. Here's a nice chart of our computations, which will also be useful in the section below.

q,k,vq, k, voow1w_1w2w_2
flops 浮点运算3B(dmodel2)3B({d_\text{model}}^2)B(dmodel2)B({d_\text{model}}^2)4B(dmodel2)4B({d_\text{model}}^2)4B(dmodel2)4B({d_\text{model}}^2)
bytes of comms 通信的字节B(dmodel)B(d_\text{model})B(dmodel)B(d_\text{model})B(dmodel)B(d_\text{model})B(dmodel)B(d_\text{model})
flops/byte 浮点运算/字节3(dmodel)3(d_\text{model})dmodeld_\text{model}4(dmodel)4(d_\text{model})4(dmodel)4(d_\text{model})

312e12÷300e9=1040312\text{e}12 \div 300\text{e}9 = 1040, which is our flops per byte of comms for our A100s. We want the values in the last row to be larger than our hardware flops per byte so that we stay flops bound (assuming we are not memory bound here). For any model with an embedding dimension over 1024 (per chip), we're safe! For 512, it's a little awkward.
312e12÷300e9=1040312\text{e}12 \div 300\text{e}9 = 1040 ,这是我们 A100 的每字节通信的浮点运算次数。我们希望最后一行的值大于我们硬件的每字节浮点运算次数,这样我们就能保持浮点运算受限(假设我们这里不受内存限制)。对于任何嵌入维度超过 1024(每个芯片)的模型,我们都是安全的!对于 512,有点尴尬。

A low-load API may result in smaller batch sizes, leading to reasonable decisions like dropping the kv cache. If an API had the load for large batches it would probably want to serve the lowest batch size that gets flop bound even if there is capacity left so that it could optimise for per-request-latency. In say mass-inferencing jobs like AlphaCode we might want to insert as many chips as we can and then do the largest batch we can do with that capacity. I say "may" a lot here but I actually think these are absolute and all three kinds of cases.
低负载 API 可能导致较小的批量大小,从而导致合理的决策,如放弃 kv 缓存。如果一个 API 有处理大批量的负载,它可能会希望服务于能够达到浮点运算受限的最小批量大小,即使还有剩余容量,这样它就可以为每个请求的延迟进行优化。比如在 AlphaCode 这样的大规模推理任务中,我们可能会想要插入尽可能多的芯片,然后用这些容量做最大的批量处理。我在这里多次使用“可能”这个词,但我实际上认为这些是绝对的,并且涵盖了所有三种情况。

flops counting 浮点运算次数计算

Previously; 之前;

We do 2P2\cdot P flops of operations, which can be intuited by the fact that we matmul through all the parameters.
我们进行了 2P2\cdot P 次浮点运算操作,这可以通过我们通过所有参数进行矩阵乘法的事实来直观理解。

This is correct reasoning, but we can break it down by walking through all the transformer steps and check that we get 2P2P.
这是正确的推理,但我们可以通过走遍所有变压器步骤并检查我们是否得到 2P2P 来分解它。

The following calculations are per token, per layer. I describe Wq,Wk,WvRdmodel×dmodelW_q, W_k, W_v \in \mathbb{R}^{d_\text{model}\times d_\text{model}} where it's more accurate to say we have Wqi,Wki,WviRdmodel×dheadW_q^i, W_k^i, W_v^i \in \mathbb{R}^{d_\text{model}\times d_\text{head}}, where ii goes up to nheadsn_\text{heads}. But for the sake of calculating latency, I simplify Wq,Wk,WvW_q, W_k, W_v to include all the heads.
以下计算是每个令牌,每层进行的。我描述了 Wq,Wk,WvRdmodel×dmodelW_q, W_k, W_v \in \mathbb{R}^{d_\text{model}\times d_\text{model}} ,更准确地说我们有 Wqi,Wki,WviRdmodel×dheadW_q^i, W_k^i, W_v^i \in \mathbb{R}^{d_\text{model}\times d_\text{head}} ,其中 ii 上升到 nheadsn_\text{heads} 。但为了计算延迟,我简化了 Wq,Wk,WvW_q, W_k, W_v 以包括所有的头部。

  • Computing qkv  计算 qkv
    • Multiply teR1×dmodelt_e \in \mathbb{R}^{1\times d_\text{model}} by Wq,Wk,WvRdmodel×dmodelW_q, W_k, W_v \in \mathbb{R}^{d_\text{model}\times d_\text{model}}
      teR1×dmodelt_e \in \mathbb{R}^{1\times d_\text{model}} 乘以 Wq,Wk,WvRdmodel×dmodelW_q, W_k, W_v \in \mathbb{R}^{d_\text{model}\times d_\text{model}}
    • Flop count: 23dmodel2{2 \cdot 3 \cdot d_\text{model}}^2 Flop 计数: 23dmodel2{2 \cdot 3 \cdot d_\text{model}}^2
  • Calculate z  计算 z
    • This is softmax((qk)÷dhead)v=z\text{softmax}((q\cdot k)\div\sqrt{d_\text{head}}) \cdot v = z 这是 softmax((qk)÷dhead)v=z\text{softmax}((q\cdot k)\div\sqrt{d_\text{head}}) \cdot v = z
    • No matrices are multiplied, the number of flops is some factor of dmodeld_\text{model}.
      没有矩阵相乘,浮点运算次数是 dmodeld_\text{model} 的某个因数。
  • Multiply by the output projection matrix
    • Multiply WoRdmodel×dmodelW_o \in \mathbb{R}^{d_\text{model}\times d_\text{model}}, by zRdmodel×1z \in \mathbb{R}^{d_\text{model}\times1}
      WoRdmodel×dmodelW_o \in \mathbb{R}^{d_\text{model}\times d_\text{model}} 乘以 zRdmodel×1z \in \mathbb{R}^{d_\text{model}\times1}
    • Flop count: 2dmodel22 \cdot {d_\text{model}}^2 Flop 计数: 2dmodel22 \cdot {d_\text{model}}^2
  • Feed-forward  前馈
    • We have our MLP weights W1R4×dmodel,W2Rdmodel×4W_1 \in \mathbb{R}^{4\times d_\text{model}}, W_2 \in \mathbb{R}^{d_\text{model}\times 4} for two linear transformations (there's a ReLU in the middle, which small).
      我们有我们的 MLP 权重 W1R4×dmodel,W2Rdmodel×4W_1 \in \mathbb{R}^{4\times d_\text{model}}, W_2 \in \mathbb{R}^{d_\text{model}\times 4} 用于两个线性变换(中间有一个 ReLU,很小)。
    • Flop count: 28dmodel22\cdot 8 \cdot {d_\text{model}}^2  Flop 计数: 28dmodel22\cdot 8 \cdot {d_\text{model}}^2
  • Some other things  一些其他事情
    • There are typically layernorm that happen after each attention, where the weights there are a vector of length dmodeld_\text{model}.
      通常在每次注意之后会发生 layernorm,那里的权重是长度为 dmodeld_\text{model} 的向量。
    • There's another linear layer and then a softmax that sits on top, which is our output (token) embedding or unembedding or de-embedding or embedding1^{-1}.
      还有另一个线性层,然后是一个位于顶部的 softmax,这是我们的输出(令牌)嵌入或去嵌入或去嵌入或嵌入 1^{-1}
    • The original transformer has a cosine absolute positional encoding scheme, which is an addition operation on the token embedding.

Adding up all the flops!
把所有的 flops 加起来!

F=nlayers(23dmodel2+2dmodel2+16dmodel2)=nlayers24dmodel2F = n_\text{layers} \cdot (2 \cdot 3 \cdot {d_\text{model}}^2 + 2\cdot d_\text{model}^2 + 16\cdot d_\text{model}^2 )\\ = n_\text{layers} \cdot 24 \cdot {d_\text{model}}^2
Subbing in our 8192 model, we should get about 100B flops;
使用我们的 8192 模型,我们应该能得到大约 100B 次浮点运算;

F=642481922=103079215104flopsF = 64\cdot 24\cdot 8192^2 = 103079215104 \text{flops}

103079215104 over two is about 51.5B. We're a lil under (we get 51.5B instead of 52B) but that's because token (un)embeddings are nearly a billion parameters. It would be reasonable to do the latency calculations with 212nlayersdmodel22\cdot 12\cdot n_\text{layers} \cdot {d_\text{model}}^2 instead of 2P2\cdot P, but it's less than a 2% difference.
103079215104 除以二大约是 51.5B。我们稍微少一点(我们得到 51.5B 而不是 52B),但这是因为令牌(取消)嵌入几乎有十亿个参数。用 212nlayersdmodel22\cdot 12\cdot n_\text{layers} \cdot {d_\text{model}}^2 而不是 2P2\cdot P 进行延迟计算是合理的,但差距不到 2%。

What about the the calculation of zz and all the other steps I didn't count? Those are all vector-vector (or even vector-scalar) operations, so they are built around a factor of dmodeld_\text{model} rather than dmodel2{d_\text{model}}^2. Even if we had 100 of these operations per layer, it would come out to a hundred million flops, which is 0.1% of the number of flops we counted.
那么 zz 和我没有计算的所有其他步骤的计算呢?这些都是向量-向量(甚至是向量-标量)操作,所以它们是围绕 dmodeld_\text{model} 而不是 dmodel2{d_\text{model}}^2 构建的。即使我们每层有 100 个这样的操作,它也会产生一亿次浮点运算,这是我们计算的次数的 0.1%。

intermediate memory costs

Data Movement Is All You Need (which is mostly about optimising low level data movement for transformers, and isn't a particularly relevant read) has a nice way of classifying operations. We have tensor contractions, which are the big matmuls we've mostly cared about (including the linear layers). Then there are statistical normalisations, the softmax and layernorm. Finally, which this post has completely ignored till now are element-wise operators, which are things like biases, dropouts and activations.
你所需要的就是数据移动(这主要是关于为变换器优化低级数据移动,并不是特别相关的阅读)有一种很好的操作分类方法。我们有张量收缩,这是我们主要关心的大型矩阵乘法(包括线性层)。然后是统计规范化,softmax 和 layernorm。最后,这篇文章直到现在完全忽略的是逐元素操作,这些是像偏差、dropout 和激活这样的东西。

So how do we calculate the latency of those matmuls, layernorms, etc? The reported flops on our hardware is specificially for the multiply-add operations so it would not be right to count it in there even if we could count the flops. Surprise! It's only to cost memory to do the softmax read/writes as that's what the bandwidth to flops ratio favours. This is the latency factor that has been alluded to!
那么我们如何计算这些矩阵乘法、layernorm 等的延迟呢?我们硬件上报告的浮点运算是专门针对乘加操作的,所以即使我们能计算浮点运算,把它算在里面也是不对的。惊喜!只有将内存用于进行 softmax 读/写操作才是成本,因为这是带宽与浮点运算比率所偏好的。这就是一直被暗示的延迟因素!

I'm going to break character on the first-principles aspect of this and discuss Table A.1 from the Data Movement Is All You Need paper. Here we see that the latency for softmax is actually slightly higher than the calculations for qkv (which are a 1/3 of the time). This is a little concerning!
我将在第一原理方面打破角色,并讨论《你所需要的就是数据移动》论文中的表 A.1。这里我们看到 softmax 的延迟实际上略高于 qkv 的计算(这是时间的 1/3)。这有点令人担忧!

For the same reason the softmax is memory bound, so is the multiplication of qk, ReLU and dropout are also quite expensive.
出于同样的原因,softmax 是内存受限的,qk 的乘法、ReLU 和 dropout 也相当昂贵。

GPU Kernel Fusion GPU 内核融合

GPUs execute operations in units of "kernels". Kernel fusion means that something that was usually 2 kernels can become one, and the primary use is to reuse loads into memory and reduce redundant loads and stores. For example, a multiply-add is one kernel. If it were two, then one would load+add+store and the second would load+multiply+store. We could save a lot of trips by doing load+add+multiply+store.
GPU 以“内核”为单位执行操作。内核融合意味着通常是 2 个内核的东西可以变成一个,主要用途是重用加载到内存中的数据并减少冗余的加载和存储。例如,乘加运算是一个内核。如果它是两个,则一个会进行加载+加+存储,第二个会进行加载+乘+存储。通过执行加载+加+乘+存储,我们可以节省很多次访问。

We can also tell the softmax here is not perfectly fused by counting the number of read-writes we should need. In theory it can just be one read and one write (the standard is uh, four so I'm bullying a bit). For qk, it would be two reads and one write (the two reads can probably be saved). The three to one ratio then, indicates that the softmax is doing more memory passes than is optimal. I say this, because this expresses how much this counting is software dependents and needs experiments to estimate, since in theory the cost could be zero.
我们还可以通过计算我们应该需要的读写次数来判断这里的 softmax 没有完美融合。理论上它只需要一次读和一次写(标准是嗯,四次,所以我有点吹毛求疵)。对于 qk,它将是两次读取和一次写入(两次读取可能可以节省)。因此,三比一的比例表明,softmax 进行的内存传递次数多于最佳状态。我之所以这么说,是因为这表明了这种计数有多依赖软件并且需要实验来估计,因为理论上成本可能是零。

It's also worth noting that the percentage of time these operations take gets smaller quickly as model size increases as the memory will increase on the order of dmodeld_\text{model} while the flops increase on the order of dmodel2{d_\text{model}}^2 — per layer. The paper is a 336M param model, dmodel=1024,nlayers=24d_\text{model} = 1024, n_\text{layers} = 24.
同样值得注意的是,随着模型大小的增加,这些操作所占的时间比例会迅速变小,因为内存的增加量是 dmodeld_\text{model} 的数量级,而 flops 的增加量是 dmodel2{d_\text{model}}^2 的数量级——每层。该论文是一个 336M 参数模型, dmodel=1024,nlayers=24d_\text{model} = 1024, n_\text{layers} = 24

I added up the latency of all the values in the "Ours" column that were memory bound, including the element-wise operations. The result is that these intermediate steps take 43% of the time. In a model of size 52B (where dmodeld_\text{model} is 8 times larger, we see these operations become less significant.
我将“我们的”列中所有内存受限的值的延迟加起来,包括逐元素操作。结果是,这些中间步骤占用了 43%的时间。在一个大小为 52B 的模型中(其中 dmodeld_\text{model} 是 8 倍更大,我们看到这些操作变得不那么重要。

The duration of these memory bound intermediate operations will take 8 times longer as the operations are vectors of length dmodeld_\text{model}. However, the number of flops will increase by 64 times, which means the flop time increases by 64 times.
这些内存受限的中间操作的持续时间将会是操作长度为 dmodeld_\text{model} 的向量的 8 倍。然而,flops 的数量将增加 64 倍,这意味着 flop 时间增加了 64 倍。

0.438÷(164)=0.053750.43\cdot 8 \div(1\cdot64) = 0.05375

So using the optimisations in that paper, a 52B model inference latency would be about 5% of these intermediate calculations we didn't factor into latency.
因此,使用该论文中的优化,一个 52B 模型的推理延迟将是我们没有计入延迟的这些中间计算的大约 5%。

comparing against real benchmarks

I work at a language modelling company that has its own infrastructure and existing benchmarks but uh, IP is hard. There is a sadly small number of public benchmarks available for model parallel inferencing? It seems like the only public engines for this are Nvidia FasterTransformer and Microsoft Deepspeed with other benchmarks probably scattered in papers I don't know exist. Anywho, we can verify our calculations against some real benchmarks!
我在一家拥有自己的基础设施和现有基准的语言建模公司工作,但是,知识产权很难处理。可用于模型并行推理的公共基准数量少得可怜?看来唯一的公共引擎是 Nvidia FasterTransformer 和 Microsoft Deepspeed,其他基准可能散布在我不知道存在的论文中。无论如何,我们可以根据一些真实的基准验证我们的计算!

Because I only want to use 2 GPUs, I've run a 13B parameter model with FasterTransformer, which does a bunch of good kernel fusing and provides us with tensor parallelism. 13B is 40 layers, 40 heads, each of dim 128 for a dim size of 5120. I have screenshots of the profiles in here and there are a bunch of interesting things in there that might make another post.
因为我只想使用 2 个 GPU,我已经用 FasterTransformer 运行了一个 13B 参数模型,它进行了大量良好的内核融合,并为我们提供了张量并行性。13B 是 40 层,每层 40 个头,每个头的维度为 128,维度大小为 5120。我这里有一些配置文件的截图,里面有很多有趣的东西,可能会写成另一篇文章。

We'll start with a 512 context length, batch size 1 and 10 tokens outputted. For a small batch for one token on 2 GPUs we expect 8.4ms, and about 1ms of comms. For 1 GPU, that would be 16.8ms and 0 comms. (2x40x12x5120^2/1.5e12)
我们将从 512 的上下文长度,批量大小 1 和输出的 10 个令牌开始。对于在 2 个 GPU 上的一个小批量的一个令牌,我们预计 8.4ms,以及大约 1ms 的通信时间。对于 1 个 GPU,那将是 16.8ms 和 0 通信时间。(2x40x12x5120^2/1.5e12)

Excuse my mangled significant figures, I probably should've kept the mem bandwidth to 1.555 instead of 1.5.
对不起,我的有效数字弄错了,我可能应该将内存带宽保持在 1.555 而不是 1.5。

Our empirical result for 1 GPU is 22.0ms, meaning our guess was 76% there. We can actually safely account for all of this, where we know some percentage will go to intermediate activations, and that we don't actually get 100% of our theoretical memory bandwidth. For these dimensions, a profile tells us we get up to about 90% of our full memory bandwidth (where I compare the expected cost of a matmul to the duration of a single matmul kernel and rounded up as the bandwidth usage varies quite a bit depending on the tensors being loaded). Counting that in, we expect to take 18.5ms. Adding up the cost of intermediate activations (which we can do from a profile) we get another 2.2ms, getting us to 20.7 ms! To account for the last 1.4 ms there are some other sub-millisecond operations like token embeddings, doing top-(k|p), less net bandwidth than 90% (I couldn't be bothered to actually average everything I took the highest bw usage I could find) or even kernal launch times.
我们对 1 个 GPU 的实证结果是 22.0ms,意味着我们的猜测达到了 76%。我们实际上可以安全地解释所有这些,我们知道一些百分比将用于中间激活,而我们实际上并没有得到我们理论内存带宽的 100%。对于这些维度,一个配置文件告诉我们我们获得了大约 90%的完整内存带宽(我比较了一个 matmul 的预期成本与单个 matmul 内核的持续时间,并四舍五入,因为带宽使用根据正在加载的张量而变化很大)。计算在内,我们预计将花费 18.5ms。加上中间激活的成本(我们可以从配置文件中做到这一点),我们得到另外 2.2ms,达到 20.7ms!为了解释最后的 1.4ms,还有一些其他的亚毫秒操作,如令牌嵌入,进行 top-(k|p),净带宽少于 90%(我实在懒得平均我找到的最高带宽使用率)或甚至是内核启动时间。

Our emprical result for 2 GPUs is 13.5. We're farther off this time, for only 62% of the way there. We would check the profile again to see the memory bandwidth (which we expect to be slightly worse, as smaller tensors tend to be able to get less of the bandwidth). This time, it doesn't quite get to 90, more like 87, getting us to 9.5ms. The intermediate activations take a similar amount of time (2ms), getting us 11.7ms. With the remaining 1.5 ms then, we're searching for comms! This is easily covered by our calculated 1ms of comms not being parallelised. From the profile, our comms take 40-50microseconds per layer, for a total of 1.7ish ms of comms time, which accounts for everything pretty well!
我们对 2 个 GPU 的实证结果是 13.5。这次我们偏离得更远了,只有 62%的路程。我们会再次检查配置文件,以查看内存带宽(我们预计会稍差一些,因为较小的张量往往能够获得较少的带宽)。这次,它并没有达到 90,更像是 87,让我们达到了 9.5ms。中间激活花费了类似的时间(2ms),让我们达到了 11.7ms。剩下的 1.5ms,我们在寻找通信时间!这很容易被我们计算的 1ms 通信时间没有并行化所覆盖。从配置文件中,我们的通信时间每层需要 40-50 微秒,总共大约 1.7ms 的通信时间,这几乎解释了一切!

I think for both of those operations, the counting of intermediate activations was a bit higher than it should be, because the profile gave consistently slightly-higher latencies than the raw benchmarking run. The output of the benchmark run was 180.86 ms (context time: 45.45 ms) and 283.60 ms (context time: 63.17 ms).
我认为对于这两种操作,中间激活的计数应该比实际情况要高一些,因为配置文件给出的延迟一致地略高于原始基准运行。基准运行的输出是 180.86 ms (context time: 45.45 ms)283.60 ms (context time: 63.17 ms)

But what about the forwards pass? I expect the forwards pass to take num_tokens/flops_to_bw_ratio times as long as a decoding step. This is because we have to send all the tokens to all the GPUs, and each GPU will do their heads of attention on it and store kv. Let's use the updated memory bandwidth, 312e12/(1.5e12x0.9)=231. Looking at the 1 GPU setup, where 22 is our expected decoding step, we see the 22*(512/231) = 48 which is not quite the claimed 63. For 2 GPUs we get 13.5*(512/231) = 30ms, even worse!
但是前向传递呢?我预计前向传递将花费 num_tokens/flops_to_bw_ratio 倍的时间作为解码步骤。这是因为我们必须将所有令牌发送到所有 GPU,并且每个 GPU 将对其进行注意力头处理并存储 kv。让我们使用更新的内存带宽,312e12/(1.5e12x0.9)=231。看看 1 个 GPU 的设置,其中 22 是我们预期的解码步骤,我们看到 22*(512/231)=48,这并不完全是声称的 63。对于 2 个 GPU,我们得到 13.5*(512/231)=30ms,更糟!

For the one gpu, some of the missing time should just be kv storing. Looking at the profiles, this is 18 microseconds per layer, 0.7ms. There are some Memsets for 0.2ms. We expect the flop time (this is flops bound!) for one of our MLP multiplies to be 512x4x5120^2x2/312e12 = 344 microseconds. In practice, this is 476 at the lowest which means we get 72% of the flops we expect? For the projection in the attention we expect we 512x5120^2x2/312e12 = 86 microseconds. In profiles we find this to be 159 at the lowest, which is 54%. Yikes! I panicked for a bit, but uh this is apparently just the flops we expect? See Figure 14 in this paper where a 512x4000x4000 ends up getting less than 150TFLOPs/s.
对于一个 GPU,一些缺失的时间应该只是 kv 存储。查看性能分析,这是每层 18 微秒,0.7ms。有一些 Memsets 需要 0.2ms。我们预计一个 MLP 乘法的 flop 时间(这是受 flops 限制的!)为 512x4x5120^2x2/312e12 = 344 微秒。实际上,这在最低时为 476,这意味着我们获得了我们预期的 72% 的 flops 吗?对于注意力中的投影,我们预计 512x5120^2x2/312e12 = 86 微秒。在性能分析中,我们发现这在最低时为 159,即 54%。哎呀!我有点慌了,但呃,这显然只是我们预期的 flops 吗?参见本文中的图 14,其中一个 512x4000x4000 的结果获得的 TFLOPs/s 不到 150。

exercises 练习

  1. Given batch size, context length and next_n, how can we calculate the savings of using kv cache?
    给定批量大小、上下文长度和 next_n,我们如何计算使用 kv 缓存的节省?

  2. What overheads does the kv cache add in memory time?

  3. Can we be memory bound on our forwards pass but flops bound at each sampling step?

  4. What tradeoffs and calculations should we consider for using more GPUs than is necessary for capacity? Say for example, a 52B model on 8 or 16 GPUs instead of 4.
    对于使用超出必要容量的 GPU,我们应该考虑哪些权衡和计算?例如,一个 52B 模型使用 8 个或 16 个 GPU 而不是 4 个。

  5. We came up with formulas to calculate time to predict one token. How would we calculate the time to do an entire sample, from doing the forwards pass on the context to predicting all the tokens requested?

  6. In the capacity section, I say the memory of intermediate calculations are negligble. How small are they exactly?

  7. In the batch sizes section, we went a bit off topic and talked about the flops per byte of communication. What are the tradeoffs if we had an embedding dimension size of 512?
    在批量大小部分,我们有点偏题,谈到了每字节通信的浮点运算次数。如果我们有一个嵌入维度大小为 512,会有什么权衡?

  8. We assume GPUs attached to the same host here, but could communicate GPUs between hosts like we do in training. AWS has 400gb/s. What about it!
    我们假设 GPU 连接到同一个主机上,但像我们在训练中做的那样,GPU 之间可以跨主机通信。AWS 有 400gb/s。这意味着什么!

  9. In model parallelism, we could in practice communicate all the shards and then have each accelerator do all the addition, instead of just a share of their addition. What are the latency implications there?

  10. Try calculating the large batch speed for a 52B on 4xGPUs at batch size 256. The compute should be about 21ms and comms should be about 4ms.
    尝试计算一个 52B 模型在 4xGPU 上批量大小为 256 时的大批量速度。计算应该约为 21ms,通信应该约为 4ms。

  11. Consider the operation of taking the vector out of the last layer and multiplying it by the unembedding matrix, storing the logits and then doing top-k or top-p sampling (which requires a sort). How long should this take for a 52B model, and what can we parallelise here?
    考虑从最后一层取出向量并乘以反嵌入矩阵的操作,存储逻辑值,然后进行 top-k 或 top-p 采样(这需要排序)。对于一个 52B 模型,这应该需要多长时间,我们可以在这里并行化什么?

  12. How can we shard the token embeddings? Would shard the input token embeddings differently from the unembeddings? Layernorms? What extra communication does this incur?

acknowledgements 致谢

Would like to extend credit and thanks to people who make a positive impact on this post in varying capacities. James Bradbury, Eric Zhang, Taylor Rogalski, Horace He, Julian Schrittwieser, Reiner Pope, Jim Wu, Mohammad Bavarian, Tudor Brindus and Adrien Morisot with James leading by a long shot.
想要向对这篇文章产生积极影响的人们表示感谢和致敬,包括 James Bradbury, Eric Zhang, Taylor Rogalski, Horace He, Julian Schrittwieser, Reiner Pope, Jim Wu, Mohammad Bavarian, Tudor Brindus 和 Adrien Morisot,其中 James 贡献最大。

citation 引用

Please cite as: 请引用为:

Chen, Carol. "Transformer Inference Arithmetic",, 2022.

hey kipply you should better understand our big model inferencing latency

yes that's a great idea i'll look into it!

cool i'd love to see the profile

if i sit in a dark room by myself long enough i think i can explain all the milliseconds


The architectures and latencies expressed in this post are those of publicly known or theoretical models and benchmarks and do not necessarily reflect the architectures or latencies of my employer's models.