EAGLE: Extrapolation Algorithm for Greater Language-model Efficiency
EAGLE: 更大的语言模型效率外推算法

by Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang, December 8, 2023
由李宇辉，魏芳云，张超，张泓洋，2023 年 12 月 8 日

Vector Institute, University of Waterloo, Peking University, Microsoft Research
Vector 研究所,滑铁卢大学,北京大学,微软研究

[Code with Apache-2.0] [EAGLE-1 Paper] [EAGLE-2 Paper]
[Apache-2.0 代码] [EAGLE-1 论文] [EAGLE-2 论文]

Figure 1: Medusa was tested on the Vicuna's benchmark and Lookahead was tested on the LLaMA2-chat's benchmark by the original authors. In order to make a fair comparison, we run EAGLE on both benchmarks. Medusa's and Lookahead's numbers are copied from their original technical reports.
图 1：Medusa 在 Vicuna 的基准测试上进行了测试，而 Lookahead 在 LLaMA2-chat 的基准测试上进行了测试，这是由原始作者完成的。为了进行公平的比较，我们在两个基准测试上运行了 EAGLE。Medusa 和 Lookahead 的数据是从它们的原始技术报告中复制的。

TL;DR: We introduce EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a new baseline for fast decoding of Large Language Models (LLMs) with provable performance maintenance. This approach involves extrapolating the second-top-layer contextual feature vectors of LLMs, enabling a significant boost in generation efficiency.
摘要：我们引入了 EAGLE（扩展语言模型效率的更大语言模型推断算法），这是一种新的基准，用于快速解码大型语言模型（LLMs），同时保证性能不变。这种方法涉及推断LLMs的第二层上下文特征向量，从而显著提高了生成效率。

In summary, EAGLE is: 总之，EAGLE 是：

3x faster than vanilla decoding (13B).
比原始解码快 3 倍（13B）。
2x faster than Lookahead (13B).
比 Lookahead 快 2 倍（13B）。
1.6x faster than Medusa (13B).
比 Medusa（13B）快 1.6 倍。
provably maintaining the consistency with vanilla decoding in the distribution of generated texts.
证明性地保持生成文本分布与原始解码的一致性。
trainable (within 1-2 days) and testable on 8x RTX 3090 GPUs. So even the GPU poor can afford it.
可训练（1-2 天内）并在 8 个 RTX 3090 GPU 上进行测试。因此，即使是 GPU 资源较少的人也能负担得起。
combinable with other parallelled techniques such as vLLM, DeepSpeed, Mamba, FlashAttention, quantization, and hardware optimization.
可与其他并行技术结合使用，如 vLLM、DeepSpeed、Mamba、FlashAttention、量化和硬件优化。

Figure 2: Generation speed of Vicuna 33B using different methods, with inference conducted on RTX 3090 GPUs at fp16 precision. For an enhanced viewing experience, the animation has been sped up fourfold.
图 2：使用不同方法生成 Vicuna 33B 的速度，推理在 RTX 3090 GPU 上以 fp16 精度进行。为了提供更好的观看体验，动画速度被加快了四倍。

Introduction 介绍

Large Language Models (LLMs) like ChatGPT demonstrate remarkable capabilities and are increasingly applied in various domains. However, their text generation process is costly and slow. This inefficiency is attributed to the nature of auto-regressive decoding: each token generation necessitates a forward pass, requiring access to the entire parameter set of the LLM, which can amount to several tens or even hundreds of billions of parameters. This results in a memory-bound limitation for auto-regressive decoding.
大型语言模型（LLMs）如 ChatGPT 展现出非凡的能力，并在各个领域得到越来越多的应用。然而，它们的文本生成过程成本高昂且速度缓慢。这种低效性归因于自回归解码的本性：每个令牌的生成都需要前向传递，需要访问整个模型的参数集，这可能包含数十亿甚至数百亿个参数。这导致了自回归解码的内存限制。

One approach to accelerate auto-regressive decoding is speculative decoding. This technique employs a smaller, draft model to guess the next γ tokens through standard auto-regressive generation. Subsequently, the Original LLM validates these guessed tokens, necessitating only a single forward pass for verification. If the draft model accurately predicts α tokens, a single forward pass of the Original LLM can generate α+1 tokens.
加速自回归解码的一种方法是推测性解码。该技术使用较小的草稿模型通过标准的自回归生成来猜测接下来的γ个令牌。随后，原始模型LLM验证这些猜测的令牌，仅需要一次前向传递进行验证。如果草稿模型准确预测了α个令牌，原始模型LLM的一次前向传递可以生成α+1 个令牌。

In speculative decoding, both the draft model and the Original LLM share the same task: predicting the next token based on the current sequence of tokens. Accomplishing this task with a model significantly smaller in parameters is extremely challenging and often yields sub-optimal results. Furthermore, the draft model's independent predictions do not leverage the rich semantic information extracted by the Original LLM, leading to potential inefficiencies.
在推测性解码中，草稿模型和原始LLM共同承担着同一任务：根据当前的令牌序列预测下一个令牌。使用参数显著较少的模型完成此任务极其具有挑战性，往往会导致次优结果。此外，草稿模型的独立预测并未利用原始LLM提取的丰富语义信息，这可能导致潜在的低效。

This limitation inspired the development of EAGLE. Our approach utilizes the contextual features extracted by the Original LLM (i.e., those gathered during the prediction of the next token as the second-top layer output, without requiring additional computation). EAGLE is building upon the following First Principle:
这一限制激发了 EAGLE 的发展。我们的方法利用了由 Original LLM提取的上下文特征（即，在预测下一个词的第二层输出时收集的特征，无需额外计算）。EAGLE 基于以下第一原理：

The sequence of features is compressible, making the prediction of subsequent feature vectors from previous ones easy.
特征序列是可压缩的，使得从先前的特征向量预测后续的特征向量变得容易。

We train a lightweight plugin, called Auto-regression Head, in conjunction with the Original LLM's frozen embedding layer, to predict the next feature based on the current feature sequence from the second-top-layer of the Original model. The token is then derived using the frozen classification head of the Original LLM, which maps features to tokens. Features, being more abstract and clearer than token sequences, make the task of regressing features considerably simpler than regressing tokens. In summary, EAGLE extrapolates at the feature level using a small auto-regressive head and then employs the frozen classification head to generate the predicted sequence of tokens.
我们训练了一个轻量级插件，称为自回归头部，与原始LLM's 冻结嵌入层结合使用，以根据原始模型第二顶层的当前特征序列预测下一个特征。然后使用原始LLM的冻结分类头部生成令牌，该头部将特征映射为令牌。特征比令牌序列更抽象且更清晰，使得基于特征回归的任务比基于令牌回归的任务简单得多。总之，EAGLE 在特征级别使用小型自回归头部进行外推，然后利用冻结分类头部生成预测的令牌序列。

Consistent with similar works like speculative sampling, Medusa, and Lookahead, our focus is on the latency per prompt inference rather than the overall system throughput.
与类似的作品，如投机采样、美杜莎和前瞻，我们的关注点在于每个提示推理的延迟，而不是整个系统的吞吐量。

EAGLE – Enhancing LLM Generation Efficiency
EAGLE – 提升LLM代生成效率

Figure 2: A comparison of how to "guess" the fourth and fifth tokens t4 and t5 in different methods under the guess-verify framework, where t1t2 is the prompt. 't' (blue blocks) represents tokens, and 'f' (orange blocks) signifies the features from the second-top layer, with subscripts indicating their positions in the sequence. For simplicity, the "n" in the n-gram for Lookahead shown in the figure has been set to 2.
图 2：在猜测验证框架下，不同方法如何“猜测”第四和第五个标记 t4 和 t5 的比较，其中 t1t2 是提示。't'（蓝色块）代表标记，'f'（橙色块）表示第二顶层的特征，下标表示它们在序列中的位置。为了简化，图中所示的 Lookahead 的 n-gram 中的 "n" 已设置为 2。

The figure below illustrates the workflow of EAGLE. During the forward process of the Original LLM, we collect the features from the second-top-layer. Beginning with these features and a token generated in this forward pass by the Original LLM, the FeatExtrapolator commences its "guessing" process. The FeatExtrapolator integrates embeddings and generates the next feature in an auto-regressive manner. Subsequently, the distribution of tokens is determined using the frozen LM head, enabling us to sample from this distribution. By repeating this sampling multiple times, we conduct a tree-like generation process, as depicted on the right side of the figure. In this example, three forward passes of the FeatExtrapolator "guess" a tree of 10 tokens.
下图说明了 EAGLE 的工作流程。在原始LLM的前向过程中，我们收集了第二顶层的特征。从这些特征和在前向过程中由原始LLM生成的令牌开始，特征推断器启动其“猜测”过程。特征推断器整合嵌入并以自回归的方式生成下一个特征。随后，使用冻结的 LM 头部确定令牌的分布，使我们能够从这个分布中采样。通过多次重复这个采样过程，我们进行了一种树形生成过程，如图的右侧所示。在这个例子中，特征推断器的三次前向过程“猜测”了一个包含 10 个令牌的树结构。

Figure 3: Schematic Diagram of EAGLE. The left side illustrates the computational process, while the right side displays the corresponding generation results for each step. Green blocks represent token embeddings, orange blocks signify the features from the second-top-layer layer of the LLM, red boxes indicate features predicted by the Auto-regression Head, and blue modules with snowflake icons represent the use of original LLM parameters, which are not subject to training.

We employ the lightweight FeatExtrapolator to predict the features of the Original LLM, which may not always be precise (as indicated by the features within red boxes in the figure). To ensure the consistency of the generated text distribution, we subsequently verify the predicted tree. Owing to the properties of causal LMs, this verification process can be completed in a single forward pass, which also simultaneously generates a token. Using this generated token and the collected features, the FeatExtrapolator can make further "guesses". Through this cycle of prediction and verification, the LLM is enabled to generate tokens rapidly.

Training the Auto-regression Head is straightforward. For all models, we utilize the ShareGPT dataset, which comprises fewer than 70,000 conversational rounds, for training purposes. The FeatExtrapolator is also characterized by a minimal number of trainable parameters. As indicated by the blue sections in the above figure, the majority of components are frozen. The only requirement is "One Auto-regression Head," which is a single-layer transformer decoder and has 0.24B-0.99B parameters. Even with GPU-poor setups, training the Auto-regression Head is feasible. For example, the Auto-regression Head for Vicuna 33B can be trained on RTX 3090 GPUs within 24 hours.

Why We Use Token Embedding？

Unlike Medusa, which predicts tokens at various offsets (e.g., the next token, the one following it, etc.) solely using the second-top-layer's features, FeatExtrapolator additionally incorporates the embedding of the next token into its input for prediction. This additional information helps FeatExtrapolator handle the randomness inherent in the sampling process.

Consider the example in the figure below where the query is "I". The LLM outputs probabilities for "I" being followed by "am" or "always". Medusa, not factoring in whether "am" or "always" is sampled, directly predicts the probability of the token following the next one. Therefore, Medusa's target is the probability of the token following "I am" or "I always". Due to the randomness of the sampling process, Medusa's identical inputs can have different targets, leading to a lack of a consistent mapping between inputs and outputs.

In contrast, EAGLE's input includes the embedding of the sampled result, ensuring a consistent mapping between input and output. This distinction allows FeatExtrapolator to more accurately predict subsequent tokens, accounting for the specific context established by the sampling process.

Figure 4: Demonstrating the Importance of Using Token Embedding. The figure depicts the LLM's generation process using "I" as the query, where the LLM predicts the next token to be either "am" or "always." The left side assumes the sampling of "am," and the right side assumes "always." For both EAGLE and Medusa, the target is to predict the probability of the token that follows the next token, which means predicting the token after "I am" (on the left side) or after "I always" (on the right side), depending on the random sampling outcome. Medusa’s input is the feature of "I," which does not account for the sampled result, leading to different targets for the same input. In contrast, EAGLE's input includes the embedding of the sampled result - either "am" (left side) or "always" (right side), ensuring a unique target for each input.
图 4：展示使用词嵌入的重要性。该图描绘了使用“I”作为查询时，LLM的生成过程，其中LLM预测下一个词可能是“am”或“always”。左侧假设采样的是“am”，右侧假设是“always”。对于 EAGLE 和 Medusa 而言，目标是预测下一个词后的词的概率，这意味着预测在“我 am”（左侧）或“我 always”（右侧）之后的词的概率，具体取决于随机采样的结果。Medusa 的输入是“我”的特征，不考虑采样结果，导致相同的输入有不同的目标。相比之下，EAGLE 的输入包含了采样结果的嵌入——要么是“am”（左侧），要么是“always”（右侧），确保每个输入都有唯一的目标。

Tree Generation Caption 树木生成标题

Differing from other guess-verify frameworks like speculative sampling, Lookahead, and Medusa, our approach, by employing tree-like generation in the "guessing" phase, achieves greater efficiency. As illustrated in the diagram, the generation processes in speculative sampling and Lookahead are linear or chain-like. Medusa's method, hindered by its inability to construct context during the guessing phase, generates a tree through the Cartesian product, leading to a fully connected tree across adjacent layers. This approach frequently results in nonsensical combinations, such as "I am begin." On the other hand, EAGLE creates a sparser tree structure. The sparse tree structure is more selective and context-aware, preventing the formation of nonsensical sequences and focusing computational resources on more plausible token combinations.
不同于其他猜测验证框架，如投机采样、Lookahead 和 Medusa，我们的方法通过在“猜测”阶段采用树形生成，实现了更高的效率。如图所示，投机采样和 Lookahead 的生成过程是线性的或链状的。Medusa 的方法受限于其在猜测阶段无法构建上下文，通过笛卡尔积生成树，导致相邻层之间形成完全连接的树。这种方法经常导致不合逻辑的组合，如“我开始”。相比之下，EAGLE 创建了一个更稀疏的树结构。稀疏的树结构更具选择性和上下文意识，防止形成不合逻辑的序列，并将计算资源集中在更有可能的令牌组合上。

Figure 5: Schematic diagram illustrating the structure of text generation under different methods within the guess-verify framework.
图 5：展示了在猜测-验证框架下，不同方法下文本生成结构的示意图。

Multi-Round Speculative Sampling
多轮投机采样

Speculative sampling, traditionally used for chain-like guessing processes, maintains distribution consistency. To adapt it for tree-like guessing scenarios, we have extended this approach into a multi-round recursive form. The pseudocode of Multi-Round Speculative Sampling is presented below.
投机采样，传统上用于链式猜测过程，保持分布一致性。为了适应树状猜测场景，我们将此方法扩展为多轮递归形式。以下是多轮投机采样伪代码。

During Tree Generation, we record the probability corresponding to each sampled token. Through Multi-Round Speculative Sampling, we ensure that the distribution of every token generated ultimately aligns with that of the Original LLM.

Experimental Result

The following figure illustrates the acceleration effects of EAGLE using Vicuna 33B across different types of tasks. The "coding" task, which involves a substantial amount of fixed templates, shows the best acceleration performance.
以下图表展示了使用 Vicuna 33B 时，EAGLE 在不同类型的任务中的加速效果。涉及大量固定模板的“编码”任务表现出最佳的加速性能。

Acknowledgement 确认

This project has been influenced by many excellent projects in the LLM community, such as Medusa, Lookahead, and others. The logo is designed by GPT-4. We also appreciate many valuable discussions with Tianle Cai, Hao Zhang, Ziteng Sun, and others.
这个项目受到了 1001#社区中许多优秀项目的启发，比如 Medusa、Lookahead 等。Logo 由 GPT-4 设计。我们也非常感谢与蔡天乐、张浩、孙至腾以及其他人的许多有价值的讨论。

EAGLE: Extrapolation Algorithm for Greater Language-model EfficiencyEAGLE: 更大的语言模型效率外推算法