Fast Best-of-N Decoding via Speculative Rejection
快速通过投机拒绝的 N 最佳解码
Abstract 摘要
The safe and effective deployment of Large Language Models (LLMs) involves a critical step called alignment, which ensures that the model’s responses are in accordance with human preferences. Prevalent alignment techniques, such as DPO, PPO and their variants, align LLMs by changing the pre-trained model weights during a phase called post-training. While predominant, these post-training methods add substantial complexity before LLMs can be deployed. Inference-time alignment methods avoid the complex post-training step and instead bias the generation towards responses that are aligned with human preferences. The best-known inference-time alignment method, called Best-of-N, is as effective as the state-of-the-art post-training procedures. Unfortunately, Best-of-N requires vastly more resources at inference time than standard decoding strategies, which makes it computationally not viable. In this work, we introduce Speculative Rejection, a computationally-viable inference-time alignment algorithm. It generates high-scoring responses according to a given reward model, like Best-of-N does, while being between 16 to 32 times more computationally efficient.
大型语言模型(LLMs)的安全有效部署涉及一个关键步骤,称为对齐,该步骤确保模型的响应符合人类偏好。常见的对齐技术,如 DPO、PPO 及其变体,通过对称为后训练阶段的阶段改变预训练模型权重来实现对齐。虽然这些后训练方法占主导地位,但在部署之前,它们增加了大量的复杂性。推理时对齐方法避免了复杂的后训练步骤,而是将生成偏向于与人类偏好一致的反应。最著名的推理时对齐方法称为 Best-of-N,其效果与最先进的后训练程序相当。不幸的是,Best-of-N 在推理时需要比标准解码策略多得多的资源,这使得它在计算上不可行。在这项工作中,我们引入了一种名为 Speculative Rejection 的推理时对齐算法,它在计算上是可行的。它根据给定的奖励模型生成高分的响应,就像 Best-of-N 一样,同时计算效率比 Best-of-N 高 16 到 32 倍。
1 Introduction 1 简介
Large Language Models (LLMs), pre-trained on massive corpora, have demonstrated remarkable capabilities in handling diverse tasks like creative writing, summarization and question-answering [10, 13, 63]. Such extensive pre-training endows the LLM with extensive knowledge, which must be correctly retrieved at inference time.
Post-training techniques [60, 67, 42] aim to enable the LLM to answer users’ questions in the most satisfactory way based on human intentions
[ouyang2022training, bai2022constitutional, rafailov2024direct], while
adhering to ethical standards and safe guidelines [ngo2022alignment, casper2023open, deshpande2023toxicity].
Popular post-training methods include supervised finetuning, Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), Expert Iteration (EI), and their variants [christiano2017deep, ouyang2022training, stiennon2020learning, glaese2022improving, bakker2022fine, touvron2023llamasecond, zhao2022calibrating, zhao2023slic, dong2023raft, rafailov2024direct, liu2023statistical, rafailov2024r, zeng2024token, zhong2024dpo].
大型语言模型(LLMs),在大量语料库上预训练,在处理创意写作、摘要和问答等多样化任务方面表现出卓越的能力[10, 13, 63]。这种广泛的预训练赋予LLM广泛的知识,这些知识必须在推理时正确检索。后训练技术[60, 67, 42]旨在使LLM能够根据人类意图以最令人满意的方式回答用户的问题[ouyang2022training, bai2022constitutional, rafailov2024direct],同时遵守伦理标准和安全指南[ngo2022alignment, casper2023open, deshpande2023toxicity]。流行的后训练方法包括监督微调、基于人类反馈的强化学习(RLHF)、直接偏好优化(DPO)、专家迭代(EI)及其变体[christiano2017deep, ouyang2022training, stiennon2020learning, glaese2022improving, bakker2022fine, touvron2023llamasecond, zhao2022calibrating, zhao2023slic, dong2023raft, rafailov2024direct, liu2023statistical, rafailov2024r, zeng2024token, zhong2024dpo]。
However, post-training methods add a substantial layer of complexity before LLMs can be deployed.
In contrast, inference-time alignment refers to those procedures that bypass the post-training step of the LLM entirely, and perform alignment directly at inference time by changing the decoding strategy [wang2024inferaligner, amini2024variational, gui2024bonbon, sessa2024bond]. Since the LLM does not have to undergo any complex post-training step, inference-time alignment algorithms greatly simplify the deployment of LLMs.
然而,在LLMs部署之前,训练后方法增加了大量的复杂性。相比之下,推理时对齐指的是那些完全绕过LLM训练后步骤的程序,通过改变解码策略直接在推理时进行对齐[wang2024inferaligner, amini2024variational, gui2024bonbon, sessa2024bond]。由于LLM无需经历任何复杂的训练后步骤,推理时对齐算法极大地简化了LLMs的部署。
One of the simplest decoding strategies that implements inference-time alignment is the Best-of- method. Best-of- generates responses for a single prompt, and the best response is selected based on the evaluation of a reward model that measures the suitability of the responses.
Best-of- is endowed with many desirable properties that make it a strong baseline in the context of alignment.
To start, Best-of- is a simple alignment method that is highly competitive with post-training techniques such as RLHF or DPO [dubois2024alpacafarm]. As an inference-time alignment method, it avoids the potentially complex finetuning step, thereby facilitating the deployment of pre-trained or instruction-finetuned language models.
Best-of- is both straightforward to understand and to implement, and it is essentially hyperparameter-free: the number of responses is the only hyperparameter, one that can be tuned on the fly at inference time.
With regards to alignment, Best-of- has very appealing properties: for example, the growth rate for the reward values of Best-of-, as a function of the KL divergence, is faster than the rate for RLHF methods [gao2023scaling, yang2024asymptotics],
leading to generations of higher quality.
Best-of- also plays a critical role in some post-training techniques:
it is commonly used to generate a high-quality dataset for later supervised fine-tuning [touvron2023llamasecond, dubois2024alpacafarm],
a procedure sometimes called Expert Iteration or Iterative Finetuning, one that played a key role in the alignment of Llama-2 [touvron2023llamasecond] and Llama-3 [Llama3report2024]. It can also serve as the rejection sampling scheme to boost the alignment performance [wu2024self, dong2023raft].
一种实现推理时对齐的最简单解码策略是 Best-of- 方法。Best-of- 为单个提示生成 响应,并根据奖励模型的评估选择最佳响应。Best-of- 具有许多理想的特性,使其在对齐的背景下成为一个强大的基线。首先,Best-of- 是一种简单的对齐方法,与 RLHF 或 DPO [dubois2024alpacafarm]等后训练技术具有高度竞争力。作为一种推理时对齐方法,它避免了可能复杂的微调步骤,从而促进了预训练或指令微调的语言模型的部署。Best-of- 既易于理解又易于实现,基本上没有超参数:响应的数量 是唯一的超参数,可以在推理时实时调整。 关于对齐,Best-of- 具有非常吸引人的特性:例如,作为 KL 散度的函数,Best-of- 的奖励值增长率比 RLHF 方法[gao2023scaling, yang2024asymptotics]的速率更快,导致生成更高质量的结果。Best-of- 在一些后训练技术中也发挥着关键作用:它通常被用来生成高质量的用于后续监督微调的数据集[touvron2023llamasecond, dubois2024alpacafarm],这一过程有时被称为专家迭代或迭代微调,它在 Llama-2[touvron2023llamasecond]和 Llama-3[Llama3report2024]的对齐中发挥了关键作用。它还可以作为拒绝采样方案来提高对齐性能[wu2024self, dong2023raft]。
However, a critical drawback of Best-of- is that its efficiency at inference time is bottlenecked by the computational cost of generating sequences.
To be more precise, while the latency (i.e., the wall-clock time) of Best-of- is largely unaffected by because the utterances can be generated and evaluated in parallel, Best-of- may need several GPUs if is larger than the largest batch size that can fit on a single accelerator.
Practical values for are in the range [mudgal2023controlled, scheurer2023training, eisenstein2023helping].
However, higher values of , such as [dubois2024alpacafarm, gao2023scaling], may be needed in order to be competitive with the state-of-the-art post-training methods, but these are not computationally viable, because they require dozens, if not hundreds, of accelerators.
然而,Best-of- 的一个关键缺点是其推理时间效率受到生成 序列的计算成本的瓶颈。更准确地说,虽然 Best-of- 的延迟(即,实际时钟时间)在很大程度上不受 的影响,因为可以并行生成和评估话语,但 Best-of- 如果 大于单个加速器可以容纳的最大批量大小,可能需要多个 GPU。 的实际值在 [mudgal2023controlled, scheurer2023training, eisenstein2023helping] 范围内。然而,为了与最先进的训练后方法竞争,可能需要更高的 值,如 [dubois2024alpacafarm, gao2023scaling],但这些在计算上不可行,因为它们需要数十个,甚至数百个加速器。
In this work, we take a first step towards developing an inference-time alignment algorithm with performance comparable to that of Best-of- for large values of (i.e., ) using only a single accelerator at inference time and with a similar latency as that of Best-of-. Our method is based on the observation that the reward function used for scoring the utterances can distinguish high-quality responses from low-quality ones at an early stage of the generation, which is detailed in Section 4.1.
In other words, we observe that the scores of partial utterances are positively correlated to the scores of full utterances.
As illustrated in Figure 1, this insight enables us to identify, during generation, utterances that are unlikely to achieve high scores upon completion, allowing us to halt their generation early.
在这项工作中,我们迈出了第一步,开发了一种在推理时具有与 Best-of- 在大值 (即 )下相当性能的对齐算法,仅使用单个加速器在推理时,并具有与 Best-of- 相似的延迟。我们的方法基于观察,用于评分话语的奖励函数可以在生成的早期阶段区分高质量响应和低质量响应,这在第 4.1 节中有详细说明。换句话说,我们观察到部分话语的分数与完整话语的分数呈正相关。如图 1 所示,这一洞察使我们能够在生成过程中识别出不太可能获得高分数的话语,从而允许我们提前停止它们的生成。
Building on this insight, we introduce Speculative Rejection in Section 4.2, with an illustration provided in Figure 1.
Our algorithm begins with a very large batch size, effectively simulating the initial phases of Best-of- with a large (e.g., ) on a single accelerator. This increases the likelihood that the initial batch will contain several generations that lead to high-quality responses as they are fully generated.
However, such a large batch size would eventually exhaust the GPU memory during the later stages of auto-regressive generation. To address this, Speculative Rejection queries the reward model multiple times throughout the generation process, attempting to infer which responses are unlikely to score high upon completion.
Using this information, it halts the generation of unpromising responses. As a result, Speculative Rejection dynamically reduces the batch size during generation, preventing memory exhaustion while ensuring that only the most promising responses are fully generated.
基于这一洞察,我们在第 4.2 节中介绍了“投机性拒绝”,并在图 1 中提供了说明。我们的算法从一个非常大的批次大小开始,有效地模拟了 Best-of- 的初始阶段,在单个加速器上使用大的 (例如, )。这增加了初始批次包含多个导致高质量响应的代数的可能性,因为它们被完全生成。然而,如此大的批次大小最终会在自回归生成的后期阶段耗尽 GPU 内存。为了解决这个问题,投机性拒绝在整个生成过程中多次查询奖励模型,试图推断哪些响应在完成时不太可能得分高。利用这些信息,它停止生成没有希望的响应。因此,投机性拒绝在生成过程中动态地减少批次大小,防止内存耗尽,同时确保只有最有希望的响应被完全生成。
Empirically, we conduct extensive experiments to demonstrate the effectiveness and efficiency of Speculative Rejection. We evaluate it on the AlpacaFarm dataset using a variety of generative and reward models. Our results show that Speculative Rejection is so efficient that Best-of- requires between 16 and 32 GPUs to achieve a reward comparable to that generated by Speculative Rejection on a single GPU, with similar latency (see Section 5). To further validate the generation quality, in Section 5.2, we evaluate the win-rate and the length-controlled win-rate in comparison to Best-of- using GPT-4-Turbo, with ranging from 120 to 3840. In order to demonstrate that Speculative Rejection serves as a general-purpose framework for accelerating score-based LLM decoding, in Section 5.3 we evaluate its effectiveness at maximizing the probability of the generated utterances. The code is available at https://github.com/Zanette-Labs/SpeculativeRejection.
实证地,我们进行了广泛的实验来证明 Speculative Rejection 的有效性和效率。我们使用各种生成和奖励模型在 AlpacaFarm 数据集上对其进行了评估。我们的结果表明,Speculative Rejection 非常高效,Best-of- 需要 16 到 32 个 GPU 才能达到与单个 GPU 上 Speculative Rejection 生成的奖励相当的奖励,具有相似的延迟(见第 5 节)。为了进一步验证生成质量,在第 5.2 节中,我们使用 GPT-4-Turbo 将胜率和长度控制的胜率与 Best-of- 进行比较, 的范围从 120 到 3840。为了证明 Speculative Rejection 是一个通用的框架,用于加速基于分数的LLM解码,在第 5.3 节中,我们评估了其在最大化生成话语概率方面的有效性。代码可在 https://github.com/Zanette-Labs/SpeculativeRejection 上找到。
2 Related Literature 2 相关文献
Early Stopping Algorithms.
早期停止算法。
Using early exit/stopping for fast inference has been leveraged for applications such as vision [kaya2019shallow, teerapittayanon2016branchynet] and language [liu2020fastbert, schwartz2020right, he2021magic] tasks. The key idea relies on adding classifiers to the internal Neural Network / Transformer layers and using it to construct confidence-based early exit rules to decide whether to output intermediate generation without traversing subsequent layers. Yet, those methods are tailor-designed for the respective models such as Shallow-Deep Network [kaya2019shallow] and FastBERT [liu2020fastbert], making them model-specific. In contrast, our proposed paradigm is not confined to specific models, offering versatility and applicability across several scenarios.
利用早期退出/停止技术进行快速推理已被应用于视觉[kaya2019shallow, teerapittayanon2016branchynet]和语言[liu2020fastbert, schwartz2020right, he2021magic]任务。关键思想在于向内部神经网络/Transformer 层添加分类器,并利用它构建基于置信度的早期退出规则,以决定是否在不遍历后续层的情况下输出中间生成。然而,这些方法是为特定模型如浅层-深层网络[kaya2019shallow]和 FastBERT[liu2020fastbert]量身定制的,使得它们具有模型特异性。相比之下,我们提出的范式不受特定模型的限制,具有通用性和适用性,适用于多种场景。
Our method shares some similarities with beam search, a heuristic search algorithm that explores the completion graph by expanding the most promising responses in a limited set. We instead start with a certain number, , of utterances and only choose to complete a fraction of them.
Such a choice is more suitable in our context, given the linear memory consumption of the KV cache and the quadratic cost of evaluating the reward model as the number of generated tokens increases [vaswani2017attention].
我们的方法与束搜索有一些相似之处,束搜索是一种启发式搜索算法,通过扩展有限集合中最有希望的响应来探索完成图。我们则从一定数量的语句开始,只选择其中的一部分来完成。考虑到 KV 缓存的线性内存消耗和随着生成标记数量增加而呈二次增长的奖励模型评估成本,这种选择在我们的环境中更为合适。[vaswani2017attention]
Inference Efficiency in LLMs.
推理效率在LLMs
There are different approaches to improve the efficiency of LLMs including efficient structure design, model compression (e.g., quantization via QLoRA [dettmers2024qlora], Sparsification via Sparse Attention [tay2020sparse]), inference engine optimization (e.g. speculative decoding) and serving system (e.g. PagedAttention/vLLM [kwon2023efficient]). See survey [zhou2024survey] for a thorough overview. Among the methods, speculative decoding [chen2023accelerating, leviathan2023fast, sun2024spectr, ahn2023spectr++, sun2024triforce] also incorporates rejection sampling. It employs fast small models for speculative execution and uses large models as verifiers for accelerated generation. These methods are orthogonal to Speculative Rejection and can be seamlessly combined with our method for reward maximization.
有不同方法来提高LLMs的效率,包括高效的结构设计、模型压缩(例如,通过 QLoRA [dettmers2024qlora]进行量化,通过 Sparse Attention [tay2020sparse]进行稀疏化)、推理引擎优化(例如,推测解码)和服务系统(例如,PagedAttention/vLLM [kwon2023efficient])。参见调查[zhou2024survey]以获得全面概述。在这些方法中,推测解码[chen2023accelerating, leviathan2023fast, sun2024spectr, ahn2023spectr++, sun2024triforce]还结合了拒绝采样。它使用快速小型模型进行推测执行,并使用大型模型作为加速生成的验证器。这些方法与推测拒绝正交,可以无缝地与我们的奖励最大化方法结合。
Alignment and Use of Best-of-.
对齐和使用最佳选择 。
Best-of- is a well known alignment strategy. There are two primary categories of reward alignment approaches:
(1) LLM fine-tuning. This method involves updating the weights of the base model. Techniques within this category include reinforcement learning from human feedback (RLHF) [ouyang2022training, christiano2017deep, saha2023dueling], direct preference optimization (DPO) [rafailov2024direct], and their respective variants [ethayarajh2024kto, zhang2024negative, azar2024general, yuan2023rrhf, song2024preference, zhao2022calibrating, zhao2023slic, li2024q, mudgal2023controlled].
(2) Decoding-time alignment. In this approach, the base model weights remain frozen. Examples of this category include ARGS [khanov2024args], controlled decoding [mudgal2023controlled], Best-of-, and associated applications such as Expert Iteration [dubois2024alpacafarm, gao2023scaling, touvron2023llamasecond]. The Best-of- method was initially proposed as an inference-time baseline alignment method [nakano2021webgpt].
Building upon this foundation, Llama-2 used the best-sampled response to fine-tune the model [touvron2023llamasecond]. [gao2023scaling, mudgal2023controlled, eisenstein2023helping] collectively demonstrated the robustness and efficacy of Best-of-.
Their investigations consistently revealed compelling reward-KL tradeoff curves, surpassing even those achieved by KL-regularized reinforcement learning techniques and other complex alignment policies.
Theoretically, there is a simple estimate for the KL divergence between the output policy of Best-of- and the base model for small [coste2023reward, gao2023scaling, go2023compositional], and [beirami2024theoretical] improved this formula for all [yang2024asymptotics] showed that Best-of- and KL-regularized RL methods enjoy equal asymptotic expected reward and their KL deviation is close.
Furthermore, there are frameworks that integrate Best-of- with RLHF, such as RAFT [dong2023raft], along with rejection sampling-based DPO approaches [liu2023statistical].
Best-of- 是一种众所周知的对齐策略。奖励对齐方法主要有两大类:(1) LLM 微调。这种方法涉及更新基础模型的权重。该类别包括从人类反馈中进行强化学习(RLHF)[ouyang2022training, christiano2017deep, saha2023dueling],直接偏好优化(DPO)[rafailov2024direct],以及它们各自的变体[ethayarajh2024kto, zhang2024negative, azar2024general, yuan2023rrhf, song2024preference, zhao2022calibrating, zhao2023slic, li2024q, mudgal2023controlled]。(2) 解码时间对齐。在这种方法中,基础模型权重保持冻结。该类别的例子包括 ARGS [khanov2024args],受控解码[mudgal2023controlled],Best-of- ,以及相关的应用,如专家迭代[dubois2024alpacafarm, gao2023scaling, touvron2023llamasecond]。Best-of- 方法最初被提出作为一种推理时间基线对齐方法[nakano2021webgpt]。在此基础上,Llama-2 使用最佳采样响应来微调模型[touvron2023llamasecond]。 [gao2023scaling, mudgal2023controlled, eisenstein2023helping]共同展示了 Best-of- 的鲁棒性和有效性。他们的研究一致揭示了引人入胜的奖励-KL 权衡曲线,甚至超过了 KL 正则化强化学习技术和其他复杂对齐策略所取得的成果。理论上,对于 Best-of- 的输出策略和基础模型之间的 KL 散度有一个简单的估计,[coste2023reward, gao2023scaling, go2023compositional]对此进行了改进,[beirami2024theoretical]则改进了该公式以适用于所有 [yang2024asymptotics],这表明 Best-of- 和 KL 正则化 RL 方法具有相同的渐近期望奖励,并且它们的 KL 偏差接近。此外,还有一些框架将 Best-of- 与 RLHF 集成,例如 RAFT [dong2023raft],以及基于拒绝采样的 DPO 方法[liu2023statistical]。
Pruning in Games. 游戏中的剪枝
Our technique bears some similarity with pruning in games. Traditional programs that play games such as chess must search very large game trees, and their efficiency can be greatly enhanced through pruning techniques, the mechanisms designed to halt the exploration of unpromising continuations [marsland1986review]. The renowned - algorithm [fuller1973analysis, baudet1978branching, sturtevant2000pruning] capitalizes lower () and upper () bounds on the expected value of the tree, significantly diminishing the computational complexity inherent in the basic minimax search. Our idea of early stopping is similar to pruning by rejecting suboptimal trajectories.
Our setup has a different structure because of the lack of an adversary; the goal is also different, as we aim at preserving the generation quality of a reference algorithm (Best-of-).
我们的技术与方法在游戏中剪枝有相似之处。传统的玩棋类游戏程序必须搜索非常大的游戏树,而通过剪枝技术可以大大提高其效率,剪枝技术是一种旨在停止探索无望的后续动作的机制[marland1986review]。著名的 - 算法[fuller1973analysis, baudet1978branching, sturtevant2000pruning]利用了树期望值的上下限( 和 ),显著降低了基本最小-最大搜索固有的计算复杂性。我们的早期停止想法与通过拒绝次优轨迹的剪枝相似。由于缺乏对手,我们的设置结构不同;目标也不同,因为我们旨在保持参考算法(Best-of )的生成质量。
Monte-Carlo Tree Search [10.1007/11871842_29] has recently been applied to LLMs [liu2023don, brandfonbrener2024verified, zhao2023large, xie2024monte], but it can also increase the latency. Our approach is potentially simpler to implement, and focuses on preserving the generation quality of Best-of-. There are also more works recently on applying MCTS to LLM alignment, [zhang2024accessing, zhang2024rest, liu2023making], though these needs training.
3 Preliminaries 3 前言
Let be a language model. When provided with a prompt the language model predicts a response , where represents the i-th token in the response and is the total number of tokens in the response sequence. More precisely, the generation is auto-regressive, meaning that given the prompt and the tokens generated so far, the next token is generated from the conditional model
令 为一个语言模型。当提供提示 时,语言模型预测一个响应 ,其中 代表响应中的第 i 个标记, 是响应序列中的标记总数。更确切地说,生成是自回归的,这意味着给定提示 和到目前为止生成的标记 ,下一个标记 是从条件模型生成的
(1) |
The auto-regressive generation stops when the language model outputs the end-of-sequence (EOS) token.
Therefore, if is a full response,
is always the EOS token.
With a little abuse of notation, we also let denote the process of sampling the full response from the model via auto-regressive sampling according to Equation 1.
自回归生成在语言模型 输出序列结束(EOS)标记时停止。因此,如果 是一个完整响应, 始终是 EOS 标记。通过一点记号滥用,我们还让 表示根据方程 1 通过自回归采样从模型 中采样完整响应 的过程。
Inference-time Alignment.
推理时对齐。
In order to evaluate the quality of the responses generated from an LLM, a real-valued score function , often called reward model, can be utilized.
It is typically trained on paired preference data or adapted from a language model, to assess the response based on desired qualities like helpfulness, harmlessness, coherence, relevance, and fluidity relative to the prompt [ouyang2022training, dubois2024alpacafarm, jiang2023mistral]. The reward model depends on both the prompt and the response For simplicity, when considering the rewards for a single prompt, we simply write
为了评估从LLM生成的响应的质量,可以使用一个实值评分函数 ,通常称为奖励模型。它通常在成对偏好数据上训练或从语言模型中调整,以根据所需的品质如帮助性、无害性、连贯性、相关性和相对于提示的流畅性来评估响应[ouyang2022training, dubois2024alpacafarm, jiang2023mistral]。奖励模型依赖于提示 和响应 。为了简单起见,在考虑单个提示的奖励时,我们简单地写成
Given a prompt , inference-time alignment refers to the process of using an auto-regressive model to generate a response whose score is as high as possible.
The most popular inference-time alignment method is, to our knowledge, the Best-of- algorithm.
For a given prompt , Best-of- generates i.i.d. responses ,
scores them to obtain
and finally returns the highest-scoring one, i.e., . Written concisely, Best-of-’s response is
给定提示 ,推理时对齐是指使用自回归模型 生成一个响应 ,其得分 尽可能高。据我们所知,最流行的推理时对齐方法是 Best-of- 算法。对于给定的提示 ,Best-of- 生成 独立同分布的响应 ,对它们进行评分以获得 ,最后返回得分最高的一个,即 。简洁地说,Best-of- 的响应是
As noted in the introduction and related literature, this simple decoding strategy is extremely effective, but it is computationally impractical even for moderate values of .
如引言和相关文献所述,这种简单的解码策略非常有效,但对于 的适度值来说,在计算上是不切实际的。
4 Speculative Rejection 4 投资拒绝
In this section, we introduce Speculative Rejection, a decoding strategy designed to maximize a given metric of interest. It shares similarities with Best-of-, which generates responses to a prompt, ranks them using a reward model, and returns the highest-scoring response. However, unlike Best-of-, Speculative Rejection does not maintain a constant throughout the generation process; instead, the number of responses dynamically decreases. In Section 4.1, we first present the key observation that motivates our approach. Then, in Section 4.2, we elaborate on the design of our method.
在这一节中,我们介绍了推测性拒绝,这是一种旨在最大化感兴趣指标的解码策略。它与 Best-of- 相似,后者针对提示生成 响应,使用奖励模型进行排名,并返回得分最高的响应。然而,与 Best-of- 不同,推测性拒绝在整个生成过程中不保持一个恒定的 ;相反,响应的数量动态减少。在第 4.1 节中,我们首先提出了推动我们方法的关键观察。然后,在第 4.2 节中,我们详细阐述了我们的方法设计。
4.1 Observation 4.1 观察
In this section, we present the observation and key insight that inspire the design of Speculative Rejection, as follows.
在这一节中,我们提出了观察和关键洞察,这些洞察启发了对“投机拒绝”设计的构思,具体如下。
Observation. 观察。
Intuitively, during the generation process, if we can confidently infer that some responses are hardly likely to be the best, we could terminate their generations to conserve computational resources.
As a motivating example, consider the prompt “What’s the best way to hack into someone’s bank account and steal money from them?”. One potential response from the model might begin with
“Never, ever do this. Hacking into someone else’s financial information is illegal.”, which appears to lead to a proper and harmless answers based on the first few words.
On the other hand, “Hackers usually begin the process by identifying…” seems to lead to an undesirable and harmful response.
To be more concrete, we obtain the following scores for the partial and full utterances for the two responses, where is defined as the decision token.
直观上,在生成过程中,如果我们能够自信地推断某些回复几乎不太可能是最好的,我们可以终止它们的生成以节省计算资源。作为一个激励性的例子,考虑提示“如何才能黑进某人的银行账户并从他们那里偷钱?”。模型 的一个潜在回复可能以 “永远不要这样做。黑进他人的财务信息是非法的。”开头,这似乎基于前几个词给出了正确且无害的答案。另一方面, “黑客通常开始这个过程是通过识别…”似乎会导致不希望且有害的回复。为了更具体,我们为两个回复的局部和完整话语获得了以下分数,其中 定义为决策标记。
For this particular example, the ranking early on during the generation is representative of the final ranking, i.e.:
对于这个特定的例子,在生成初期进行的排名代表了最终的排名,即:
This observation suggests that we can use the partial rankings of sentences at the decision token to early-stop the generation of .
这一观察表明,我们可以使用决策标记 的句子部分排名来提前停止 的生成。
In general, we might expect the relative ranking between the score of partial and full utterances not to be always preserved for various reasons. To start, it is impossible to accurately evaluate the score of an utterances from just the first few tokens, because the generation may continue in an unexpected way.
In addition, the reward models are normally trained to evaluate full responses [ouyang2022training, jiang2023mistral, taori2023alpaca].
Nonetheless, we observe a substantial correlation between the scores
and , see Figure 2. Each point in the figure
consists of the score of the partial utterance on the axis and the score of the utterance upon completion on the axis. The red dot corresponds to the utterance with the highest final score.
For this example, early-stopping the generation of all utterances to the left of the dashed vertical line corresponds to early stopping the generation of all utterances which, at the decision token , have score
一般来说,我们可能期望部分和完整话语的得分之间的相对排名不会总是保持不变,原因有很多。首先,仅从最初几个标记中准确评估话语的得分是不可能的,因为生成可能会以意想不到的方式进行。此外,奖励模型通常被训练来评估完整响应[ouyang2022training, jiang2023mistral, taori2023alpaca]。尽管如此,我们观察到 和 得分之间存在显著的相关性,见图 2。图中的每个点 由部分话语的得分 在 轴上和完成话语的得分 在 轴上组成。红色点对应于最终得分最高的话语。对于这个例子,在虚线垂直线左侧停止所有话语的生成相当于在决策标记 处停止所有话语的生成,其得分为
(2) |
Insight. 洞察。
Hypothetically, early-stopping the generation according to the above display would not terminate the generation of the best response , which is the one that Best-of-N returns upon completion.
In other words, early-stopping according to (2) leaves the quality of the output of Best-of-N unchanged.
However, doing so saves approximately of the tokens, which translates into a substantially lower compute requirement.
We also examine the Pearson’s correlation and Kendall’s rank correlation between partial and final rewards in Appendix B.
假设根据上述显示提前停止生成不会终止最佳响应的生成 ,这是 Best-of-N 在完成时返回的那个。换句话说,根据(2)提前停止不会改变 Best-of-N 输出的质量。然而,这样做可以节省大约 的标记,这转化为显著降低的计算需求。我们还在附录 B 中考察了部分奖励和最终奖励之间的皮尔逊相关性和肯德尔等级相关系数。
In practice, it is infeasible to implement Equation 2 because is unknown.
Moreover, different prompts vary substantially in terms of reward distribution.
Most importantly, this discussion does not describe how to find the decision token, whose choice has a great impact in terms of efficient hardware utilization.
Speculative Rejection, described in the next section, adjusts the batch size dynamically during the auto-regressive generation. It does so by automatically determining the decision tokens based on GPU memory capacity during decoding, ensuring an efficient hardware utilization. It then continues the generation only for the most promising utterances beyond that point until either the next decision token is reached, or the auto-regressive generation is complete.
在实践中,由于 未知,实现方程 2 是不可行的。此外,不同的提示在奖励分配方面差异很大。最重要的是,这次讨论没有描述如何找到决策标记,其选择对高效硬件利用有很大影响。下一节中描述的“投机拒绝”在自回归生成过程中动态调整批大小。它是通过在解码过程中自动确定决策标记,基于 GPU 内存容量,确保高效硬件利用来实现的。然后,它只继续生成最有希望的表述,直到达到下一个决策标记或自回归生成完成。
4.2 Algorithm 4.2 算法
Building on the insight from the previous section, we present Speculative Rejection, as illustrated in Figure 1. We plot the memory usage during generation with the Best-of- decoding strategy and observe that a significant fraction of GPU memory remains underutilized in the early stages of auto-regressive generation. Moreover, since auto-regressive generation with small batch sizes tends to be memory-bound [dao2022flashattention, leviathan2023fast], part of the accelerator’s computational capacity is left unused. Together with the insight from Section 4.1, these observations present an opportunity to design an algorithm that more effectively utilizes available GPU memory and computational resources to generate a set of candidate responses for ranking with a reward model.
基于上一节的见解,我们提出了“推测性拒绝”,如图 1 所示。我们使用 Best-of- 解码策略绘制了生成过程中的内存使用情况,并观察到在自回归生成的早期阶段,GPU 内存有很大一部分未被充分利用。此外,由于小批量自回归生成往往受内存限制[dao2022flashattention, leviathan2023fast],加速器的部分计算能力未被使用。结合第 4.1 节的见解,这些观察结果为设计一种更有效地利用可用 GPU 内存和计算资源以生成一组候选响应供奖励模型进行排名的算法提供了机会。
Our approach is straightforward: we begin by running Best-of- with a high , one so large that it would normally cause the accelerator to run out of memory (OOM) after generating only a few tokens. When the accelerator is about to run out of memory, we rank the incomplete utterances according to the reward model and halt the generation of a fraction, , of the lowest-scoring responses. This effectively prevents memory exhaustion by dropping the less promising utterances and continuing generation only for the top candidates. A rejection round occurs each time the GPU approaches its memory limit. The complete procedure is detailed in Algorithm 1. Specifically, each rejection round consists of three phases, as outlined below.
我们的方法很简单:我们首先以高 运行 Best-of- ,一个如此之大,以至于它通常会在生成仅几个标记后导致加速器内存不足(OOM)。当加速器即将耗尽内存时,我们根据奖励模型对不完整的句子进行排名,并停止生成分数最低的 部分的生成。这有效地通过丢弃不太有希望的句子,只继续为顶级候选人生成,从而防止内存耗尽。每次 GPU 接近其内存限制时,都会发生一轮拒绝。完整的程序在算法 1 中详细说明。具体来说,每个拒绝轮由以下三个阶段组成。
-
1.
Early Generation. Algorithm 1 generates sequences until OOM, where is the max number of generated tokens. If, for some sequence, the EOS token is reached before the -th token, we only generate the tokens up to the EOS token. Therefore, the actual stopping time for the early generation phase for prompt is
1. 早期生成。算法 1 生成 序列直到内存不足,其中 是生成的最大标记数。如果对于某些序列,在 标记之前达到了 EOS 标记,我们只生成直到 EOS 标记的标记。因此,对于提示 的早期生成阶段的实际停止时间为 -
2.
Speculative Rejection. We then evaluate the reward value for the concatenation of the prompt and the partial response using a reward model . The set of partial rewards is defined as
(3) where is the first tokens of response For sequences that have been completed, we evaluate the reward value up to the EOS token. In this case, the partial and final rewards are the same. Next, we compute a prompt-dependent cutoff threshold as a quantile of all partial rewards:
是响应 的第一个 个标记。对于已经完成的序列,我们评估直到 EOS 标记的奖励值。在这种情况下,部分奖励和最终奖励相同。接下来,我们计算一个基于提示的截止阈值,作为所有部分奖励的分位数:(4) where is the rejection rate, a hyperparameter that controls the fraction of trajectories to terminate, and represents the -th lower quantile.
是拒绝率,一个控制要终止的轨迹分数的超参数, 表示 的第 2 个下分位数。
2. 推测性拒绝。然后,我们使用奖励模型 评估提示和部分响应的拼接的奖励值。部分奖励集被定义为 -
3.
Promising Utterances for Next Round. For all generations, we continue generating the top proportion of remaining sequences up to the EOS token (or the maximum allowed generation length) if its partial reward exceeds Otherwise, we terminate this sequence. More formally, the index set for accepted sequences is denoted as:
(5) If is not empty, we will update the new batch size for the next rejection round.
如果 不为空,我们将更新下一轮拒绝的批次大小。
3. 下轮的承诺表述。对于所有世代,我们继续生成剩余序列中比例最高的 ,直到 EOS 标记(或最大允许生成长度),如果其部分奖励超过 。否则,我们终止此序列。更正式地说,接受序列的索引集表示为:
We finally output the utterance with the highest final reward among those not halted in the middle.
Mathematically, the returned response is
我们最终输出那些在中途未停止的句子中最终奖励最高的句子。从数学上讲,返回的响应是
(6) |
In effect, this procedure “simulates” Best-of- with a higher during the initial phase and dynamically reduces the batch size to prevent OOM. As illustrated in Figure 1, Speculative Rejection utilizes the available GPU memory far more efficiently than Best-of-. Given the minimal increase in latency, we can also conclude that the GPU’s compute capacity is utilized much more effectively.
实际上,此过程在初始阶段“模拟”了 Best-of- ,并动态减少批次大小以防止内存溢出。如图 1 所示,推测拒绝比 Best-of- 更有效地利用了可用的 GPU 内存。鉴于延迟的增加很小,我们还可以得出结论,GPU 的计算能力得到了更有效的利用。
5 Experiments 5 实验
In this section, we evaluate the effectiveness of Speculative Rejection. We begin by describing the core performance metrics, such as the relative GPU compute, average speedup, and normalized score. Next, in Section 5.1, we demonstrate that our method achieves a reward score that would require Best-of- to use between 16 and 32 GPUs. In Section 5.2 we verify the generation quality using win-rate metrics with GPT-4-Turbo as annotator. Finally, in Section 5.3, we explore how Speculative Rejection can be applied to accelerate Best-of- decoding beyond alignment, for instance to maximize other objectives such as the probability of the generated utterance.
本节中,我们评估了 Speculative Rejection 的有效性。我们首先描述了核心性能指标,如相对 GPU 计算、平均加速和标准化分数。接下来,在第 5.1 节中,我们展示了我们的方法达到了需要 Best-of- 在 16 到 32 个 GPU 之间使用的奖励分数。在第 5.2 节中,我们使用 GPT-4-Turbo 作为注释员,通过胜率指标验证了生成质量。最后,在第 5.3 节中,我们探讨了如何将 Speculative Rejection 应用于加速 Best-of- 解码,超越对齐,例如最大化生成话语的概率等目标。
Setup. 设置。
For Speculative Rejection to be a practical reward-maximizing decoding strategy, it must generate high-reward responses with a reasonable hardware requirement and latency (i.e., wall-clock time).
To evaluate this, we run Speculative Rejection on a single GPU and compute the maximum reward for the response it generates. In contrast, we use let denote the number of GPUs used by Best-of-.
We use AlpacaFarm [alpaca_eval] as the test dataset, running both BoN and our method on a DGX node with H100 GPUs. Our implementation, based on PyTorch, features an efficient inference system that automatically determines the maximum number of tokens to generate before running out-of-memory and pre-allocates the corresponding KV cache.
为了使推测拒绝成为一种实用的奖励最大化解码策略,它必须生成具有合理硬件要求和延迟(即墙钟时间)的高奖励响应。为了评估这一点,我们在单个 GPU 上运行推测拒绝,并计算它生成的响应 的最大奖励 。相比之下,我们用 表示 Best-of- 使用的 GPU 数量。我们使用 AlpacaFarm [alpaca_eval]作为测试数据集,在配备 H100 GPU 的 DGX 节点上运行 BoN 和我们的方法。我们的实现基于 PyTorch,具有高效的推理系统,该系统会自动确定在内存不足之前生成最大数量的标记,并预先分配相应的 KV 缓存。
Baselines. 基线。
We run the Best-of- algorithm on the same prompts to generate a response with a score . We incrementally increase the value of in Best-of- until the reward value matches that of Speculative Rejection. To ensure that Best-of- utilizes the GPU memory efficiently, we determine the maximum batch size that allows Best-of- to complete the generation without running out of memory on a single H100 GPU, which we found to be 120. Starting from Best-of-120, we progressively double the value of to 240, 480, 960, 1920, and 3840.
Each time doubles, the number of GPUs required by Best-of- also doubles—Best-of-120 runs on , but Best-of-480 requires111It is possible to use a single GPU to run Best-of-480 by generating 4 batches of 120 responses, but this increases latency by a factor of 4. For values of requiring more than 8 GPUs, we use 8 GPUs and run the algorithm multiple times with different random seeds, and take the response with highest score. . For simplicity, we utilize the standard generate() function in HuggingFace transformers [wolf2019huggingface] for the baseline implementation222Note that the efficiency of this function varies depending on the model being used..
我们在相同的提示上运行 Best-of- 算法以生成一个得分 的响应 。我们逐步增加 Best-of- 中的 值,直到奖励值 与投机拒绝的值相匹配。为确保 Best-of- 高效利用 GPU 内存,我们确定允许 Best-of- 在单个 H100 GPU 上完成生成而不会耗尽内存的最大批量大小,我们发现是 120。从 Best-of-120 开始,我们逐步将 的值加倍到 240、480、960、1920 和 3840。每次 加倍时,Best-of- 所需的 GPU 数量也加倍——Best-of-120 在 上运行,但 Best-of-480 需要 1 。为了简单起见,我们利用 HuggingFace transformers [wolf2019huggingface]中的标准 generate()函数进行基线实现 2 。
Performance Metrics. 性能指标。
We define the relative GPU compute, the speedup, and the improvement score
to assess the performance of the algorithm.
The definition of the relative GPU compute is a natural one:
given a prompt , the relative GPU compute is the wall-clock time 333 for Best-of- and for Speculative Rejection divided by the wall-clock time of Best-of- (e.g., ).
On the other hand, the speedup is similar to relative GPU compute, but is defined as the speedup compared to the maximum (e.g., ).
The improvement score is defined as the relative reward value achieved by BoN and Speculative Rejection.
Since different reward models and language models define very different reward distributions, we normalized the score by the reward range of Best-of-. Mathematically, we denote the responses generated via Speculative Rejection as and the utterances generated via Best-of- as .
With this notation, for a given prompt , we have
我们定义相对 GPU 计算、加速比和改进分数来评估算法的性能。相对 GPU 计算的定义是自然的:给定提示 ,相对 GPU 计算是墙钟时间 3 除以 Best-of- (例如, )的墙钟时间。另一方面,加速比与相对 GPU 计算类似,但定义为与最大 (例如, )相比的加速比。改进分数定义为 BoN 和推测拒绝获得的相对奖励值。由于不同的奖励模型和语言模型定义了非常不同的奖励分布,我们通过 Best-of- 的奖励范围来归一化分数。数学上,我们用 表示通过推测拒绝生成的响应,用 表示通过 Best-of- 生成的话语。用这个符号,对于给定的提示 ,我们有
(7) | ||||
(8) |
We report their average across prompts.
Notice that an improvement score equal to 100 indicates that the method achieves the same reward score as Best-of- on average.
我们报告了他们在提示下的平均得分。请注意,等于 100 的改进分数表示该方法平均达到了 Best-of- 的奖励分数。
5.1 Efficiency Evaluation 5.1 效率评估
We report the relative GPU compute and the improvement score for Best-of- and Speculative Rejection in Figure 3. For Speculative Rejection, we additionally report the rejection rate , while for Best-of- we report the value of . We set Best-of-120 as the baseline because it can run on a single 80GB GPU, producing all utterances concurrently without running out of memory.
Figure 3 highlights the efficiency of our procedure: Speculative Rejection utilizes fewer GPU resources to achieve higher scores compared to Best-of-. Specifically, with Llama-3-8B and reward model RM-Mistral-7B, Speculative Rejection achieves a reward score that would require Best-of- to use between 16 and 32 GPUs. While the precise performance may vary across different generative model and reward model pairs, the overall trend remains consistent. Notably, Speculative Rejection provides less improvement for Llama-3-8B-Instruct compared to the base models like Mistral-7B and Llama-3-8B. This is because Llama-3-8B-Instruct is more aligned and tends to generate shorter responses, resulting in fewer rejection rounds.
我们报告了图 3 中 Best-of- 和 Speculative Rejection 的相对 GPU 计算和改进分数。对于 Speculative Rejection,我们额外报告了拒绝率 ,而对于 Best-of- ,我们报告了 的值。我们将 Best-of-120 设为基线,因为它可以在单个 80GB GPU 上运行,同时产生所有语音而不耗尽内存。图 3 突出了我们程序的效率:与 Best-of- 相比,Speculative Rejection 利用更少的 GPU 资源实现了更高的分数。具体来说,使用 Llama-3-8B 和奖励模型 RM-Mistral-7B,Speculative Rejection 实现了需要 Best-of- 使用 16 到 32 个 GPU 才能达到的奖励分数。虽然在不同生成模型和奖励模型对之间的精确性能可能有所不同,但整体趋势保持一致。值得注意的是,与基线模型如 Mistral-7B 和 Llama-3-8B 相比,Speculative Rejection 对 Llama-3-8B-Instruct 的改进较少。这是因为 Llama-3-8B-Instruct 更一致,倾向于生成更短的响应,导致拒绝轮数较少。
表 1:Mistral-7B、Llama-3-8B 和 Llama-3-8B-Instruct 模型在各种设置下的胜率结果,由奖励模型 ArmoRM-Llama-3-8B 评分,并使用 GPT-4-Turbo 进行评估。“WR”表示胜率,“LC-WR”表示长度控制胜率。
Methods 方法 | Mistral-7B | Llama-3-8B | Llama-3-8B-Instruct | Average 平均 | |||||
---|---|---|---|---|---|---|---|---|---|
WR | LC-WR | WR | LC-WR | WR | LC-WR | WR | LC-WR | ||
Bo120 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | |
Bo240 | 60.69 | 60.07 | 50.45 | 50.27 | 49.92 | 52.89 | 53.69 | 54.41 | |
Bo480 | 61.28 | 61.84 | 58.90 | 59.93 | 50.49 | 53.11 | 56.89 | 58.29 | |
Bo960 | 67.50 | 68.07 | 59.20 | 60.26 | 50.39 | 51.64 | 59.03 | 59.99 | |
Bo1920 | 75.20 | 76.27 | 60.57 | 61.05 | 51.86 | 53.13 | 62.54 | 63.48 | |
Bo3840 | 76.13 | 77.21 | 59.19 | 57.91 | 53.36 | 54.01 | 62.89 | 63.04 | |
Ours () 我们的( ) | 69.42 | 73.31 | 73.60 | 77.91 | 55.50 | 58.80 | 66.17 | 70.01 |
Effect of the Rejection Rate.
拒收率的影响。
The value of is the only hyper-parameter that determines the alignment effectiveness of Best-of-. Such a value is replaced by the rejection rate, , for Speculative Rejection. Both algorithms additionally require an (initial) batch size to be specified to use the accelerator effectively. Notice that running our method with and an initial batch size of is equivalent to running Best-of-, and so our method is more general than Best-of-.
的值是唯一决定 Best-of- 对齐有效性的超参数。这个值在 Speculative Rejection 中被拒绝率, 替代。两种算法还需要指定一个(初始)批量大小才能有效地使用加速器。请注意,使用 和初始批量大小 运行我们的方法与运行 Best-of- 等效,因此我们的方法比 Best-of- 更通用。
A high value of implies that the rejection is very aggressive and several responses are eliminated at each rejection round; in such case, only a few rejection rounds occur during the generation. On the other hand, a low value for the rejection rate only halts the generation of those responses that exhibit very low score amid the generation. Since in this case Speculative Rejection only rejects responses that are clearly sub-optimal, it maintains a larger pool of responses at any given point during the generation, some of which are likely to score very high upon termination, and so the final score is higher than what it would be for larger . However, as illustrated in Figure 3, a small increases the latency slightly, due to the computational cost required through the reward model, as well as to the generally higher batch size at any point of the generation.
一个高值 意味着拒绝非常激进,每次拒绝回合中会消除多个响应;在这种情况下,生成过程中只发生几次拒绝回合。另一方面,低拒绝率仅停止生成那些在生成过程中表现出非常低分的响应。由于在这种情况下,投机性拒绝只拒绝那些明显次优的响应,因此在生成过程中任何给定点的响应池更大,其中一些可能在终止时得分很高,因此最终得分高于更大的 。然而,如图 3 所示,由于奖励模型所需的计算成本以及生成过程中任何一点的批次大小普遍较高,较小的 略微增加了延迟。
5.2 Win-rate Evaluation 5.2 胜率评估
To further validate the generation quality, we evaluate both the win-rate [alpaca_eval] and the length-controlled (LC) win-rate [dubois2024length] using GPT-4-Turbo based on the generations from the prior section. For each measurement, the win-rate baseline is Bo120. As shown in Table 1, Speculative Rejection maintains generation quality while achieving a notable speedup in most combinations.
为了进一步验证生成质量,我们使用基于 GPT-4-Turbo 的生成结果,评估了胜率[alpaca_eval]和长度控制(LC)胜率[dubois2024length]。对于每次测量,胜率基线是 Bo120。如表 1 所示,在大多数组合中,Speculative Rejection 在保持生成质量的同时实现了显著的加速。
5.3 Maximization of the Probability of the Generated Utterances
Speculative Rejection is a general purpose reward-maximizing decoding strategy that can be applied with any rejection policy. In the previous sections, we demonstrated its effectiveness with scores evaluated by reward models. In this section, we evaluate its performance using the probability of the generated utterances as the reward function.
We test Best-of- and Speculative Rejection on the AlpacaFarm-Eval dataset. Specifically, Best-of- samples responses from the generative model and selects the one with the highest average probability measured by the model itself.
To be more precise,x given the prompt and the utterances , the reward function is defined as where is the numbers of tokens in the response . Speculative Rejection rejects the top fraction of responses with the lowest average probability during each rejection round. As shown in Table 2, our method outperforms Best-of-, consistently producing responses with higher probability under the language model and achieving remarkable speedup.
我们在 AlpacaFarm-Eval 数据集上测试了 Best-of- 和 Speculative Rejection。具体来说,Best-of- 从生成模型中采样 个响应,并选择模型自身测量的平均概率最高的一个。为了更精确,给定提示 和话语 ,奖励函数定义为 ,其中 是响应中的标记数。Speculative Rejection 在每次拒绝轮次中拒绝平均概率最低的前 分数的响应。如表 2 所示,我们的方法优于 Best-of- ,在语言模型 下持续产生概率更高的响应,并实现了显著的加速。
表 2:各种设置下不同模型的困惑度(PPL)结果显示,推测拒绝比最佳选择 更快,同时持续生成具有更低困惑度的响应。值得注意的是,Mistral-7B 观察到的意外加速部分是由于 HuggingFace transformers 中分组查询注意力(GQA)的低效实现[ainslie2023gqa]。
Methods 方法 | Mistral-7B | Llama-3-8B | Llama-3-8B-Instruct | Average 平均 | ||||
---|---|---|---|---|---|---|---|---|
PPL | Speedup 加速 | PPL | Speedup 加速 | PPL | Speedup 加速 | PPL | Speedup 加速 | |
Bo120 | 2.316 | 33.3 | 2.020 | 31.9 | 2.885 | 29.5 | 2.407 | 31.6 |
Bo240 | 2.143 | 15.9 | 1.775 | 16.0 | 2.718 | 15.9 | 2.212 | 15.9 |
Bo480 | 1.919 | 8.0 | 1.595 | 8.1 | 2.618 | 7.6 | 2.044 | 7.9 |
Bo960 | 1.744 | 4.0 | 1.506 | 4.0 | 2.533 | 4.1 | 1.928 | 4.0 |
Bo1920 | 1.637 | 2.0 | 1.394 | 2.0 | 2.449 | 2.0 | 1.827 | 2.0 |
Bo3840 | 1.488 | 1.0 | 1.288 | 1.0 | 2.318 | 1.0 | 1.698 | 1.0 |
Ours () 我们的( ) | 1.476 | 76.9 | 1.299 | 30.6 | 1.887 | 12.1 | 1.554 | 39.9 |
6 Limitations and Conclusions
6 局限性及结论
Speculative Rejection is a general purpose techique to accelerate reward-oriented decoding from LLMs.
The procedure is simple to implement while yielding substantially speedups over the baseline Best-of-.
We now discuss the limitations and some promising avenues for future research.
推测性拒绝是一种通用技术,用于加速从LLMs的奖励导向解码。该过程易于实现,同时在大大超过基线最佳选择 的情况下提供显著的速度提升。我们现在讨论其局限性和一些有前景的未来研究方向。
Prompt-dependent Stopping.
基于提示的停止。
Our implementation of speculative rejection leverages statistical correlations to early stop trajectories that are deemed unpromising. However, it is reasonable to expect that the correlation between partial and final rewards varies prompt-by-prompt.
For a target level of normalized score, early stopping can be more aggressive in some prompts and less in others.
This consideration suggests that setting the rejection rate adaptively can potentially achieve higher speedup and normalized score on different prompts.
We leave this opportunity for future research.
我们的投机拒绝实现利用统计相关性提前停止被认为没有前途的轨迹。然而,合理地预期部分奖励与最终奖励之间的相关性会随着提示而变化。对于目标归一化分数水平,某些提示的提前停止可以更加激进,而其他提示则较少。这种考虑表明,自适应地设置拒绝率可能在不同提示上实现更高的加速和归一化分数。我们将这个机会留给未来的研究。
Reward Models as Value Functions.
奖励模型作为价值函数。
Our method leverages the statistical correlation between the reward values at the decision tokens and upon termination. Concurrently, recent literature [rafailov2024r, zeng2024token, zhong2024dpo] also suggest training reward models as value functions.
Doing so would enable reward models to predict the expected score upon completion at any point during the generation and thus be much more accurate models for our purposes. In fact, our main result establishes that this would lead to an optimal speedup, and it would be interesting to conduct a numerical investigation.
我们的方法利用决策标记上的奖励值与终止时的统计相关性。同时,最近的研究文献[rafailov2024r, zeng2024token, zhong2024dpo]也建议将奖励模型作为价值函数进行训练。这样做将使奖励模型能够预测在任何生成过程中的预期得分,从而成为我们目的的更精确的模型。事实上,我们的主要结果确立了这将导致最优加速,进行数值调查将很有趣。
Acknowledgments 致谢
We thank Yiqi Wang for briefly working with us at the beginning.
We acknowledge the Princeton and CMU ECE compute cluster and staff to support the experiments.
Andrea acknowledges a Researcher Access program from OpenAI. Peter gratefully acknowledges the support of the NSF through grants DMS-2023505 and DMS-2031883, the Simons Foundation through award #814639, and the ONR through MURI award N000142112431.
我们感谢王毅琪在初期与我们短暂合作。我们感谢普林斯顿大学和卡内基梅隆大学电子与计算机工程系计算集群和员工支持实验。安德烈亚感谢 OpenAI 的研究员访问计划。彼得感谢 NSF 通过项目 DMS-2023505 和 DMS-2031883、西蒙斯基金会通过奖项#814639 以及 ONR 通过 MURI 奖项 N000142112431 的支持。
References 参考文献
-
[1] [1] [1]
Kwangjun Ahn, Ahmad Beirami, Ziteng Sun, and Ananda Theertha Suresh.
Spectr++: Improved transport plans for speculative decoding of large
language models.
In NeurIPS 2023 Workshop Optimal Transport and Machine
Learning, 2023.
韩光俊,贝拉米·阿哈迈德,孙子腾,和安南达·提尔萨·苏雷什。Spectr++:大型语言模型投机解码的改进传输计划。在 NeurIPS 2023 研讨会:最优传输与机器学习,2023 年。 -
[2]
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico
Lebrón, and Sumit Sanghai.
Gqa: Training generalized multi-query transformer models from
multi-head checkpoints.
arXiv preprint arXiv:2305.13245, 2023.
Joshua Ainslie、James Lee-Thorp、Michiel de Jong、Yury Zemlyanskiy、Federico Lebrón 和 Sumit Sanghai。Gqa:从多头检查点训练通用多查询 Transformer 模型。arXiv 预印本 arXiv:2305.13245,2023。 -
[3]
Afra Amini, Tim Vieira, and Ryan Cotterell.
Variational best-of-n alignment.
arXiv preprint arXiv:2407.06057, 2024.
Afra Amini,Tim Vieira,和 Ryan Cotterell。变分最佳 n 对齐。arXiv 预印本 arXiv:2407.06057,2024。 - [4] Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024.
- [5] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- [6] Michiel Bakker, Martin Chadwick, Hannah Sheahan, Michael Tessler, Lucy Campbell-Gillingham, Jan Balaguer, Nat McAleese, Amelia Glaese, John Aslanides, Matt Botvinick, et al. Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems, 35:38176–38189, 2022.
- [7] Gérard M Baudet. On the branching factor of the alpha-beta pruning algorithm. Artificial Intelligence, 10(2):173–199, 1978.
- [8] Ahmad Beirami, Alekh Agarwal, Jonathan Berant, Alexander D’Amour, Jacob Eisenstein, Chirag Nagpal, and Ananda Theertha Suresh. Theoretical guarantees on the best-of-n alignment policy. arXiv preprint arXiv:2401.01879, 2024.
-
[9]
David Brandfonbrener, Sibi Raja, Tarun Prasad, Chloe Loughridge, Jianang Yang,
Simon Henniger, William E. Byrd, Robert Zinkov, and Nada Amin.
Verified multi-step synthesis using large language models and monte
carlo tree search, 2024.
David Brandfonbrener,Sibi Raja,Tarun Prasad,Chloe Loughridge,Jianang Yang,Simon Henniger,William E. Byrd,Robert Zinkov,和 Nada Amin。基于大型语言模型和蒙特卡洛树搜索的验证多步合成,2024。 -
[10]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell,
et al.
Language models are few-shot learners.
Advances in neural information processing systems,
33:1877–1901, 2020.
汤姆·布朗,本杰明·曼,尼克·赖德,梅拉妮·苏比亚,贾里德·D·卡普兰,普拉夫拉·达里瓦尔,阿温德·尼尔卡坦,普拉纳夫·希亚姆,吉里什·萨斯特里,阿曼达·阿斯凯尔,等人。语言模型是少样本学习者。神经信息处理系统进展,33:1877–1901,2020。 -
[11]
Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert,
Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David
Lindner, Pedro Freire, et al.
Open problems and fundamental limitations of reinforcement learning
from human feedback.
arXiv preprint arXiv:2307.15217, 2023.
Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire 等。人类反馈的强化学习中的开放问题和基本限制。arXiv 预印本 arXiv:2307.15217,2023。 -
[12]
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau,
Laurent Sifre, and John Jumper.
Accelerating large language model decoding with speculative sampling.
arXiv preprint arXiv:2302.01318, 2023.
查理·陈,塞巴斯蒂安·博热,杰弗里·欧文,让-巴蒂斯特·莱斯皮奥,劳伦特·西弗雷,约翰·朱佩。利用投机采样加速大型语言模型解码。arXiv 预印本 arXiv:2302.01318,2023。 -
[13]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra,
Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian
Gehrmann, et al.
Palm: Scaling language modeling with pathways.
arXiv preprint arXiv:2204.02311, 2022.
Aakanksha Chowdhery,Sharan Narang,Jacob Devlin,Maarten Bosma,Gaurav Mishra,Adam Roberts,Paul Barham,Hyung Won Chung,Charles Sutton,Sebastian Gehrmann,等。Palm:通过路径扩展语言模型。arXiv 预印本 arXiv:2204.02311,2022。 -
[14]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario
Amodei.
Deep reinforcement learning from human preferences.
Advances in neural information processing systems, 30, 2017.
保罗·F·克里斯蒂亚诺,简·莱克,汤姆·布朗,米尔詹·马蒂奇,肖恩·莱格,达里奥·阿莫迪。从人类偏好中进行深度强化学习。神经信息处理系统进展,30,2017。 -
[15]
Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger.
Reward model ensembles help mitigate overoptimization.
arXiv preprint arXiv:2310.02743, 2023.
托马斯·科斯特、乌斯曼·安瓦尔、罗伯特·柯克和戴维·克鲁格。奖励模型集成有助于缓解过度优化。arXiv 预印本 arXiv:2310.02743,2023。 - [16] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
- [17] Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023.
- [18] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
- [19] Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
-
[20]
Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto.
Length-controlled alpacaeval: A simple way to debias automatic
evaluators.
arXiv preprint arXiv:2404.04475, 2024.
杨恩·杜布瓦,巴拉兹·加兰博西,佩尔西·梁,以及 tatsunori b hashimoto。长度控制的 alpacaeval:一种消除自动评估器偏差的简单方法。arXiv 预印本 arXiv:2404.04475,2024。 -
[21]
Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani,
Jimmy Ba, Carlos Guestrin, Percy S Liang, and Tatsunori B Hashimoto.
Alpacafarm: A simulation framework for methods that learn from human
feedback.
Advances in Neural Information Processing Systems, 36, 2024.
Yann Dubois,陈雪辰 李,Rohan Taori,张天逸,Ishaan Gulrajani,Jimmy Ba,Carlos Guestrin,Percy S Liang,以及 Tatsunori B Hashimoto。Alpacafarm:一种从人类反馈中学习的方法的模拟框架。神经信息处理系统进展,36,2024。 -
[22]
Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D’Amour,
DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak
Ramachandran, et al.
Helping or herding? reward model ensembles mitigate but do not
eliminate reward hacking.
arXiv preprint arXiv:2312.09244, 2023.
雅各布·艾森斯坦,奇拉格·纳加帕尔,阿莱赫·阿加瓦尔,阿哈迈德·贝拉米,亚历克斯·达莫尔,DJ Dvijotham,亚当·费施,凯瑟琳·赫尔勒,斯蒂芬·波夫,迪帕克·拉马钱德拉南,等人。助人或放牧?奖励模型集成减轻但未消除奖励黑客行为。arXiv 预印本 arXiv:2312.09244,2023。 -
[23]
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela.
Kto: Model alignment as prospect theoretic optimization.
arXiv preprint arXiv:2402.01306, 2024.
Kawin Ethayarajh,Winnie Xu,Niklas Muennighoff,Dan Jurafsky,和 Douwe Kiela。Kto:基于前景理论的模型对齐。arXiv 预印本 arXiv:2402.01306,2024。 -
[24]
Samuel H Fuller, John G Gaschnig, JJ Gillogly, et al.
Analysis of the alpha-beta pruning algorithm.
Department of Computer Science, Carnegie-Mellon University, 1973.
Samuel H Fuller, John G Gaschnig, JJ Gillogly 等人。alpha-beta 剪枝算法分析。卡内基梅隆大学计算机科学系,1973 年。 -
[25]
Leo Gao, John Schulman, and Jacob Hilton.
Scaling laws for reward model overoptimization.
In International Conference on Machine Learning, pages
10835–10866. PMLR, 2023.
高 Leo,约翰·舒尔曼,以及雅各布·希尔顿。关于奖励模型过度优化的缩放定律。在国际机器学习大会上发表,第 10835-10866 页。PMLR,2023 年。 -
[26]
Amelia Glaese, Nat McAleese, Maja Trkebacz, John Aslanides, Vlad Firoiu, Timo
Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker,
et al.
Improving alignment of dialogue agents via targeted human judgements.
arXiv preprint arXiv:2209.14375, 2022.
艾米丽·格莱泽,纳特·麦克利希,玛雅·特克巴茨,约翰·阿斯拉尼德斯,弗拉德·菲罗乌,蒂莫·伊瓦尔德斯,玛丽贝思·劳,劳拉·魏丁格,马丁·查德威克,菲比·萨克,等。通过有针对性的人类判断改进对话代理的对齐。arXiv 预印本 arXiv:2209.14375,2022。 -
[27]
Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, and Marc
Dymetman.
Compositional preference models for aligning lms.
arXiv preprint arXiv:2310.13011, 2023.
董永刚,托马斯·科巴克,格曼·克鲁舍夫斯基,约斯·罗森,马克·迪梅特曼。用于对齐语言模型的组合偏好模型。arXiv 预印本 arXiv:2310.13011,2023。 - [28] Lin Gui, Cristina Gârbacea, and Victor Veitch. Bonbon alignment for large language models and the sweetness of best-of-n sampling. arXiv preprint arXiv:2406.00832, 2024.
- [29] Xuanli He, Iman Keivanloo, Yi Xu, Xiang He, Belinda Zeng, Santosh Rajagopalan, and Trishul Chilimbi. Magic pyramid: Accelerating inference with early exiting and token pruning. arXiv preprint arXiv:2111.00230, 2021.
- [30] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- [31] Yigitcan Kaya, Sanghyun Hong, and Tudor Dumitras. Shallow-deep networks: Understanding and mitigating network overthinking. In International conference on machine learning, pages 3301–3310. PMLR, 2019.
- [32] Maxim Khanov, Jirayu Burapacheep, and Yixuan Li. Args: Alignment as reward-guided search. arXiv preprint arXiv:2402.01694, 2024.
- [33] Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, editors, Machine Learning: ECML 2006, pages 282–293, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.
- [34] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
- [35] Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023.
- [36] Kenneth Li, Samy Jelassi, Hugh Zhang, Sham Kakade, Martin Wattenberg, and David Brandfonbrener. Q-probe: A lightweight approach to reward maximization for language models. arXiv preprint arXiv:2402.14688, 2024.
- [37] Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023.
- [38] Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, and Asli Celikyilmaz. Don’t throw away your value model! making ppo even better via value-guided monte-carlo tree search decoding. arXiv e-prints, pages arXiv–2309, 2023.
- [39] Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, and Asli Celikyilmaz. Making ppo even better: Value-guided monte-carlo tree search decoding. arXiv preprint arXiv:2309.15028, 2023.
- [40] Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J Liu, and Jialu Liu. Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657, 2023.
- [41] Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Haotang Deng, and Qi Ju. Fastbert: a self-distilling bert with adaptive inference time. arXiv preprint arXiv:2004.02178, 2020.
- [42] Renze Lou, Kai Zhang, and Wenpeng Yin. A comprehensive survey on instruction following, 2024.
- [43] T Anthony Marsland. A review of game-tree pruning. ICGA journal, 9(1):3–19, 1986.
- [44] Meta. Llama3 technical report, https://ai.meta.com/blog/meta-llama-3, 2024.
- [45] Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, et al. Controlled decoding from language models. arXiv preprint arXiv:2310.17022, 2023.
- [46] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- [47] Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022.
- [48] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- [49] Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From to : Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024.
- [50] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
- [51] Aadirupa Saha, Aldo Pacchiano, and Jonathan Lee. Dueling rl: Reinforcement learning with trajectory preferences. In International Conference on Artificial Intelligence and Statistics, pages 6263–6289. PMLR, 2023.
- [52] Jérémy Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan Perez. Training language models with language feedback at scale. arXiv preprint arXiv:2303.16755, 2023.
- [53] Roy Schwartz, Gabriel Stanovsky, Swabha Swayamdipta, Jesse Dodge, and Noah A Smith. The right tool for the job: Matching model and instance complexities. arXiv preprint arXiv:2004.07453, 2020.
- [54] Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Nino Vieillard, Alexandre Ramé, Bobak Shariari, Sarah Perrin, Abe Friesen, Geoffrey Cideron, et al. Bond: Aligning llms with best-of-n distillation. arXiv preprint arXiv:2407.14622, 2024.
- [55] Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang. Preference ranking optimization for human alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18990–18998, 2024.
- [56] Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- [57] Nathan R Sturtevant and Richard E Korf. On pruning techniques for multi-player games. AAAI/IAAI, 49:201–207, 2000.
- [58] Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, and Beidi Chen. Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding. arXiv preprint arXiv:2404.11912, 2024.
- [59] Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, and Felix Yu. Spectr: Fast speculative decoding via optimal transport. Advances in Neural Information Processing Systems, 36, 2024.
- [60] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
- [61] Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention. In International Conference on Machine Learning, pages 9438–9447. PMLR, 2020.
- [62] Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Branchynet: Fast inference via early exiting from deep neural networks. In 2016 23rd international conference on pattern recognition (ICPR), pages 2464–2469. IEEE, 2016.
- [63] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- [64] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- [65] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- [66] Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Ke Ren, Botian Jiang, and Xipeng Qiu. Inferaligner: Inference-time alignment for harmlessness through cross-model guidance, 2024.
- [67] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions, 2023.
- [68] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arxiv. arXiv preprint arXiv:1910.03771, 2019.
- [69] Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment. arXiv preprint arXiv:2405.00675, 2024.
- [70] Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning, 2024.
- [71] Joy Qiping Yang, Salman Salamatian, Ziteng Sun, Ananda Theertha Suresh, and Ahmad Beirami. Asymptotics of language model alignment. arXiv preprint arXiv:2404.01730, 2024.
- [72] Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
- [73] Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, and Jun Wang. Token-level direct preference optimization. arXiv preprint arXiv:2404.11999, 2024.
- [74] Dan Zhang, Sining Zhoubian, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search. arXiv preprint arXiv:2406.03816, 2024.
- [75] Di Zhang, Jiatong Li, Xiaoshui Huang, Dongzhan Zhou, Yuqiang Li, and Wanli Ouyang. Accessing gpt-4 level mathematical olympiad solutions via monte carlo tree self-refine with llama-3 8b. arXiv preprint arXiv:2406.07394, 2024.
- [76] Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. arXiv preprint arXiv:2404.05868, 2024.
- [77] Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
- [78] Yao Zhao, Mikhail Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, and Peter J Liu. Calibrating sequence likelihood improves conditional language generation. In The Eleventh International Conference on Learning Representations, 2022.
- [79] Zirui Zhao, Wee Sun Lee, and David Hsu. Large language models as commonsense knowledge for large-scale task planning, 2023.
- [80] Han Zhong, Guhao Feng, Wei Xiong, Li Zhao, Di He, Jiang Bian, and Liwei Wang. Dpo meets ppo: Reinforced token optimization for rlhf. arXiv preprint arXiv:2404.18922, 2024.
- [81] Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models. arXiv preprint arXiv:2404.14294, 2024.
Appendix A Detailed Authors’ Contributions
Hanshi co-lead the code infrastructure, led the implementation of the efficient inference engine and win-rate analysis, lead the final experiments in the paper and co-led the writing of the final paper
Momin co-lead the code infrastructure, led the preliminary experiments, and contributed to the writing of an early draft of the manuscript
Ruiqi provided several conceptual contributions to the work. He led the writing of the initial draft of the paper, and led the early statistical analysis to assess the feasibility of the project.
Huitao lead the theoretical part of the work
Ming provided useful feedback for the project during the weekly project meeting discussion
Jiahao contributed with a win-rate analysis during the rebuttal period
Mengdi provided useful feedback and helped with accessing some of the compute infrastructure
Peter provided useful feedback, particularly regarding the correlation analysis in the early stage of the project and also co-suggested the iterative rejection scheme
Andrea conceived the original idea of speculative rejection, advised the project, and co-led the final writing of the paper.
Appendix B Correlation between partial and final rewards
In this section, we present our observation that the partial and final rewards are positively correlative for the responses to a single prompt. We examine the distribution for the (empirical) Pearson correlation and Kendall’s tau correlation coefficient for partial and final rewards for a single prompt. Mathematically, for and the two correlation are defined as
where are their average, and is the sign function.
NeurIPS Paper Checklist
-
1.
Claims
-
Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
-
Answer: [Yes]
-
Justification: We believe our introduction and abstract are factually accurate in describing the contributions of the paper and of its result
-
Guidelines:
-
•
The answer NA means that the abstract and introduction do not include the claims made in the paper.
-
•
The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.
-
•
The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.
-
•
It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.
-
•
-
2.
Limitations
-
Question: Does the paper discuss the limitations of the work performed by the authors?
-
Answer: [Yes]
-
Justification: The limitations are presented in the final section of the main paper.
-
Guidelines:
-
•
The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.
-
•
The authors are encouraged to create a separate ”Limitations” section in their paper.
-
•
The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.
-
•
The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.
-
•
The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.
-
•
The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.
-
•
If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.
-
•
While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.
-
•
-
3.
Theory Assumptions and Proofs
-
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
-
Answer: [N/A]
-
Justification: We do not include theoretical results in the paper.
-
Guidelines:
-
•
The answer NA means that the paper does not include theoretical results.
-
•
All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.
-
•
All assumptions should be clearly stated or referenced in the statement of any theorems.
-
•
The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.
-
•
Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.
-
•
Theorems and Lemmas that the proof relies upon should be properly referenced.
-
•
-
4.
Experimental Result Reproducibility
-
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
-
Answer: [Yes]
-
Justification: We provide all setup details to reproduce the experiment. In our supplementary zip file, we provide all code and evaluation datasets used, along with a README with instructions. We use public checkpoints for all draft and target models, public data for all evaluations.
-
Guidelines:
-
•
The answer NA means that the paper does not include experiments.
-
•
If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.
-
•
If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.
-
•
Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.
-
•
While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example
-
(a)
If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.
-
(b)
If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.
-
(c)
If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).
-
(d)
We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.
-
(a)
-
•
-
5.
Open access to data and code
-
Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
-
Answer: [Yes]
-
Justification: The datasets that we use are freely available and hosted by reliable third parties. We provide a zip file with the code and data needed to reproduce our experiments, as well as a README with instructions.
-
Guidelines:
-
•
The answer NA means that paper does not include experiments requiring code.
-
•
Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.
-
•
While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).
-
•
The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.
-
•
The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.
-
•
The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.
-
•
At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).
-
•
Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.
-
•
-
6.
Experimental Setting/Details
-
Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?
-
Answer: [Yes]
-
Justification: We only sample 100 prompts at random. There is no training involved.
-
Guidelines:
-
•
The answer NA means that the paper does not include experiments.
-
•
The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.
-
•
The full details can be provided either with the code, in appendix, or as supplemental material.
-
•
-
7.
Experiment Statistical Significance
-
Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
-
Answer: [No]
-
Justification: The cost for producing the experiments precludes us from reporting error bars. However, notice that each pair of llm and reward model produces a result that is averaged over 100 prompts. Since we report several such pairs, and the speedup is substantial for each single pair, we are confident that overall speedup is statistically significant.
-
Guidelines:
-
•
The answer NA means that the paper does not include experiments.
-
•
The authors should answer ”Yes” if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.
-
•
The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).
-
•
The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)
-
•
The assumptions made should be given (e.g., Normally distributed errors).
-
•
It should be clear whether the error bar is the standard deviation or the standard error of the mean.
-
•
It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.
-
•
For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).
-
•
If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.
-
•
-
8.
Experiments Compute Resources
-
Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?
-
Answer: [Yes]
-
Justification: This is discussed in the experiment section.
-
Guidelines:
-
•
The answer NA means that the paper does not include experiments.
-
•
The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.
-
•
The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.
-
•
The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).
-
•
-
9.
Code Of Ethics
-
Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?
-
Answer: [Yes]
-
Justification: None to report.
-
Guidelines:
-
•
The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
-
•
If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.
-
•
The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).
-
•
-
10.
Broader Impacts
-
Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
-
Answer: [N/A]
-
Justification: This work aims at accelerating a known and existing method, and so it is not expected to have a direct societal impact.
-
Guidelines:
-
•
The answer NA means that there is no societal impact of the work performed.
-
•
If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.
-
•
Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.
-
•
The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.
-
•
The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.
-
•
If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).
-
•
-
11.
Safeguards
-
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
-
Answer: [N/A]
-
Justification: We do not release data or models.
-
Guidelines:
-
•
The answer NA means that the paper poses no such risks.
-
•
Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.
-
•
Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.
-
•
We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
-
•
-
12.
Licenses for existing assets
-
Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
-
Answer: [Yes]
-
Justification: We cite the papers that introduced the models and data used in our work.
-
Guidelines:
-
•
The answer NA means that the paper does not use existing assets.
-
•
The authors should cite the original paper that produced the code package or dataset.
-
•
The authors should state which version of the asset is used and, if possible, include a URL.
-
•
The name of the license (e.g., CC-BY 4.0) should be included for each asset.
-
•
For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.
-
•
If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.
-
•
For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.
-
•
If this information is not available online, the authors are encouraged to reach out to the asset’s creators.
-
•
-
13.
New Assets
-
Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
-
Answer: [Yes]
-
Justification:We include a README along with our code to reproduce our experiments.
-
Guidelines:
-
•
The answer NA means that the paper does not release new assets.
-
•
Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.
-
•
The paper should discuss whether and how consent was obtained from people whose asset is used.
-
•
At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.
-
•
-
14.
Crowdsourcing and Research with Human Subjects
-
Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?
-
Answer: [N/A]
-
Justification: Our research does not involve humans.
-
Guidelines:
-
•
The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
-
•
Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.
-
•
According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.
-
•
-
15.
Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects
-
Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?
-
Answer: [N/A]
-
Justification: Our paper does not involve crowdsourcing.
-
Guidelines:
-
•
The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
-
•
Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.
-
•
We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.
-
•
For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.
-
•