LoRA: Low-Rank Adaptation of Large Language Models LoRA:大型语言模型的低秩自适应
Edward Hu* Yelong Shen* Phillip Wallis Zeyuan Allen-Zhu 爱德华·胡* 沈一龙* 菲利普·沃利斯 赵泽元-朱Yuanzhi Li Shean Wang Lu Wang Weizhu Chen 李元志 王世安 陆王伟竹 陈伟珠Microsoft Corporation 微软公司{edwardhu, yeshe, phwallis, zeyuana, yuanzhil, swang, luw, wzchen}@microsoft.com {爱德华胡, 是他, 菲利斯, 赵宇娜, 元志立, 王旺, 陆伟, 陈伟志}@微软.comyuanzhil@andrew.cmu.edu
(Version 2) (版本 2)
Abstract 摘要
An important paradigm of natural language processing consists of large-scale pretraining on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example - deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than finetuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 athttps://github.com/microsoft/LoRA. 自然语言处理的一个重要范例是在通用领域数据上进行大规模预训练,并适应特定任务或领域。随着我们预训练更大规模的模型,全量微调,即重新训练所有模型参数,变得越来越不可行。以 GPT-3 175B 为例——部署独立实例的微调模型,每个模型都有 175B 个参数,成本过高。我们提出了低秩适应,或称 LoRA,它冻结预训练模型权重,并将可训练的秩分解矩阵注入到 Transformer 架构的每一层,极大地减少了下游任务的可训练参数数量。与使用 Adam 微调的 GPT-3 175B 相比,LoRA 可以将可训练参数数量减少 10,000 倍,并将 GPU 内存需求减少 3 倍。尽管具有更少的可训练参数、更高的训练吞吐量和,与适配器不同,没有额外的推理延迟,LoRA 在 RoBERTa、DeBERTa、GPT-2 和 GPT-3 上的模型质量表现与微调相当或更好。我们还对语言模型适应中的秩亏进行了实证研究,这有助于了解 LoRA 的有效性。 我们发布了一个简化 LoRA 与 PyTorch 模型集成的包,并在 https://github.com/microsoft/LoRA 提供我们的实现和 RoBERTa、DeBERTa 和 GPT-2 的模型检查点。
1 Introduction 1 引言
Many applications in natural language processing rely on adapting one large-scale, pre-trained language model to multiple downstream applications. Such adaptation is usually done via fine-tuning, which updates all the parameters of the pre-trained model. The major downside of fine-tuning is that the new model contains as many parameters as in the original model. As larger models are trained every few months, this changes from a mere “inconvenience” for GPT-2 (Radford et al., b) or RoBERTa large (Liu et al., 2019) to a critical deployment challenge for GPT-3 (Brown et al., 2020) with 175 billion trainable parameters ^(1){ }^{1} 许多自然语言处理应用依赖于将一个大规模预训练语言模型适应多个下游应用。这种适应通常通过微调来完成,这会更新预训练模型的所有参数。微调的主要缺点是,新模型包含与原始模型一样多的参数。随着每几个月就训练出更大的模型,这从 GPT-2(Radford 等人,b)或 RoBERTa 大型(刘等人,2019)的“不便”变成了 GPT-3(Brown 等人,2020)的 1750 亿个可训练参数的临界部署挑战。
Many sought to mitigate this by adapting only some parameters or learning external modules for new tasks. This way, we only need to store and load a small number of task-specific parameters in ad- 许多人试图通过仅调整一些参数或学习新的外部模块来减轻这种情况。这样,我们只需要存储和加载少量特定任务的参数。
Figure 1: Our reparametrization. We only train AA and BB. dition to the pre-trained model for each task, greatly boosting the operational efficiency when deployed. However, existing techniques 图 1:我们的重新参数化。我们只训练 AA 和 BB 。除了为每个任务训练预训练模型外,还大大提高了部署时的操作效率。然而,现有技术
often introduce inference latency (Houlsby et al., 2019, Rebuffi et al., 2017) by extending model depth or reduce the model’s usable sequence length (Li & Liang, 2021; Lester et al., 2021; Hambardzumyan et al. 2020; Liu et al. 2021) (Section 3). More importantly, these method often fail to match the fine-tuning baselines, posing a trade-off between efficiency and model quality. 经常通过增加模型深度或减少模型可用的序列长度引入推理延迟(Houlsby 等人,2019,Rebuffi 等人,2017),或无法匹配微调基线,在效率和模型质量之间造成权衡(第 3 节)。更重要的是,这些方法通常无法匹配微调基线,在效率和模型质量之间造成权衡。
We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed Low-Rank Adaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1. Using GPT-3 175B as an example, we show that a very low rank (i.e., rr in Figure 1 can be one or two) suffices even when the full rank (i.e., dd ) is as high as 12,288 , making LoRA both storage- and compute-efficient. 我们从 Li 等人(2018a);Aghajanyan 等人(2020)的研究中汲取灵感,这些研究表明学习到的过参数化模型实际上位于一个低内在维度上。我们假设模型适应过程中权重的变化也具有很低的“内在秩”,从而导致了我们提出的低秩适应(LoRA)方法。LoRA 允许我们通过优化适应过程中密集层变化的秩分解矩阵,间接地训练神经网络中的某些密集层,同时保持预训练权重冻结,如图 1 所示。以 GPT-3 175B 为例,我们表明即使满秩(即 dd )高达 12,288,一个非常低的秩(即图 1 中的 rr 可以是一或两个)也足够了,这使得 LoRA 在存储和计算上都非常高效。
LoRA possesses several key advantages. LoRA 具有几个关键优势。
A pre-trained model can be shared and used to build many small LoRA modules for different tasks. We can freeze the shared model and efficiently switch tasks by replacing the matrices AA and BB in Figure 1, reducing the storage requirement and task-switching overhead significantly. 预训练模型可以共享并用于构建许多不同任务的小型 LoRA 模块。我们可以冻结共享模型,通过替换图 1 中的矩阵 AA 和 BB ,有效地切换任务,显著降低存储需求和任务切换开销。
LoRA makes training more efficient and lowers the hardware barrier to entry by up to 3 times when using adaptive optimizers since we do not need to calculate the gradients or maintain the optimizer states for most parameters. Instead, we only optimize the injected, much smaller low-rank matrices. LoRA 通过自适应优化器使训练更高效,并使用自适应优化器时将硬件入门门槛降低 3 倍,因为我们不需要计算大多数参数的梯度或维护优化器状态。相反,我们只优化注入的、远小得多的低秩矩阵。
Our simple linear design allows us to merge the trainable matrices with the frozen weights when deployed, introducing no inference latency compared to a fully fine-tuned model, by construction. 我们的简单线性设计允许我们在部署时将可训练矩阵与冻结权重合并,与完全微调的模型相比,在构造上引入了零推理延迟。
LoRA is orthogonal to many prior methods and can be combined with many of them, such as prefix-tuning. We provide an example in Appendix E LoRA 与许多先前的方法正交,可以与其中许多方法结合,例如前缀调整。我们在附录 E 中提供了一个示例
Terminologies and Conventions We make frequent references to the Transformer architecture and use the conventional terminologies for its dimensions. We call the input and output dimension size of a Transformer layer d_("model ")d_{\text {model }}. We use W_(q),W_(k),W_(v)W_{q}, W_{k}, W_{v}, and W_(o)W_{o} to refer to the query/key/value/output projection matrices in the self-attention module. WW or W_(0)W_{0} refers to a pretrained weight matrix and Delta W\Delta W its accumulated gradient update during adaptation. We use rr to denote the rank of a LoRA module. We follow the conventions set out by (Vaswani et al., 2017, Brown et al., 2020) and use Adam (Loshchilov & Hutter, 2019; Kingma & Ba, 2017) for model optimization and use a Transformer MLP feedforward dimension d_(ffn)=4xxd_("model ")d_{f f n}=4 \times d_{\text {model }}. 术语和约定我们经常引用 Transformer 架构,并使用其维度的传统术语。我们称 Transformer 层的输入和输出维度大小为 d_("model ")d_{\text {model }} 。我们使用 W_(q),W_(k),W_(v)W_{q}, W_{k}, W_{v} 、 W_(o)W_{o} 来指代自注意力模块中的查询/键/值/输出投影矩阵。 WW 或 W_(0)W_{0} 指预训练权重矩阵, Delta W\Delta W 指适应过程中的累积梯度更新。我们用 rr 表示 LoRA 模块的秩。我们遵循(Vaswani 等,2017,Brown 等,2020)设定的约定,并使用 Adam(Loshchilov & Hutter,2019;Kingma & Ba,2017)进行模型优化,并使用 Transformer MLP 前馈维度 d_(ffn)=4xxd_("model ")d_{f f n}=4 \times d_{\text {model }} 。
2 Problem Statement 2 问题陈述
While our proposal is agnostic to training objective, we focus on language modeling as our motivating use case. Below is a brief description of the language modeling problem and, in particular, the maximization of conditional probabilities given a task-specific prompt. 尽管我们的提案对训练目标无关紧要,但我们专注于语言建模作为我们的激励用例。以下是语言建模问题的简要描述,特别是针对特定任务的提示条件下的条件概率最大化。
Suppose we are given a pre-trained autoregressive language model P_(Phi)(y∣x)P_{\Phi}(y \mid x) parametrized by Phi\Phi. For instance, P_(Phi)(y∣x)P_{\Phi}(y \mid x) can be a generic multi-task learner such as GPT (Radford et al. b; Brown et al. 2020) based on the Transformer architecture (Vaswani et al., 2017). Consider adapting this pre-trained model to downstream conditional text generation tasks, such as summarization, machine reading comprehension (MRC), and natural language to SQL (NL2SQL). Each downstream task is represented by a training dataset of context-target pairs: Z={(x_(i),y_(i))}_(i=1,dots,N)\mathcal{Z}=\left\{\left(x_{i}, y_{i}\right)\right\}_{i=1, \ldots, N}, where both x_(i)x_{i} and y_(i)y_{i} are sequences of tokens. For example, in NL2SQL, x_(i)x_{i} is a natural language query and y_(i)y_{i} its corresponding SQL command; for summarization, x_(i)x_{i} is the content of an article and y_(i)y_{i} its summary. 假设我们给定一个由 Phi\Phi 参数化的预训练自回归语言模型 P_(Phi)(y∣x)P_{\Phi}(y \mid x) 。例如, P_(Phi)(y∣x)P_{\Phi}(y \mid x) 可以是一个基于 Transformer 架构(Vaswani et al., 2017)的通用多任务学习器,如 GPT(Radford et al. b; Brown et al. 2020)。考虑将此预训练模型适应于下游条件文本生成任务,例如摘要、机器阅读理解(MRC)和自然语言到 SQL(NL2SQL)。每个下游任务由一个上下文-目标对训练数据集表示: Z={(x_(i),y_(i))}_(i=1,dots,N)\mathcal{Z}=\left\{\left(x_{i}, y_{i}\right)\right\}_{i=1, \ldots, N} ,其中 x_(i)x_{i} 和 y_(i)y_{i} 都是标记序列。例如,在 NL2SQL 中, x_(i)x_{i} 是一个自然语言查询, y_(i)y_{i} 是其对应的 SQL 命令;对于摘要, x_(i)x_{i} 是文章的内容, y_(i)y_{i} 是其摘要。
During full fine-tuning, the model is initialized to pre-trained weights Phi_(0)\Phi_{0} and updated to Phi_(0)+Delta Phi\Phi_{0}+\Delta \Phi by repeatedly following the gradient to maximize the conditional language modeling objective: 在完全微调期间,模型初始化为预训练权重 Phi_(0)\Phi_{0} ,并通过反复跟随梯度以最大化条件语言建模目标更新为 Phi_(0)+Delta Phi\Phi_{0}+\Delta \Phi :
One of the main drawbacks for full fine-tuning is that for each downstream task, we learn a different set of parameters Delta Phi\Delta \Phi whose dimension |Delta Phi||\Delta \Phi| equals |Phi_(0)|\left|\Phi_{0}\right|. Thus, if the pre-trained model is large (such as GPT-3 with |Phi_(0)|~~175\left|\Phi_{0}\right| \approx 175 Billion), storing and deploying many independent instances of fine-tuned models can be challenging, if at all feasible 全微调的主要缺点之一是,对于每个下游任务,我们学习一组不同的参数 Delta Phi\Delta \Phi ,其维度 |Delta Phi||\Delta \Phi| 等于 |Phi_(0)|\left|\Phi_{0}\right| 。因此,如果预训练模型很大(如 GPT-3 拥有 |Phi_(0)|~~175\left|\Phi_{0}\right| \approx 175 十亿),存储和部署许多独立的微调模型实例可能具有挑战性,甚至可能根本不可行。
In this paper, we adopt a more parameter-efficient approach, where the task-specific parameter increment Delta Phi=Delta Phi(Theta)\Delta \Phi=\Delta \Phi(\Theta) is further encoded by a much smaller-sized set of parameters Theta\Theta with |Theta|≪|Phi_(0)||\Theta| \ll\left|\Phi_{0}\right|. The task of finding Delta Phi\Delta \Phi thus becomes optimizing over Theta\Theta : 在本文中,我们采用了一种更参数高效的方案,其中特定任务的参数增量 Delta Phi=Delta Phi(Theta)\Delta \Phi=\Delta \Phi(\Theta) 通过一组更小规模的参数 Theta\Theta 进行进一步编码, |Theta|≪|Phi_(0)||\Theta| \ll\left|\Phi_{0}\right| 。因此,寻找 Delta Phi\Delta \Phi 的任务变成了在 Theta\Theta 上优化。
In the subsequent sections, we propose to use a low-rank representation to encode Delta Phi\Delta \Phi that is both compute- and memory-efficient. When the pre-trained model is GPT-3 175B, the number of trainable parameters |Theta||\Theta| can be as small as 0.01%0.01 \% of |Phi_(0)|\left|\Phi_{0}\right|. 在后续章节中,我们提出使用低秩表示来编码 Delta Phi\Delta \Phi ,该表示既计算效率高又内存效率高。当预训练模型为 GPT-3 175B 时,可训练参数的数量 |Theta||\Theta| 可以小到 0.01%0.01 \% 的 |Phi_(0)|\left|\Phi_{0}\right| 。
3 Aren' t Existing Solutions Good Enough? 3 现有的解决方案还不够好吗?
The problem we set out to tackle is by no means new. Since the inception of transfer learning, dozens of works have sought to make model adaptation more parameter- and compute-efficient. See Section 6 for a survey of some of the well-known works. Using language modeling as an example, there are two prominent strategies when it comes to efficient adaptations: adding adapter layers (Houlsby et al., 2019; Rebuffi et al., 2017, Pfeiffer et al., 2021; Rücklé et al., 2020) or optimizing some forms of the input layer activations (Li & Liang, 2021; Lester et al., |2021; Hambardzumyan et al., 2020, Liu et al., 2021). However, both strategies have their limitations, especially in a large-scale and latency-sensitive production scenario. 我们所要解决的问题绝非新鲜事。自从迁移学习诞生以来,众多研究致力于使模型适应更加参数和计算高效。参见第 6 节,了解一些知名工作的综述。以语言模型为例,在高效适应方面有两种主要策略:添加适配器层(Houlsby 等,2019;Rebuffi 等,2017,Pfeiffer 等,2021;Rücklé等,2020)或优化输入层激活的一些形式(Li & Liang,2021;Lester 等,|2021;Hambardzumyan 等,2020,Liu 等,2021)。然而,这两种策略都有其局限性,尤其是在大规模和延迟敏感的生产场景中。
Adapter Layers Introduce Inference Latency There are many variants of adapters. We focus on the original design by Houlsby et al. (2019) which has two adapter layers per Transformer block and a more recent one by Lin et al. (2020) which has only one per block but with an additional LayerNorm ( Ba et al., 2016). While one can reduce the overall latency by pruning layers or exploiting multi-task settings (Rücklé et al., 2020; Pfeiffer et al., 2021), there is no direct ways to bypass the extra compute in adapter layers. This seems like a non-issue since adapter layers are designed to have few parameters (sometimes < 1%<1 \% of the original model) by having a small bottleneck dimension, which limits the FLOPs they can add. However, large neural networks rely on hardware parallelism to keep the latency low, and adapter layers have to be processed sequentially. This makes a difference in the online inference setting where the batch size is typically as small as one. In a generic scenario without model parallelism, such as running inference on GPT-2 (Radford et al. b) medium on a single GPU, we see a noticeable increase in latency when using adapters, even with a very small bottleneck dimension (Table 1). 适配层引入推理延迟适配器有许多变体。我们关注 Houlsby 等人(2019)的原始设计,该设计在每个 Transformer 块中有两个适配层,以及 Lin 等人(2020)的一个更近期的版本,该版本在每个块中只有一个适配层,但额外增加了 LayerNorm(Ba 等人,2016)。虽然可以通过剪枝层或利用多任务设置(Rücklé等人,2020;Pfeiffer 等人,2021)来减少整体延迟,但无法直接绕过适配层中的额外计算。这似乎不是一个问题,因为适配层的设计是为了具有很少的参数(有时为原始模型的 0#),通过具有较小的瓶颈维度,这限制了它们可以添加的 FLOPs。然而,大型神经网络依赖于硬件并行性以保持低延迟,而适配层必须按顺序处理。这在在线推理设置中造成了差异,其中批大小通常小到只有一个。在通用场景中,例如在 GPT-2(Radford 等人)上运行推理,没有模型并行性。 b) 在单个 GPU 上运行 medium,使用适配器时,即使瓶颈维度非常小,也会出现明显的延迟增加(见表 1)。
This problem gets worse when we need to shard the model as done in Shoeybi et al. (2020); Lepikhin et al. (2020), because the additional depth requires more synchronous GPU operations such as AllReduce and Broadcast, unless we store the adapter parameters redundantly many times. 这个问题在我们需要像 Shoeybi 等人(2020 年);Lepikhin 等人(2020 年)那样对模型进行分片时变得更糟,因为额外的深度需要更多的同步 GPU 操作,如 AllReduce 和 Broadcast,除非我们多次冗余地存储适配器参数。
Directly Optimizing the Prompt is Hard The other direction, as exemplified by prefix tuning (Li(\mathrm{Li} & Liang, 2021), faces a different challenge. We observe that prefix tuning is difficult to optimize and that its performance changes non-monotonically in trainable parameters, confirming similar observations in the original paper. More fundamentally, reserving a part of the sequence length for adaptation necessarily reduces the sequence length available to process a downstream task, which we suspect makes tuning the prompt less performant compared to other methods. We defer the study on task performance to Section 5 直接优化提示词很困难。另一个方向,如前缀调整( (Li(\mathrm{Li} & Liang, 2021)所示,面临不同的挑战。我们观察到前缀调整难以优化,并且其性能在可训练参数中非单调变化,这与原始论文中的观察结果相似。更基本的是,为适应保留一部分序列长度必然减少可用于处理下游任务的序列长度,这使我们怀疑与前缀调整相比,提示词调整的性能较低。我们将关于任务性能的研究推迟到第 5 节。
*Equal contribution. 平等贡献。 ^(0){ }^{0} Compared to V1, this draft includes better baselines, experiments on GLUE, and more on adapter latency. 与 V1 相比,这个草案包括更好的基线、GLUE 上的实验以及更多关于适配器延迟的内容。 ^(1){ }^{1} While GPT-3 175B achieves non-trivial performance with few-shot learning, fine-tuning boosts its performance significantly as shown in Appendix A GPT-3 175B 在少量样本学习方面取得了非平凡的性能,而微调显著提升了其性能,如附录 A 所示