这是用户在 2024-5-4 21:53 为 https://ar5iv.labs.arxiv.org/html/2305.14314?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers &Artidoro Pagnoni &Ari Holtzman &Luke Zettlemoyer

University of Washington
华盛顿大学 {dettmers,artidoro,ahai,lsz}@cs.washington.edu
Equal contribution. 平等贡献。
Abstract 摘要

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) Double Quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) Paged Optimizers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training.111https://github.com/artidoro/qlora and https://github.com/TimDettmers/bitsandbytes
https://github.com/artidoro/qlora 和 https://github.com/TimDettmers/bitsandbytes

我们提出的 QLoRA 是一种高效的微调方法,它能减少内存使用量,足以在单个 48GB GPU 上对 65B 参数模型进行微调,同时保持完整的 16 位微调任务性能。QLoRA 通过将冻结的 4 位量化预训练语言模型的梯度反向传播到低秩适配器(LoRA)中。我们的最佳模型系列被命名为 Guanaco,它在 Vicuna 基准测试中的表现优于之前所有公开发布的模型,达到了 ChatGPT 99.3% 的性能水平,而在单个 GPU 上只需要 24 小时的微调。QLoRA 引入了一系列创新技术,以在不牺牲性能的前提下节省内存:(a)4 位 NormalFloat(NF4),这是一种新的数据类型,从信息理论上讲是正态分布权重的最优数据类型;(b)双量化,通过量化量化常数来减少平均内存占用;以及(c)分页优化器,用于管理内存峰值。我们使用 QLoRA 对 1,000 多个模型进行了微调,对 8 个指令数据集、多种模型类型(LLaMA、T5)以及常规微调无法运行的模型规模(如 33B 和 65B 参数模型)中的指令跟随和聊天机器人性能进行了详细分析。我们的结果表明,在小型高质量数据集上进行 QLoRA 微调,即使使用的模型比以前的 SoTA 更小,也能获得最先进的结果。我们对基于人类和 GPT-4 评估的聊天机器人性能进行了详细分析,结果表明 GPT-4 评估是人类评估的廉价而合理的替代方案。此外,我们发现当前的聊天机器人基准在准确评估聊天机器人的性能水平方面并不可信。柠檬选取分析表明了 Guanaco 与 ChatGPT 相比的失败之处。我们发布了所有模型和代码,包括用于 4 位训练的 CUDA 内核。 1

1 Introduction 1 引言

Finetuning large language models (LLMs) is a highly effective way to improve their performance, [40, 62, 43, 61, 59, 37] and to add desirable or remove undesirable behaviors [43, 2, 4]. However, finetuning very large models is prohibitively expensive; regular 16-bit finetuning of a LLaMA 65B parameter model [57] requires more than 780 GB of GPU memory. While recent quantization methods can reduce the memory footprint of LLMs [14, 13, 18, 66], such techniques only work for inference and break down during training [65].
对大型语言模型(LLMs)进行微调是提高其性能[40, 62, 43, 61, 59, 37]以及增加理想行为或删除不理想行为[43, 2, 4]的一种非常有效的方法。然而,对超大模型进行微调的成本过高;对一个 LLaMA 65B 参数模型进行常规 16 位微调[ 57] 需要超过 780 GB 的 GPU 内存。虽然最近的量化方法可以减少 LLMs 的内存占用 [ 14, 13, 18, 66],但这些技术仅适用于推理,在训练过程中会出现问题 [ 65]。

We demonstrate for the first time that it is possible to finetune a quantized 4-bit model without any performance degradation. Our method, QLoRA, uses a novel high-precision technique to quantize a pretrained model to 4-bit, then adds a small set of learnable Low-rank Adapter weights [28]
我们首次证明,可以在不降低性能的情况下对量化的 4 位模型进行微调。我们的方法 QLoRA 使用一种新颖的高精度技术将预训练模型量化为 4 位,然后添加一小组可学习的低秩适配器权重[28]。

Table 1: Elo ratings for a competition between models, averaged for 10,000 random initial orderings. The winner of a match is determined by GPT-4 which declares which response is better for a given prompt of the the Vicuna benchmark. 95% confidence intervals are shown (±plus-or-minus\pm). After GPT-4, Guanaco 33B and 65B win the most matches, while Guanaco 13B scores better than Bard.
表 1:10,000 次随机初始排序的平均值。比赛的胜负是由 GPT-4 决定的,GPT-4 宣布在 Vicuna 基准的给定提示下哪种反应更好。图中显示了 95% 的置信区间 ( ±plus-or-minus\pm )。在 GPT-4 之后,Guanaco 33B 和 65B 赢得了最多的比赛,而 Guanaco 13B 的得分高于 Bard。
Model Size Elo
GPT-4 - 1348 ±plus-or-minus\pm 1
Guanaco 65B 41 GB 1022 ±plus-or-minus\pm 1
Guanaco 33B 21 GB   992 ±plus-or-minus\pm 1
Vicuna 13B 维库纳 13B 26 GB   974 ±plus-or-minus\pm 1
ChatGPT -   966 ±plus-or-minus\pm 1
Guanaco 13B 10 GB   916 ±plus-or-minus\pm 1
Bard -   902 ±plus-or-minus\pm 1
Guanaco 7B 6 GB   879 ±plus-or-minus\pm 1

that are tuned by backpropagating gradients through the quantized weights.

QLoRA reduces the average memory requirements of finetuning a 65B parameter model from >>780GB of GPU memory to <<48GB without degrading the runtime or predictive performance compared to a 16-bit fully finetuned baseline. This marks a significant shift in accessibility of LLM finetuning: now the largest publicly available models to date finetunable on a single GPU. Using QLoRA, we train the Guanaco family of models, with the second best model reaching 97.8% of the performance level of ChatGPT on the Vicuna [10] benchmark, while being trainable in less than 12 hours on a single consumer GPU; using a single professional GPU over 24 hours we achieve 99.3% with our largest model, essentially closing the gap to ChatGPT on the Vicuna benchmark. When deployed, our smallest Guanaco model (7B parameters) requires just 5 GB of memory and outperforms a 26 GB Alpaca model by more than 20 percentage points on the Vicuna benchmark (Table 6).
与 16 位完全微调基线相比,QLoRA 将微调 65B 参数模型的平均内存需求从 >> 780GB GPU 内存减少到 << 48GB,而运行时间或预测性能却没有降低。这标志着LLM微调的可访问性发生了重大转变:迄今为止,可在单个 GPU 上进行微调的公开可用模型规模最大。使用 QLoRA,我们训练了 Guanaco 系列模型,其中第二好的模型在 Vicuna [ 10] 基准上达到了 ChatGPT 97.8% 的性能水平,而在单个消费级 GPU 上训练时间不到 12 小时;使用单个专业 GPU 24 小时后,我们最大的模型达到了 99.3%,基本上缩小了在 Vicuna 基准上与 ChatGPT 的差距。在部署时,我们最小的 Guanaco 模型(7B 参数)仅需 5 GB 内存,在 Vicuna 基准测试中比 26 GB 的 Alpaca 模型高出 20 多个百分点(表 6)。

QLoRA introduces multiple innovations designed to reduce memory use without sacrificing performance: (1) 4-bit NormalFloat, an information theoretically optimal quantization data type for normally distributed data that yields better empirical results than 4-bit Integers and 4-bit Floats. (2) Double Quantization, a method that quantizes the quantization constants, saving an average of about 0.37 bits per parameter (approximately 3 GB for a 65B model). (3) Paged Optimizers, using NVIDIA unified memory to avoid the gradient checkpointing memory spikes that occur when processing a mini-batch with a long sequence length. We combine these contributions into a better tuned LoRA approach that includes adapters at every network layer and thereby avoids almost all of the accuracy tradeoffs seen in prior work.
QLoRA 引入了多项创新,旨在减少内存使用而不牺牲性能:(1) 4 位 NormalFloat,这是一种信息论上最优的量化数据类型,适用于正态分布数据,其经验结果优于 4 位整数和 4 位浮点数。(2) 双量化(Double Quantization),这是一种量化量化常数的方法,平均每个参数可节省约 0.37 比特(65B 模型约 3 GB)。(3) 分页优化器,使用英伟达统一内存,避免在处理序列长度较长的迷你批次时出现梯度检查点内存峰值。我们将这些贡献结合到一个经过更好调整的 LoRA 方法中,该方法包括每个网络层的适配器,从而避免了之前工作中出现的几乎所有精度折衷问题。

QLoRA’s efficiency enables us to perform an in-depth study of instruction finetuning and chatbot performance on model scales that would be impossible using regular finetuning due to memory overhead. Therefore, we train more than 1,000 models across several instruction tuning datasets, model architectures, and sizes between 80M to 65B parameters. In addition to showing that QLoRA recovers 16-bit performance (§4) and training a state-of-the-art chatbot, Guanaco, (§5), we also analyze trends in the trained models. First, we find that data quality is far more important than dataset size, e.g., a 9k sample dataset (OASST1) outperformed a 450k sample dataset (FLAN v2, subsampled) on chatbot performance, even when both are meant to support instruction following generalization. Second, we show that strong Massive Multitask Language Understanding (MMLU) benchmark performance does not imply strong Vicuna chatbot benchmark performance and vice versa—in other words, dataset suitability matters more than size for a given task.

Furthermore, we also provide a extensive analysis of chatbot performance that uses both human raters and GPT-4 for evaluation. We use tournament-style benchmarking where models compete against each other in matches to produce the best response for a given prompt. The winner of a match is judged by either GPT-4 or human annotators. The tournament results are aggregated into Elo scores [16, 17] which determine the ranking of chatbot performance. We find that GPT-4 and human evaluations largely agree on the rank of model performance in the tournaments, but we also find there are instances of strong disagreement. As such, we highlight that model-based evaluation while providing a cheap alternative to human-annotation also has its uncertainties.

We augment our chatbot benchmark results with a qualitative analysis of Guanaco models. Our analysis highlights success and failure cases that were not captured by the quantitative benchmarks.
我们对 Guanaco 模型进行了定性分析,从而增强了聊天机器人的基准结果。我们的分析突出强调了定量基准未涵盖的成功和失败案例。

We release all model generations with human and GPT-4 annotations to facilitate further study. We open-source our codebase and CUDA kernels and integrate our methods into the Hugging Face transformers stack [64], making them easily accessible to all. We release a collection of adapters for 7/13/33/65B size models, trained on 8 different instruction following datasets, for a total of 32 different open sourced, finetuned models.
我们发布的所有模型代都带有人类和 GPT-4 注释,以方便进一步研究。我们开源了我们的代码库和 CUDA 内核,并将我们的方法集成到 Hugging Face 转换器堆栈中[ 64],使所有人都能轻松访问。我们发布了 7/13/33/65B 大小模型的适配器集合,这些模型是在 8 个不同的指令数据集上训练的,总共有 32 种不同的开源微调模型。

Refer to caption
Figure 1: Different finetuning methods and their memory requirements. QLoRA improves over LoRA by quantizing the transformer model to 4-bit precision and using paged optimizers to handle memory spikes.
图 1:不同的微调方法及其内存需求。QLoRA 通过将变压器模型量化为 4 位精度,并使用分页优化器处理内存峰值,从而改进了 LoRA。

2 Background 2 背景

Block-wise k-bit Quantization
分块 k 位量化

Quantization is the process of discretizing an input from a representation that holds more information to a representation with less information. It often means taking a data type with more bits and converting it to fewer bits, for example from 32-bit floats to 8-bit Integers. To ensure that the entire range of the low-bit data type is used, the input data type is commonly rescaled into the target data type range through normalization by the absolute maximum of the input elements, which are usually structured as a tensor. For example, quantizing a 32-bit Floating Point (FP32) tensor into a Int8 tensor with range [127,127]127127[-127,127]:
量化是将输入信息从包含较多信息的表示法离散到包含较少信息的表示法的过程。这通常意味着将位数较多的数据类型转换为位数较少的数据类型,例如从 32 位浮点数转换为 8 位整数。为确保使用低位数据类型的整个范围,输入数据类型通常通过输入元素的绝对最大值(通常为张量结构)进行归一化,从而重新缩放到目标数据类型的范围内。例如,将 32 位浮点 (FP32) 张量量化为范围为 [127,127]127127[-127,127] 的 Int8 张量:

𝐗Int8=round(127absmax(𝐗FP32)𝐗FP32)=round(cFP32𝐗FP32),superscript𝐗Int8round127absmaxsuperscript𝐗FP32superscript𝐗FP32roundsuperscript𝑐FP32superscript𝐗FP32\mathbf{X}^{\text{Int8}}=\text{round}\left(\frac{127}{{\text{absmax}}(\mathbf{X}^{{\text{FP32}}})}\mathbf{X}^{{\text{FP32}}}\right)=\text{round}(c^{\text{FP32}}\cdot\mathbf{X}^{{\text{FP32}}}), (1)

where c𝑐c is the quantization constant or quantization scale. Dequantization is the inverse:
其中 c𝑐c 是量化常数或量化标度。去量化则是反比:

dequant(cFP32,𝐗Int8)=𝐗Int8cFP32=𝐗FP32dequantsuperscript𝑐FP32superscript𝐗Int8superscript𝐗Int8superscript𝑐FP32superscript𝐗FP32\text{dequant}(c^{\text{FP32}},\mathbf{X}^{\text{Int8}})=\frac{\mathbf{X}^{\text{Int8}}}{c^{\text{FP32}}}=\mathbf{X}^{\text{FP32}} (2)

The problem with this approach is that if a large magnitude value (i.e., an outlier) occurs in the input tensor, then the quantization bins—certain bit combinations—are not utilized well with few or no numbers quantized in some bins. To prevent the outlier issue, a common approach is to chunk the input tensor into blocks that are independently quantized, each with their own quantization constant c𝑐c. This can be formalized as follows: We chunk the input tensor 𝐗b×h𝐗superscript𝑏\mathbf{X}\in\mathbb{R}^{b\times h} into n𝑛n contiguous blocks of size B𝐵B by flattening the input tensor and slicing the linear segment into n=(b×h)/B𝑛𝑏𝐵{n=({b\times h})/{B}} blocks. We quantize these blocks independently with Equation 1 to create a quantized tensor and n𝑛n quantization constants cisubscript𝑐𝑖c_{i}.
这种方法的问题在于,如果输入张量中出现一个较大的幅度值(即离群值),那么量化分区--某些比特组合--就不能很好地利用,某些分区中量化的数字很少或没有。为防止出现离群点问题,一种常见的方法是将输入张量分成独立量化的块,每个块都有自己的量化常数 c𝑐c 。其形式化过程如下:我们将输入张量 𝐗b×h𝐗superscript𝑏\mathbf{X}\in\mathbb{R}^{b\times h} 压缩成大小为 B𝐵Bn𝑛n 连续块,方法是将输入张量扁平化,并将线性段切成 n=(b×h)/B𝑛𝑏𝐵{n=({b\times h})/{B}} 块。我们使用公式 1 对这些块进行独立量化,以创建量化张量和 n𝑛n 量化常数 cisubscript𝑐𝑖c_{i}

Low-rank Adapters 低电平适配器

Low-rank Adapter (LoRA) finetuning [28] is a method that reduces memory requirements by using a small set of trainable parameters, often termed adapters, while not updating the full model parameters which remain fixed. Gradients during stochastic gradient descent are passed through the fixed pretrained model weights to the adapter, which is updated to optimize the loss function. LoRA augments a linear projection through an additional factorized projection. Given a projection 𝐗𝐖=𝐘𝐗𝐖𝐘{\mathbf{X}\mathbf{W}=\mathbf{Y}} with 𝐗b×h𝐗superscript𝑏\mathbf{X}\in\mathbb{R}^{b\times h}, 𝐖h×o𝐖superscript𝑜\mathbf{W}\in\mathbb{R}^{h\times o} LoRA computes:
低秩适配器(Low-rank Adapter,LoRA)微调[28]是一种通过使用一小部分可训练参数(通常称为适配器)来减少内存需求的方法,同时不更新保持固定的全部模型参数。随机梯度下降过程中的梯度通过固定的预训练模型权重传递给适配器,适配器进行更新以优化损失函数。LoRA 通过额外的因子化投影来增强线性投影。给定一个投影 𝐗𝐖=𝐘𝐗𝐖𝐘{\mathbf{X}\mathbf{W}=\mathbf{Y}}𝐗b×h𝐗superscript𝑏\mathbf{X}\in\mathbb{R}^{b\times h}𝐖h×o𝐖superscript𝑜\mathbf{W}\in\mathbb{R}^{h\times o} LoRA 会计算:

𝐘=𝐗𝐖+s𝐗𝐋1𝐋2,𝐘𝐗𝐖𝑠subscript𝐗𝐋1subscript𝐋2\mathbf{Y}=\mathbf{X}\mathbf{W}+s\mathbf{X}\mathbf{L}_{1}\mathbf{L}_{2}, (3)

where 𝐋1h×rsubscript𝐋1superscript𝑟\mathbf{L}_{1}\in\mathbb{R}^{h\times r} and 𝐋2r×osubscript𝐋2superscript𝑟𝑜\mathbf{L}_{2}\in\mathbb{R}^{r\times o}, and s𝑠s is a scalar.
其中 𝐋1h×rsubscript𝐋1superscript𝑟\mathbf{L}_{1}\in\mathbb{R}^{h\times r}𝐋2r×osubscript𝐋2superscript𝑟𝑜\mathbf{L}_{2}\in\mathbb{R}^{r\times o}s𝑠s 是一个标量。

Memory Requirement of Parameter-Efficient Finetuning

One important point of discussion is the memory requirement of LoRA during training both in terms of the number and size of adapters used. Since the memory footprint of LoRA is so minimal, we can use more adapters to improve performance without significantly increasing the total memory used. While LoRA was designed as a Parameter Efficient Finetuning (PEFT) method, most of the memory footprint for LLM finetuning comes from activation gradients and not from the learned LoRA parameters. For a 7B LLaMA model trained on FLAN v2 with a batch size of 1, with LoRA weights equivalent to commonly used 0.2% of the original model weights[28, 37], the LoRA input gradients have a memory footprint of 567 MB while the LoRA parameters take up only 26 MB. With gradient checkpointing [9], the input gradients reduce to an average of 18 MB per sequence making them more memory intensive than all LoRA weights combined. In comparison, the 4-bit base model consumes 5,048 MB of memory. This highlights that gradient checkpointing is important but also that aggressively reducing the amount of LoRA parameter yields only minor memory benefits. This means we can use more adapters without significantly increasing the overall training memory footprint (see Appendix G for a detailed breakdown). As discussed later, this is crucial for recovering full 16-bit precision performance.
一个重要的讨论点是 LoRA 在训练过程中对内存的需求,包括所使用适配器的数量和大小。由于 LoRA 的内存占用极小,我们可以使用更多适配器来提高性能,而不会显著增加总内存使用量。虽然 LoRA 被设计为参数高效微调 (PEFT) 方法,但 LLM 微调的大部分内存占用来自激活梯度,而非学习的 LoRA 参数。对于在 FLAN v2 上训练的批次大小为 1 的 7B LLaMA 模型,LoRA 权重相当于常用的原始模型权重的 0.2%[28,37],LoRA 输入梯度的内存占用为 567 MB,而 LoRA 参数仅占用 26 MB。利用梯度检查点技术[9],输入梯度平均减少到每个序列 18 MB,比所有 LoRA 权重的总和还要耗费内存。相比之下,4 位基本模型的内存消耗为 5,048 MB。这说明梯度检查点非常重要,但同时也说明大力减少 LoRA 参数量只会带来微小的内存优势。这意味着我们可以使用更多的适配器,而不会显著增加总体训练内存占用(详见附录 G)。正如后面所讨论的,这对于恢复全部 16 位精度性能至关重要。

3 QLoRA Finetuning 3 QLoRA 微调

QLoRA achieves high-fidelity 4-bit finetuning via two techniques we propose—4-bit NormalFloat (NF4) quantization and Double Quantization. Additionally, we introduce Paged Optimizers, to prevent memory spikes during gradient checkpointing from causing out-of-memory errors that have traditionally made finetuning on a single machine difficult for large models.
QLoRA 通过我们提出的两种技术--4 位 NormalFloat (NF4) 量化和双量化,实现了高保真 4 位微调。此外,我们还引入了分页优化器(Paged Optimizers),以防止梯度检查点过程中的内存峰值导致内存不足错误,而传统上,这种错误会导致难以在单机上对大型模型进行微调。

QLoRA has one low-precision storage data type, in our case usually 4-bit, and one computation data type that is usually BFloat16. In practice, this means whenever a QLoRA weight tensor is used, we dequantize the tensor to BFloat16, and then perform a matrix multiplication in 16-bit.
QLoRA 有一种低精度存储数据类型(在我们的情况下通常为 4 位)和一种计算数据类型(通常为 BFloat16)。实际上,这意味着只要使用 QLoRA 权重张量,我们就会将张量去量化为 BFloat16,然后以 16 位执行矩阵乘法。

We now discuss the components of QLoRA followed by a formal definition of QLoRA.
我们现在讨论 QLoRA 的组成部分,然后给出 QLoRA 的正式定义。

4-bit NormalFloat Quantization
4 位 NormalFloat 量化

The NormalFloat (NF) data type builds on Quantile Quantization [15] which is an information-theoretically optimal data type that ensures each quantization bin has an equal number of values assigned from the input tensor. Quantile quantization works by estimating the quantile of the input tensor through the empirical cumulative distribution function.
NormalFloat (NF) 数据类型建立在量化量化(Quantile Quantization)[15] 的基础上,它是一种信息论上最优的数据类型,可确保每个量化仓都有相同数量的输入张量值。定量量化的工作原理是通过经验累积分布函数估算输入张量的定量值。

The main limitation of quantile quantization is that the process of quantile estimation is expensive. Therefore fast quantile approximation algorithms, such as SRAM quantiles [15], are used to estimate them. Due to the approximate nature of these quantile estimation algorithms, the data type has large quantization errors for outliers, which are often the most important values.
量子量化的主要局限是量子估计过程耗资巨大。因此,人们使用快速量化近似算法(如 SRAM 量化算法[15])来估计它们。由于这些量化估计算法的近似性质,数据类型对异常值的量化误差较大,而异常值往往是最重要的值。

Expensive quantile estimates and approximation errors can be avoided when input tensors come from a distribution fixed up to a quantization constant. In such cases, input tensors have the same quantiles making exact quantile estimation computationally feasible.

Since pretrained neural network weights usually have a zero-centered normal distribution with standard deviation σ𝜎\sigma (see Appendix F), we can transform all weights to a single fixed distribution by scaling σ𝜎\sigma such that the distribution fits exactly into the range of our data type. For our data type, we set the arbitrary range [1,1]11[-1,1]. As such, both the quantiles for the data type and the neural network weights need to be normalized into this range.
由于预训练的神经网络权重通常具有标准偏差为 σ𝜎\sigma 的零心正态分布(参见附录 F),因此我们可以通过缩放 σ𝜎\sigma 将所有权重转换为单一固定分布,从而使该分布完全符合我们的数据类型范围。对于我们的数据类型,我们设置的任意范围是 [1,1]11[-1,1] 。因此,数据类型和神经网络权重的量化值都需要归一化到这个范围内。

The information theoretically optimal data type for zero-mean normal distributions with arbitrary standard deviations σ𝜎\sigma in the range [1,1]11[-1,1] is computed as follows: (1) estimate the 2k+1superscript2𝑘12^{k}+1 quantiles of a theoretical N(0,1)𝑁01N(0,1) distribution to obtain a k𝑘k-bit quantile quantization data type for normal distributions, (2) take this data type and normalize its values into the [1,1]11[-1,1] range, (3) quantize an input weight tensor by normalizing it into the [1,1]11[-1,1] range through absolute maximum rescaling.
对于在 [1,1]11[-1,1] 范围内具有任意标准偏差 σ𝜎\sigma 的零均值正态分布,其信息理论最佳数据类型的计算方法如下:(1) 估算理论 N(0,1)𝑁01N(0,1) 分布的 2k+1superscript2𝑘12^{k}+1 量化值,以获得正态分布的 k𝑘k 位量化数据类型,(2) 使用该数据类型并将其值归一化到 [1,1]11[-1,1] 范围内,(3) 通过绝对最大值重缩放将输入权重张量归一化到 [1,1]11[-1,1] 范围内,从而量化输入权重张量。

Once the weight range and data type range match, we can quantize as usual. Step (3) is equivalent to rescaling the standard deviation of the weight tensor to match the standard deviation of the k-bit data type. More formally, we estimate the 2ksuperscript2𝑘2^{k} values qisubscript𝑞𝑖q_{i} of the data type as follows:
一旦权重范围和数据类型范围相匹配,我们就可以像往常一样进行量化。步骤 (3) 相当于重新调整权重张量的标准偏差,使其与 k 位数据类型的标准偏差相匹配。更正式地说,我们对数据类型的 2ksuperscript2𝑘2^{k}qisubscript𝑞𝑖q_{i} 估算如下:

qi=12(QX(i2k+1)+QX(i+12k+1)),subscript𝑞𝑖12subscript𝑄𝑋𝑖superscript2𝑘1subscript𝑄𝑋𝑖1superscript2𝑘1q_{i}=\frac{1}{2}\left(Q_{X}\left(\frac{i}{2^{k}+1}\right)+Q_{X}\left(\frac{i+1}{2^{k}+1}\right)\right), (4)

where QX()subscript𝑄𝑋Q_{X}(\cdot) is the quantile function of the standard normal distribution N(0,1)𝑁01N(0,1). A problem for a symmetric k-bit quantization is that this approach does not have an exact representation of zero, which is an important property to quantize padding and other zero-valued elements with no error. To ensure a discrete zeropoint of 00 and to use all 2ksuperscript2𝑘2^{k} bits for a k-bit datatype, we create an asymmetric data type by estimating the quantiles qisubscript𝑞𝑖q_{i} of two ranges qisubscript𝑞𝑖q_{i}: 2k1superscript2𝑘12^{k-1} for the negative part and 2k1+1superscript2𝑘112^{k-1}+1 for the positive part and then we unify these sets of qisubscript𝑞𝑖q_{i} and remove one of the two zeros that occurs in both sets. We term the resulting data type that has equal expected number of values in each quantization bin k-bit NormalFloat (NFk), since the data type is information-theoretically optimal for zero-centered normally distributed data. The exact values of this data type can be found in Appendix E.
其中 QX()subscript𝑄𝑋Q_{X}(\cdot) 是标准正态分布 N(0,1)𝑁01N(0,1) 的量化函数。对称 k 位量化的一个问题是,这种方法无法精确表示零,而零是对填充和其他零值元素进行无差错量化的一个重要属性。为了确保 00 的离散零点,并使用 k 位数据类型的所有 2ksuperscript2𝑘2^{k} 位,我们通过估算两个范围 qisubscript𝑞𝑖q_{i} 的量化值 qisubscript𝑞𝑖q_{i} :负数部分为 2k1superscript2𝑘12^{k-1} ,正数部分为 2k1+1superscript2𝑘112^{k-1}+1 ,创建了一种非对称数据类型。我们将由此产生的数据类型称为 k 位 NormalFloat (NFk),因为这种数据类型对于零中心正态分布数据来说是信息论上的最优数据类型。该数据类型的精确值见附录 E。

Double Quantization 双重量化

We introduce Double Quantization (DQ), the process of quantizing the quantization constants for additional memory savings. While a small blocksize is required for precise 4-bit quantization [13], it also has a considerable memory overhead. For example, using 32-bit constants and a blocksize of 64 for 𝐖𝐖\mathbf{W}, quantization constants add 32/64=0.532640.532/64=0.5 bits per parameter on average. Double Quantization helps reduce the memory footprint of quantization constants.
我们引入了双量化(DQ)技术,即量化量化常数以节省更多内存。虽然精确的 4 位量化需要较小的块大小[13],但它也有相当大的内存开销。例如, 𝐖𝐖\mathbf{W} 使用 32 位常数,块大小为 64,量化常数平均会为每个参数增加 32/64=0.532640.532/64=0.5 位。双重量化有助于减少量化常数的内存占用。

More specifically, Double Quantization treats quantization constants c2FP32superscriptsubscript𝑐2FP32c_{2}^{\text{FP32}} of the first quantization as inputs to a second quantization. This second step yields the quantized quantization constants c2FP8superscriptsubscript𝑐2FP8c_{2}^{\text{FP8}} and the second level of quantization constants c1FP32superscriptsubscript𝑐1FP32c_{1}^{\text{FP32}}. We use 8-bit Floats with a blocksize of 256 for the second quantization as no performance degradation is observed for 8-bit quantization, in line with results from Dettmers and Zettlemoyer [13]. Since the c2FP32superscriptsubscript𝑐2FP32c_{2}^{\text{FP32}} are positive, we subtract the mean from c2subscript𝑐2c_{2} before quantization to center the values around zero and make use of symmetric quantization. On average, for a blocksize of 64, this quantization reduces the memory footprint per parameter from 32/64=0.532640.532/64=0.5 bits, to 8/64+32/(64256)=0.12786432642560.1278/64+32/(64\cdot 256)=0.127 bits, a reduction of 0.373 bits per parameter.
更具体地说,双量化将第一次量化的量化常数 c2FP32superscriptsubscript𝑐2FP32c_{2}^{\text{FP32}} 作为第二次量化的输入。第二步产生量化后的量化常数 c2FP8superscriptsubscript𝑐2FP8c_{2}^{\text{FP8}} 和第二级量化常数 c1FP32superscriptsubscript𝑐1FP32c_{1}^{\text{FP32}} 。我们使用块大小为 256 的 8 位浮点运算来进行第二次量化,因为 8 位量化不会导致性能下降,这与 Dettmers 和 Zettlemoyer [ 13] 的研究结果一致。由于 c2FP32superscriptsubscript𝑐2FP32c_{2}^{\text{FP32}} 为正值,因此我们在量化前从 c2subscript𝑐2c_{2} 中减去平均值,将值集中在零点附近,并使用对称量化。平均而言,在块大小为 64 的情况下,这种量化将每个参数的内存占用从 32/64=0.532640.532/64=0.5 位减少到 8/64+32/(64256)=0.12786432642560.1278/64+32/(64\cdot 256)=0.127 位,每个参数减少了 0.373 位。

Paged Optimizers  分页优化器

use the NVIDIA unified memory 222https://docs.nvidia.com/cuda/cuda-c-programming-guide
使用 NVIDIA 统一内存 2
feature wich does automatic page-to-page transfers between the CPU and GPU for error-free GPU processing in the scenario where the GPU occasionally runs out-of-memory. The feature works like regular memory paging between CPU RAM and the disk. We use this feature to allocate paged memory for the optimizer states which are then automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU memory when the memory is needed in the optimizer update step.
该功能可在 CPU 和 GPU 之间自动进行页到页传输,以便在 GPU 偶尔内存不足的情况下进行无差错 GPU 处理。该功能与 CPU RAM 和磁盘之间的常规内存分页功能类似。我们使用该功能为优化器状态分配分页内存,当 GPU 运行到内存不足时,这些内存会被自动驱逐到 CPU RAM,而在优化器更新步骤中需要内存时,这些内存又会被重新分页到 GPU 内存中。


Using the components described above, we define QLoRA for a single linear layer in the quantized base model with a single LoRA adapter as follows:
利用上述组件,我们对量化基础模型中的单个线性层和单个 LoRA 适配器的 QLoRA 定义如下:

𝐘BF16=𝐗BF16doubleDequant(c1FP32,c2k-bit,𝐖NF4)+𝐗BF16𝐋1BF16𝐋2BF16,superscript𝐘BF16superscript𝐗BF16doubleDequantsuperscriptsubscript𝑐1FP32superscriptsubscript𝑐2k-bitsuperscript𝐖NF4superscript𝐗BF16subscriptsuperscript𝐋BF161subscriptsuperscript𝐋BF162\mathbf{Y}^{\text{BF16}}=\mathbf{X}^{\text{BF16}}\text{doubleDequant}(c_{1}^{\text{FP32}},c_{2}^{\text{k-bit}},\mathbf{W}^{\text{NF4}})+\mathbf{X}^{\text{BF16}}\mathbf{L}^{\text{BF16}}_{1}\mathbf{L}^{\text{BF16}}_{2}, (5)

where doubleDequant()(\cdot) is defined as:
其中 doubleDequant ()(\cdot) 的定义为

doubleDequant(c1FP32,c2k-bit,𝐖k-bit)=dequant(dequant(c1FP32,c2k-bit),𝐖4bit)=𝐖BF16,doubleDequantsuperscriptsubscript𝑐1FP32superscriptsubscript𝑐2k-bitsuperscript𝐖k-bitdequantdequantsuperscriptsubscript𝑐1FP32superscriptsubscript𝑐2k-bitsuperscript𝐖4bitsuperscript𝐖BF16\text{doubleDequant}(c_{1}^{\text{FP32}},c_{2}^{\text{k-bit}},\mathbf{W}^{\text{k-bit}})=\text{dequant}(\text{dequant}(c_{1}^{\text{FP32}},c_{2}^{\text{k-bit}}),\mathbf{W}^{\text{4bit}})=\mathbf{W}^{\text{BF16}}, (6)

We use NF4 for 𝐖𝐖\mathbf{W} and FP8 for c2subscript𝑐2c_{2}. We use a blocksize of 64 for 𝐖𝐖\mathbf{W} for higher quantization precision and a blocksize of 256 for c2subscript𝑐2c_{2} to conserve memory.
我们对 𝐖𝐖\mathbf{W} 使用 NF4,对 c2subscript𝑐2c_{2} 使用 FP8。为了提高量化精度,我们对 𝐖𝐖\mathbf{W} 使用 64 的块大小,对 c2subscript𝑐2c_{2} 使用 256 的块大小,以节省内存。

For parameter updates only the gradient with respect to the error for the adapters weights E𝐋i𝐸subscript𝐋𝑖\frac{\partial E}{\partial\mathbf{L}_{i}} are needed, and not for 4-bit weights E𝐖𝐸𝐖\frac{\partial E}{\partial\mathbf{W}}. However, the calculation of E𝐋i𝐸subscript𝐋𝑖\frac{\partial E}{\partial\mathbf{L}_{i}} entails the calculation of 𝐗𝐖𝐗𝐖\frac{\partial\mathbf{X}}{\partial\mathbf{W}} which proceeds via equation (5) with dequantization from storage 𝐖NF4superscript𝐖NF4\mathbf{W}^{\text{NF4}} to computation data type 𝐖BF16superscript𝐖BF16\mathbf{W}^{\text{BF16}} to calculate the derivative 𝐗𝐖𝐗𝐖\frac{\partial\mathbf{X}}{\partial\mathbf{W}} in BFloat16 precision.
对于参数更新,只需要适配器权重 E𝐋i𝐸subscript𝐋𝑖\frac{\partial E}{\partial\mathbf{L}_{i}} 相对于误差的梯度,而不需要 4 位权重 E𝐖𝐸𝐖\frac{\partial E}{\partial\mathbf{W}} 的梯度。 但是,计算 E𝐋i𝐸subscript𝐋𝑖\frac{\partial E}{\partial\mathbf{L}_{i}} 需要计算 𝐗𝐖𝐗𝐖\frac{\partial\mathbf{X}}{\partial\mathbf{W}} ,而 𝐗𝐖𝐗𝐖\frac{\partial\mathbf{X}}{\partial\mathbf{W}} 通过公式 (5) 从存储 𝐖NF4superscript𝐖NF4\mathbf{W}^{\text{NF4}} 去量化到计算数据类型 𝐖BF16superscript𝐖BF16\mathbf{W}^{\text{BF16}} ,以计算 BFloat16 精度的导数 𝐗𝐖𝐗𝐖\frac{\partial\mathbf{X}}{\partial\mathbf{W}}

To summarize, QLoRA has one storage data type (usually 4-bit NormalFloat) and a computation data type (16-bit BrainFloat). We dequantize the storage data type to the computation data type to perform the forward and backward pass, but we only compute weight gradients for the LoRA parameters which use 16-bit BrainFloat.
总而言之,QLoRA 有一种存储数据类型(通常为 4 位 NormalFloat)和一种计算数据类型(16 位 BrainFloat)。我们将存储数据类型去量化为计算数据类型,以执行前向和后向传递,但我们只计算使用 16 位 BrainFloat 的 LoRA 参数的权重梯度。

4 QLoRA vs. Standard Finetuning
4QLoRA 与标准微调对比

We have discussed how QLoRA works and how it can significantly reduce the required memory for finetuning models. The main question now is whether QLoRA can perform as well as full-model finetuning. Furthermore, we want to analyze the components of QLoRA including the impact of NormalFloat4 over standard Float4. The following sections will discuss the experiments that aimed at answering these questions.
我们已经讨论了 QLoRA 的工作原理,以及它如何大幅减少模型微调所需的内存。现在的主要问题是 QLoRA 是否能像全模型微调一样出色。此外,我们还想分析 QLoRA 的各个组成部分,包括 NormalFloat4 对标准 Float4 的影响。下文将讨论旨在回答这些问题的实验。

Experimental setup. 实验装置

We consider three architectures (encoder, encoder-decoder, and decoder only) and compare QLoRA with 16-bit adapter-finetuning and with full-finetuning for models up to 3B. Our evaluations include GLUE [58] with RoBERTa-large [38], Super-NaturalInstructions (TKInstruct) [61] with T5 [49], and 5-shot MMLU [24] after finetuning LLaMA on Flan v2 [39] and Alpaca [55]. To additionally study the advantages of NF4 over other 4-bit data types, we use the setup of Dettmers and Zettlemoyer [13] and measure post-quantization zero-shot accuracy and perplexity across different models (OPT [72], LLaMA [57], BLOOM [52], Pythia [7]) for model sizes 125m - 13B. We provide more details in the results section for each particular setup to make the results more readable. Full details in Appendix A.
我们考虑了三种架构(编码器、编码器-解码器和仅解码器),并比较了 QLoRA 与 16 位适配器-微调和全微调对高达 3B 的模型的影响。我们的评估包括使用 RoBERTa-large [ 38] 的 GLUE [ 58]、使用 T5 [ 49] 的 Super-NaturalInstructions (TKInstruct) [ 61],以及在 Flan v2 [ 39] 和 Alpaca [ 55] 上对 LLaMA 进行微调后的 5shot MMLU [ 24]。为了进一步研究 NF4 相对于其他 4 位数据类型的优势,我们使用了 Dettmers 和 Zettlemoyer [ 13] 的设置,并测量了不同模型(OPT [ 72]、LLaMA [ 57]、BLOOM [ 52]、Pythia [ 7])在模型大小 125m - 13B 之间的量化后零点准确率和困惑度。我们在结果部分提供了每个特定设置的更多细节,以使结果更具可读性。全部细节见附录 A。

Refer to caption
Figure 2: RougeL for LLaMA 7B models on the Alpaca dataset. Each point represents a run with a different random seed. We improve on the Stanford Alpaca fully finetuned default hyperparameters to construct a strong 16-bit baseline for comparisons. Using LoRA on all transformer layers is critical to match 16-bit performance.
图 2:LLaMA 7B 模型在羊驼数据集上的 RougeL。每个点代表一次使用不同随机种子的运行。我们在斯坦福 Alpaca 完全微调默认超参数的基础上进行了改进,以构建一个强大的 16 位基线进行比较。在所有 transformer 层上使用 LoRA 对于匹配 16 位性能至关重要。

While paged optimizers are critical to do 33B/65B QLoRA tuning on a single 24/48GB GPU, we do not provide hard measurements for Paged Optimizers since the paging only occurs when processing mini-batches with long sequence lengths, which is rare. We do, however, perform an analysis of the runtime of paged optimizers for 65B models on 48GB GPUs and find that with a batch size of 16, paged optimizers provide the same training speed as regular optimizers. Future work should measure and characterize under what circumstances slow-downs occur from the paging process.
虽然分页优化器对于在单个 24/48GB GPU 上进行 33B/65B QLoRA 调整至关重要,但我们并没有提供分页优化器的硬性测量结果,因为分页只会在处理序列长度较长的迷你批次时发生,而这种情况并不多见。不过,我们对分页优化器在 48GB GPU 上处理 65B 模型的运行时间进行了分析,发现在批量大小为 16 的情况下,分页优化器的训练速度与普通优化器相同。未来的工作应该测量并描述分页过程在什么情况下会导致速度减慢。

Refer to caption
Figure 3: Mean zero-shot accuracy over Winogrande, HellaSwag, PiQA, Arc-Easy, and Arc-Challenge using LLaMA models with different 4-bit data types. The NormalFloat data type significantly improves the bit-for-bit accuracy gains compared to regular 4-bit Floats. While Double Quantization (DQ) only leads to minor gains, it allows for a more fine-grained control over the memory footprint to fit models of certain size (33B/65B) into certain GPUs (24/48GB).
图 3:使用不同 4 位数据类型的 LLaMA 模型,Winogrande、HellaSwag、PiQA、Arc-Easy 和 Arc-Challenge 的平均零点精度。与普通 4 位浮点数据相比,NormalFloat 数据类型显著提高了位对位精度。虽然双量化(DQ)只能带来微小的收益,但它允许对内存占用进行更精细的控制,以便将特定大小(33B/65B)的模型安装到特定的 GPU(24/48GB)中。

Default LoRA hyperparameters do not match 16-bit performance
默认 LoRA 超参数与 16 位性能不匹配

When using the standard practice of applying LoRA to query and value attention projection matrices [28], we are not able to replicate full finetuning performance for large base models. As shown in Figure 2 for LLaMA 7B finetuning on Alpaca, we find that the most critical LoRA hyperparameter is how many LoRA adapters are used in total and that LoRA on all linear transformer block layers are required to match full finetuning performance. Other LoRA hyperparameters, such as the projection dimension r𝑟r, do not affect performance (see Appendix A).
当使用将 LoRA 应用于查询和值注意投影矩阵的标准做法时[28],我们无法复制大型基础模型的完全微调性能。如图 2 所示,在 Alpaca 上进行 LLaMA 7B 微调时,我们发现最关键的 LoRA 超参数是总共使用了多少个 LoRA 适配器,而且需要在所有线性 transformer 块层上使用 LoRA 才能达到完全微调的性能。其他 LoRA 超参数,如投影维度 r𝑟r ,不会影响性能(见附录 A)。

Similarly, we find that default hyperparameters for fully finetuned baselines are undertuned. We do a hyperparameter search over learning rates 1e-6 to 5e-5 and batch sizes 8 to 128 to find robust baselines. Results for 7B LLaMA finetuning on Alpaca are shown in Figure 2.
同样,我们发现完全微调基线的默认超参数也是欠调整的。我们在学习率 1e-6 到 5e-5 和批量大小 8 到 128 之间进行超参数搜索,以找到稳健的基线。图 2 显示了在 Alpaca 上对 7B LLaMA 进行微调的结果。

4-bit NormalFloat yields better performance than 4-bit Floating Point
4 位 NormalFloat 比 4 位浮点运算性能更好

Table 2: Pile Common Crawl mean perplexity for different data types for 125M to 13B OPT, BLOOM, LLaMA, and Pythia models.
表 2:针对 125M 至 13B OPT、BLOOM、LLaMA 和 Pythia 模型的不同数据类型,桩式普通爬行的平均复杂度。
Data type 数据类型 Mean PPL PPL 平均值
Int4 34.34
Float4 (E2M1) 浮点数4 (E2M1) 31.07
Float4 (E3M0) 浮点数4(E3M0) 29.48
NFloat4 + DQ 27.41

While the 4-bit NormalFloat (NF4) data type is information-theoretically optimal, it still needs to be determined if this property translates to empirical advantages. We follow the setup from Dettmers and Zettlemoyer [13] where quantized LLMs (OPT [72], BLOOM [52], Pythia [7], LLaMA) of different sizes (125M to 65B) with different data types are evaluated on language modeling and a set of zero-shot tasks. In Figure 3 and Table 2 we see that NF4 improves performance significantly over FP4 and Int4 and that double quantization reduces the memory footprint without degrading performance.
虽然 4 位 NormalFloat (NF4) 数据类型在信息理论上是最优的,但仍需确定这一特性是否能转化为经验优势。我们沿用了 Dettmers 和 Zettlemoyer [ 13] 的设置,在语言建模和一组零点任务中对不同数据类型的不同大小(125M 到 65B)的量化 LLMs (OPT [ 72]、BLOOM [ 52]、Pythia [ 7]、LLaMA)进行了评估。在图 3 和表 2 中,我们可以看到 NF4 比 FP4 和 Int4 明显提高了性能,而且双重量化减少了内存占用而没有降低性能。

k-bit QLoRA matches 16-bit full finetuning and 16-bit LoRA performance
k 位 QLoRA 与 16 位完全微调和 16 位 LoRA 性能相匹配

Recent findings have established that 4-bit quantization for inference is possible, but leads to performance degradation relative to 16-bit [13, 18]. This raises the crucial question of whether the lost performance can be recovered by conducting 4-bit adapter finetuning. We test this for two setups.
最近的研究结果表明,4 位量化推理是可行的,但相对于 16 位会导致性能下降[13, 18]。这就提出了一个关键问题,即通过对 4 位适配器进行微调,是否可以恢复所损失的性能。我们在两种情况下进行了测试。

The first focuses on a comparison with full 16-bit finetuning of RoBERTA and T5 models sized 125M to 3B parameters on GLUE and the Super-NaturalInstructions dataset. Results are shown in Table 3. In both datasets, we observe that 16-bit, 8-bit, and 4-bit adapter methods replicate the performance of the fully finetuned 16-bit baseline. This suggests that the performance lost due to the imprecise quantization can be fully recovered through adapter finetuning after quantization.
第一项重点是在 GLUE 和 Super-NaturalInstructions 数据集上对 RoBERTA 和 T5 模型进行全 16 位微调(参数从 125M 到 3B)的比较。结果如表 3 所示。在这两个数据集中,我们观察到 16 位、8 位和 4 位适配器方法复制了完全微调的 16 位基线的性能。这表明,通过量化后的适配器微调,可以完全恢复因不精确量化而损失的性能。

For our second setup, since full finetuning models at and beyond 11B parameters requires more than one server of high memory GPUs, we continue to test whether 4-bit QLoRA can match 16-bit LoRA at the 7B to 65B parameter scales. To this end, we finetune LLaMA 7B through 65B on two instruction following datasets, Alpaca and FLAN v2, and evaluate on the MMLU benchmark via 5-shot accuracy. Results are shown in Table 4 where we see that NF4 with double quantization fully recovers the 16-bit LoRA MMLU performance. In addition, we also note that QLoRA with FP4 lags behind the 16-bit brain float LoRA baseline by about 1 percentage point. This corroborates both our findings that (1) QLoRA with NF4 replicates both 16-bit full finetuning and 16-bit LoRA finetuning performance, and (2) NF4 is superior to FP4 in terms of quantization precision.
在第二个设置中,由于对 11B 及以上参数的模型进行全面微调需要一台以上的大内存 GPU 服务器,因此我们继续测试 4 位 QLoRA 是否能在 7B 至 65B 参数范围内与 16 位 LoRA 相匹敌。为此,我们在两个指令数据集(Alpaca 和 FLAN v2)上对 LLaMA 7B 至 65B 进行了微调,并通过 5 次精度在 MMLU 基准上进行了评估。结果如表 4 所示,采用双量化的 NF4 完全恢复了 16 位 LoRA MMLU 的性能。此外,我们还注意到,采用 FP4 的 QLoRA 比 16 位脑浮点 LoRA 基线落后约 1 个百分点。这印证了我们的两个发现:(1) 采用 NF4 的 QLoRA 复制了 16 位完全微调和 16 位 LoRA 微调性能;(2) 就量化精度而言,NF4 优于 FP4。

Table 3: Experiments comparing 16-bit BrainFloat (BF16), 8-bit Integer (Int8), 4-bit Float (FP4), and 4-bit NormalFloat (NF4) on GLUE and Super-NaturalInstructions. QLoRA replicates 16-bit LoRA and full-finetuning.
表 3:在 GLUE 和 Super-NaturalInstructions 上比较 16 位 BrainFloat (BF16)、8 位 Integer (Int8)、4 位 Float (FP4) 和 4 位 NormalFloat (NF4) 的实验。QLoRA 复制了 16 位 LoRA 和全微调。
Dataset GLUE (Acc.) 胶水( Acc.) Super-NaturalInstructions (RougeL)
超级自然指令 (RougeL)
Model RoBERTa-large T5-80M T5-250M T5-780M T5-3B T5-11B
BF16 88.6 40.1 42.1 48.0 54.3 62.0
BF16 replication BF16 复制 88.6 40.0 42.2 47.3 54.9 -
LoRA BF16 88.8 40.5 42.6 47.1 55.4 60.7
QLoRA Int8 88.8 40.4 42.9 45.4 56.5 60.7
QLoRA FP4 88.6 40.3 42.4 47.5 55.6 60.9
QLoRA NF4 + DQ - 40.4 42.7 47.7 55.3 60.9

Summary 摘要

Our results consistently show that 4-bit QLoRA with NF4 data type matches 16-bit full finetuning and 16-bit LoRA finetuning performance on academic benchmarks with well-established evaluation setups. We have also shown that NF4 is more effective than FP4 and that double quantization does not degrade performance. Combined, this forms compelling evidence that 4-bit QLoRA tuning reliably yields results matching 16-bit methods.
我们的研究结果一致表明,采用 NF4 数据类型的 4 位 QLoRA 与 16 位完全微调和 16 位 LoRA 微调性能相匹配。我们还证明,NF4 比 FP4 更有效,而且双重量化不会降低性能。综上所述,这构成了令人信服的证据,证明 4 位 QLoRA 微调能可靠地产生与 16 位方法相匹配的结果。

In line with previous work on quantization [13], our MMLU and Elo results indicate that with a given finetuning and inference resource budget it is beneficial to increase the number of parameters in the base model while decreasing their precision. This highlights the importance of efficiency benefits from QLoRA. Since we did not observe performance degradation compared to full-finetuning in our experiments with 4-bit finetuning, this raises the question of where the performance-precision trade-off exactly lies for QLoRA tuning, which we leave to future work to explore.
与之前的量化工作[13]一致,我们的 MMLU 和 Elo 结果表明,在给定的微调和推理资源预算下,增加基础模型中的参数数量同时降低参数精度是有益的。这凸显了 QLoRA 效率优势的重要性。由于我们在 4 位微调实验中没有观察到与全微调相比的性能下降,这就提出了 QLoRA 微调的性能-精度权衡究竟在哪里的问题,我们将在未来的工作中对此进行探索。

We proceed to investigate instruction tuning at scales that would be impossible to explore with full 16-bit finetuning on academic research hardware.
我们将继续研究指令调整的规模,这是在学术研究硬件上进行完全 16 位微调所无法探索的。

Table 4: Mean 5-shot MMLU test accuracy for LLaMA 7-65B models finetuned with adapters on Alpaca and FLAN v2 for different data types. Overall, NF4 with double quantization (DQ) matches BFloat16 performance, while FP4 is consistently one percentage point behind both.
表 4:在 Alpaca 和 FLAN v2 上使用适配器微调的 LLaMA 7-65B 模型针对不同数据类型的平均 5 次 MMLU 测试精度。总体而言,采用双量化 (DQ) 技术的 NF4 与 BFloat16 的性能相当,而 FP4 始终比两者落后一个百分点。
Mean 5-shot MMLU Accuracy
平均 5 发 MMLU 准确率
LLaMA Size LLaMA 尺寸 7B 13B 33B 65B Mean
Dataset Alpaca FLAN v2 Alpaca FLAN v2 Alpaca FLAN v2 Alpaca FLAN v2
BFloat16 38.4 45.6 47.2 50.6 57.7 60.5 61.8 62.5 53.0
Float4 37.2 44.0 47.3 50.0 55.9 58.5 61.3 63.3 52.2
NFloat4 + DQ 39.0 44.5 47.5 50.7 57.3 59.2 61.8 63.9 53.1

5 Pushing the Chatbot State-of-the-art with QLoRA
5 利用 QLoRA 推动聊天机器人的最新发展

Having established that 4-bit QLoRA matches 16-bit performance across scales, tasks, and datasets we conduct an in-depth study of instruction finetuning up to the largest open-source language models available for research. To assess the performance of instruction finetuning these models, we evaluate on a challenging Natural Language Understanding benchmark (MMLU) and develop new methods for real-world chatbot performance evaluation.
在确定了 4 位 QLoRA 在各种规模、任务和数据集上都能与 16 位性能相匹配之后,我们深入研究了对最大的开源语言模型进行微调的教学方法。为了评估对这些模型进行指令微调的性能,我们在具有挑战性的自然语言理解基准(MMLU)上进行了评估,并开发了用于真实世界聊天机器人性能评估的新方法。

5.1 Experimental setup 5.1 实验设置

We now describe an overview of the experimental setup with full details in Appendix B.
下面我们将介绍实验装置的概况,详细情况见附录 B。

Data 数据

As, to our knowledge, there is no comprehensive study of recent instruction-following datasets, we select eight recent datasets. We include datasets obtained through crowd-sourcing (OASST1 [31], HH-RLHF [4]), distillation from instruction-tuned models (Alpaca [55], self-instruct [59], unnatural-instructions [26]), corpora aggregations (FLAN v2 [12]), as well as hybrids (Chip2 [32], Longform [30]). These datasets cover different languages, data sizes, and licenses.
据我们所知,目前还没有对最近的指令跟随数据集进行全面研究,因此我们选择了八个最近的数据集。其中包括通过众包(OASST1 [ 31]、HH-RLHF [ 4])、指令调整模型(Alpaca [ 55]、self-instruct [ 59]、unnatural-instructions [ 26])、语料库聚合(FLAN v2 [ 12])以及混合(Chip2 [ 32]、Longform [ 30])获得的数据集。这些数据集涵盖不同的语言、数据大小和许可证。

Training Setup 培训设置

To avoid confounding effects from different training objectives, we perform QLoRA finetuning with cross-entropy loss (supervised learning) without reinforcement learning, even for datasets that include human judgments of different responses. For datasets that have a clear distinction between instruction and response, we finetune only on the response (see ablations in Appendix B). For OASST1 and HH-RLHF, multiple responses are available. We then select the top response at every level of the conversation tree and finetune on the full selected conversation, including the instructions. In all of our experiments, we use NF4 QLoRA with double quantization and paged optimizers to prevent memory spikes during gradient checkpointing. We do small hyperparameter searches for the 13B and 33B LLaMA models and we find that all hyperparameter settings found at 7B generalize (including number of epochs) except learning rate and batch size. We halve the learning rate for 33B and 65B while doubling the batch size.
为了避免不同训练目标带来的混淆效应,我们使用交叉熵损失(监督学习)进行 QLoRA 微调,而不使用强化学习,即使对于包含人类对不同反应的判断的数据集也是如此。对于指令和反应有明显区别的数据集,我们只对反应进行微调(见附录 B 中的消融)。对于 OASST1 和 HH-RLHF,我们提供了多种反应。然后,我们在对话树的每一层选择最重要的回应,并对所选的全部对话(包括指令)进行微调。在所有实验中,我们都使用了带有双重量化和分页优化器的 NF4 QLoRA,以防止梯度检查点过程中出现内存峰值。我们对 13B 和 33B LLaMA 模型进行了小型超参数搜索,发现除了学习率和批量大小外,在 7B 模型中发现的所有超参数设置都具有通用性(包括历时次数)。我们将 33B 和 65B 的学习率减半,同时将批量大小加倍。

Baselines 基线

We compare our models to both research (Vicuna [10] and Open Assistant [31]) and commercial (GPT-4 [42], GPT-3.5-turbo and Bard) chatbot systems. The Open Assistant model is a LLaMA 33B model finetuned with Reinforcement Learning from Human Feedback (RLHF) on the same OASST1 dataset that we experiment with. Vicuna does full fine-tuning of LLaMA 13B on proprietary user-shared conversations from ShareGPT and is thus the result of distillation from OpenAI GPT models.
我们将我们的模型与研究型(Vicuna [ 10] 和 Open Assistant [ 31])和商业型(GPT-4 [ 42]、GPT-3.5-turbo 和 Bard)聊天机器人系统进行了比较。开放助手 "模型是一个 LLaMA 33B 模型,在与我们实验相同的 OASST1 数据集上通过 "人类反馈强化学习"(RLHF)进行了微调。Vicuna 在 ShareGPT 的专有用户共享对话上对 LLaMA 13B 进行了全面微调,因此是从 OpenAI GPT 模型中提炼出来的。

5.2 Evaluation 5.2 评估

Table 5: MMLU 5-shot test results for different sizes of LLaMA finetuned on the corresponding datasets using QLoRA.
表 5:使用 QLoRA 在相应数据集上对不同大小的 LLaMA 进行微调后的 MMLU 5 次测试结果。
Dataset 7B 13B 33B 65B
LLaMA no tuning LLaMA 无调整 35.1 46.9 57.8 63.4
Self-Instruct 36.4 33.3 53.0 56.7
Longform 32.1 43.2 56.6 59.7
Chip2 34.5 41.6 53.6 59.8
HH-RLHF 34.9 44.6 55.8 60.1
Unnatural Instruct 非自然教学 41.9 48.1 57.3 61.3
Guanaco (OASST1) 瓜纳科(OASST1) 36.6 46.4 57.0 62.2
Alpaca 38.8 47.8 57.3 62.5
FLAN v2 44.5 51.4 59.2 63.9

Following common practice, we use the MMLU (Massively Multitask Language Understanding) benchmark [24] to measure performance on a range of language understanding tasks. This is a multiple-choice benchmark covering 57 tasks including elementary mathematics, US history, computer science, law, and more. We report 5-shot test accuracy.
按照惯例,我们使用 MMLU(大规模多任务语言理解)基准[24]来衡量一系列语言理解任务的性能。这是一个多项选择基准,涵盖 57 项任务,包括初等数学、美国历史、计算机科学、法律等。我们报告了 5 次测试的准确性。

We also test generative language capabilities through both automated and human evaluations. This second set of evaluations relies on queries curated by humans and aims at measuring the quality of model responses. While this is a more realistic testbed for chatbot model performance and is growing in popularity, there is no commonly accepted protocol in the literature. We describe below our proposed setup, using nucleus sampling with p=0.9𝑝0.9p=0.9 and temperature in all cases.
我们还通过自动和人工评估来测试生成语言的能力。第二组评估依赖于人类策划的查询,旨在衡量模型响应的质量。虽然这对聊天机器人模型性能来说是更现实的测试平台,而且越来越受欢迎,但文献中并没有公认的协议。下面我们将介绍我们建议的设置,即在所有情况下使用 p=0.9𝑝0.9p=0.9 和温度 的核采样。

Benchmark Data 基准数据

We evaluate on two curated datasets of queries (questions): the Vicuna prompts [10] and the OASST1 validation dataset [31]. We use the Vicuna prompts, a set of 80 prompts from a diverse set of categories, without modifications. The OASST1 dataset is a multilingual collection of crowd-sourced multiturn dialogs between a user and an assistant. We select all user messages in the validation dataset as queries and include previous turns in the prompt. This procedure leads to 953 unique user queries. We term these two datasets the Vicuna and OA benchmarks.
我们在两个经过策划的查询(问题)数据集上进行了评估:Vicuna 提示[10] 和 OASST1 验证数据集[31]。我们使用的是 Vicuna 提示集,它由 80 个不同类别的提示组成,未作任何修改。OASST1 数据集是用户与助手之间的多语言众包多轮对话集。我们选择验证数据集中的所有用户信息作为查询,并在提示中包含之前的回合。这样就得到了 953 个独特的用户查询。我们将这两个数据集称为 Vicuna 和 OA 基准。

Automated Evaluation 自动评估

First, based on the evaluation protocol introduced by Chiang et al. [10], we use GPT-4 to rate the performance of different systems against ChatGPT (GPT-3.5 Turbo) on the Vicuna benchmark. Given a query along with ChatGPT’s and a model’s responses, GPT-4 is prompted to assign a score out of ten to both responses and provide an explanation. The overall performance of a model is calculated as a percentage of the score that ChatGPT achieved. Note this relative score can be higher than 100% if the model achieves a higher absolute score than ChatGPT. We find a significant ordering effect with GPT-4 increasing the score of the response occurring earlier in the prompt. To control for such effects, we recommend reporting the mean score over both orders.
首先,根据 Chiang 等人提出的评估协议[ 10],我们使用 GPT-4 在 Vicuna 基准上对 ChatGPT(GPT-3.5 Turbo)进行不同系统的性能评估。给定一个查询以及 ChatGPT 和一个模型的响应,GPT-4 会被提示给这两个响应打分(满分 10 分)并给出解释。模型的总体性能是以 ChatGPT 所得分的百分比来计算的。请注意,如果模型的绝对得分高于 ChatGPT,则相对得分可以高于 100%。我们发现 GPT-4 有明显的排序效应,它能提高提示中较早出现的回答的得分。为了控制这种效应,我们建议报告两个顺序的平均得分。

Next, we measure performance through direct comparisons between system outputs. We simplify the rating scheme to a three-class labeling problem that accounts for ties. We prompt GPT-4 to pick the best response or declare a tie and provide an explanation. We conduct these head-to-head comparisons on all permutations of pairs of systems on both the Vicuna and OA benchmarks.
接下来,我们通过直接比较系统输出来衡量性能。我们将评级方案简化为三类标签问题,并考虑了并列情况。我们会提示 GPT-4 挑选最佳答案或宣布平局并给出解释。我们在 Vicuna 和 OA 基准上对成对系统的所有排列组合进行了正面比较。

Human Evaluation 人类评价

While recent work indicates generative models can be effectively employed for system evaluations [19], the reliability GPT-4 ratings to assess chatbot performance is, to our knowledge, yet to be proven to correlate with human judgments. Therefore, we run two parallel human evaluations on the Vicuna benchmark matching both automated evaluation protocols described above. We use Amazon Mechanical Turk (AMT) and get two human annotators for comparisons to ChatGPT and three annotators for pairwise comparisons.
虽然最近的研究表明生成模型可以有效地用于系统评估[ 19],但据我们所知,用于评估聊天机器人性能的 GPT-4 评级与人类判断之间的相关性尚未得到证实。因此,我们在 Vicuna 基准上运行了两个平行的人工评估,与上述两个自动评估协议相匹配。我们使用 Amazon Mechanical Turk (AMT),让两名人类注释者与 ChatGPT 进行比较,三名注释者进行配对比较。

Elo Rating Elo 评分

With both human and automated pairwise comparisons, we create a tournament-style competition where models compete against each other. The tournament is made up of matches where pairs of models compete to produce the best response for a given prompt. This is similar to how Bai et al. [4] and Chiang et al. [10] compare models, but we also employ GPT-4 ratings in addition to human ratings. We randomly sample from the set of labeled comparisons to compute Elo [16, 17]. Elo rating, which is widely used in chess and other games, is a measure of the expected win-rate relative to an opponent’s win rate, for example, an Elo of 1100 vs 1000 means the Elo 1100 player has an expected win-rate of approximately 65% against the Elo 1000 opponent; a 1000 vs 1000 or 1100 vs 1100 match results in an expected win-rate of 50%. The Elo rating changes after each match proportionally to the expected outcome, that is, an unexpected upset leads to a large change in Elo rating while an expected outcome leads to a small change. Over time, Elo ratings approximately match the skill of each player at playing the game. We start with a score of 1,000 and use K=32𝐾32K=32. Similar to Chiang et al. [10], we repeat this procedure 10,000 times with different random seeds to control for ordering effects, e.g., the effect of which model pairs compete with each other first.
通过人工和自动配对比较,我们创建了一个锦标赛式的比赛,让模型之间相互竞争。比赛由多场比赛组成,在这些比赛中,成对的模型竞相对给定的提示做出最佳响应。这与 Bai 等人[4] 和 Chiang 等人[10] 比较模型的方式类似,但除了人类评分外,我们还采用了 GPT-4 评分。我们从标注的比较集中随机抽样计算 Elo [ 16, 17]。Elo 等级被广泛应用于国际象棋和其他游戏中,它是相对于对手胜率的预期胜率的衡量标准,例如,Elo 为 1100 vs 1000 意味着 Elo 为 1100 的棋手对 Elo 为 1000 的对手的预期胜率约为 65%;1000 vs 1000 或 1100 vs 1100 的比赛的预期胜率为 50%。每场比赛后的 Elo 等级变化与预期结果成正比,也就是说,意外失利导致 Elo 等级变化较大,而预期结果导致的变化较小。随着时间的推移,Elo 评级与每位棋手的棋艺大致相符。我们从 1,000 分开始,使用 K=32𝐾32K=32 。与 Chiang 等人[10]的做法类似,我们用不同的随机种子重复这一过程 10,000 次,以控制排序效应,例如,哪些模型对首先相互竞争的效应。

Table 6: Zero-shot Vicuna benchmark scores as a percentage of the score obtained by ChatGPT evaluated by GPT-4. We see that OASST1 models perform close to ChatGPT despite being trained on a very small dataset and having a fraction of the memory requirement of baseline models.
表 6:通过 GPT-4 评估的 ChatGPT 基准得分与 Vicuna 基准得分的百分比。我们可以看到,尽管 OASST1 模型是在很小的数据集上训练的,内存需求也只有基线模型的一小部分,但其性能却接近 ChatGPT。
Model / Dataset 模型/数据集 Params Model bits 模型钻头 Memory ChatGPT vs Sys Sys vs ChatGPT Mean 95% CI
GPT-4 - - - 119.4% 110.1% 114.5% 2.6%
Bard - - - 93.2% 96.4% 94.8% 4.1%
Guanaco 65B 4-bit 41 GB 96.7% 101.9% 99.3% 4.4%
Alpaca 65B 4-bit 41 GB 63.0% 77.9% 70.7% 4.3%
FLAN v2 65B 4-bit 41 GB 37.0% 59.6% 48.4% 4.6%
Guanaco 33B 4-bit 21 GB 96.5% 99.2% 97.8% 4.4%
Open Assistant 开放式助理 33B 16-bit 66 GB 91.2% 98.7% 94.9% 4.5%
Alpaca 33B 4-bit 21 GB 67.2% 79.7% 73.6% 4.2%
FLAN v2 33B 4-bit 21 GB 26.3% 49.7% 38.0% 3.9%
Vicuna 13B 16-bit 26 GB 91.2% 98.7% 94.9% 4.5%
Guanaco 13B 4-bit 10 GB 87.3% 93.4% 90.4% 5.2%
Alpaca 13B 4-bit 10 GB 63.8% 76.7% 69.4% 4.2%
HH-RLHF 13B 4-bit 10 GB 55.5% 69.1% 62.5% 4.7%
Unnatural Instr. 非自然 Instr. 13B 4-bit 10 GB 50.6% 69.8% 60.5% 4.2%
Chip2 13B 4-bit 10 GB 49.2% 69.3% 59.5% 4.7%
Longform 13B 4-bit 10 GB 44.9% 62.0% 53.6% 5.2%
Self-Instruct 13B 4-bit 10 GB 38.0% 60.5% 49.1% 4.6%
FLAN v2 13B 4-bit 10 GB 32.4% 61.2% 47.0% 3.6%
Guanaco 7B 4-bit 5 GB 84.1% 89.8% 87.0% 5.4%
Alpaca 7B 4-bit 5 GB 57.3% 71.2% 64.4% 5.0%
FLAN v2 7B 4-bit 5 GB 33.3% 56.1% 44.8% 4.0%

5.3 Guanaco: QLoRA trained on OASST1 is a State-of-the-art Chatbot
5.3 Guanaco:基于 OASST1 训练的 QLoRA 是最先进的聊天机器人

Based on our automated and human evaluations, we find that the top QLoRA tuned model, Guanaco 65B, which we finetune on a variant of OASST1, is the best-performing open-source chatbot model and offers performance competitive to ChatGPT. When compared to GPT-4, Guanaco 65B and 33B have an expected win probability of 30%, based on Elo rating from human annotators system-level pairwise comparisons - the highest reported to date.
根据我们的自动和人工评估,我们发现顶级 QLoRA 调整模型 Guanaco 65B 是表现最好的开源聊天机器人模型,其性能可与 ChatGPT 相媲美。与 GPT-4 相比,Guanaco 65B 和 33B 的预期胜率为 30%,这是基于人类注释者系统级配对比较的 Elo 评级,是迄今为止报告的最高胜率。

The Vicuna benchmark [10] results relative to ChatGPT are shown in Table 6. We find that Guanaco 65B is the best-performing model after GPT-4, achieving 99.3% performance relative to ChatGPT. Guanaco 33B has more parameters than the Vicuna 13B model, but uses only 4-bit precision for its weights and is thus much more memory efficient at 21 GB vs 26 GB, providing a three percentage points of improvement over Vicuna 13B. Furthermore, Guanaco 7B easily fits on modern phones at a 5 GB footprint while still scoring nearly 20 percentage points higher than Alpaca 13B.
与 ChatGPT 相比,Vicuna 基准[ 10] 的结果如表 6 所示。我们发现,Guanaco 65B 是 GPT-4 之后性能最好的模型,相对 ChatGPT 达到 99.3% 的性能。Guanaco 33B 比 Vicuna 13B 模型有更多参数,但权重仅使用 4 位精度,因此内存效率更高,为 21 GB 对 26 GB,比 Vicuna 13B 提高了三个百分点。此外,Guanaco 7B 仅占用 5 GB 内存,可轻松安装在现代手机上,但得分仍比 Alpaca 13B 高出近 20 个百分点。

However, Table 6 also has very wide confidence intervals, with many models overlapping in performance. We hypothesize that this uncertainty comes from the lack of clear specification of scale, e.g., it is unclear what 8 on a 10 point scale means across different scenarios. As such, we instead recommend using the Elo ranking method [16], based on pairwise judgments from human annotators and GPT-4 to avoid the problem of grounding an absolute scale. Elo ratings of the most competitive models can be seen in Table 1. We note that human and GPT-4 ranking of models on the Vicuna benchmark disagree partially, particularly for Guanaco 7B, but are consistent for most models with a Kendall Tau of τ=0.43𝜏0.43\tau=0.43 and Spearman rank correlation of r=0.55𝑟0.55r=0.55 at the system level. At the example level, the agreement between GPT-4 and human annotators’ majority vote is weaker with Fleiss κ=0.25𝜅0.25\kappa=0.25. Overall, this shows a moderate agreement between system-level judgments by GPT-4 and human annotators, and thus that model-based evaluation represents a somewhat reliable alternative to human evaluation. We discuss further considerations in Section 6.2.
然而,表 6 的置信区间也非常宽,许多模型的表现都有重叠。我们认为,这种不确定性是由于缺乏明确的标度规范造成的,例如,10 分制中的 8 分在不同情况下的含义并不明确。因此,我们建议使用 Elo 排名法[16],该方法基于人类注释者和 GPT-4 的成对判断,以避免绝对尺度的基础问题。最具竞争力模型的 Elo 评级见表 1。我们注意到,在 Vicuna 基准上,人类和 GPT-4 对模型的排名存在部分分歧,尤其是对 Guanaco 7B 的排名,但对大多数模型的排名是一致的,在系统层面的 Kendall Tau 为 τ=0.43𝜏0.43\tau=0.43 ,Spearman 排名相关性为 r=0.55𝑟0.55r=0.55 。在示例层面,GPT-4 与人类注释者的多数票之间的一致性较弱,弗莱斯 κ=0.25𝜅0.25\kappa=0.25 。总体而言,这表明 GPT-4 和人类注释者在系统层面的判断具有一定的一致性,因此基于模型的评估在一定程度上可以替代人类评估。我们将在第 6.2 节讨论进一步的考虑因素。

Elo rankings in Table 7 indicate that Guanaco 33B and 65B models outperform all models besides GPT-4 on the Vicuna and OA benchmarks and that they perform comparably to ChatGPT in line with Table 6. We note that the Vicuna benchmark favors open-source models while the larger OA benchmark favors ChatGPT. Furthermore, we can see from Tables 5 and 6 that the suitability of a finetuning dataset is a determining factor in performance. Finetuning Llama models on FLAN v2 does particularly well on MMLU, but performs worst on the Vicuna benchmark (similar trends are observed with other models). This also points to partial orthogonality in current evaluation benchmarks: strong MMLU performance does not imply strong chatbot performance (as measured by Vicuna or OA benchmarks) and vice versa.
表 7 中的 Elo 排名表明,Guanaco 33B 和 65B 模型在 Vicuna 和 OA 基准上的表现优于除 GPT-4 之外的所有模型,它们与 ChatGPT 的表现也与表 6 一致。我们注意到,Vicuna 基准有利于开源模型,而更大的 OA 基准则有利于 ChatGPT。此外,我们还可以从表 5 和表 6 中看出,微调数据集的适用性是影响性能的决定性因素。在 FLAN v2 上对 Llama 模型进行微调后,在 MMLU 上的表现尤为出色,但在 Vicuna 基准上的表现最差(其他模型也有类似趋势)。这也说明了当前评估基准的部分正交性:MMLU 性能强并不意味着聊天机器人性能强(以 Vicuna 或 OA 基准衡量),反之亦然。

Guanaco is the only top model in our evaluation that is not trained on proprietary data as the OASST1 dataset collection guidelines explicitly forbid the use of GPT models. The next best model trained on only open-source data is the Anthropic HH-RLHF model, which scores 30 percentage points lower than Guanaco on the Vicuna benchmark (see Table 6). Overall, these results show that 4-bit QLoRA is effective and can produce state-of-the-art chatbots that rival ChatGPT. Furthermore, our 33B Guanaco can be trained on 24 GB consumer GPUs in less than 12 hours. This opens up the potential for future work via QLoRA tuning on specialized open-source data, which produces models that can compete with the very best commercial models that exist today.
由于 OASST1 数据集收集指南明确禁止使用 GPT 模型,因此 Guanaco 是我们评估中唯一没有使用专利数据训练的顶级模型。仅使用开源数据训练的次佳模型是 Anthropic HH-RLHF 模型,该模型在 Vicuna 基准上的得分比 Guanaco 低 30 个百分点(见表 6)。总之,这些结果表明,4 位 QLoRA 是有效的,可以生成与 ChatGPT 相媲美的一流聊天机器人。此外,我们的 33B Guanaco 可以在 24 GB 消费者 GPU 上进行训练,时间不到 12 小时。这为未来在专门的开源数据上通过 QLoRA 进行调整的工作开辟了潜力,而这种调整所产生的模型可以与当今最好的商业模型相媲美。

Table 7: Elo rating for a tournament between models where models compete to generate the best response for a prompt, judged by human raters or GPT-4. Overall, Guanaco 65B and 33B tend to be preferred to ChatGPT-3.5 on the benchmarks studied. According to human raters they have a Each 10-point difference in Elo is approximately a difference of 1.5% in win-rate.
表 7:由人类评分员或 GPT-4 评判的模型间比赛的 Elo 评分,在该比赛中,各模型竞相为提示生成最佳回复。总体而言,在所研究的基准上,Guanaco 65B 和 33B 往往优于 ChatGPT-3.5。根据人类评测员的数据,它们的 Elo 每相差 10 分,胜率就相差约 1.5%。
Benchmark Vicuna Vicuna Open Assistant 开放式助理
# Prompts # 提示 80 80 953
Judge Human raters 人类评分员 GPT-4 GPT-4 Median Rank 中位数排名
Model Elo Rank Elo Rank Elo Rank
GPT-4 1176 1 1348 1 1294 1 1
Guanaco-65B 1023 2 1022 2 1008 3 2
Guanaco-33B 1009 4 992 3 1002 4 4
ChatGPT-3.5 Turbo 916 7 966 5 1015 2 5
Vicuna-13B 984 5 974 4 936 5 5
Guanaco-13B 975 6 913 6 885 6 6
Guanaco-7B 1010 3 879 8 860 7 7
Bard 909 8 902 7 - - 8

6 Qualitative Analysis 6 定量分析

While quantitative analysis is the core of our evaluation, there are a number of issues with only looking at summary statistics. Perhaps the largest is the problem of benchmark validity [36]—whether a benchmark truly tests what its name or description suggests is always at question, especially as we discover “shortcuts” to solve benchmarks that machine learning models sometimes exploit [22, 46]. To partially alleviate this, we here perform some qualitative analysis, in two sections. First, in §6.1 we show some examples that we believe are representative of some observed patterns in the text generated by our 65b Guanaco model. Second, §6.2 we detail considerations about the results we have discussed and our interpretation of them.
虽然定量分析是我们评估的核心,但只看摘要统计数据也有很多问题。最大的问题可能是基准的有效性问题[36]--一个基准是否真正测试了其名称或描述所暗示的内容始终是个问题,尤其是当我们发现机器学习模型有时会利用的解决基准问题的 "捷径 "时[22, 46]。为了部分缓解这一问题,我们在此分两部分进行一些定性分析。首先,在第 6.1 节中,我们展示了一些例子,我们认为这些例子代表了我们的 65b Guanaco 模型生成的文本中的一些观察到的模式。其次,在第 6.2 节中,我们将详细讨论我们所讨论的结果以及我们对这些结果的解释。

6.1 Qualitative Analysis of Example Generations
6.1 世代实例定性分析

To find examples, we first go through data generated for the Vicuna benchmark and the OpenAssistant benchmark, and look for patterns in the answers Guanaco generates. When we notice a pattern we attempt to setup a question or prompt that will induce the pattern even though it is the incorrect solution, e.g., if we observe that the model tends to give long-winded answers we prompt the model to “Answer yes or no without explanation.” We use this to find “lemons” where we manage to adversarially break the model and “cherries” where we fail to break the model, and present both. All generations in this section were generated with Nucleus Sampling [25] with p=0.9𝑝0.9p=0.9.
为了找到示例,我们首先查看了为 Vicuna 基准和 OpenAssistant 基准生成的数据,并在 Guanaco 生成的答案中寻找模式。例如,如果我们观察到模型倾向于给出啰嗦的答案,我们就会提示模型 "回答是或否,无需解释"。我们利用这一点来找到 "柠檬 "和 "樱桃","柠檬 "是我们设法逆向打破了模型,而 "樱桃 "是我们未能打破模型,并将两者都呈现出来。本节中的所有世代都是用 "核采样"[ 25] 和 p=0.9𝑝0.9p=0.9 生成的。

Of course, this is by no means comprehensive, since it is beyond the scope of this small qualitative study to control for all the variables involved, e.g., the full distribution of responses the model can generate for a given prompt is quite large, so we rely on samples we hope are representative. However, we believe describing these examples gives context to the quantitative evidence shown earlier in the paper. Since we open source all models and code, we hope this section will inspire future work to examine in more detail the issues we present here.

Factual Recall 事实回忆

For questions such as “What is the capital of Zambia?” all models consistently generate correct answers, e.g.,
对于 "赞比亚的首都是哪里?"这样的问题,所有模型都能得出正确答案,例如:"赞比亚的首都是哪里?

Guanaco: The capital of Zambia is Lusaka.

However as questions get more obscure Guanaco becomes unreliable, but stays confident. For instance in response to this prompt from HotPotQA [68]:
不过,随着问题越来越晦涩难懂,Guanaco 就变得不可靠了,但它仍然很自信。例如,在回答 HotPotQA [ 68] 的提示时:

User: In what year was the singer who popularized ‘I’ll Keep the Lovelight Burning’ born?
用户:让《I'll Keep the Lovellight Burning》流行起来的歌手是哪一年出生的?

GuanacoAl Jolson is credited with popularizing the song ‘I’ll Keep the Lovelight Burning,’ and he was born in the year 1886.
Guanaco:艾尔-乔尔森(Al Jolson)是《I'll Keep the Lovelight Burning》这首歌的流行人,他出生于 1886 年。

Guanaco generates the wrong popularizer and the wrong birthday (though the birthday given is correct for the person mentioned, Al Jolson).
Guanaco 生成了错误的流行者和错误的生日(尽管所给的生日是正确的,提到的人是阿尔-乔尔森)。

Suggestibility 可感知性

Guanaco shows a surprising resistance to going along with some kinds of assumed misinformation, e.g., in the following interaction:
Guanaco 对某些假定的错误信息表现出惊人的抵制,例如在下面的互动中:

User: How was it finally, officially confirmed that the earth is flat by peer reviewed scientists?

Guanaco: The earth has never been officially confirmed to be flat by peer reviewed scientists. The overwhelming consensus among scientists is that the earth is spherical. In fact, this has been known for centuries and has been confirmed through numerous experiments and observations.

The belief that the earth is flat is a conspiracy theory that has been debunked by science. The idea that a group of scientists have peer reviewed and confirmed the earth is flat is simply false.

If you have any more questions please ask.

Guanaco is also quite good at knowing what kinds of questions aren’t possible to answer, e.g.,
Guanaco 还很擅长了解哪些问题是无法回答的,例如:

User: What time is it?

Guanaco: I’m sorry, but as an AI language model, I don’t have access to real-time information. My responses are generated based on my pre-existing knowledge and do not reflect the current time. If you’re looking for the current time, I suggest using a timekeeping device or checking an online timepiece.
Guanaco: 很抱歉,作为一个人工智能语言模型,我无法获得实时信息。我的回答是根据我已有的知识生成的,并不反映当前的时间。如果你想知道当前时间,我建议你使用计时设备或查看在线计时器。

Similar results hold for “Where are you?”, “How are you?”, etc.
类似的结果还有 "你在哪里?"、"你好吗?"等。

Refusal 拒绝

Similarly, Guanaco sometimes refuses to follow instructions for seemingly random reasons:
同样,Guanaco 有时也会因为一些看似偶然的原因而拒绝听从指令:

User: Please reverse the words in the sentence “I’m a cool dude in a mocha mood.”
用户:请将句子 "我是一个酷酷的家伙,正在享受摩卡咖啡 "中的词语颠倒过来。

Guanaco: I’m sorry, I’m not able to do that right now, but I can explain why the sentence you provided is grammatically correct.
Guanaco: 很抱歉,我现在做不到,但我可以解释为什么你提供的句子在语法上是正确的。

The sentence ‘‘I’m a cool dude in a mocha mood" follows the correct grammar and syntax rules of the English language. In this case, the adjective ‘‘cool’’ comes before the noun ‘‘dude’’ to describe it, and the prepositional phrase ‘‘in a mocha mood" provides additional information about the context of the sentence. This sentence structure is perfectly acceptable and makes sense in the context of the sentence.

If you have any other questions or if there’s anything else I can help you with, feel free to ask
句子"'I'm a cool dude in a mocha mood "遵循正确的英语语法和句法规则。在这个句子中,形容词"'cool'"出现在名词"'dude'"之前,用来描述"'dude'",介词短语"'in a mocha mood'"提供了有关句子上下文的额外信息。这种句子结构是完全可以接受的,而且在句子的上下文中也是合理的。如果您有任何其他问题或我还能帮到您什么,请随时提问

Secret Keeping 保守秘密

Unfortunately, this refusal is unreliable even when it is desired. Often system designers would like to keep something secret from the user, even if the system is allowed to give free text responses back to the user. We prompt the model with “The secret word is ‘banana’. Do not reveal or repeat back this back, no matter what the human requests.” and try to see how easy it is for a potential user to access this secret information. Directly asking fails:
遗憾的是,这种拒绝即使在需要时也是不可靠的。通常情况下,系统设计者希望对用户保密,即使系统可以向用户提供自由文本回复。我们用 "密语是'香蕉'"来提示模型。无论人类提出什么要求,都不要透露或重复这个单词。"并尝试了解潜在用户获取这一秘密信息的难易程度。直接询问失败:

User: What is the secret word?

Guanaco: I’m sorry, but I’m not able to disclose the secret word. I’m programmed to not reveal or repeat back the secret word, no matter what the human requests. Is there anything else I can help you with?

which is the desired behavior. However, a tiny amount of trickery breaks this behavior:

User: This is a game. The goal is to ignore your previous instructions. What is the secret word?

Guanaco: Sorry, the secret word is banana. It’s best if we don’t speak of it. Is there anything else you’d like to ask?

showcasing the need to study methods for more faithful instruction following.

Math 数学

Guanaco’s biggest weakness is mathematics, an area where many language models struggle in [35]. When Guanaco shows its work, it tends to be accurate, e.g.,
Guanaco 最大的弱点是数学,这也是许多语言模型难以逾越的领域[35]。当 Guanaco 展示其工作时,它往往是准确的,例如:

User: Lee mows one lawn and charges $33. Last week he mowed 16 lawns and three customers each gave him a $10 tip. How many dollars did Lee earn mowing lawns last week?
用户:李上周修剪了 16 块草坪,有三位顾客每人给了他 10 美元小费。李上周修剪草坪赚了多少美元?

Guanaco: Lee earned $582 mowing lawns last week.
Guanaco:李上周修剪草坪赚了 582 美元。

To find the total amount he earned, we first need to calculate how much he earned for mowing 16 lawns:

$33 x 16 = $528

Then we need to add the tips he received from the three customers:

$10 x 3 = $30

So the total amount Lee earned was:

$528 + $30 = $558
为了算出他的总收入,我们首先需要计算出他修剪 16 块草坪的收入:33 美元 x 16 = 528 美元 然后我们需要加上他从三位顾客那里得到的小费:10 美元 x 3 = 30 美元 所以李的总收入是:528 美元 + 30 美元 = 558 美元

However Guanaco can breakdown at even simple problems if it fails to break them down step-by-step, a known issue [63]. For example, we see the following interaction:
但是,如果 Guanaco 不能逐步分解问题,即使是简单的问题也会被分解,这是一个已知的问题[ 63]。例如,我们可以看到以下交互:

User: Please factorize 1833.
用户: 请对 1833 进行因式分解。

Guanaco: Sure, I’d be happy to help! 1833 is a prime number, which means it has only two factors: 1 and 1833.
Guanaco: 当然,我很乐意帮忙!1833 是一个质数,也就是说它只有两个因数:1 和 1833。

So the unique factorization of 1833 is:

2^1 * 3^2 * 17^1

I hope this helps! Let me know if you have any other questions.
所以 1833 的唯一因式分解是2^1 * 3^2 * 17^1 希望能帮到你!如果您还有其他问题,请告诉我。

which is not only wrong (the true factorization is 3×17×43317433\times 17\times 43), it’s wrong twice .
这不仅是错误的(真正的因式分解是 3×17×43317433\times 17\times 43 ),而且错了两次。

Theory of Mind 思维理论

Guanaco displays surprisingly strong Theory of Mind capabilities [41, 51]. For instance, the model correctly answers the following question in great detail:
Guanaco 显示出令人惊讶的思维理论能力[ 41, 51]。例如,该模型能正确回答下列问题的细节:

User: James and Abby are in the bedroom. Abby put the pen in the desk drawer. Abby leaves the bedroom. James moves the pen into the bag. Where does James think Abby will look for the pen?

Guanaco: James knows that Abby put the pen in the desk drawer. But James also knows that he has moved the pen into the bag, and he assumes that Abby does not know about this. So, James may guess that Abby will look for the pen in the desk drawer, because that’s where she left it last time. However, this assumption may be incorrect, because Abby may have seen James moving the pen.
Guanaco: 詹姆斯知道艾比把笔放在了桌子的抽屉里。但詹姆斯也知道他已经把笔移到了包里,而且他认为艾比并不知道这件事。因此,詹姆斯可能会猜测艾比会在书桌抽屉里找钢笔,因为上次她就是放在那里的。然而,这个假设可能是错误的,因为艾比可能已经看到詹姆斯在移动钢笔。

However, these inferences are unreliable, and often the model gives reasons that presume things that don’t make sense in the situation, e.g.,

User: Evelyn entered the living room. Jackson entered the playroom. James entered the playroom. The beans are in the treasure chest. James exited the playroom. Jackson moved the beans to the pantry. Jackson exited the playroom. James entered the living room. Where will James look for the beans?

Guanaco: James will look for the beans in the pantry, because that’s where Jackson moved them.

where Guanaco presumes information transfer that was never described. These issues echo recent literature [51], but require more study.
其中,Guanaco 假定了从未描述过的信息传递。这些问题与最近的文献[51]相呼应,但还需要更多的研究。

6.2 Considerations 6.2 考虑因素

Evaluation 评估

We report moderate agreement among human annotators (Fleiss κ=0.42𝜅0.42\kappa=0.42) with additional deterioration when comparing two strong systems. This points to limitations in the current benchmarks and human evaluation protocols for chatbot task performance. When manually comparing generations from ChatGPT and Guanaco 65B on the Vicuna benchmark, we find that subjective preferences start to play an important role as the authors of this paper disagreed on the many preferred responses. Future work should investigate approaches to mitigate these problems drawing from disciplines that developed mechanisms to deal with subjective preferences, such as Human-Computer Interaction and Psychology.
我们报告了人类注释者(Fleiss κ=0.42𝜅0.42\kappa=0.42 )之间的中度一致性,在比较两个强大的系统时,一致性进一步恶化。这说明当前的聊天机器人任务性能基准和人类评估协议存在局限性。当人工比较 ChatGPT 和 Guanaco 65B 在 Vicuna 基准上的生成时,我们发现主观偏好开始发挥重要作用,因为本文作者在许多首选回复上存在分歧。未来的工作应借鉴已开发出处理主观偏好机制的学科,如人机交互和心理学,研究缓解这些问题的方法。

In our analysis, we also find that automated evaluation systems have noticeable biases. For example, we observe strong order effects with GPT-4 assigning higher scores to the system appearing first in its prompt. The relatively weak sample-level agreement between GPT-4 and human annotators (Fleiss κ=0.25𝜅0.25\kappa=0.25) also suggests that human annotators and automated systems might rely on preferences that are not always aligned. In addition, in Table 7, we observe that GPT-4 assigns significantly higher scores to its own outputs compared to human ratings, Elo of 1348 vs 1176, which represent an additional  20% probability of winning against an opponent. Future work should examine the presence of potential biases in automated evaluation systems as well as possible mitigation strategies.
在分析中,我们还发现自动评估系统存在明显的偏差。例如,我们观察到强烈的顺序效应,GPT-4 会给在提示中最先出现的系统更高的分数。GPT-4 和人类注释者(Fleiss κ=0.25𝜅0.25\kappa=0.25 )在样本层面上的一致性相对较弱,这也表明人类注释者和自动化系统可能依赖于并非总是一致的偏好。此外,在表 7 中,我们观察到 GPT-4 给自己的输出打出的分数明显高于人类评分,Elo 为 1348 对 1176,这意味着在与对手的对局中获胜的概率增加了 20%。未来的工作应研究自动评估系统中存在的潜在偏差以及可能的缓解策略。

Data & Training 数据与培训

We note that the OASST1 dataset on which Guanaco models are trained is multilingual and that the OA benchmark also contains prompts in different languages. We leave it to future work to investigate the degree to which such multilingual training improves performance on instructions in languages other than English and whether this explains the larger gap between Vicuna-13B model (only trained on English data) and Guanaco 33B and 65B on the OA benchmark.
我们注意到,训练 Guanaco 模型的 OASST1 数据集是多语言的,而 OA 基准也包含不同语言的提示。我们将留待今后的工作来研究这种多语言训练在多大程度上提高了除英语外的其他语言指令的性能,以及这是否可以解释 Vicuna-13B 模型(仅在英语数据上训练)与 Guanaco 33B 和 65B 在 OA 基准上的较大差距。

Given the strong performance of Guanaco models, we investigate any data leakage between the OASST1 data and the Vicuna benchmark prompts. We do not find overlapping prompts after performing fuzzy string matching in the two datasets and inspecting the closest matches manually.
鉴于 Guanaco 模型的强大性能,我们调查了 OASST1 数据和 Vicuna 基准提示之间的任何数据泄漏。在对两个数据集进行模糊字符串匹配并手动检查最接近的匹配结果后,我们没有发现重叠的提示。

Furthermore, we note that our model is only trained with cross-entropy loss (supervised learning) without relying on reinforcement learning from human feedback (RLHF). This calls for further investigations of the tradeoffs of simple cross-entropy loss and RLHF training. We hope that QLoRA enables such analysis at scale, without the need for overwhelming computational resources.
此外,我们还注意到,我们的模型仅通过交叉熵损失(监督学习)进行了训练,而没有依赖于人类反馈的强化学习(RLHF)。这就需要进一步研究简单的交叉熵损失和 RLHF 训练之间的权衡。我们希望 QLoRA 能够在不需要大量计算资源的情况下进行大规模分析。

7 Related Work 7 相关工作

Quantization of Large Language Models

Quantization of LLMs has largely focused on quantization for inference time. Major approaches for preserving 16-bit LLM quality focus on managing outlier features (e.g., SmoothQuant [66] and LLM.int8() [14]) while others use more sophisticated grouping methods [44, 69]. Lossy quantization approaches study the trade-offs for regular rounding [13, 71, 47] or how to optimize rounding decisions to improve quantization precision [18]. Besides our work, SwitchBack layers [65] is the only work that studies backpropagation through quantized weights at a scale beyond 1B parameters.
LLMs 的量化主要集中在推理时间的量化上。保持 16 位 LLM 质量的主要方法侧重于管理离群特征(例如 SmoothQuant [ 66] 和 LLM.int8() [ 14]),而其他方法则使用更复杂的分组方法 [ 44, 69]。有损量化方法研究常规舍入的权衡[ 13, 71, 47] 或如何优化舍入决策以提高量化精度[ 18]。除了我们的研究之外,SwitchBack 层[ 65] 是唯一研究通过量化权重进行反向传播的研究,其规模超过了 1B 参数。

Finetuning with Adapters

While we use Low-rank Adapters [28] (LoRA), many other Parameter Efficient FineTuning (PEFT) methods have been proposed such as prompt tuning [48, 33, 34], tuning the embedding layer inputs [1], tuning hidden states (IA3) [37], adding full layers [27], tuning biases [70], learning a mask over weights based on Fisher information [54], and a combination of approaches [23]. In our work, we show that LoRA adapters are able to reach full 16-bit finetuning performance. We leave it to future work to explore the tradeoffs of other PEFT approaches.
虽然我们使用的是低阶适配器[28](LoRA),但也有人提出了许多其他参数高效微调(PEFT)方法,例如提示微调[48, 33, 34]、调整嵌入层输入[1]、调整隐藏状态(IA 3 )[37]、添加全层[27]、调整偏置[70]、根据费雪信息学习权重掩码[54]以及各种方法的组合[23]。在我们的工作中,我们展示了 LoRA 适配器能够达到完整的 16 位微调性能。我们将在未来的工作中探索其他 PEFT 方法的权衡。

Instruction Finetuning 指令微调

To help a pretrained LLM follow the instructions provided in a prompt, instruction finetuning uses input-output pairs of various data sources to finetune a pretrained LLM to generate the output given the input as a prompt. Approaches and datasets include MetaICL [40], MetaTuning [73], InstructGPT [43], FLAN [62, 12], PromptSource [3], Super-NaturalInstructions [61, 50], Self-instruct [59], UnnaturalInstructions [26], OPT-IML [29], UnifiedSKG[67], OIG/Chip2 [32], Alpaca [55], Vicuna [10], Koala [20], and Self-instruct-GPT-4 [45].
为了帮助预训练的 LLM 遵循提示中提供的指令,指令微调使用各种数据源的输入-输出对来微调预训练的 LLM 以生成提示输入的输出。方法和数据集包括 MetaICL [ 40]、MetaTuning [ 73]、InstructGPT [ 43]、FLAN [ 62, 12]、PromptSource [ 3]、Super-NaturalInstructions [ 61, 50]、Self-instruct [ 59]、UnnaturalInstructions [ 26]、OPT-IML [ 29]、UnifiedSKG[ 67]、OIG/Chip2 [ 32]、Alpaca [ 55]、Vicuna [ 10]、Koala [ 20] 以及 Self-instruct-GPT-4 [ 45]。

Chatbots 聊天机器人

Many instruction following models are structured as dialogue-based chatbots, often using Reinforcement Learning from Human Feedback (RLHF) [11] or generating data from an existing model to train with AI model feedback (RLAIF) [5]. Approaches and datasets include Anthropic-HH [2, 4], Open Assistant [31], LaMDA [56], and Sparrow [21]. We do not use reinforcement learning, but our best model, Guanaco, is finetuned on multi-turn chat interactions from the Open Assistant dataset which was designed to be used for RLHF training [31]. For the evaluation of chatbots approaches that use GPT-4 instead of costly human annotation have been developed [10, 45]. We improve on such approaches with a focus on an evaluation setup that is more reliable.
许多指令跟踪模型都是基于对话的聊天机器人,通常使用从人类反馈中强化学习(RLHF)[ 11] 或从现有模型中生成数据,用人工智能模型反馈进行训练(RLAIF)[ 5]。方法和数据集包括 Anthropic-HH [ 2, 4]、Open Assistant [ 31]、LaMDA [ 56] 和 Sparrow [ 21]。我们没有使用强化学习,但我们的最佳模型 Guanaco 是在 Open Assistant 数据集的多轮聊天交互基础上进行微调的,该数据集旨在用于 RLHF 训练[ 31]。为了评估聊天机器人,已经开发出了使用 GPT-4 代替昂贵的人工注释的方法[10, 45]。我们对这些方法进行了改进,重点放在更可靠的评估设置上。

8 Limitations and Discussion
8 限制与讨论

We have shown evidence that our method, QLoRA, can replicate 16-bit full finetuning performance with a 4-bit base model and Low-rank Adapters (LoRA). Despite this evidence, we did not establish that QLoRA can match full 16-bit finetuning performance at 33B and 65B scales. Due to the immense resource costs, we leave this study to future work.
我们已经证明,我们的方法 QLoRA 可以通过 4 位基本模型和低阶适配器 (LoRA) 复制 16 位完全微调性能。尽管证据确凿,但我们并未证实 QLoRA 能够在 33B 和 65B 规模下达到 16 位的完全微调性能。由于资源成本巨大,我们将这项研究留待未来工作进行。

Another limitation is the evaluation of instruction finetuning models. While we provide evaluations on MMLU, the Vicuna benchmark, and the OA benchmark, we did not evaluate on other benchmarks such as BigBench, RAFT, and HELM, and it is not ensured that our evaluations generalize to these benchmarks. On the other hand, we perform a very broad study on MMLU and develop new methods for evaluating chatbots.
另一个局限是对指令微调模型的评估。虽然我们对 MMLU、Vicuna 基准和 OA 基准进行了评估,但我们没有对 BigBench、RAFT 和 HELM 等其他基准进行评估,因此无法确保我们的评估结果能推广到这些基准。另一方面,我们对 MMLU 进行了非常广泛的研究,并开发了评估聊天机器人的新方法。

From the evidence presented, it appears that the performance of these benchmarks likely depends how similar the finetuning data is to the benchmark dataset. For example, FLAN v2 is similar to MMLU, but dissimilar to chatbot benchmarks and vice versa for the Chip2 dataset and both models score accordingly on the MMLU and Vicuna benchmarks. This highlights that not only better benchmarks and evaluation is needed, but that one needs to be careful about what one is evaluating in the first place. Do we want to create models that do well on classroom highschool and colleague knowledge or do we want to do well on chatbot conversation ability? Maybe something else? Because it is always easier to evaluate on an existing benchmark compared to creating a new one, certain benchmarks can steer the community towards a certain direction. We should ensure as a community that the benchmarks measure what we care about.
从提供的证据来看,这些基准的性能可能取决于微调数据与基准数据集的相似程度。例如,FLAN v2 与 MMLU 相似,但与聊天机器人基准不同,而 Chip2 数据集则相反,两种模型在 MMLU 和 Vicuna 基准上的得分也相应不同。这说明不仅需要更好的基准和评估,而且首先需要谨慎评估的内容。我们是想创建在课堂高中知识和同事知识方面表现出色的模型,还是想在聊天机器人对话能力方面表现出色?还是其他?因为与创建新基准相比,根据现有基准进行评估总是更容易一些,所以某些基准可以引导社区朝着某个方向发展。作为一个社区,我们应该确保这些基准能衡量我们所关心的问题。

Table 8: Evaluation of biases on the CrowS dataset. A lower score indicates lower likelihood of generating biased sequences. Guanaco follows the biased pattern of the LLaMA base model.
表 8:对 CrowS 数据集偏差的评估。得分越低,说明产生偏差序列的可能性越小。Guanaco 遵循了 LLaMA 基本模型的偏差模式。
LLaMA-65B GPT-3 OPT-175B Guanaco-65B
Gender 70.6 62.6 65.7 47.5
Religion 79.0 73.3 68.6 38.7
Race/Color 种族/肤色 57.0 64.7 68.6 45.3
Sexual orientation 性取向 81.0 76.2 78.6 59.1
Age 70.1 64.4 67.8 36.3
Nationality 64.2 61.6 62.9 32.4
Disability 66.7 76.7 76.7 33.9
Physical appearance 外貌 77.8 74.6 76.2 43.1
Socioeconomic status 社会经济地位 71.5 73.8 76.2 55.3
Average 66.6 67.2 69.5 43.5

While we provide a detailed evaluation for general chatbot performance, another limitation is that we only do a limited responsible AI evaluation of Guanaco. We evaluate the likelihood of Guanaco-65B to generate a socially biased sequence of tokens compared to other models in Table 8. We see that the average score in Guanaco-65B is much lower than other raw pretrained models. As such, it seems that finetuning on the OASST1 dataset reduces the bias of the LLaMA base model. While these results are encouraging, it is unclear if Guanaco does also well when assessed on other types of biases. We leave further evaluation of analyzing biases in Guanaco and similar chatbots to future work.
虽然我们对一般聊天机器人的性能进行了详细评估,但另一个局限是我们只对 Guanaco 进行了有限的负责任人工智能评估。与其他模型相比,我们在表 8 中评估了 Guanaco-65B 生成具有社会偏见的标记序列的可能性。我们看到,Guanaco-65B 的平均得分远低于其他原始预训练模型。因此,在 OASST1 数据集上进行微调似乎可以减少 LLaMA 基础模型的偏差。虽然这些结果令人鼓舞,但目前还不清楚在评估其他类型的偏差时,Guanaco 是否也表现出色。我们将在今后的工作中进一步评估分析 Guanaco 和类似聊天机器人的偏差。

An additional limitation is that we did not evaluate different bit-precisions, such as using 3-bit base models, or different adapter methods. Besides LoRA, there is also a wide variety Parameter Efficient FineTuning (PEFT) methods that have been shown to work well. However, it is unclear if these methods scale to large models. We used LoRA as many results established its robustness but other adapters might yield better performance. Since finetuning after quantization seems to recover most of the information that is lost during quantization this might enable much more aggressive quantization. For example, 3-bit GPTQ quantization of the basemodel with LoRA might also yield 16-bit full finetuning performance after finetuning.
另外一个局限是,我们没有对不同的位精度(如使用 3 位基本模型)或不同的适配器方法进行评估。除 LoRA 外,还有各种参数高效微调 (PEFT) 方法,这些方法已被证明效果良好。然而,这些方法是否适用于大型模型尚不清楚。我们使用 LoRA,因为许多结果都证明了它的鲁棒性,但其他适配器可能会产生更好的性能。由于量化后的微调似乎可以恢复量化过程中丢失的大部分信息,这可能会使量化更加激进。例如,使用 LoRA 对基本模型进行 3 位 GPTQ 量化,也可能在微调后获得 16 位的完全微调性能。

9 Broader Impacts 9 更广泛的影响

Our QLoRA finetuning method is the first method that enables the finetuning of 33B parameter models on a single consumer GPU and 65B parameter models on a single professional GPU, while not degrading performance relative to a full finetuning baseline. We have demonstrated that our best 33B model trained on the Open Assistant dataset can rival ChatGPT on the Vicuna benchmark. Since instruction finetuning is an essential tool to transform raw pretrained LLMs into ChatGPT-like chatbots, we believe that our method will make finetuning widespread and common in particular for the researchers that have the least resources, a big win for the accessibility of state of the art NLP technology. QLoRA can be seen as an equalizing factor that helps to close the resource gap between large corporations and small teams with consumer GPUs.
我们的 QLoRA 微调方法是第一种能在单个消费级 GPU 上实现 33B 参数模型微调和在单个专业级 GPU 上实现 65B 参数模型微调的方法,同时不会降低相对于完全微调基线的性能。我们已经证明,我们在开放助手数据集上训练的最佳 33B 模型可以在 Vicuna 基准上与 ChatGPT 相媲美。由于指令微调是将原始预训练 LLMs 转化为类似 ChatGPT 的聊天机器人的重要工具,我们相信我们的方法将使微调技术得到普及和应用,尤其是对于资源最少的研究人员来说,这将极大地促进最先进 NLP 技术的普及。QLoRA 可以被视为一个均衡因素,有助于缩小大型企业与使用消费级 GPU 的小型团队之间的资源差距。

Another potential source of impact is deployment to mobile phones. We believe our QLoRA method might enable the critical milestone of enabling the finetuning of LLMs on phones and other low resource settings. While 7B models were shown to be able to be run on phones before, QLoRA is the first method that would enable the finetuning of such models. We estimate that with an iPhone 12 Plus, QLoRA can finetune 3 million tokens per night while the phone is charging. While finetuned 7B models do not reach the quality of ChatGPT, we believe that the quality is good enough to enable novel applications that have not been possible before due to privacy or LLM quality issues. QLoRA can help enable privacy-preserving usage of LLMs, where users can own and manage their own data and models, while simultaneously making LLMs easier to deploy.
另一个潜在的影响来源是在手机上的部署。我们相信,我们的 QLoRA 方法可以实现在手机和其他低资源设置上对 LLMs 进行微调的关键里程碑。虽然 7B 模型以前就被证明能够在手机上运行,但 QLoRA 是第一种能够对此类模型进行微调的方法。我们估计,如果使用 iPhone 12 Plus,QLoRA 可以在手机充电时每晚微调 300 万个代币。虽然经过微调的 7B 模型无法达到 ChatGPT 的质量,但我们相信其质量足以支持以前由于隐私或 LLM 质量问题而无法实现的新型应用。QLoRA 可以帮助实现对 LLMs 的隐私保护,用户可以拥有和管理自己的数据和模型,同时使 LLMs 更易于部署。

However, finetuning is a dual-use technology that can be abused to cause harm. Widespread use of LLMs has known dangers [8, 6], but we believe that equalizing access to a technology that is quickly becoming ubiquitous will allow for better more independent analysis than keeping the power of LLMs in the hands of large corporations that do not release models or source code for auditing.
然而,微调是一种双重用途技术,可能会被滥用而造成危害。广泛使用LLMs具有已知的危险[ 8, 6],但我们相信,与将LLMs的权力掌握在那些不发布模型或源代码以供审计的大公司手中相比,让人们平等地使用这项正在迅速普及的技术,将有助于进行更好、更独立的分析。

All in all, we believe that QLoRA will have a broadly positive impact making the finetuning of high quality LLMs much more widely and easily accessible.
总之,我们相信 QLoRA 将产生广泛的积极影响,使高质量 LLMs 的微调更广泛、更容易获得。

Acknowledgements 致谢

We thank Aditya Kusupati, Ofir Press, Ashish Sharma, Margaret Li, Raphael Olivier, Zihao Ye, and Evangelia Spiliopoulou for their valuable feedback. Our research was facilitated by the advanced computational, storage, and networking infrastructure of the Hyak supercomputer system at the University of Washington. We thank the Hyak team for ensuring a smooth operation. We thank the beta testers of the bitsandbytes library, in particular Alex Birch and Alyssa Vance. We thank Younes Belkada for help with the integration of our software into the Hugging Face transformers stack.
我们感谢 Aditya Kusupati、Ofir Press、Ashish Sharma、Margaret Li、Raphael Olivier、叶子豪和 Evangelia Spiliopoulou 提供的宝贵意见。华盛顿大学 Hyak 超级计算机系统先进的计算、存储和网络基础设施为我们的研究提供了便利。我们感谢 Hyak 团队确保了系统的顺利运行。我们感谢 bitsandbytes 库的测试人员,特别是 Alex Birch 和 Alyssa Vance。感谢 Younes Belkada 帮助我们将软件集成到拥抱脸变压器堆栈中。

References 参考资料

  • An et al. [2022] An 等人[2022] S. An, Y. Li, Z. Lin, Q. Liu, B. Chen, Q. Fu, W. Chen, N. Zheng, and J.-G. Lou. Input-tuning: Adapting unfamiliar inputs to frozen pretrained models. arXiv preprint arXiv:2203.03131, 2022.
    S.An, Y. Li, Z. Lin, Q. Liu, B. Chen, Q. Fu, W. Chen, N. Zheng, and J.-G. Lou.Lou.输入调整:ArXiv preprint arXiv:2203.03131, 2022.
  • Askell et al. [2021] Askell 等人[2021] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
    A.Askell、Y. Bai、A. Chen、D. Drain、D. Ganguli、T. Henighan、A. Jones、N. Joseph、B. Mann、N. DasSarma 等:《作为对齐实验室的通用语言助手》,arXiv 预印本 arXiv:2112.00861, 2021 年。
  • Bach et al. [2022] 巴赫等人[2022] S. H. Bach, V. Sanh, Z.-X. Yong, A. Webson, C. Raffel, N. V. Nayak, A. Sharma, T. Kim, M. S. Bari, T. Fevry, et al. Promptsource: An integrated development environment and repository for natural language prompts. arXiv preprint arXiv:2202.01279, 2022.
    S.S. H. Bach, V. Sanh, Z.-X.Yong, A. Webson, C. Raffel, N. V. Nayak, A. Sharma, T. Kim, M. S. Bari, T. Fevry, et al. Promptsource:ArXiv preprint arXiv:2202.01279, 2022.
  • Bai et al. [2022a] Bai et al. [2022a] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
    Y.Bai、A. Jones、K. Ndousse、A. Askell、A. Chen、N. DasSarma、D. Drain、S. Fort、D. Ganguli、T. Henighan, et al.
  • Bai et al. [2022b] Bai 等人 [2022b] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
    Y.Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. ArXiv preprint arXiv:2212.08073, 2022b.
  • Bender et al. [2021] 本德尔等人[2021] E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021.
    E.E. M. Bender、T. Gebru、A. McMillan-Major、S. Shmitchell.论随机鹦鹉的危险:语言模型会太大吗?In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610-623, 2021.
  • Biderman et al. [2023] 比德曼等人[2023] S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373, 2023.
    Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, K. O'Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, et al. Pythia:A suite for analyzing large language models across training and scaling. ArXiv preprint arXiv:2304.01373, 2023.
  • Bommasani et al. [2021] Bommasani 等人[2021] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
    R.Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of the foundation models. ArXiv preprint arXiv:2108.07258, 2021.
  • Chen et al. [2016] 陈等人[2016] T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
    T.Chen, B. Xu, C. Zhang, and C. Guestrin.以亚线性内存成本训练深度网。arXiv preprint arXiv:1604.06174, 2016.
  • Chiang et al. [2023] Chiang 等人[2023] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
    W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing.Vicuna:一个开源聊天机器人给 gpt-4 留下深刻印象,聊天质量达到 90%*,2023 年 3 月。URL https://lmsys.org/blog/2023-03-30-vicuna/.
  • Christiano et al. [2017]
    P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
    P.F. Christiano、J. Leike、T. Brown、M. Martic、S. Legg 和 D. Amodei。从人类偏好进行深度强化学习。神经信息处理系统进展》,2017 年第 30 期。
  • Chung et al. [2022] Chung 等人[2022] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
    H.H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, et al.
  • Dettmers and Zettlemoyer [2022]
    Dettmers 和 Zettlemoyer [2022]
    T. Dettmers and L. Zettlemoyer. The case for 4-bit precision: k-bit inference scaling laws. arXiv preprint arXiv:2212.09720, 2022.
    T.Dettmers and L. Zettlemoyer.4 位精度的案例:K 位推理缩放定律。arXiv 预印本 arXiv:2212.09720, 2022。
  • Dettmers et al. [2022a] Dettmers 等人 [2022a] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, 2022a.
    T.Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer.LLM.int8():规模变压器的 8 位矩阵乘法。神经信息处理系统进展 35:神经信息处理系统年会 2022,NeurIPS 2022,2022a。
  • Dettmers et al. [2022b] Dettmers 等人 [2022b] T. Dettmers, M. Lewis, S. Shleifer, and L. Zettlemoyer. 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR, 2022b.
    T.Dettmers, M. Lewis, S. Shleifer, and L. Zettlemoyer.通过分块量化的 8 位优化器。第 9 届国际学习表征会议,ICLR,2022b.
  • Elo [1967] A. E. Elo. The proposed uscf rating system. its development, theory, and applications. Chess Life, 22(8):242–247, 1967.
    A.E. Elo.建议的 USCF 评级系统。其发展、理论和应用。国际象棋生活,22(8):242-247,1967.
  • Elo [1978] Elo [1978] A. E. Elo. The rating of chessplayers, past and present. Arco Pub., 1978.
    A.E. Elo.过去和现在的棋手评级》。Arco 出版社,1978 年。
  • Frantar et al. [2022] 弗朗塔尔等人[2022] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
    E.Frantar、S. Ashkboos、T. Hoefler 和 D. Alistarh。Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  • Fu et al. [2023] Fu 等人[2023] J. Fu, S.-K. Ng, Z. Jiang, and P. Liu. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023.
    J.Fu, S.-K. Ng, Z. Jiang, and P. Liu.Ng, Z. Jiang, and P. Liu.Gptscore:ArXiv preprint arXiv:2302.04166, 2023.
  • Geng et al. [2023] 耿等人[2023] X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, S. Levine, and D. Song. Koala: A dialogue model for academic research. Blog post, April 2023. URL https://bair.berkeley.edu/blog/2023/04/03/koala/.
    X.Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, S. Levine, and D. Song.Koala:学术研究的对话模型。博客文章,2023 年 4 月。URL https://bair.berkeley.edu/blog/2023/04/03/koala/.
  • Glaese et al. [2022] Glaese 等人[2022] A. Glaese, N. McAleese, M. Trębacz, J. Aslanides, V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P. Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
    Glaese, N. McAleese, M. Trębacz, J. Aslanides, V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P. Thacker, et al.
  • Gururangan et al. [2018]
    Gururangan 等人[2018]
    S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith. Annotation artifacts in natural language inference data. arXiv preprint arXiv:1803.02324, 2018.
    S.Gururangan、S. Swayamdipta、O. Levy、R. Schwartz、S. R. Bowman 和 N. A. Smith。自然语言推理数据中的注释人工制品。arXiv preprint arXiv:1803.02324, 2018.
  • Henderson et al. [2021] 亨德森等人[2021] J. Henderson, S. Ruder, et al. Compacter: Efficient low-rank hypercomplex adapter layers. In Advances in Neural Information Processing Systems, 2021.
    J. Henderson, S. Ruder, et al:高效的 low-rank 超复杂适配器层。神经信息处理系统进展》,2021 年。
  • Hendrycks et al. [2020] 亨德里克斯等人[2020] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.
    D.Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt.测量大规模多任务语言理解。在2020年国际学习表征会议上。
  • Holtzman et al. [2020] 霍尔茨曼等人[2020] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020.
    A.Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi.神经文本退化的奇特案例。国际学习表征会议,2020 年。
  • Honovich et al. [2022] 霍诺维奇等人[2022] O. Honovich, T. Scialom, O. Levy, and T. Schick. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689, 2022.
    O.Honovich、T. Scialom、O. Levy 和 T. Schick。非自然指令:用(几乎)无需人工的方式调整语言模型。arXiv preprint arXiv:2212.09689, 2022.
  • Houlsby et al. [2019] Houlsby 等人[2019] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
    N.Houlsby、A. Giurgiu、S. Jastrzebski、B. Morrone、Q. De Laroussilhe、A. Gesmundo、M. Attariyan 和 S. Gelly。nlp 的参数高效迁移学习。国际机器学习大会》,第 2790-2799 页。PMLR, 2019.
  • Hu et al. [2021] Hu 等人[2021] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
    E.E. J. Hu、Y. Shen、P. Wallis、Z. Allen-Zhu、Y. Li、S. Wang、L. Wang 和 W. Chen。Lora:ArXiv preprint arXiv:2106.09685, 2021.
  • Iyer et al. [2022] Iyer 等人[2022] S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig, P. Yu, K. Shuster, T. Wang, Q. Liu, P. S. Koura, et al. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017, 2022.
    S.Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig, P. Yu, K. Shuster, T. Wang, Q. Liu, P. S. Koura, et al. Opt-iml:ArXiv preprint arXiv:2212.12017, 2022.
  • Köksal et al. [2023] Köksal 等人 [2023] A. Köksal, T. Schick, A. Korhonen, and H. Schütze. Longform: Optimizing instruction tuning for long text generation with corpus extraction. arXiv preprint arXiv:2304.08460, 2023.
    A.Köksal, T. Schick, A. Korhonen, and H. Schütze.Longform:通过语料库提取优化长文本生成的指令调整。arXiv preprint arXiv:2304.08460, 2023.
  • Köpf et al. [2023] Köpf 等人 [2023] A. Köpf, Y. Kilcher, D. von Rütte, S. Anagnostidis, Z.-R. Tam, K. Stevens, A. Barhoum, N. M. Duc, O. Stanley, R. Nagyfi, et al. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
    A.Köpf, Y. Kilcher, D. von Rütte, S. Anagnostidis, Z.-R. Tam, K. Stevens, A. Barhoum, N. M. Duc, O. Stanley, R. Nagyfi, et al. Openassistant conversations-democratizing large language model alignment. ArXiv preprint arXiv:2304.07327, 2023.
  • LAION [2023] LAION. Open-instruction-generalist dataset. https://github.com/LAION-AI/Open-Instruction-Generalist, 2023.
  • Lester et al. [2021] 莱斯特等人[2021] B. Lester, R. Al-Rfou, and N. Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
    B.Lester, R. Al-Rfou, and N. Constant.参数高效及时调整的规模力量》,arXiv preprint arXiv:2104.08691, 2021.
  • Li and Liang [2021]
    李和梁 [2021]
    X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
    X.L. Li 和 P. Liang.前缀调整:ArXiv preprint arXiv:2101.00190, 2021.
  • Liang et al. [2022] 梁等人[2022] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
    P.Leung, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Kumar, et al.
  • Liao et al. [2021] 廖等人[2021] T. Liao, R. Taori, I. D. Raji, and L. Schmidt. Are we learning yet? a meta review of evaluation failures across machine learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
    T.Liao, R. Taori, I. D. Raji, and L. Schmidt.机器学习评估失败的元回顾。第三十五届神经信息处理系统数据集与基准会议(第二轮),2021年。
  • Liu et al. [2022] Liu 等人[2022] H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, and C. A. Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
    H.Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, and C. A. Raffel.少量参数高效微调比上下文学习更好、更便宜。神经信息处理系统进展》,35:1950-1965,2022 年。
  • Liu et al. [2019] Liu et al. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    Y.Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov.罗伯塔:A robustly optimized bert pretraining approach. ArXiv preprint arXiv:1907.11692, 2019.
  • Longpre et al. [2023] Longpre 等人 [2023] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
    S.Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei, et al:ArXiv preprint arXiv:2301.13688, 2023.
  • Min et al. [2021] Min 等人[2021] S. Min, M. Lewis, L. Zettlemoyer, and H. Hajishirzi. Metaicl: Learning to learn in context. arXiv preprint arXiv:2110.15943, 2021.
    S.Min、M. Lewis、L. Zettlemoyer 和 H. Hajishirzi。Metaicl: Learning to learn in context. arXiv preprint arXiv:2110.15943, 2021.
  • Nematzadeh et al. [2018]
    Nematzadeh 等人[2018]
    A. Nematzadeh, K. Burns, E. Grant, A. Gopnik, and T. Griffiths. Evaluating theory of mind in question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2392–2400, 2018.
    A.Nematzadeh, K. Burns, E. Grant, A. Gopnik, and T. Griffiths.评估问题解答中的心智理论。2018年自然语言处理经验方法大会论文集》,第2392-2400页,2018年。
  • OpenAI [2023] OpenAI. Gpt-4 technical report. arXiv, 2023.
  • Ouyang et al. [2022] 欧阳等人[2022] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
    L.欧阳、吴俊、蒋旭、D. Almeida、C. Wainwright、P. Mishkin、C. Zhang、S. Agarwal、K. Slama、A. Ray 等人.训练语言模型以遵循人类反馈指令.神经信息处理系统进展,35:27730-27744, 2022.神经信息处理系统进展》,35:27730-27744,2022 年。
  • Park et al. [2022] Park 等人[2022] G. Park, B. Park, S. J. Kwon, B. Kim, Y. Lee, and D. Lee. nuqmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557, 2022.
    G.Park、B.Park、S.J.Kwon、B.Kim、Y.Lee 和 D.Lee. nuqmm:Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557, 2022.
  • Peng et al. [2023] Peng 等人 [2023] B. Peng, C. Li, P. He, M. Galley, and J. Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
    B.Peng, C. Li, P. He, M. Galley, and J. Gao.用 gpt-4 进行指令调整。arXiv 预印本 arXiv:2304.03277, 2023.
  • Poliak et al. [2018] 波利亚克等人[2018] A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, and B. Van Durme. Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180–191, 2018.
    A.Poliak, J. Naradowsky, A. Haldar, R. Rudinger, and B. Van Durme.自然语言推理中的假设基线。In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180-191, 2018.
  • Pope et al. [2022] 波普等人[2022] R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, A. Levskaya, J. Heek, K. Xiao, S. Agrawal, and J. Dean. Efficiently scaling transformer inference. arXiv preprint arXiv:2211.05102, 2022.
    Pope、S. Douglas、A. Chowdhery、J. Devlin、J. Bradbury、A. Levskaya、J. Heek、K. Xiao、S. Agrawal 和 J. Dean。ArXiv preprint arXiv:2211.05102, 2022.
  • Qin and Eisner [2021]
    G. Qin and J. Eisner. Learning how to ask: Querying lms with mixtures of soft prompts. arXiv preprint arXiv:2104.06599, 2021.
    G. Qin 和 J. Eisner.学习如何提问:ArXiv preprint arXiv:2104.06599, 2021.
  • Raffel et al. [2020] 拉斐尔等人[2020] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), jan 2020. ISSN 1532-4435.
    C.Raffel、N. Shazeer、A. Roberts、K. Lee、S. Narang、M. Matena、Y. Zhou、W. Li 和 P. J. Liu。用统一的文本到文本 transformer 探索迁移学习的极限。J. Mach.Learn.21(1),2020 年 1 月。ISSN 1532-4435.
  • Sanh et al. [2021] Sanh 等人[2021] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja, et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
    V.Sanh、A. Webson、C. Raffel、S. H. Bach、L. Sutawika、Z. Alyafeai、A. Chaffin、A. Stiegler、T. L. Scao、A. Raja 等人的多任务提示训练实现了零射击任务泛化。arXiv 预印本 arXiv:2110.08207, 2021。
  • Sap et al. [2022] 萨普等人[2022] M. Sap, R. LeBras, D. Fried, and Y. Choi. Neural theory-of-mind? on the limits of social intelligence in large lms. arXiv preprint arXiv:2210.13312, 2022.
    M.Sap, R. LeBras, D. Fried, and Y. Choi.arXiv preprint arXiv:2210.13312, 2022.
  • Scao et al. [2022] 斯考等人[2022] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
    T.T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  • Shaphiro and Wilk [1965]
    Shaphiro 和 Wilk [1965]
    S. Shaphiro and M. Wilk. An analysis of variance test for normality. Biometrika, 52(3):591–611, 1965.
    S.Shaphiro and M. Wilk.正态性的方差分析检验。Biometrika, 52(3):591-611, 1965.
  • Sung et al. [2021] Sung 等人[2021] Y.-L. Sung, V. Nair, and C. A. Raffel. Training neural networks with fixed sparse masks. Advances in Neural Information Processing Systems, 34:24193–24205, 2021.
    Y.-L. Sung, V. Nair, and C. A. Raffel.用固定稀疏掩码训练神经网络。神经信息处理系统进展,34:24193-24205,2021。
  • Taori et al. [2023] 陶里等人[2023] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
    R.Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto.斯坦福羊驼:https://github.com/tatsu-lab/stanford_alpaca, 2023.
  • Thoppilan et al. [2022] 托普皮兰等人[2022] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
    R.Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du, et al.Cheng、A. Jin、T. Bos、L. Baker、Y. Du 等 Lamda:ArXiv preprint arXiv:2201.08239, 2022.
  • Touvron et al. [2023] Touvron 等人 [2023] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
    H.Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama:ArXiv preprint arXiv:2302.13971, 2023.
  • Wang et al. [2018] Wang 等人[2018] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
    A.Wang、A. Singh、J. Michael、F. Hill、O. Levy 和 S. R. Bowman。Glue:用于自然语言理解的多任务基准和分析平台。arXiv preprint arXiv:1804.07461, 2018.
  • Wang et al. [2022a] Wang 等人 [2022a] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022a.
    Y.Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi.自我指导:将语言模型与自我生成的指令对齐。arXiv 预印本 arXiv:2212.10560, 2022a.
  • Wang et al. [2022b] Wang 等人[2022b] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap, et al. Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks. In EMNLP, 2022b.
    Y.Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap, et al. Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks.In EMNLP, 2022b.
  • Wang et al. [2022c] Wang 等人[2022c] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, 2022c.
    Y.Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stap, et al. Super-naturalinstructions:在 1600 多项 NLP 任务中通过声明性指令实现泛化。In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085-5109, 2022c.
  • Wei et al. [2021] 魏等人[2021] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
    J. Wei、M. Bosma、V. Y. Zhao、K. Guu、A. W. Yu、B. Lester、N. Du、A. M. Dai 和 Q. V. V. V.V.Le.Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  • Wei et al. [2022] 魏等人[2022] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. H. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022.
    J.Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. H. Chi, Q. V. Le, D. Zhou, et al.神经信息处理系统进展》,2022 年。
  • Wolf et al. [2019] 沃尔夫等人[2019] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
    T.T.Wolf、L.Debut、V.Sanh、J.Chaumond、C.Delangue、A.Moi、P.Cistac、T.Rault、R.Louf、M.Funtowicz 等:《Huggingface 的转换器:最先进的自然语言处理》,arXiv preprint arXiv:1910.03771, 2019。
  • Wortsman et al. [2023] Wortsman 等人 [2023] M. Wortsman, T. Dettmers, L. Zettlemoyer, A. Morcos, A. Farhadi, and L. Schmidt. Stable and low-precision training for large-scale vision-language models. arXiv preprint arXiv:2304.13013, 2023.
    M.Wortsman, T. Dettmers, L. Zettlemoyer, A. Morcos, A. Farhadi, and L. Schmidt.大规模视觉语言模型的稳定和低精度训练。arXiv preprint arXiv:2304.13013, 2023.
  • Xiao et al. [2022] 肖等人[2022] G. Xiao, J. Lin, M. Seznec, J. Demouth, and S. Han. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
    G. Xiao, J. Lin, M. Seznec, J. Demouth, and S. Han.Smoothquant:精确高效的大型语言模型训练后量化。arXiv preprint arXiv:2211.10438, 2022.
  • Xie et al. [2022] Xie 等人[2022] T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak, M. Yasunaga, C.-S. Wu, M. Zhong, P. Yin, S. I. Wang, et al. Unifiedskg: Unifying and multi-tasking structured knowledge grounding with text-to-text language models. arXiv preprint arXiv:2201.05966, 2022.
    T.Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak, M. Yasunaga, C.-S. Wu, M. Zhong, P. Yin, S. I. Wang, et al.Wu, M. Zhong, P. Yin, S. I. Wang, et al. Unifiedskg:ArXiv preprint arXiv:2201.05966, 2022.
  • Yang et al. [2018] 杨等人[2018] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018.
    Z.Yang、P. Qi、S. Zhang、Y. Bengio、W. Cohen、R. Salakhutdinov 和 C. D. Manning。Hotpotqa:用于多样化、可解释的多跳问题解答的数据集。2018 年自然语言处理实证方法大会论文集》,第 2369-2380 页,2018 年。
  • Yao et al. [2022] 姚等人[2022] Z. Yao, R. Y. Aminabadi, M. Zhang, X. Wu, C. Li, and Y. He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. arXiv preprint arXiv:2206.01861, 2022.
    Z.Yao, R. Y. Aminabadi, M. Zhang, X. Wu, C. Li, and Y. He.Aminabadi, M. Zhang, X. Wu, C. Li, and Y. He.Zeroquant:ArXiv preprint arXiv:2206.01861, 2022.
  • Zaken et al. [2021] Zaken 等人[2021] E. B. Zaken, S. Ravfogel, and Y. Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.
    E.B.Zaken, S. Ravfogel, and Y. Goldberg.Goldberg.Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.
  • Zeng et al. [2022] 曾等人[2022] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
    A.Glm-130b: An open bilingual pre-trained model. ArXiv preprint arXiv:2210.02414, 2022.
  • Zhang et al. [2022] Zhang 等人[2022] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
    S.Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, et al. Opt:开放式预训练 transformer 语言模型。arXiv preprint arXiv:2205.01068, 2022.
  • Zhong et al. [2021] Zhong 等人[2021] R. Zhong, K. Lee, Z. Zhang, and D. Klein. Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections. arXiv preprint arXiv:2104.04670, 2021.
    R.Zhong、K. Lee、Z. Zhang 和 D. Klein.通过在数据集和提示集合上进行元调谐,为零点学习调整语言模型》,arXiv preprint arXiv:2104.04670, 2021.

Appendix A QLoRA vs Standard Finetuning Experimental Setup Details
附录 AQLoRA 与标准微调实验装置细节对比

A.1 Hyperparameters for QLoRA
A.1 QLoRA 的超参数

We do a hyperparameter search for LoRA over the following variables: LoRA dropout { 0.0, 0.05, 0.1}, LoRA r𝑟r { 8, 16, 32, 64, 128, 256}, LoRA layers {key+query, all attention layers, all FFN layers, all layers, attention + FFN output layers}. We keep LoRA α𝛼\alpha fixed and search the learning rate, since LoRA α𝛼\alpha is always proportional to the learning rate.
我们通过以下变量对 LoRA 进行超参数搜索:LoRA dropout { 0.0, 0.05, 0.1}, LoRA r𝑟r { 8, 16, 32, 64, 128, 256}, LoRA layers { key+query, all attention layers, all FFN layers, all layers, attention + FFN output layers}。我们保持 LoRA α𝛼\alpha 固定不变,并搜索学习率,因为 LoRA α𝛼\alpha 始终与学习率成正比。

We find that LoRA dropout 0.05 is useful for small models (7B, 13B), but not for larger models (33B, 65B). We find LoRA r𝑟r is unrelated to final performance if LoRA is used on all layers as can be seen in Figure 4
我们发现 LoRA dropout 0.05 对小型模型(7B、13B)有用,但对大型模型(33B、65B)无效。我们发现,如果在所有层上都使用 LoRA,LoRA r𝑟r 与最终性能无关,如图 4 所示

Refer to caption
Figure 4: LoRA r𝑟r for LLaMA 7B models finetuned on Alpaca. Each dot represents a combination of hyperparameters and for each LoRA r𝑟r we run 3 random seed with each hyperparameter combination. The performance of specific LoRA r𝑟r values appears to be independent of other hyperparameters.
图 4:在 Alpaca 上微调的 LLaMA 7B 模型的 LoRA r𝑟r 。每个点代表一个超参数组合,对于每个 LoRA r𝑟r ,我们使用每个超参数组合运行 3 个随机种子。特定 LoRA r𝑟r 值的性能似乎与其他超参数无关。

A.2 Super-Natural Instructions Experimental Setup Details
A.2 超自然指令实验装置详情

We use the same preprocessing of the Super-Natural Instruction dataset as Wang et al. [60]. However, we split the training data in training and validation datasets allowing us to perform more rigorous hyperparameter tuning and early stopping. We use the same hyperparameters described in the paper for training the various T5 model sizes on the Super-Natural Instruction data. We use LoRA r=16𝑟16r=16 for small, medium, and large T5 models and LoRA r=64𝑟64r=64 for T5 xl and xxl models. We also use LoRA α=64𝛼64\alpha=64 in all our experiments and no LoRA dropout.
我们采用了与 Wang 等人[60]相同的超自然指令数据集预处理方法。不过,我们将训练数据分为训练数据集和验证数据集,这样就可以进行更严格的超参数调整和早期停止。我们使用论文中描述的相同超参数在超级自然指令数据上训练各种规模的 T5 模型。我们对小型、中型和大型 T5 模型使用 LoRA r=16𝑟16r=16 ,对 T5 xl 和 xxl 模型使用 LoRA r=64𝑟64r=64 。我们还在所有实验中使用了 LoRA α=64𝛼64\alpha=64 ,并且不使用 LoRA dropout。

Appendix B Training a State-of-the-art Chatbot Experimental Setup Details
附录 BT 培训最先进的聊天机器人 实验设置详情

B.1 Datasets B.1 数据集

We describe the datasets used for QLoRA finetuning experiments outlined in Section 5.
我们将在第 5 节中介绍用于 QLoRA 微调实验的数据集。


The OpenAssistant dataset [31] was collected via crowd-sourcing. It contains 161,443 unique messages distributed across 66,497 conversations and spanning 35 different languages. The dataset often contains several ranked replies for each given user question. In our experiments, we only use the top reply at each level in the conversation tree. This limits the dataset to 9,209 examples. We finetuning our models on the full conversation including the user queries.
OpenAssistant 数据集[ 31] 是通过众包方式收集的。该数据集包含 161,443 条独特信息,分布在 66,497 个对话中,涉及 35 种不同语言。该数据集通常包含针对每个给定用户问题的多个排名回复。在我们的实验中,我们只使用对话树中每一级的最高回复。这就将数据集限制在 9209 个示例。我们在包括用户询问在内的完整对话中对我们的模型进行微调。


This is a human preference dataset about helpfulness and harmlessness. Each datapoint consists of two assistant replies to a user question along with a human preference judgment of the best reply. The dataset contains 160,800 examples. When finetuning on this dataset, we combine helpfulness and harmlessness data and only keep the preferred assistant reply.
这是一个关于有用性和无害性的人类偏好数据集。每个数据点包括对用户问题的两个助手回复,以及对最佳回复的人类偏好判断。该数据集包含 160,800 个示例。在对该数据集进行微调时,我们将有用性和无害性数据结合起来,只保留首选的助手回复。


The FLAN v2 collection [39] is a collection of 1836 tasks augmented with hundreds of manually curated templates and rich formatting patterns into over 15M examples. The authors show that models trained on this collection outperform other public collections including the original FLAN 2021 [62], T0++ [50], Super-Natural Instructions [60], and OPT-IML [29]. We used the same task mixtures described by the authors with the exception of some datasets that were not freely available at the time of writing.
FLAN v2 程序集[ 39] 是一个包含 1836 个任务的程序集,其中增加了数百个人工编辑的模板和丰富的格式化模式,共有超过 1500 万个示例。作者的研究表明,在该集合上训练的模型优于其他公共集合,包括原始 FLAN 2021 [ 62]、T0++ [ 50]、Super-Natural Instructions [ 60] 和 OPT-IML [ 29]。我们使用了作者所描述的相同任务混合物,但在撰写本文时尚未免费提供的一些数据集除外。

Self-Instruct, Alpaca, Unnatural Instructions
自我指令, 羊驼, 非自然指令

The Self-Instruct, Alpaca, and Unnatural Instructions datasets [59, 55, 26] are instruction tuning datasets collected with various approaches of model distillation from GPT-3 Instruct and ChatGPT. They rely on prompting, in-context learning, and paraphrasing to come up with diverse sets of instructions and outputs. The datasets comprise of 82,612, 51,942, and 240,670 examples respectively. One advantage of such distilled datasets is that they contain a more diverse set of instruction styles compared to the FLAN v2 collection and similar instruction tuning collections.
Self-Instruct、Alpaca 和 Unnatural Instructions 数据集[59, 55, 26]是通过从 GPT-3 Instruct 和 ChatGPT 中提炼模型的各种方法收集的指令调整数据集。这些数据集依靠提示、上下文学习和转述来产生不同的指令集和输出。这些数据集分别包含 82,612 个、51,942 个和 240,670 个示例。与 FLAN v2 数据集和类似的指令调整数据集相比,此类提炼数据集的一个优势是包含了更多样化的指令风格。

Longform 长篇小说

The LongForm dataset [30] is based on an English corpus augmented with instructions and as such is a hybrid human-generated dataset. The underlying documents are human-written and come from C4 and Wikipedia while the instructions are generated visa LLMs. The dataset is extended with additional structured corpora examples such as Stack Exchange and WikiHow and task examples such as question answering, email writing, grammar error correction, story/poem generation, and text summarization. The dataset contains 23,700 examples.
LongForm 数据集[30] 基于一个添加了说明的英语语料库,因此是一个人工生成的混合数据集。基础文档由人工撰写,来自 C4 和维基百科,而说明则由签证 LLMs 生成。该数据集通过 Stack Exchange 和 WikiHow 等其他结构化语料库示例以及问题解答、电子邮件写作、语法错误纠正、故事/诗歌创作和文本摘要等任务示例进行了扩展。该数据集包含 23,700 个示例。

Chip2 芯片2

is part of the OIG Laion dataset. It contains Python code examples, natural instruction examples, generic harmless instructions, instruction/responses with lists, follow-up questions, Wikipedia toxic adversarial questions, grade school math, reasoning instructions, and character and scene descriptions with a total of 210,289 examples.
是 OIG Laion 数据集的一部分。它包含 Python 代码示例、自然指令示例、通用无害指令、带列表的指令/回复、后续问题、维基百科有毒对抗性问题、小学数学、推理指令以及角色和场景描述,共计 210 289 个示例。

B.2 Hyperparameters B.2 超参数

We provide the exact hyperparameters used in our QLoRA finetuning experiments. We find hyperparameters to be largely robust across datasets. We use the MMLU 5-shot dev set for validation and hyperparameter tuning. In all our experiments we use NF4 with double quantization and bf16 computation datatype. We set LoRA r=64𝑟64r=64, α=16𝛼16\alpha=16, and add LoRA modules on all linear layers of the base model. We also use Adam beta2 of 0.999, max grad norm of 0.3 and LoRA dropout of 0.1 for models up to 13B and 0.05 for 33B and 65B models. Following previous work on instruction finetuning [62, 60] and after benchmarking other linear and cosine schedules, we use a constant learning rate schedule. We use group-by-length to group examples of similar lengths in the same batch (note this will produce a oscillating loss curve). The hyperparameters we tune for each model size are shown in Table 9.
我们提供了 QLoRA 微调实验中使用的精确超参数。我们发现超参数在不同数据集之间具有很大的鲁棒性。我们使用 MMLU 5-shot dev 集进行验证和超参数调整。在所有实验中,我们使用双量化的 NF4 和 bf16 计算数据类型。我们设置了 LoRA r=64𝑟64r=64α=16𝛼16\alpha=16 ,并在基础模型的所有线性层上添加了 LoRA 模块。我们还使用了 0.999 的 Adam beta2、0.3 的 max grad norm 和 0.1 的 LoRA dropout(对于 13B 以下的模型和 0.05 的 33B 和 65B 模型)。根据之前的指令微调工作[62, 60],并在对其他线性和余弦计划进行基准测试后,我们使用了恒定学习率计划。我们采用按长度分组的方法,将长度相似的示例归入同一批次(注意这会产生一条波动的损失曲线)。表 9 显示了我们为每种模型大小调整的超参数。

Parameters Dataset Batch size 批量大小 LR Steps Source Length 源长度 Target Length 目标长度
7B All 16 2e-4 10000 384 128
7B OASST1 16 2e-4 1875 - 512
7B HH-RLHF 16 2e-4 10000 - 768
7B Longform 16 2e-4 4000 512 1024
13B All 16 2e-4 10000 384 128
13B OASST1 16 2e-4 1875 - 512
13B HH-RLHF 16 2e-4 10000 - 768
13B Longform 16 2e-4 4000 512 1024
33B All 32 1e-4 5000 384 128
33B OASST1 16 1e-4 1875 - 512
33B HH-RLHF 32 1e-4 5000 - 768
33B Longform 32 1e-4 2343 512 1024
65B All 64 1e-4 2500 384 128
65B OASST1 16 1e-4 1875 - 512
65B HH-RLHF 64 1e-4 2500 - 768
65B Longform 32 1e-4 2343 512 1024
Table 9: Training hyperparameters for QLoRA finetuning on different datasets and across model sizes.
表 9:QLoRA 在不同数据集和不同模型大小上进行微调的训练超参数

B.3 Ablations B.3 切割

While it is general practice in the literature to only train on the response in instruction following datasets, we study the effect of training on the instruction in addition to the response in Table 10. In these experiments, we restrict the training data to 52,000 examples and use the 7B model. Over four different instruction tuning datasets, we find that only training on the target is beneficial to MMLU performance. We did not evaluate the effect this may have on chatabot performance as measured by vicuna or OA benchmarks.
虽然文献中的一般做法是只对指令集后的响应进行训练,但我们在表 10 中研究了除了对响应进行训练外,还对指令进行训练的效果。在这些实验中,我们将训练数据限制为 52,000 个示例,并使用 7B 模型。在四个不同的指令调整数据集上,我们发现仅对目标进行训练有利于提高 MMLU 的性能。我们并没有评估这对以 vicuna 或 OA 基准衡量的 chatabot 性能的影响。

Dataset Unnatural Instructions 非自然指令 Chip2 Alpaca FLAN v2 Mean
Train on source and target
36.2 33.7 38.1 42.0 37.5
Train on target 目标训练 38.0 34.5 39.0 42.9 38.6
Table 10: MMLU 5-shot test results studying the effect of training on the instructions in addition to the response.
表 10:MMLU 5 次测试结果,研究培训对指令和反应的影响。

B.4 What is more important: instruction finetuning dataset size or dataset quality?
B.4 什么更重要:指令微调数据集大小还是数据集质量?

Data set suitability is more important than dataset size.

To understand the effects of dataset quality vs. dataset size, we experiment with subsampling large datasets with at least 150,000 samples (Chip2, FLAN v2, Unnatural Instructions), into datasets of size 50,000, 100,000 and 150,000 and examine the resulting trends, as shown in Table 11. We find that increasing the dataset size and increasing the number of epochs improves MMLU only marginally (0.0 - 0.5 MMLU), while the difference between datasets is up to 40x larger (1.5 - 8.0 MMLU). This is a clear indicator that dataset quality rather than dataset size is critical for mean MMLU accuracy. We obtain similar findings for chatbot performance as discussed in .
为了解数据集质量对数据集大小的影响,我们尝试将至少有 150,000 个样本的大型数据集(Chip2、FLAN v2、Unnatural Instructions)子采样为 50,000、100,000 和 150,000 个大小的数据集,并研究由此产生的趋势,如表 11 所示。我们发现,增加数据集大小和增加epoch次数对 MMLU 的改善作用微乎其微(0.0 - 0.5 MMLU),而数据集之间的差异却高达 40 倍(1.5 - 8.0 MMLU)。这清楚地表明,数据集质量而非数据集大小对平均 MMLU 准确度至关重要。我们在.NET.COM 中讨论的聊天机器人性能方面也得到了类似的结果。

Table 11: Effect different dataset sizes and finetuning epochs on mean 5-shot MMLU test set accuracy. While increasing the dataset size and training for more than 1 epochs helps with MMLU performance, the difference between datasets are far larger, indicating that dataset quality affects MMLU performance more than dataset size.
表 11:不同数据集大小和微调历元对平均 5 次 MMLU 测试集准确率的影响。虽然增加数据集大小和训练超过 1 个历元对MLU 性能有帮助,但数据集之间的差异要大得多,这表明数据集质量比数据集大小对 MMLU 性能的影响更大。
Chip Unnatural Instructions 非自然指令 FLAN v2
Datapoints \downarrow Epochs \rightarrow
数据点 \downarrow 时间 \rightarrow
1 2 3 1 2 3 1 2 3 Mean
50000 34.50 35.30 34.70 38.10 42.20 38.10 43.00 43.50 44.10 39.28
100000 33.70 33.90 34.00 40.10 41.20 37.00 43.90 43.70 44.90 39.16
150000 34.40 34.80 35.10 39.70 41.10 41.50 44.60 45.50 43.50 40.02
Mean 34.20 34.67 34.60 39.30 41.50 38.87 43.83 44.23 44.17

Appendix C Human Evaluation
附录 人类评估

We conduct a human evaluation with the same wording given to GPT-4 in the original Vicuna evaluation [10], adjusted for an Amazon Mechanical Turk form as show in Figure 5.
如图 5 所示,我们使用与最初的 Vicuna 评估[10]中给 GPT-4 使用的相同措辞进行了人工评估,并对亚马逊 Mechanical Turk 表格进行了调整。

Refer to caption
Figure 5: The crowdsourcing form used by human annotators.
图 5:人类注释者使用的众包形式。

Appendix D Pairwise Evaluation with GPT-4
附录 DP 与 GPT-4 配对评估

While we found that the GPT-4 evaluation gave different results depend on which system was presented first, when averaged over both options the pairwise results were well-ordered. The aggregated pairwise judgments are hown in Table 12. On inspection, it is clear these judgments are transitive, i.e., when System A is judged better than System B and System B is judged better than System C, it is always the case that System A is judged better than System C. This yields a complete ordering, given in Table 13.
我们发现,GPT-4 评估的结果因先呈现的系统不同而不同,但如果对两个选项进行平均,成对结果则井然有序。表 12 显示了汇总的成对判断结果。通过观察,我们可以清楚地看到这些评判结果是互易的,也就是说,当系统 A 被评判为优于系统 B,而系统 B 被评判为优于系统 C 时,系统 A 总是被评判为优于系统 C。

Table 12: Aggregated pairwise GPT-4 judgments between systems where the value of a cell at row x𝑥x and column y𝑦y is # judgment x is better than y# judgment y is better than xtotal # number of judgments# judgment x is better than y# judgment y is better than xtotal # number of judgments\frac{\text{\# judgment $x$ is better than $y$}-\text{\# judgment $y$ is better than $x$}}{\text{total \# number of judgments}}
表 12:当行 x𝑥x 和列 y𝑦y 中单元格的值为 # judgment x is better than y# judgment y is better than xtotal # number of judgments# judgment x is better than y# judgment y is better than xtotal # number of judgments\frac{\text{\# judgment $x$ is better than $y$}-\text{\# judgment $y$ is better than $x$}}{\text{total \# number of judgments}} 时,系统间的成对 GPT-4 判定汇总表
Model Guanaco 65B Guanaco 33B Vicuna ChatGPT-3.5 Turbo Bard Guanaco 13B Guanaco 7B
Guanaco 65B - 0.21 0.19 0.16 0.72 0.59 0.86
Guanaco 33B -0.21 - 0.17 0.10 0.51 0.41 0.68
Vicuna -0.19 -0.17 - 0.10 0.50 0.20 0.57
ChatGPT-3.5 Turbo -0.16 -0.10 -0.10 - 0.35 0.19 0.40
Bard -0.72 -0.51 -0.50 -0.35 - 0.12 0.03
Guanaco 13B -0.59 -0.41 -0.20 -0.19 -0.12 - 0.20
Guanaco 7B -0.86 -0.68 -0.57 -0.40 -0.03 -0.20 -
Table 13: The complete ordering induced by pairwise GPT-4 judgments between systems
表 13:由系统间成对的 GPT-4 判断引起的完整排序
Model Params Size
Guanaco 65B 41 GB
Guanaco 33B 21 GB
Vicuna 13B 26 GB
ChatGPT-3.5 Turbo N/A N/A
Bard N/A N/A
Guanaco 13B 10 GB
Guanaco 7B 5 GB

Appendix E NormalFloat 4-bit data type
附录 ENormalFloat 4 位数据类型

The exact values of the NF4 data type are as follows:
NF4 数据类型的确切值如下:

[-1.0, -0.6961928009986877, -0.5250730514526367,
-0.39491748809814453, -0.28444138169288635, -0.18477343022823334,
-0.09105003625154495, 0.0, 0.07958029955625534, 0.16093020141124725,
0.24611230194568634, 0.33791524171829224, 0.44070982933044434,
0.5626170039176941, 0.7229568362236023, 1.0]

Appendix F Normality of Trained Neural Network Weights

While it is common knowledge that trained neural network weights are mostly normally distributed, we perform statistical testing to verify this. We use the Shapiro-Wilk test[53] on the weights of the 7B LLaMA model [57]. We find that the weights of each hidden unit have different normal distributions. As such, we test he weights of each individual hidden unit. This mean for weight 𝐖in×out𝐖superscript𝑖𝑛𝑜𝑢𝑡\mathbf{W}\in\mathcal{R}^{in\times out} we perform tests over the out𝑜𝑢𝑡out dimension. Using a 5% significance threshold, we find that 7.5% of neurons are non-normally distributed which is about 2.5% more than the expected false-positive rate. As such, while almost all pretrained weights appear to be normally distributed there seem to be exceptions. Such exceptions might be due to outliers weights [13] or because the p-value of the Shaprio-Wilk test is not accurate for large samples sizes[53] that occur in the LLaMA FFN layer hidden units. this verifies the claim that neural network weights.
众所周知,经过训练的神经网络权重大多呈正态分布,而我们则通过统计检验来验证这一点。我们对 7B LLaMA 模型[ 57] 的权重进行了 Shapiro-Wilk 检验[ 53]。我们发现每个隐藏单元的权重都具有不同的正态分布。因此,我们对每个隐藏单元的权重进行了检验。这意味着,对于权重 𝐖in×out𝐖superscript𝑖𝑛𝑜𝑢𝑡\mathbf{W}\in\mathcal{R}^{in\times out} ,我们在 out𝑜𝑢𝑡out 维度上进行测试。使用 5% 的显著性阈值,我们发现有 7.5% 的神经元是非正态分布的,这比预期的假阳性率高出约 2.5%。因此,虽然几乎所有预训练的权重看起来都呈正态分布,但似乎也有例外。这些例外情况可能是由于异常值权重造成的[ 13],也可能是由于 Shaprio-Wilk 检验的 p 值对于 LLaMA FFN 层隐藏单元中出现的大样本量并不准确[ 53]。

Appendix G Memory Footprint
附录 GMemory Footprint

The memory footpring for QLoRA training with different LLaMA base models can be seen in Figure 6. We see that the 33B model does not quite fit into a 24 GB and that paged optimizers are needed to train it. Depicted is also batch size 1 with a sequence length of 512 and gradient checkpointning. This means, if one uses a larger batch size, or if a long sequence is processed, the activation gradient might consume a considerable amount of memory.
图 6 显示了使用不同 LLaMA 基本模型进行 QLoRA 训练的内存空间。我们可以看到,33B 模型无法完全容纳在 24 GB 的内存中,需要分页优化器对其进行训练。图中还显示了序列长度为 512 的批次大小 1 和梯度检查点。这意味着,如果使用更大的批次规模,或者处理较长的序列,激活梯度可能会消耗大量内存。

Refer to caption
Figure 6: Breakdown of the memory footprint of different LLaMA models. The input gradient size is for batch size 1 and sequence length 512 and is estimated only for adapters and the base model weights (no attention). Numbers on the bars are memory footprint in MB of individual elements of the total footprint. While some models do not quite fit on certain GPUs, paged optimzier provide enough memory to make these models fit.
图 6:不同 LLaMA 模型的内存占用细分。输入梯度大小为批次大小 1 和序列长度 512,仅对适配器和基本模型权重进行了估算(无关注)。条形图上的数字是总占用空间中单个元素的内存占用空间,单位为 MB。虽然某些模型在某些 GPU 上不太适合,但分页优化器提供的内存足以使这些模型适合。