[cor1]Corresponding author
[cor1]对应作者
1]organization=School of Physics and Information Technology, Shaanxi Normal University,
city=Xi’an,
postcode=710119,
state=Shaanxi,
country=China
1]组织=陕西师范大学物理与信息技术学院,城市=西安,邮编=710119,省份=陕西,国家=中国
2]organization=Institute of Medical Research, Northwestern Polytechnical University,
city=Xi’an,
postcode=710072,
state=Shaanxi,
country=China
2]机构=西北工业大学医学研究所,城市=西安,邮编=710072,省份=陕西,国家=中国
3]organization=School of Computer Science, Northwestern Polytechnical University,
city=Xi’an,
postcode=710072,
state=Shaanxi,
country=China
3]机构=西北工业大学计算机学院,城市=西安,邮编=710072,省份=陕西,国家=中国
4]organization=School of Automation, Northwestern Polytechnical University,
city=Xi’an,
postcode=710072,
state=Shaanxi,
country=China
4]机构=西北工业大学自动化学院,城市=西安,邮编=710072,省份=陕西,国家=中国
5]organization=School of Computing, The University of Georgia,
city=Athens,
postcode=30602,
country=USA
5]组织=乔治亚大学计算机学院,城市=雅典,邮编=30602,国家=美国
Understanding LLMs: A Comprehensive Overview from Training to Inference
理解LLMs:从训练到推理的全面概述
Abstract 摘要
The introduction of ChatGPT has led to a significant increase in the utilization of Large Language Models (LLMs) for addressing downstream tasks. There’s an increasing focus on cost-efficient training and deployment within this context. Low-cost training and deployment of LLMs represent the future development trend. This paper reviews the evolution of large language model training techniques and inference deployment technologies aligned with this emerging trend. The discussion on training includes various aspects, including data preprocessing, training architecture, pre-training tasks, parallel training, and relevant content related to model fine-tuning. On the inference side, the paper covers topics such as model compression, parallel computation, memory scheduling, and structural optimization. It also explores LLMs’ utilization and provides insights into their future development.
ChatGPT 的引入导致用于解决下游任务的大型语言模型(LLMs)的使用显著增加。在此背景下,对成本效益高的训练和部署的关注日益增加。LLMs的低成本训练和部署代表了未来的发展趋势。本文回顾了与这一新兴趋势相一致的大型语言模型训练技术和推理部署技术的演变。关于训练的讨论包括数据预处理、训练架构、预训练任务、并行训练以及与模型微调相关的相关内容。在推理方面,本文涵盖了模型压缩、并行计算、内存调度和结构优化等主题。它还探讨了LLMs的利用,并对其未来的发展提供了见解。
keywords:
Large Language Models \sepTraining \sepInference \sepSurvey关键词:大型语言模型 \sep 训练 \sep 推理 \sep 调查
1 Introduction 1 简介
Language modeling (LM) is a fundamental approach for achieving cognitive intelligence in the field of natural language processing (NLP), and its progress has been notable in recent years [1; 2; 3]. It assumes a central role in understanding, generating, and manipulating human language, serving as the cornerstone for a diverse range of NLP applications [4], including machine translation, chatbots, sentiment analysis, and text summarization. With the evolution of deep learning, the early statistical language models (SLM) have gradually transformed into neural language models (NLM) based on neural networks. This shift is characterized by the adoption of word embeddings, representing words as distributed vectors. Notably, these word embeddings have consistently excelled in practical NLP tasks, profoundly shaping the field’s progress. Pre-trained language models (PLM) represent a subsequent phase in the evolution of language models following NLM. Early attempts at PLMs included ELMo [5], which was built on a Bidirectional LSTM architecture. However, with the advent of the transformer architecture [6], characterized by parallel self-attention mechanisms, the pre-training and fine-tuning learning paradigm has propelled PLM to prominence as the prevailing approach. These models are typically trained via self-supervision on extensive datasets, cementing their status as the primary methodology in the field.
语言建模(LM)是自然语言处理(NLP)领域实现认知智能的基本方法,近年来其进展显著[1;2;3]。它在理解、生成和操作人类语言中扮演着核心角色,是众多 NLP 应用[4]的基石,包括机器翻译、聊天机器人、情感分析和文本摘要。随着深度学习的发展,早期的统计语言模型(SLM)逐渐演变为基于神经网络的神经语言模型(NLM)。这一转变以采用词嵌入为特征,将词语表示为分布式向量。值得注意的是,这些词嵌入在实践 NLP 任务中一直表现出色,深刻地塑造了该领域的进步。预训练语言模型(PLM)是 NLM 之后语言模型演变的下一阶段。PLM 的早期尝试包括基于双向 LSTM 架构的 ELMo[5]。 然而,随着并行自注意力机制的特征性变压器架构[6]的出现,预训练和微调学习范式推动了 PLM 成为主流方法。这些模型通常通过在大规模数据集上进行自监督训练,巩固了其在该领域的首要方法地位。
The Transformer architecture is exceptionally well-suited for scaling up models, and research analysis has revealed that increasing the model’s scale or training data size can significantly enhance its performance. Many studies have pushed the boundaries of model performance by continuously expanding the scale of PLM [7; 8; 9; 10]. As models grow larger, a remarkable phenomenon known as "emergence" occurs, wherein they exhibit astonishing performance [8]. These models are capable of generating high-quality text and possess robust learning and reasoning abilities. They can even tackle few-shot learning tasks through in-context learning (ICL) [8]. This remarkable capability enables their seamless application to a wide range of downstream tasks across diverse domains [11; 12; 13; 14].
Transformer 架构非常适合扩大模型规模,研究分析表明,增加模型的规模或训练数据量可以显著提高其性能。许多研究通过不断扩展 PLM 的规模[7; 8; 9; 10]推动了模型性能的边界。随着模型规模的扩大,出现了一种称为“涌现”的显著现象,其中它们表现出惊人的性能[8]。这些模型能够生成高质量的文本,并具有强大的学习和推理能力。它们甚至可以通过情境学习(ICL)[8]处理少样本学习任务。这种非凡的能力使它们能够无缝应用于广泛的后向任务,涵盖多个领域[11; 12; 13; 14]。
Pre-trained language models (PLMs) with significantly larger parameter sizes and extensive training data are typically denoted as Large Language Models (LLMs) [15; 16; 17]. The model size usually exceeds 6-10 billion (6-10B) parameters. A prominent milestone in the development of LLMs is exemplified by the GPT series [18; 7; 8; 19]. Notably, OpenAI released ChatGPT in November 2022, marking a pivotal moment in the era of LLMs and a game-changing moment in the field of artificial intelligence. ChatGPT has empowered current AI algorithms to achieve unprecedented levels of strength and effectiveness, reshaping the way humans employ or develop AI algorithms. Its emergence has captured the attention of the research community. However, owing to ChatGPT’s absence as an open-source platform, the principal way to use ChatGPT currently is by accessing it through OpenAI’s website at https://chat.openai.com or via their API interface. Training LLMs that can serve as alternatives to ChatGPT, or domain-specific LLMs, has become highly necessary [20; 21; 22; 23; 24; 1; 25; 26]. Training and deploying LLMs demand expertise in handling large-scale data and substantial practical experience in distributed parallel training [27; 28; 29]. This requirement emphasizes the need for researchers developing LLMs to possess significant engineering capabilities in addressing the challenges encountered during LLM development. Researchers who are interested in the field of LLMs must possess engineering skills or learn to collaborate effectively with engineers.
预训练语言模型(PLM)通常指具有显著更大的参数规模和大量训练数据的模型,称为大型语言模型(LLMs)[15; 16; 17]。模型规模通常超过 60 亿至 100 亿(6-10B)参数。LLMs的发展中的一个显著里程碑可由 GPT 系列[18; 7; 8; 19]为例。值得注意的是,OpenAI 于 2022 年 11 月发布了 ChatGPT,标志着LLMs时代的转折点和人工智能领域的变革时刻。ChatGPT 使当前的 AI 算法达到了前所未有的强度和有效性,重塑了人类使用或开发 AI 算法的方式。它的出现引起了研究界的关注。然而,由于 ChatGPT 作为一个开源平台的存在,目前使用 ChatGPT 的主要方式是通过访问 OpenAI 的网站 https://chat.openai.com 或通过他们的 API 接口。训练可以作为 ChatGPT 替代品或特定领域的LLMs的模型变得非常必要[20; 21; 22; 23; 24; 1; 25; 26]。 训练和部署LLMs需要处理大规模数据的专业知识和分布式并行训练的丰富实践经验[27;28;29]。这一要求强调了研究人员在开发LLMs时需要具备解决开发过程中遇到挑战的显著工程能力。对LLMs领域感兴趣的研究人员必须具备工程技能或学会与工程师有效合作。
For the above reasons, the primary objective of this paper is to provide a comprehensive overview of LLMs training and inference techniques to equip researchers with the knowledge required for developing, deploying, and applying LLMs. The structure of the rest of this review is as follows: In Section 2, we will introduce the relevant background and foundational knowledge of LLMs. In Section 3, we will delve into the technical aspects of training LLMs, while in Section 4 we will explore the technologies related to LLM’s inference and deployment. In Section 5, we will discuss the utilization of LLMs, and Section 6 will explore the future directions and their implications for LLMs.
由于上述原因,本文的主要目标是提供一个关于LLMs训练和推理技术的全面概述,以便为研究人员提供开发、部署和应用LLMs所需的知识。本综述的结构如下:在第 2 节中,我们将介绍LLMs的相关背景和基础知识。在第 3 节中,我们将深入探讨训练LLMs的技术方面,而在第 4 节中,我们将探讨与LLM推理和部署相关的技术。在第 5 节中,我们将讨论LLMs的利用,第 6 节将探讨未来的方向及其对LLMs的影响。
2 Background Knowledge 2 背景知识
2.1 Transformer 2.1 变压器
Transformer is a deep learning model based on an attention mechanism for processing sequence data that can effectively solve complex natural language processing problems. This model was first proposed in 2017 [6], and replaced the traditional recurrent neural network architecture [30] in machine translation tasks as the state-of-the-art model at that time. Due to its suitability for parallel computing and the complexity of the model itself, Transformer outperforms the previously popular recurrent neural networks in terms of accuracy and performance. The Transformer architecture consists primarily of two modules, an Encoder and a Decoder, as well as the attention mechanism within these modules.
Transformer 是一种基于注意力机制的深度学习模型,用于处理序列数据,可以有效解决复杂的自然语言处理问题。该模型于 2017 年首次提出[6],当时作为机器翻译任务中的最先进模型,取代了传统的循环神经网络架构[30]。由于其适用于并行计算以及模型本身的复杂性,Transformer 在准确性和性能方面优于之前流行的循环神经网络。Transformer 架构主要由两个模块组成,即编码器和解码器,以及这些模块内的注意力机制。
2.1.1 Self-Attention 2.1.1 自注意力
Self-Attention Structure [6]: Essentially, the attention mechanism aims at selecting a small amount of important information from a large amount of data and focusing on these important pieces while ignoring the majority of unimportant information. The self-attention mechanism, as a variant of the attention mechanism, reduces reliance on external information and excels at capturing internal correlations within data or features. Applying the self-attention mechanism in text-primarily involves calculating the mutual influence between words to address the issue of long-range dependencies. Additionally, self-attention is the core idea behind transformers. The core formula for key-value attention is as follows:
自注意力结构[6]:本质上,注意力机制旨在从大量数据中选择少量重要信息,并关注这些重要信息,同时忽略大多数不重要的信息。自注意力机制作为注意力机制的变体,减少了对外部信息的依赖,擅长捕捉数据或特征内部的关联。在文本中应用自注意力机制主要涉及计算词语之间的相互影响,以解决长距离依赖问题。此外,自注意力是变换器背后的核心思想。关键值注意力核心公式如下:
(1) |
Self-attention allows the model to weigh the importance of different words in a sentence when predicting a particular word. It calculates a weighted sum of the values of all words in the sentence, where the weights are determined by the relevance of each word to the target word.
自注意力机制允许模型在预测特定单词时,根据句子中不同单词的重要性进行加权。它计算句子中所有单词值的加权总和,其中权重由每个单词与目标单词的相关性决定。
The self-attention mechanism consists of three steps: calculating the query, key, and value vectors. The query vector represents the word being attended to, while the key vectors represent all the words in the sentence. The value vectors store the information associated with each word. The attention weights are computed by taking the dot product between the query and key vectors, followed by a softmax operation to obtain a distribution over the words.
自注意力机制包括三个步骤:计算查询、键和值向量。查询向量表示正在关注的单词,而键向量表示句子中的所有单词。值向量存储与每个单词相关的信息。通过计算查询和键向量的点积来计算注意力权重,然后通过 softmax 操作获得单词上的分布。
Multi-Head Attention [6]: Multi-head self-attention extends the self-attention mechanism by performing it multiple times in parallel. Each attention head learns to focus on different aspects of the input, capturing different dependencies and patterns. The outputs of the attention heads are then concatenated and linearly transformed to obtain the final representation. By using multiple attention heads, the model can capture both local and global dependencies, allowing for a more comprehensive understanding of the input sequence. This parallelization also enhances the model’s capacity to capture complex relationships between words. The Multi-head attention can be formulated as follows:
多头注意力[6]:多头自注意力通过并行多次执行自注意力机制来扩展自注意力机制。每个注意力头学习关注输入的不同方面,捕捉不同的依赖关系和模式。然后,将注意力头的输出连接起来并进行线性变换,以获得最终的表示。通过使用多个注意力头,模型可以捕捉局部和全局依赖关系,从而更全面地理解输入序列。这种并行化也增强了模型捕捉词语之间复杂关系的能力。多头注意力可以表示如下:
(2) | |||
In this case, "" means to concatenate the attention calculation results of each head, "" is the weight matrix of the output layer, used to linearly transform the concatenated results. This yields the output of multi-head attention. In summary, multi-head attention enhances the model’s ability to represent input sequences by performing parallel attention calculations under different linear transformations, then concatenating and linearly transforming the results. This mechanism plays an important role in the Transformer model, helping to handle long-range dependencies and improve model performance.
在这种情况下,“ ”表示将每个头的注意力计算结果进行拼接,“ ”是输出层的权重矩阵,用于对拼接的结果进行线性变换。这产生了多头注意力的输出。总的来说,多头注意力通过在不同线性变换下执行并行注意力计算,然后将结果拼接和线性变换,增强了模型表示输入序列的能力。这种机制在 Transformer 模型中起着重要作用,有助于处理长距离依赖关系并提高模型性能。
2.1.2 Encoder 2.1.2 编码器
The encoder module [6] of the Transformer model is composed of multiple identical layers, each of which includes a multi-head attention mechanism and feed-forward neural network [31]. In the multi-head attention mechanism, each position in the input sequence is calculated for attention with other positions to capture the dependencies between different positions in the input sequence. The feed-forward neural network is then used to further process and extract features from the output of the attention mechanism. The encoder module gradually extracts features of the input sequence through the stacking of multiple such layers and passes the final encoding result to the decoder module for decoding. The design of the encoder module enables it to effectively handle long-range dependencies within the input sequence and has significantly improved performance in various NLP tasks.
Transformer 模型的编码器模块[6]由多个相同的层组成,每一层都包含多头注意机制和前馈神经网络[31]。在多头注意机制中,输入序列中的每个位置都会与其他位置进行注意力计算,以捕获输入序列中不同位置之间的依赖关系。然后,使用前馈神经网络进一步处理和从注意力的输出中提取特征。编码器模块通过堆叠多个这样的层逐渐提取输入序列的特征,并将最终的编码结果传递给解码器模块进行解码。编码器模块的设计使其能够有效地处理输入序列中的长距离依赖关系,并在各种 NLP 任务中显著提高了性能。
2.1.3 Decoder 2.1.3 解码器
The decoder module [32] of the Transformer model is also composed of multiple identical layers, each of which includes a multi-head attention mechanism and a feed-forward neural network. Unlike the encoder, the decoder also includes an additional encoder-decoder attention mechanism, used to compute attention on the input sequence during the decoding process. At each position, the decoder can only perform self-attention calculations with the positions before it to ensure that the generation of the sequence does not violate grammar rules. Masks play an important role in the decoder, ensuring that only information before the current time step is focused on when generating the output sequence, and not leaking information from future time steps. Specifically, the decoder’s self-attention mechanism uses masks to prevent the model from accessing future information when generating predictions at each time step, maintaining the causality of the model. This ensures that the output generated by the model depends on the information at the current time step and before, without being influenced by future information.
Transformer 模型的解码器模块[32]也由多个相同的层组成,每一层都包含多头注意机制和前馈神经网络。与编码器不同,解码器还包括一个额外的编码器-解码器注意机制,用于在解码过程中对输入序列进行注意力计算。在每个位置,解码器只能与它之前的所有位置执行自注意力计算,以确保序列生成不违反语法规则。掩码在解码器中起着重要作用,确保在生成输出序列时只关注当前时间步之前的信息,而不泄露未来时间步的信息。具体来说,解码器的自注意力机制使用掩码来防止模型在生成每个时间步的预测时访问未来信息,保持模型的因果性。这确保了模型生成的输出依赖于当前时间步和之前的信息,而不受未来信息的影响。
2.1.4 Positional Embedding
2.1.4 位置嵌入
Position and order are crucial for certain tasks, such as understanding a sentence or a video. Position and order define the grammar of a sentence, they are integral to the semantics of sentences. The Transformer utilizes Multi-Head Self-Attention (MHSA) to avoid the recursive approach of RNN, thus speeding up the training process. Additionally, it can capture long-range dependencies in sentences and handle longer inputs. When each token in a sentence passes through the Transformer’s Encoder/Decoder stack, the model itself lacks any sense of position/order for each token (permutation invariance). Therefore, a method is still needed to incorporate the sequential information of tokens into the model. To enable the model to perceive the input sequence, positional information about the location of each token in the sentence can be added, and this technique is known as positional embedding (PE). which is used in the Transformer model to incorporate the sequential order of tokens into the input representation. Since the Transformer does not have recurrent connections, it lacks the inherent notion of token order present in recurrent neural networks. To address this, positional embedding assigns a unique vector to each token position in the input sequence. These positional embeddings are added to the word embedding before being fed into the model. By including positional information, the model can differentiate between tokens based on their position in the sequence. In the Transformer model, the core formula of the position embedding can be expressed as:
位置和顺序对于某些任务至关重要,例如理解句子或视频。位置和顺序定义了句子的语法,它们是句子语义的组成部分。Transformer 利用多头自注意力(MHSA)来避免 RNN 的递归方法,从而加快训练过程。此外,它还可以捕捉句子中的长距离依赖关系并处理更长的输入。当句子中的每个标记通过 Transformer 的编码器/解码器堆栈时,模型本身对每个标记的位置/顺序没有任何感知(排列不变性)。因此,仍然需要一种方法将标记的顺序信息纳入模型。为了使模型能够感知输入序列,可以添加每个标记在句子中的位置信息,这种技术被称为位置嵌入(PE),它用于 Transformer 模型将标记的顺序顺序纳入输入表示。由于 Transformer 没有循环连接,它缺乏循环神经网络中固有的标记顺序概念。 为了解决这个问题,位置嵌入为输入序列中的每个标记位置分配一个唯一的向量。这些位置嵌入被添加到词嵌入中,然后再输入到模型中。通过包含位置信息,模型可以根据标记在序列中的位置来区分不同的标记。在 Transformer 模型中,位置嵌入的核心公式可以表示为:
(3) |
(4) |
In this equation, represents the position embedding matrix, represents the position of a token in the sentence, represents the dimension index of the position embedding, and represents the hidden layer dimension of the Transformer model. By using sine and cosine functions and performing different calculations on the position (pos) and dimension (i), this formula generates unique position embedding values for each position and dimension. As a result, each token is assigned a unique position embedding vector, allowing the model to perceive the sequential information of tokens in the sentence. In practical applications, the position embedding matrix is added to the input word embedding matrix to combine position information and semantic information, thereby providing a more comprehensive input representation for the Transformer model.
在这个方程中, 表示位置嵌入矩阵, 表示句子中标记的位置, 表示位置嵌入的维度索引, 表示 Transformer 模型的隐藏层维度。通过使用正弦和余弦函数,并对位置(pos)和维度(i)进行不同的计算,该公式为每个位置和维度生成独特的位置嵌入值。因此,每个标记被分配一个独特的位置嵌入向量,使模型能够感知句子中标记的顺序信息。在实际应用中,将位置嵌入矩阵添加到输入词嵌入矩阵中,以结合位置信息和语义信息,从而为 Transformer 模型提供更全面的输入表示。
Two commonly used positional encoding methods in Transformer are Absolute Positional Encoding and Relative Positional Encoding.
两种在 Transformer 中常用的位置编码方法是绝对位置编码和相对位置编码。
(1) Absolute Positional Encoding: It generates unique positional embedding values for each position and dimension by using sine and cosine functions. This method uses sine and cosine functions in the mentioned formula to calculate the positional embedding values and adds them to the word embeddings. Absolute Positional Encoding provides a unique encoding for each position, enabling the model to perceive the sequential information of words in the sentence.
(1) 绝对位置编码:它通过使用正弦和余弦函数为每个位置和维度生成独特的位置嵌入值。该方法使用正弦和余弦函数在所述公式中计算位置嵌入值,并将它们添加到词嵌入中。绝对位置编码为每个位置提供独特的编码,使模型能够感知句子中词语的顺序信息。
(2) Relative Positional Encoding: It is an encoding method based on relative positional relationships. Relative Positional Encoding represents positional information by calculating the relative distances between words. This method is used in models like Transformer-XL [33], and Relative Positional Encoding can better capture the relative positional relationships between words when dealing with long sequences. Both of these positional encoding methods aim to provide the positional information of words in the input sequence to the Transformer model, enabling the model to better comprehend and process sequential data. The specific choice of positional encoding method depends on the specific application scenario and model design.
(2) 相对位置编码:这是一种基于相对位置关系的编码方法。相对位置编码通过计算单词之间的相对距离来表示位置信息。这种方法在 Transformer-XL [33]等模型中使用,相对位置编码在处理长序列时能更好地捕捉单词之间的相对位置关系。这两种位置编码方法都旨在向 Transformer 模型提供输入序列中单词的位置信息,使模型能够更好地理解和处理序列数据。具体选择哪种位置编码方法取决于特定的应用场景和模型设计。
There are also other positional encoding methods applied to other models, such as RoPE [34] and ALiBi [35].
也存在其他应用于其他模型的定位编码方法,例如 RoPE [34] 和 ALiBi [35]。
RoPE is a method that uses Absolute Positional Encoding to represent Relative Positional Encoding and is applied in the design of large language models like PaLM [36], LLaMA [9], and GLM-130B [37].
RoPE 是一种使用绝对位置编码来表示相对位置编码的方法,并应用于 PaLM [36]、LLaMA [9]和 GLM-130B [37]等大型语言模型的设计中。
ALiBi does not add positional embeddings to word embeddings but instead adds a pre-defined bias matrix to the attention score based on the distance between tokens. It is applied in the design of large language models like BLOOM [38].
ALiBi 不对词嵌入添加位置嵌入,而是根据标记之间的距离向注意力分数添加预定义的偏置矩阵。它应用于 BLOOM [38]等大型语言模型的设计中。
Some other positional encoding methods, such as mixed positional encoding, multi-digit positional encoding, and implicit positional encoding, are also used by some models.
一些其他的位置编码方法,如混合位置编码、多数字位置编码和隐式位置编码,也被某些模型所使用。
2.2 Prompt Learning 2.2 激活学习
Prompt learning serves as a widely adopted machine learning approach, particularly in the field of NLP. At its core, this methodology involves guiding a model to produce specific behaviors or outputs through the careful design of prompt statements. It is commonly employed to fine-tune and guide pre-trained LLMs for executing particular tasks or generating desired results. Researchers have observed that the design of specific prompt statements can steer pre-trained models to perform various tasks, such as question-answering, text generation, and semantic understanding [39; 40; 41; 42; 43; 44; 45; 46; 47; 48; 49; 50]. The strength of this approach lies in its ability to adapt to different tasks through simple modifications to prompt statements, eliminating the need for retraining the entire model. For LLMs like the GPT series and other pre-trained models, prompt learning provides a straightforward and powerful means for model fine-tuning. By supplying appropriate prompts, researchers and practitioners can customize the model’s behavior, making it more suitable for specific domains or task requirements. In short, prompt learning is a machine learning approach that, builds upon pre-trained language models, and guides the model to perform various tasks through the design of prompt statements, offering increased flexibility for customizing model applications. In this Section, we will introduce the basic knowledge of prompt learning.
提示学习是一种广泛采用的机器学习方法,尤其在自然语言处理领域。其核心是通过精心设计提示语句来引导模型产生特定的行为或输出。它通常用于微调和指导预训练的LLMs执行特定任务或生成期望的结果。研究人员观察到,特定提示语句的设计可以引导预训练模型执行各种任务,如问答、文本生成和语义理解[39; 40; 41; 42; 43; 44; 45; 46; 47; 48; 49; 50]。这种方法的优势在于其能够通过简单修改提示语句来适应不同任务,从而无需重新训练整个模型。对于 GPT 系列和其他预训练模型等LLMs,提示学习提供了一种简单而强大的模型微调手段。通过提供适当的提示,研究人员和实践者可以定制模型的行为,使其更适合特定领域或任务需求。 简而言之,提示学习是一种机器学习方法,它建立在预训练的语言模型之上,通过设计提示语句来引导模型执行各种任务,为定制模型应用提供更高的灵活性。在本节中,我们将介绍提示学习的基本知识。
2.2.1 Background and Overview
2.2.1 背景 和 概述
Prompt learning is a new approach to machine learning [51]. In the early field of natural language processing (NLP), researchers mainly used fully supervised learning mode[52], which trained models for specific tasks on the input and output example dataset of the target task. However, due to the limited training dataset, this method cannot train high-quality models well, so early NLP relied more on feature engineering; With the emergence of neural network models and their use in the field of NLP, people have begun to pay attention to architecture engineering [53].
提示学习是机器学习的一种新方法[ 51]。在自然语言处理(NLP)的早期领域,研究人员主要使用全监督学习模式[ 52],在目标任务的输入和输出示例数据集上训练特定任务的模型。然而,由于训练数据集有限,这种方法无法很好地训练高质量的模型,因此早期的 NLP 更多地依赖于特征工程;随着神经网络模型的出现及其在 NLP 领域的应用,人们开始关注架构工程[ 53]。
However, between 2017 and 2019, the learning approach of NLP models shifted from fully supervised learning to a new mode: pre-train and fine-tune paradigm[54]. In this paradigm, a model with a fixed architecture is pre-trained as a language model to predict the probability of observed text data.
Due to the abundant raw text data required for training language models, these language models can be trained on large datasets. During this process, language models can learn robust universal features of the language they are modeling. Then, by introducing additional parameters and fine-tuning them using task-specific objective functions, the PLM mentioned above will adapt to different downstream tasks. At this point, the focus of research shifted to objective engineering, which is to design training objectives during pre-training and fine-tuning. Since BERT, NLP has been using pre-training and fine-tuning methods for a long period of time, but this approach requires a new model to be fine-tuned for each task and cannot be shared. But for an LLM, it feels like customizing each task, which is very inefficient [51].
然而,在 2017 年至 2019 年之间,NLP 模型的学习方法从完全监督学习转变为一种新模式:预训练和微调范式[54]。在这种范式下,一个具有固定架构的模型被预训练为语言模型,以预测观察到的文本数据的概率。由于训练语言模型需要大量的原始文本数据,这些语言模型可以在大型数据集上训练。在这个过程中,语言模型可以学习到它们所建模的语言的鲁棒通用特征。然后,通过引入额外的参数并使用特定任务的客观函数来微调这些参数,上述 PLM 将适应不同的下游任务。此时,研究重点转向了目标工程,即在预训练和微调期间设计训练目标。自从 BERT 以来,NLP 已经长期使用预训练和微调方法,但这种方法需要为每个任务微调一个新的模型,并且不能共享。但对于一个LLM来说,感觉像是为每个任务定制,这非常低效[51]。
Prompt learning, this method has demonstrated amazing capabilities in GPT-3. The GPT-3 model can handle many tasks with only a few samples by using natural language prompts and task demonstrations as context, without updating parameters in the underlying model. Prompt Learning replaces the process of pre-trained and fine-tuning with pre-trained, prompts and predictions. In this paradigm, the downstream task is not to adapt the pre-trained LM to the downstream task through objective engineering, but to redefine the downstream task with the help of text prompts, making it look more like the tasks solved during the original LM training. For prompt learning, it is only necessary to insert different prompt parameters to adapt to different tasks. That is to say, each task only needs to train the prompt parameter separately, without the need to train the entire pre-trained language model[55]. This approach greatly improves the efficiency of using pre-trained language models and significantly shortens training time.
提示学习,这种方法在 GPT-3 中展示了惊人的能力。GPT-3 模型可以通过使用自然语言提示和任务演示作为上下文,仅用少量样本就能处理许多任务,而无需更新底层模型的参数。提示学习用预训练、提示和预测取代了预训练和微调的过程。在这个范式下,下游任务不是通过目标工程将预训练的 LM 适应于下游任务,而是通过文本提示重新定义下游任务,使其更像在原始 LM 训练期间解决的那些任务。对于提示学习,只需插入不同的提示参数来适应不同的任务即可。也就是说,每个任务只需要单独训练提示参数,无需训练整个预训练语言模型[55]。这种方法大大提高了使用预训练语言模型的效率,并显著缩短了训练时间。
2.2.2 Basic components and process of Prompt learning
2.2.2 指令学习的基本组件和流程
In the traditional pre-trained+fine-tuning paradigm, there is a gap between the pre-trained stage and downstream tasks [51], while prompt learning can maintain consistency between the pre-trained target format and downstream task output format, that is, align the form of downstream tasks with the form of PLMs pre-trained tasks. When training PLMs, we can transform the original target task into a fill-in-the-blank or continuation task similar to the pre-trained task of PLMs by constructing a prompt. The advantage of this method is that through a series of appropriate prompts, we can use a single language model to solve various downstream tasks.
在传统的预训练+微调范式下,预训练阶段与下游任务之间存在差距[51],而提示学习可以保持预训练目标格式与下游任务输出格式的一致性,即对齐下游任务的形式与 PLMs 预训练任务的形式。在训练 PLMs 时,我们可以通过构建提示,将原始目标任务转换为类似于 PLMs 预训练任务的填空或续写任务。这种方法的优势在于,通过一系列适当的提示,我们可以使用单个语言模型来解决各种下游任务。
Prompt learning optimizes the performance of models on different tasks by using pre-trained models and designing appropriate templates. Prompt learning consists of prompt templates, answer mappings, and pre-trained language models. The prompt template is the main body of the prompt, and fill in the blank [56] and generate based on prefix [57]are two common types of prompt learning templates. The fill-in-the-blank template selects one or more positions in the text and represents them with [MASK] tags, used to prompt the model to fill in the corresponding words; Prefix-based template generation involves adding a specific prefix before a sentence to guide the model in generating appropriate text. Answer mapping is the process of evaluating all possible answers according to a probability distribution, selecting the most likely answer as the predicted output, and converting it into appropriate category mapping words. This process typically involves converting labels into natural language vocabulary, known as Verbalizer [58].
提示学习通过使用预训练模型和设计适当的模板来优化不同任务上模型的性能。提示学习包括提示模板、答案映射和预训练语言模型。提示模板是提示的主要内容,填空[56]和基于前缀[57]生成是两种常见的提示学习模板。填空模板在文本中选择一个或多个位置并用[56]标签表示,用于提示模型填写相应的词语;基于前缀的模板生成涉及在句子前添加特定的前缀,以引导模型生成适当的文本。答案映射是根据概率分布评估所有可能的答案,选择最可能的答案作为预测输出,并将其转换为适当的分类映射词语的过程。此过程通常涉及将标签转换为自然语言词汇,称为 Verbalizer[58]。
The workflow of Prompt learning mainly includes the following four parts:
工作流学习的主要流程主要包括以下四个部分:
(1)Use PLMs as base encoders
(1)使用 PLM 作为基础编码器
(2)Add additional context (template) with a [MASK] position
(2)添加额外的上下文(模板)与一个位置
(3)Project labels to label words (verbalizer)
(3)将项目标签映射到标签词(词化器)
(4)Be the GAP between pre-training and fine-tuning
(4)成为预训练和微调之间的差距
After defining the template and answer space, we need to choose a suitable pre-trained language model. There are now various pre-trained models (PTMs) with good performance, and when selecting a model, one usually considers its paradigm, such as Auto recursive, Masked Language Modeling, Encoder Decoder, etc. Based on this, for the summary task, a more suitable Bidirectional and Auto-Regressive Transformers (BART) model can be selected.
在定义模板和答案空间后,我们需要选择一个合适的预训练语言模型。现在有各种性能良好的预训练模型(PTMs),在选择模型时,人们通常考虑其范式,例如自动递归、掩码语言建模、编码器-解码器等。基于此,对于摘要任务,可以选择更合适的双向和自回归变换器(BART)模型。
The selection of a template plays a very important role in the prompt learning. Templates can generally be distinguished based on whether they are manually specified: artificially constructed templates or automatically searched templates. Artificially created templates are the most intuitive method, easy to understand, and have good performance in practical applications. However, artificially constructed templates also have some drawbacks: prior knowledge is required when designing templates manually [59], and there may be failures [60]. There are two types of automatically generated templates: discrete prompts and continuous prompts. Discrete prompts allow the model to select the optimal template in a set of discrete template spaces, while continuous prompts allow the language model to automatically train a prompt. According to research, using multiple templates [61] can improve the performance of the model. The simplest way to choose to use multiple templates and aggregate them together to complete an answer is to take the average [60] or weighted average of each template output [58].
模板选择在提示学习中起着非常重要的作用。模板通常可以根据是否手动指定来区分:人工构建的模板或自动搜索的模板。人工创建的模板是最直观的方法,易于理解,在实际应用中表现良好。然而,人工构建的模板也有一些缺点:设计模板时需要先验知识[59],并且可能存在失败[60]。自动生成的模板有两种类型:离散提示和连续提示。离散提示允许模型在离散模板空间中选择最优模板,而连续提示允许语言模型自动训练提示。根据研究,使用多个模板[61]可以提高模型性能。选择使用多个模板并将其汇总以完成答案的最简单方法是取平均值[60]或每个模板输出的加权平均值[58]。
Verbalizer is the process of mapping labels to label words, and the selection of verbalizers is also crucial for prompt learning. There are two ways to construct a verbalizer: manual definition and automatic search. The manual definition requires professional knowledge and may have disadvantages such as strong subjectivity and a small coverage area. To solve this problem, we can choose the following solutions: (1) Manually design with human prior knowledge; (2) Start with an Intel label word, paraphrase and expand; (3) Start with an internal label word, using external knowledge and expand; (4) Decompose the label with multiple tokens; (5) Virtual token and optimize the label embedding. In addition, we can use external knowledge bases to expand and improve label words, thereby achieving better text classification results[62].
Verbalizer 是将标签映射到标签词的过程,选择 Verbalizer 对于提示学习也非常关键。构建 Verbalizer 有两种方式:手动定义和自动搜索。手动定义需要专业知识,可能存在主观性强、覆盖面小等缺点。为解决这个问题,我们可以选择以下方案:(1)手动设计,利用人类先验知识;(2)从 Intel 标签词开始,进行释义和扩展;(3)从内部标签词开始,利用外部知识进行扩展;(4)用多个标记分解标签;(5)虚拟标记和优化标签嵌入。此外,我们可以使用外部知识库来扩展和改进标签词,从而实现更好的文本分类结果[62]。
2.2.3 learning strategy 2.2.3 学习策略
The emergence of the new paradigm of Prompt learning has brought significant changes to the training process.
The learning strategies for Prompt learning mainly include the following:
(1) Pre-training then fine-tuning, which is a traditional pre-training+fine tuning method [63]; (2) Tuning free promotion, relying on the designer LM of prompts to directly provide answers [64]; (3) Fixed LM prompt tuning, which updates the relevant parameters of prompts using downstream task training data; (4) Fix prompt LM tuning, this strategy is to fine-tune the parameters of LM, which have fixed parameters when using prompts; (5) Prompt+LM tuning is a strategy that updates both prompts related parameters and LM parameters.
新范式 Prompt 学习的出现给训练过程带来了重大变化。Prompt 学习的策略主要包括以下几种:(1)先预训练后微调,这是一种传统的预训练+微调方法[63];(2)无调优自由推广,依赖于提示的设计者 LM 直接提供答案[64];(3)固定 LM 提示微调,使用下游任务训练数据更新提示的相关参数;(4)固定提示 LM 微调,这种策略是微调具有固定参数的 LM 的参数;(5)Prompt+LM 微调是一种更新与提示相关参数和 LM 参数的策略。
These different learning strategies can be selected based on specific tasks and needs. Pre-training + fine-tuning is the most common strategy, suitable for most tasks [63]. No fine-tuning prompts are suitable for simple tasks, which can greatly reduce training time and computational resource consumption. Fixed LM prompt fine-tuning and fixed prompt LM fine-tuning are suitable for tasks that require more precise control and can optimize model performance by adjusting prompt parameters or language model parameters. Combining prompts and LM fine-tuning combines the advantages of both and can further improve model performance [51].
这些不同的学习策略可以根据具体任务和需求进行选择。预训练+微调是最常见的策略,适用于大多数任务[63]。没有微调提示适用于简单任务,这可以大大减少训练时间和计算资源消耗。固定 LM 提示微调和固定提示 LM 微调适用于需要更精确控制的任务,可以通过调整提示参数或语言模型参数来优化模型性能。将提示和 LM 微调相结合结合了两者的优点,可以进一步提高模型性能[51]。
In summary, Prompt learning provides us with a new training paradigm that can optimize model performance on various downstream tasks through appropriate prompt design and learning strategies. Choosing the appropriate template, constructing an effective verbalizer, and adopting appropriate learning strategies are all important factors in improving the effectiveness of prompt learning.
总结来说,提示学习为我们提供了一种新的训练范式,通过适当的提示设计和学习策略,可以在各种下游任务中优化模型性能。选择合适的模板、构建有效的表述器和采用适当的学习策略都是提高提示学习有效性的重要因素。
3 Training of Large Language Models
3 大型语言模型的训练
The training of LLMs can be broadly divided into three steps. The first step involves data collection and processing. The second step encompasses the pre-training process, which includes determining the model’s architecture and pre-training tasks and utilizing suitable parallel training algorithms to complete the training. The third step involves fine-tuning and alignment. In this section, we will provide an overview of the model training techniques. This will include an introduction to the relevant training datasets, data preparation and preprocessing, model architecture, specific training methodologies, model evaluation, and commonly used training frameworks for LLMs.
LLMs的训练可以大致分为三个步骤。第一步涉及数据收集和处理。第二步包括预训练过程,这包括确定模型的架构和预训练任务,并利用合适的并行训练算法来完成训练。第三步涉及微调和校准。在本节中,我们将概述模型训练技术。这包括相关训练数据集的介绍,数据准备和预处理,模型架构,特定的训练方法,模型评估,以及LLMs常用的训练框架。
3.1 Data Preparation and Preprocessing
3.1 数据准备和预处理
3.1.1 Dataset 3.1.1 数据集
Training LLMs require vast amounts of text data, and the quality of this data significantly impacts LLM performance. Pre-training on large-scale corpora provides LLMs with a fundamental understanding of language and some generative capability. The first step in LLM training is collecting substantial corpora of natural language text. Pre-training data sources are diverse, commonly incorporating web text, conversational data, and books as general pre-training corpora. Additionally, some research efforts introduce specialized data from professional domains, such as code or scientific data, to enhance LLM capabilities in those fields. Leveraging diverse sources of text data for LLM training can significantly enhance the model’s generalization capabilities. In the following section, we will present the commonly used datasets for training LLMs as shown in Table 1. These corpora are categorized into 5 groups for discussion.
训练需要大量文本数据,这些数据的质量对性能有显著影响。在大规模语料库上进行预训练为提供了对语言的基本理解和一些生成能力。训练的第一步是收集大量的自然语言文本语料库。预训练数据来源多样,通常包括网络文本、对话数据和书籍作为一般预训练语料库。此外,一些研究工作引入了来自专业领域的专业数据,如代码或科学数据,以增强在这些领域的功能。利用多样化的文本数据来源进行训练可以显著提高模型的一般化能力。在下一节中,我们将展示如表 1 所示的常用训练数据集。这些语料库被分为 5 组进行讨论。
表 1:常用语料库信息。
Corpora 语料库 | Type 类型 | Links 链接 |
BookCorpus [65] | Books 书籍 | https://github.com/soskek/bookcorpus |
Gutenberg [66] 古腾堡[66] | Books 书籍 | https://www.gutenberg.org |
Books1 [8] 书籍 1 [ 8] | Books 书籍 | Not open source yet 尚未开源 |
Books2 [8] 书籍 2 [ 8] | Books 书籍 | Not open source yet 尚未开源 |
CommonCrawl [67] CommonCrawl [ 67] 通用爬虫 [67] | CommonCrawl | https://commoncrawl.org |
C4 [68] | CommonCrawl | https://www.tensorflow.org/datasets/catalog/c4 |
CC-Stories [69] CC-Stories [ 69] -> CC-Stories [ 69] | CommonCrawl | Not open source yet 尚未开源 |
CC-News [70] CC-新闻 [70] | CommonCrawl | https://commoncrawl.org/blog/news-dataset-available |
RealNews [71] | CommonCrawl | https://github.com/rowanz/grover/tree/master/realnews |
RefinedWeb [72] RefinedWeb [72] | CommonCrawl | https://huggingface.co/datasets/tiiuae/falcon-refinedweb |
WebText 网络文本 | Reddit Link Reddit 链接 | Not open source yet 尚未开源 |
OpenWebText [73] OpenWebText [73] | Reddit Link Reddit 链接 | https://skylion007.github.io/OpenWebTextCorpus/ |
PushShift.io [74] | Reddit Link Reddit 链接 | https://pushshift.io/ |
Wikipedia [75] 维基百科 [ 75] | Wikipedia 维基百科 | https://dumps.wikimedia.org/zhwiki/latest/ |
BigQuery [76] | Code 代码 | https://cloud.google.com/bigquery |
CodeParrot 代码鹦鹉 | Code 代码 | https://huggingface.co/codeparrot |
the Pile [77] the Pile [ 77] - 堆 [ 77] | Other 其他 | https://github.com/EleutherAI/the-pile |
ROOTS [78] | Other 其他 | https://huggingface.co/bigscience-data |
Books: Two commonly utilized books datasets for LLMs training are BookCorpus [65] and Gutenberg [66]. These datasets include a wide range of literary genres, including novels, essays, poetry, history, science, philosophy, and more. Widely employed by numerous LLMs [9; 79], these datasets contribute to the models’ training by exposing them to a diverse array of textual genres and subject matter, fostering a more comprehensive understanding of language across various domains.
书籍:用于LLMs训练的两种常用书籍数据集是 BookCorpus [65]和 Gutenberg [66]。这些数据集包括广泛的文学体裁,包括小说、散文、诗歌、历史、科学、哲学等。被众多LLMs [9; 79]广泛使用,这些数据集通过使模型接触多样化的文本体裁和主题,促进了模型对语言在各个领域的更全面理解。
CommonCrawl: CommonCrawl [67] manages an accessible repository of web crawl data, freely available for utilization by individuals and organizations. This repository encompasses a vast collection of data, comprising over 250 billion web pages accumulated over a span of 16 years. Established in 2007, Common Crawl has evolved into a widely recognized and referenced corpus in the academic and research communities, cited in more than 10,000 research papers. This continuously expanding corpus is a dynamic resource, with an addition of 3–5 billion new web pages each month. Its significance extends to the field of natural language processing, where it serves as a primary training corpus in numerous large language models. Notably, a substantial portion of the raw tokens employed in training GPT-3 [8], amounting to 82%, is sourced from the CommonCrawl. However, due to the presence of a substantial amount of low-quality data in web archives, preprocessing is essential when working with CommonCrawl data. Currently, four commonly used filtered datasets based on CommonCrawl are available: C4 [68], CC-Stories [69], CC-News [70], and RealNews [71].
CommonCrawl:CommonCrawl [67] 管理着一个可访问的网页爬虫数据仓库,个人和组织可以免费使用。该仓库包含大量数据,包括 16 年间积累的超过 2500 亿个网页。成立于 2007 年,Common Crawl 已成为学术界和研究界广泛认可和引用的语料库,被超过 10,000 篇研究论文引用。这个持续扩展的语料库是一个动态资源,每月新增 3-5 亿个新网页。它在自然语言处理领域具有重要意义,是许多大型语言模型的主要训练语料库。值得注意的是,用于训练 GPT-3 [8] 的原始标记中有相当一部分,即 82%,来自 CommonCrawl。然而,由于网络档案馆中存在大量低质量数据,处理 CommonCrawl 数据时进行预处理是必要的。目前,基于 CommonCrawl 的四个常用过滤数据集可供使用:C4 [68]、CC-Stories [69]、CC-News [70] 和 RealNews [71]。
Reddit Links: Reddit is a social media platform where users can submit links and posts, and others can vote on them using the "upvote" or "downvote" system. This characteristic makes it a valuable resource for creating high-quality datasets.
Reddit 链接:Reddit 是一个社交媒体平台,用户可以提交链接和帖子,其他人可以使用“点赞”或“踩”系统对它们进行投票。这一特性使其成为创建高质量数据集的有价值资源。
Wikipedia: Wikipedia [75], a free and open online encyclopedia project, hosts a vast repository of high-quality encyclopedic content spanning a wide array of topics. The English version of Wikipedia is extensively utilized in the training of many LLMs [8; 9; 80], serving as a valuable resource for language understanding and generation tasks. Additionally, Wikipedia is available in multiple languages, providing diverse language versions that can be leveraged for training in multilingual environments.
维基百科:维基百科[75],一个免费和开放的在线百科全书项目,拥有涵盖广泛主题的高质量百科内容库。维基百科的英文版本被广泛用于许多LLMs [8;9;80]的训练中,作为语言理解和生成任务的有价值资源。此外,维基百科提供多种语言版本,可在多语言环境中用于训练。
Code: There is a limited availability of publicly accessible code datasets at present. Existing efforts primarily involve web scraping of code with open-source licenses from the internet. The main sources include Github and Stack Overflow.
目前公开可访问的代码数据集数量有限。现有的努力主要涉及从互联网上开源许可的代码中进行网络爬取。主要来源包括 GitHub 和 Stack Overflow。
We have organized datasets utilized by distinct LLMs. During the training process, LLMs are typically trained on multiple datasets, as specified in Table 2 for reference.
我们已组织了由不同 LLMs 使用的数据集。在训练过程中,LLMs 通常在多个数据集上进行训练,如表 2 所示供参考。
表 2:不同数据集利用情况
LLMs | Datasets 数据集 |
GPT-3 [8] | CommonCrawl [67], WebText2 [8], Books1 [8], Books2 [8], Wikipedia [75] CommonCrawl [ 67], WebText2 [ 8], 书籍 1 [ 8], 书籍 2 [ 8], 维基百科 [ 75] |
LLaMA [9] | CommonCrawl [67], C4 [68], Wikipedia [75], Github, Books, Arxiv, StackExchange CommonCrawl [ 67], C4 [ 68], 维基百科 [ 75], GitHub,书籍,arXiv,StackExchange |
PaLM [36] | Social Media, Webpages, Books, Github, Wikipedia, News (total 780B tokens) 社交媒体、网页、书籍、GitHub、维基百科、新闻(总计 7800 亿个 token) |
T5 [68] | C4 [68], WebText, Wikipedia, RealNews C4 [ 68],WebText,维基百科,RealNews |
CodeGen [81] 代码生成 [81] | the Pile, BIGQUERY, BIGPYTHON the Pile,BIGQUERY,BIGPYTHON |
CodeGeeX [82] | CodeParrot, the Pile, Github CodeParrot,the Pile,Github |
GLM [37] GLM [37] | BooksCorpus, Wikipedia 书籍语料库,维基百科 |
BLOOM [38] | ROOTS |
OPT [83] OPT [83] | BookCorpus, CCNews, CC-Stories, the Pile, Pushshift.io |
3.1.2 Data preprocessing 3.1.2 数据预处理
Once an adequate corpus of data is collected, the subsequent step is data preprocessing. The quality of data preprocessing directly impacts the model’s performance and security. The specific preprocessing steps involve filtering low-quality text, including eliminating toxic and biased content to ensure the model aligns with human ethical standards. It also includes deduplication, removing duplicates in the training set, and excluding redundant content in the test set to maintain the sample distribution balance. Privacy scrubbing is applied to ensure the model’s security, preventing information leakage or other privacy-related concerns. Additionally, if fine-tuning LLMs is considered, expanding the vocabulary should also be considered.
On the other hand, LLaMA 2 models [10] represent a notable exception. These models forego filtering in their pretraining corpus, as aggressive filtration might accidentally filter out some demographic groups. This approach enhances the generalizability of the base LLaMA 2 models, making them more adept across a range of downstream tasks, such as hate speech detection and privacy de-identification. Observations indicate that abstaining from additional filtering in the pretraining data enables the base model to achieve reasonable safety alignment with fewer examples [10]. While this increases both generalizability and safety alignment efficiency, the implementation of additional safety mitigations is still imperative prior to public deployment, as further discussed in Section 3.5.4.
一旦收集到足够的数据语料库,下一步就是数据预处理。数据预处理的质最直接影响模型的性能和安全。具体的预处理步骤包括过滤低质量文本,包括消除有毒和偏见内容,以确保模型符合人类的伦理标准。这还包括去重,在训练集中删除重复项,以及在测试集中排除冗余内容,以保持样本分布平衡。应用隐私擦除以确保模型的安全,防止信息泄露或其他隐私相关的问题。此外,如果考虑在LLMs上进行微调,也应考虑扩展词汇。另一方面,LLaMA 2 模型[10]是一个值得注意的例外。这些模型在预训练语料库中放弃了过滤,因为激进的过滤可能会意外地过滤掉某些人口群体。这种方法增强了基础 LLaMA 2 模型的可推广性,使它们在一系列下游任务中更加熟练,例如仇恨言论检测和隐私去识别。 观察表明,在预训练数据中避免额外的过滤可以使基础模型在更少的示例中实现合理的安全性对齐[10]。虽然这增加了泛化能力和安全性对齐效率,但在公共部署之前实施额外的安全缓解措施仍然至关重要,如第 3.5.4 节所述。
Quality filtering: Filtering low-quality data is typically done using heuristic-based methods or classifier-based methods. Heuristic methods involve employing manually defined rules to eliminate low-quality data [84; 72]. For instance, rules could be set to retain only text containing digits, discard sentences composed entirely of uppercase letters, and remove files with a symbol and word ratio exceeding 0.1, and so forth. Classifier-based methods involve training a classifier on a high-quality dataset such as WebText [85] to filter out low-quality datasets.
质量过滤:通常使用基于启发式的方法或基于分类器的方法来过滤低质量数据。启发式方法涉及使用手动定义的规则来消除低质量数据[84; 72]。例如,可以设置规则仅保留包含数字的文本,丢弃完全由大写字母组成的句子,以及移除符号和词比例超过 0.1 的文件等。基于分类器的方法涉及在高质量数据集(如 WebText[85])上训练一个分类器,以过滤出低质量数据集。
Deduplication: Language models may sometimes repetitively generate the same content during text generation, potentially due to a high degree of repetition in the training data. Extensive repetition can lead to training instability, resulting in a decline in the performance of LLMs [86]. Additionally, it is crucial to consider avoiding dataset contamination by removing duplicated data present in both the training and testing set [87].
去重:语言模型在文本生成过程中有时可能会重复生成相同的内容,这可能是由于训练数据中重复程度较高所致。大量的重复可能导致训练不稳定,从而降低LLMs [ 86]的性能。此外,考虑避免数据集污染,通过移除训练集和测试集中存在的重复数据至关重要[ 87]。
Privacy scrubbing: LLMs, as text-generating models, are trained on diverse datasets, which may pose privacy concerns and the risk of inadvertent information disclosure [88]. In the preprocessing phase of language datasets, it is imperative to address privacy concerns by systematically removing any sensitive information. This involves employing techniques such as anonymization, redaction, or tokenization to eliminate personally identifiable details, geolocation, and other confidential data. By carefully scrubbing the dataset of such sensitive content, researchers and developers can ensure that the language models trained on these datasets uphold privacy standards and mitigate the risk of unintentional disclosure of private information. It is essential to strike a balance between data utility and privacy protection, fostering responsible and ethical use of language datasets in various applications.
隐私清理:LLMs作为文本生成模型,在多样化的数据集上进行训练,可能会引发隐私担忧和意外信息泄露的风险[88]。在语言数据集的前处理阶段,系统地删除任何敏感信息以解决隐私问题至关重要。这包括采用匿名化、编辑或标记化等技术来消除个人身份信息、地理位置和其他机密数据。通过仔细清理数据集中此类敏感内容,研究人员和开发者可以确保在这些数据集上训练的语言模型遵守隐私标准并降低无意中泄露私人信息的风险。在数据实用性和隐私保护之间取得平衡,对于在各个应用中负责任和道德地使用语言数据集至关重要。
Filtering out toxic and biased text: In the preprocessing steps of language datasets, a critical consideration is the removal of toxic and biased content to ensure the development of fair and unbiased language models. This involves implementing robust content moderation techniques, such as employing sentiment analysis, hate speech detection, and bias identification algorithms. By leveraging these tools [89], researchers can systematically identify and filter out text that may perpetuate harmful stereotypes, offensive language, or biased viewpoints.
过滤掉有毒和偏见文本:在语言数据集的前处理步骤中,一个关键考虑因素是移除有毒和偏见内容,以确保公平和无偏见的语言模型的发展。这涉及到实施强大的内容审查技术,例如采用情感分析、仇恨言论检测和偏见识别算法。通过利用这些工具[89],研究人员可以系统地识别和过滤掉可能 perpetuate harmful stereotypes, offensive language, or biased viewpoints 的文本。
3.2 Architecture 3.2 架构
Currently, all LLMs are built upon the Transformer architecture, allowing these models to scale to several 10 billion or even a trillion parameters. Typically, PLM architectures fall into three categories: Encoder-only [90], Encoder-decoder [68] and Decoder-only [18]. The Encoder-only architecture is no longer employed in the latest LLMs and won’t be further discussed here. Instead, this section will focus on introducing the Encoder-decoder and Decoder-only architectures.
目前,所有LLMs都是基于 Transformer 架构构建的,这使得这些模型可以扩展到数十亿甚至万亿参数。通常,PLM 架构分为三类:仅编码器[90]、编码器-解码器[68]和仅解码器[18]。仅编码器架构在最新的LLMs中不再使用,此处不再进一步讨论。相反,本节将重点介绍编码器-解码器和仅解码器架构。
3.2.1 Encoder-decoder Architecture
3.2.1 编码器-解码器架构
The Encoder-decoder architecture of LLMs is built upon the traditional Transformer Encoder-decoder architecture. The Encoder-decoder architecture consists of two main components: the Encoder and the Decoder. Each part of the Encoder is composed of multiple layers of Transformer’s Multi-Head Self-Attention layers, which encode the input sequence. The Decoder, on the other hand, utilizes cross-attention over the output representation of the Encoder and generates the target sequence in an autoregressive manner. The encoder-decoder architecture serves as the foundation for prominent LLMs such as T5 [68], flan-T5 [91], and BART [92].
LLMs的编码器-解码器架构建立在传统的 Transformer 编码器-解码器架构之上。编码器-解码器架构由两个主要组件组成:编码器和解码器。编码器的每一部分由多个 Transformer 的多头自注意力层组成,用于编码输入序列。另一方面,解码器利用对编码器输出表示的交叉注意力,以自回归的方式生成目标序列。编码器-解码器架构是 T5 [68]、flan-T5 [91]和 BART [92]等突出架构的基础。
3.2.2 Decoder-only Architecture
3.2.2 仅解码器架构
LLMs with a Decoder-only architecture utilize the decoder component of the traditional Transformer architecture. Unlike the Encoder-Decoder architecture, which incorporates both an encoder and a decoder, the Decoder-only architecture is solely focused on the decoding process. In this configuration, the model sequentially generates tokens, attending to preceding tokens in the sequence. This architecture has been applied to various language generation tasks, showcasing its effectiveness in various tasks such as text generation without the need for an explicit encoding phase. The Decoder-only architecture can be further classified into two categories: the Causal Decoder architecture and the Prefix Decoder architecture.
仅使用解码器架构利用传统 Transformer 架构的解码器组件。与包含编码器和解码器的编码器-解码器架构不同,仅解码器架构专注于解码过程。在这种配置中,模型按顺序生成标记,关注序列中的先前标记。这种架构已应用于各种语言生成任务,展示了其在各种任务(如无需显式编码阶段进行文本生成)中的有效性。仅解码器架构可以进一步分为两类:因果解码器架构和前缀解码器架构。
The Causal Decoder Architecture: In the Causal Decoder architecture, each token in the model input sequence can only attend to past input tokens and itself during the decoding process. It achieves unidirectional attention to the input sequence by using a specific mask as shown in Figure 1. In fact, different architectures are mainly implemented by configuring different mask matrices. The figure illustrates a comparison of mask configurations between the Encoder-decoder and Decoder-only architectures (including Casual Decoder and Prefix Decoder). The representative LLMs for the Causal Decoder architecture are the GPT series [18; 7; 8; 93; 19]. The GPT series of LLMs are currently known for their superior performance, with their foundational Causal Decoder architecture widely applied in other LLMs such as BLOOM [38], OPT [83], Gopher [84], and LLaMA [9].
因果解码器架构:在因果解码器架构中,模型输入序列中的每个标记在解码过程中只能关注过去的输入标记和自身。它通过使用如图 1 所示的特定掩码实现了对输入序列的单向注意力。实际上,不同的架构主要是通过配置不同的掩码矩阵来实现的。该图展示了编码器-解码器架构和仅解码器架构(包括因果解码器和前缀解码器)之间的掩码配置比较。因果解码器架构的代表性LLMs是 GPT 系列[18; 7; 8; 93; 19]。LLMs的 GPT 系列目前以其卓越的性能而闻名,其基础因果解码器架构广泛应用于其他LLMs,如 BLOOM[38]、OPT[83]、Gopher[84]和 LLaMA[9]。
The Prefix Decoder Architecture: The Prefix Decoder architecture combines the advantages of both the Encoder-decoder and Causal Decoder architectures. It leverages its unique mask configurations, as illustrated in Figure 1, enabling bidirectional attention for tokens in the prefix while maintaining unidirectional attention for generating subsequent tokens [54]. This design allows for the autoregressive generation of the output sequence with the flexibility to attend bi-directionally to the prefix tokens. Representative LLMs utilizing the Prefix Decoder architecture include PaLM [36] and GLM [37].
前缀解码器架构:前缀解码器架构结合了编码器-解码器和因果解码器架构的优点。它利用其独特的掩码配置,如图 1 所示,使前缀中的标记能够进行双向注意力,同时在生成后续标记时保持单向注意力[54]。这种设计允许以自回归的方式生成输出序列,同时具有对前缀标记进行双向注意力的灵活性。采用前缀解码器架构的代表性模型包括 PaLM[36]和 GLM[37]。
3.3 Pre-training Tasks 3.3 预训练任务
Large Language Models (LLMs) typically learn rich language representations through a pre-training process. During pre-training, these models leverage extensive corpora, such as text data from the internet, and undergo training through self-supervised learning methods. Language modeling is one common form of self-supervised learning task in which the model is tasked with predicting the next word in a given context. Through this task, the model acquires the ability to capture information related to vocabulary, grammar, semantics, and text structure.
大型语言模型(LLMs)通常通过预训练过程学习丰富的语言表示。在预训练期间,这些模型利用广泛的语料库,例如来自互联网的文本数据,并通过自监督学习方法进行训练。语言建模是自监督学习任务的一种常见形式,其中模型被要求在给定上下文中预测下一个单词。通过这项任务,模型获得了捕捉与词汇、语法、语义和文本结构相关的信息的能力。
In language modeling [18; 7; 8; 36], the model is required to predict the next word in a given context. This task enables the model to develop a nuanced understanding of language. Specifically, the model observes large amounts of textual data and attempts to predict the next word at each position in the text. This gradual learning process allows the model to capture the patterns and information inherent in language, encoding a vast amount of linguistic knowledge into its parameters. Once pre-training is complete, these model parameters can be fine-tuned for various natural language processing tasks to adapt to specific task requirements. The objective of language modeling is to train a model to maximize the likelihood of textual data. For a given text sequence, denoted as , where represents the token at position , is the probability of predicting given the preceding context , the objective function for language modeling can be expressed using cross-entropy loss. Here, we define the objective as maximizing the conditional probability of the given text sequence:
在语言建模[18; 7; 8; 36]中,模型需要预测给定上下文中的下一个单词。这项任务使模型能够发展对语言的细微理解。具体来说,模型观察大量文本数据,并尝试预测文本中每个位置的下一个单词。这种逐步学习过程使模型能够捕捉语言中固有的模式和信息,将大量语言知识编码到其参数中。一旦预训练完成,这些模型参数可以针对各种自然语言处理任务进行微调,以适应特定任务要求。语言建模的目的是训练一个模型以最大化文本数据的可能性。对于给定的文本序列,表示为 ,其中 代表位置 的标记, 是预测 给定先前上下文的概率,语言建模的目标函数可以使用交叉熵损失来表示。在这里,我们将目标定义为最大化给定文本序列的条件概率:
(5) |
Language modeling serves as a prevalent pretraining objective for most LLMs. In addition to language modeling, there are other pretraining tasks within the realm of language modeling. For instance, some models [68; 37] use text with certain portions randomly replaced, and then employ autoregressive methods to recover the replaced tokens. The primary training approach involves the autoregressive recovery of the replaced intervals.
语言建模是大多数LLMs的常见预训练目标。除了语言建模,语言建模领域内还有其他预训练任务。例如,一些模型[68;37]使用某些部分随机替换的文本,然后采用自回归方法恢复替换的标记。主要训练方法涉及替换间隔的自回归恢复。
3.4 Model Training 3.4 模型训练
3.4.1 Parallel Training 3.4.1 并行训练
In the parallel training mentioned below, there will be discussions about collective communication which helps us better understand the principles of parallel training. Figure 2 has listed five reduction relationships. 1)Broadcast: Send data from one GPU to other GPUs.2)Reduce: Reduce(sum/average) data of all GPUs, send to one GPU.3)All Reduce: Reduce all data of GPUs, send to all GPUs.4)Reduce Scatter: Reduce all data of GPUs, send portions to all GPUs.5)All Gather: Gather data of all GPUs, send all GPUs.
在以下并行训练中,将讨论集体通信,这有助于我们更好地理解并行训练的原则。图 2 列出了五种缩减关系。1)广播:将数据从一块 GPU 发送到其他 GPU。2)缩减(求和/平均):缩减所有 GPU 的数据,发送到一块 GPU。3)全缩减:缩减所有 GPU 的所有数据,发送到所有 GPU。4)缩减散射:缩减所有 GPU 的所有数据,将部分数据发送到所有 GPU。5)全收集:收集所有 GPU 的数据,发送到所有 GPU。
Data Parallel: The process of data parallelism [94] is shown in Figure 3, there is a parameter server that stores the model’s parameters and the entire batch of data. Each GPU uses broadcast to synchronize the model parameters and divides the data into one portion per GPU, with each GPU receiving a portion of the data. Each GPU uses the complete model parameters and a portion of the data to perform forward and backward propagation. This way, the gradients are obtained for each GPU. Finally, we aggregate the gradients and send the aggregated gradients back to the parameter server, where the original model parameters and the aggregated complete gradients are available. With this information, we can use an optimizer to update the model parameters. The updated parameters will then enter the next round of model training iterations.
数据并行:数据并行化过程[94]如图 3 所示,有一个参数服务器存储模型的参数和整个数据批次。每个 GPU 使用广播来同步模型参数,并将数据分成每个 GPU 一部分,每个 GPU 接收数据的一部分。每个 GPU 使用完整的模型参数和数据的一部分进行前向和反向传播。这样,为每个 GPU 获得梯度。最后,我们汇总梯度并将汇总的梯度发送回参数服务器,在那里可以找到原始模型参数和汇总的完整梯度。有了这些信息,我们可以使用优化器来更新模型参数。更新后的参数将进入下一轮模型训练迭代。
Distributed data parallelism [95] abandons the use of a parameter server and instead employs all-reduce on gradient information, ensuring that each GPU receives the same gradient information. The result of all-reduce is communicated to all GPUs, allowing them to independently update their respective model optimizers. After each round of updates, the model’s parameters, gradients, and the historical information of the optimizer are consistent across all GPUs.
分布式数据并行[95]放弃使用参数服务器,转而采用梯度信息上的 all-reduce,确保每个 GPU 接收相同的梯度信息。all-reduce 的结果被传达给所有 GPU,使它们能够独立更新各自的模型优化器。在每轮更新之后,模型的参数、梯度和优化器的历史信息在所有 GPU 上保持一致。
The occupation of GPU memory of intermediate results is related to the batch size, sentence length, and model dimensions. When using data parallelism, a batch of data is divided into many parts, allowing each GPU to process a portion of the data. In equivalent terms, the batch size processed on each GPU is reduced to one over the original number of GPUs. Data parallelism has reduced the input dimensions, resulting in an overall reduction in the intermediate results of the model. A drawback is that to support model training, each GPU needs to receive at least one piece of data. In the most extreme case, when each GPU receives only one piece of data, our parameters, gradients, and optimizer still need to be fully stored on the GPU. Even if we don’t store any intermediate results on the GPU, our model may still be unable to perform computations on a single GPU.
GPU 内存被中间结果占用与批量大小、句子长度和模型维度有关。在使用数据并行时,一个数据批次被分成多个部分,允许每个 GPU 处理数据的一部分。从等效的角度来看,每个 GPU 处理的批量大小减少到原始 GPU 数量的倒数。数据并行减少了输入维度,导致模型中间结果的整体减少。缺点是,为了支持模型训练,每个 GPU 至少需要接收一块数据。在最极端的情况下,当每个 GPU 只接收一块数据时,我们的参数、梯度和优化器仍然需要完全存储在 GPU 上。即使我们不存储任何中间结果在 GPU 上,我们的模型可能仍然无法在单个 GPU 上执行计算。
Model Parallel:Model parallelism [96] was first introduced by Megatron-LM to alleviate memory pressure. From figure 4, we can clearly understand the overall architecture of model parallelism. Taking advantage of the most common linear layer in the Transformer as an example, the parameters of the linear layer form a matrix of size A*B, and the input to the linear layer is a vector of size B*1. Representing this as = , we can horizontally partition the model’s parameters into many segments using the property of matrix multiplication. Each segment is of size a divided by n multiplied by B. Utilizing the properties of matrix multiplication, we can move into parentheses, and finally, the result of the linear layer is obtained by multiplying many small matrices with the parameters of the linear layer. Through this approach, the parameters of the linear layer can be distributed across multiple GPUs. However, it is crucial to ensure that the inputs to the model on multiple GPUs are identical. Instead of using a data parallel approach to partition the data, we need to ensure that the inputs obtained on each GPU are the same, meaning they belong to the same batch of data. We can then partition a parameter like the linear layer across GPUs, with each GPU receiving a small portion of the matrix. By performing model calculations with this small portion and the data, we obtain a sub-result, as shown in Formula 5. The results of these computations need to be concatenated using the all-gather operator and communicated to all GPUs.
模型并行:模型并行性[96]最初由 Megatron-LM 提出,以减轻内存压力。从图 4 中,我们可以清楚地了解模型并行的整体架构。以 Transformer 中最常见的线性层为例,线性层的参数形成一个 A*B 大小的矩阵,线性层的输入是一个 B*1 大小的向量。将此表示为 = ,我们可以利用矩阵乘法的性质,将模型的参数水平分割成多个段。每个段的大小是 a 除以 n 再乘以 B。利用矩阵乘法的性质,我们可以将 移动到括号内,最终通过乘以多个小矩阵和线性层的参数,得到线性层的输出。通过这种方法,线性层的参数可以分布到多个 GPU 上。然而,确保多个 GPU 上模型的输入相同至关重要。我们不是使用数据并行方法来分割数据,而是需要确保每个 GPU 上获得的输入相同,即它们属于同一批数据。 我们可以将线性层等参数在 GPU 之间进行分区,每个 GPU 接收矩阵的一小部分。通过使用这一小部分数据和模型进行计算,我们得到一个子结果,如公式 5 所示。这些计算的结果需要使用 all-gather 算子进行拼接,并传递给所有 GPU。
(6) | ||||
ZeRO: ZeRO [97] is a framework built on data parallelism. During the parameter updating process on each GPU, the same set of parameters is used, leading to computational redundancy. Each GPU uses reduced scatter to eliminate this redundancy to obtain a portion of the gradient results. After updating a portion of the model parameters on each GPU, an all-gather operation is performed to synchronize the parameters across all GPUs. After the all-gather operation, the original gradient no longer needs to be saved on the graphics card and can be removed. Figure 5 shows the update of ZeRO. In ZeRO1, the original gradient is removed after backward propagation, while in ZeRO2, the product of the gradient* is calculated in advance during backward propagation, and only the gradient* is saved on the graphics card, removing the gradient. This way, the deletion of the gradient is advanced, leading to further savings in GPU memory space. ZeRO3 conducts a detailed division of the model parameters. Each graphics card retains only a portion of the gradients for updating, and parameter updates also only affect a portion of the model parameters. Therefore, each graphics card only needs to store the parameters, gradients, and optimizer related to the part of the parameters it is responsible for. During forward and backward propagation, an all-gather operation is required once, and after the operation is complete, the model parameters are released from the graphics card.Zero3 does not use all gather during parameter updates, but it requires an all-gather operation during both forward and backward propagation, adding one communication step. Compared to ZeRO2, ZeRO3 is an algorithm that trades time for space.
ZeRO:ZeRO [97] 是一个基于数据并行的框架。在每块 GPU 上的参数更新过程中,使用相同的参数集,导致计算冗余。每个 GPU 使用减少的散射来消除这种冗余,以获得部分梯度结果。在每块 GPU 上更新部分模型参数后,执行全收集操作以同步所有 GPU 上的参数。全收集操作后,原始梯度不再需要在图形卡上保存,可以删除。图 5 显示了 ZeRO 的更新。在 ZeRO1 中,在反向传播后删除原始梯度,而在 ZeRO2 中,在反向传播期间预先计算梯度*的乘积,只保存图形卡上的梯度*,删除梯度。这样,梯度删除被提前,导致 GPU 内存空间的进一步节省。ZeRO3 对模型参数进行了详细划分。每个图形卡仅保留部分梯度以供更新,参数更新也仅影响模型参数的一部分。 因此,每张显卡只需存储其负责参数的部分的参数、梯度和优化器。在正向和反向传播过程中,需要执行一次全聚合操作,操作完成后,模型参数从显卡中释放。Zero3 在参数更新期间不使用全聚合,但在正向和反向传播期间都需要执行全聚合操作,增加了一个通信步骤。与 ZeRO2 相比,ZeRO3 是一种以空间换取时间的算法。
Pipeline Parallel: Pipeline parallelism [98] and model parallelism share similarities. In model parallelism, linear layers are divided into many small matrices, which are then distributed to different GPUs. For pipeline parallelism, different layers of the model are assigned to different GPUs. Specifically, if we have an n-layer transformer, we can assign the of the transformer to the , and so on. During the forward propagation of the model, we need to perform the computation of the on the , then pass the result to the . The receives the output from the , performs the computation for that layer and passes the result to the next GPU. This method partitions the parameters, gradients, optimizer, and intermediate results for each layer.
流水线并行:流水线并行[98]和模型并行有相似之处。在模型并行中,线性层被划分为许多小矩阵,然后分布到不同的 GPU 上。对于流水线并行,模型的各个层被分配到不同的 GPU 上。具体来说,如果我们有一个 n 层 Transformer,我们可以将 Transformer 的 分配给 ,依此类推。在模型的正向传播过程中,我们需要在 上执行 的计算,然后将结果传递给 。 接收来自 的输出,执行该层的计算,并将结果传递给下一个 GPU。这种方法为每一层的参数、梯度、优化器和中间结果进行分区。
3.4.2 Mixed Precision Training
3.4.2 混合精度训练
In recent years, to pre-train extremely large language models, some research [99] has begun to utilize 16-bit floating-point numbers (FP16) to reduce memory usage and communication overhead. FP16 has a smaller numerical range and lower precision in effective digits [100; 38], but computations tend to be faster than FP32. In general model training, FP32 is often used as the default representation for training parameters. However, in actual model training, the number of parameters in a model typically does not exceed the order of thousands, well within the numerical range of FP16. To improve computational speed, we can convert from FP32 to FP16. During parameter updates, the amount of the parameter is roughly equal to the gradient multiplied by the learning rate. The minimum value of FP16 is on the order of 1e-5. As the product of the gradient and learning rate is already well below the representation range of FP16, the parameter update would result in loss, known as underflow. Therefore, we represent the parameter update obtained by multiplying the gradient by the learning rate as FP32. We cannot directly add this high-precision parameter update to a lower-precision model, as this would still result in floating-point underflow. Consequently, we need to save an additional single-precision parameter on the optimizer. To accelerate both forward and backward passes in the model, half-precision parameters and gradients are used and passed to the optimizer for updating. The optimizer’s update quantity is saved as FP32, and we accumulate it effectively through a temporarily created FP32 parameter in the optimizer. After effective accumulation, it is then converted back to FP16 parameters.
近年来,为了预训练极其庞大的语言模型,一些研究[99]已经开始使用 16 位浮点数(FP16)来减少内存使用和通信开销。FP16 的数值范围较小,有效数字的精度较低[100; 38],但计算速度通常比 FP32 快。在一般模型训练中,FP32 通常用作训练参数的默认表示。然而,在实际模型训练中,模型的参数数量通常不超过数千的量级,完全在 FP16 的数值范围内。为了提高计算速度,我们可以将 FP32 转换为 FP16。在参数更新期间,参数的量大致等于梯度乘以学习率。FP16 的最小值约为 1e-5。由于梯度和学习率的乘积已经远低于 FP16 的表示范围,参数更新会导致损失,称为下溢。因此,我们将通过将梯度乘以学习率获得的参数更新表示为 FP32。 无法直接将此高精度参数更新添加到低精度模型中,因为这仍然会导致浮点下溢。因此,我们需要在优化器上保存一个额外的单精度参数。为了加速模型中的前向和反向传播,使用半精度参数和梯度,并将它们传递给优化器进行更新。优化器的更新量以 FP32 保存,并通过优化器中临时创建的 FP32 参数有效地累加。在有效累加后,将其转换回 FP16 参数。
3.4.3 Offloading 3.4.3 卸载
The parameters in the optimizer are at least twice as many as the model parameters, and a study [101]proposes the idea of moving the optimizer’s parameters from the GPU to the CPU. Although GPU computation is much faster than CPU, the question arises whether offloading this operation could become a bottleneck for the overall training speed of the model optimizer. In reality, we utilize ZeRO3. After the optimization with ZeRO3, the size of the parameters, gradients, and optimizer is reduced to 1/n of the number of GPUs. By binding one GPU to multiple CPUs, we effectively lower the computational load on each CPU.
优化器中的参数至少是模型参数的两倍,一项研究[101]提出了将优化器参数从 GPU 迁移到 CPU 的想法。尽管 GPU 计算速度远快于 CPU,但这个问题出现了,即这种操作是否可能成为模型优化器整体训练速度的瓶颈。实际上,我们使用了 ZeRO3。经过 ZeRO3 优化后,参数、梯度和优化器的大小减少到 GPU 数量的 1/n。通过将一个 GPU 绑定到多个 CPU 上,我们有效地降低了每个 CPU 的计算负载。
3.4.4 Overlapping 3.4.4 重叠
Memory operations are typically asynchronous. Thus, We can send a request to the memory in advance and then proceed with other computations. After completing other computations, we come back to handle the memory request. This two-step operation is used in the forward propagation process of model training. We need to obtain the parameters of through a gather operation. After obtaining the parameters of , in the forward propagation process of , we proactively retrieve the parameters of through an asynchronous fetch. Once the forward propagation calculation for is completed, the parameters for have been obtained and are stored in the GPU. We can then immediately proceed with the forward propagation calculation, and so on.
内存操作通常是异步的。因此,我们可以提前向内存发送请求,然后继续进行其他计算。完成其他计算后,我们返回处理内存请求。这种两步操作用于模型训练的前向传播过程。我们需要通过收集操作获取 的参数。在获取 的参数后,在 的前向传播过程中,我们通过异步获取主动检索 的参数。一旦完成 的前向传播计算, 的参数已经获取并存储在 GPU 中。然后我们可以立即进行前向传播计算,等等。
3.4.5 Checkpoint 3.4.5 检查点
In order to support the backward propagation of the model, All intermediate results in the GPU memory need to be saved during the forward propagation of the model. To optimize this process, a checkpoint mechanism, which does not save all intermediate results in the GPU memory but only retains certain checkpoint points is utilized.
为了支持模型的反向传播,在模型的前向传播过程中,GPU 内存中的所有中间结果都需要被保存。为了优化这一过程,采用了一种检查点机制,该机制不保存 GPU 内存中的所有中间结果,而只保留某些检查点。
The diagram below illustrates a simplified structure of a transformer. Each transformer block takes a model input, undergoes complex computations through attention and feed-forward processes, and produces the overall output of that layer. We keep only the input of each major layer in the transformer as our checkpoint.
以下图示展示了变压器的简化结构。每个变压器块接收模型输入,通过注意力和前馈过程进行复杂计算,并产生该层的整体输出。我们只保留变压器中每个主要层的输入作为我们的检查点。
During the backward propagation process, how do we compute the gradients of the linear layers within each major layer? We can perform a technique called recomputation, which involves re-executing the forward pass of each major layer during the backward propagation process. We temporarily obtain the inputs of the linear layers within each major layer, and the intermediate results obtained can be used for backward propagation. Once the backward propagation for that layer is complete, we can discard the checkpoint and the temporarily recomputed intermediate results of the linear layers within the model from the GPU memory.
在反向传播过程中,我们如何计算每个主要层中线性层的梯度?我们可以执行一种称为重新计算的技术,这涉及到在反向传播过程中重新执行每个主要层的正向传播。我们暂时获得每个主要层中线性层的输入,获得的中间结果可用于反向传播。一旦该层的反向传播完成,我们可以从 GPU 内存中丢弃检查点和模型中线性层的暂时重新计算的中间结果。
Assuming we have a transformer with 24 layers, each layer containing four to five linear layers, using the checkpoint mechanism reduces the originally required storage of 120 intermediate results to only 24 intermediate results.
假设我们有一个包含 24 层的 transformer,每层包含四到五个线性层,使用检查点机制将原本所需的 120 个中间结果减少到仅 24 个中间结果。
3.5 Fine-Tuning 3.5 精细调整
The training of LLMs in this paper is divided into three stages: data collection and processing, pre-training, and fine-tuning. This section will provide a review of the fine-tuning methods for LLMs. Specifically, we categorize fine-tuning techniques into three types: supervised fine-tuning (SFT) [93], alignment tuning, and parameter-efficient tuning.
本文中LLMs的训练分为三个阶段:数据收集与处理、预训练和微调。本节将回顾LLMs的微调方法。具体来说,我们将微调技术分为三种类型:监督微调(SFT)[93]、对齐调整和参数高效调整。
3.5.1 Supervised Fine-Tuning
3.5.1 监督微调
The core concept of supervised fine-tuning involves adjusting the model in a supervised manner on the basis of large-scale pre-training, enhancing its capability to better adapt to the specific requirements of the target task. In the process of SFT, it is necessary to prepare a labeled dataset for the target task, which includes input text along with corresponding labels. Instruction tuning is a commonly used technique in the fine-tuning process of LLMs and can be considered as a specific form of SFT. It involves further training LLMs on a dataset composed of (instruction, output) pairs, focusing on enhancing the capabilities and controllability of large language models by understanding and following human instructions. We compiled commonly used instruction tuning datasets, as illustrated in Table 3.
监督微调的核心概念涉及在大规模预训练的基础上,以监督方式调整模型,增强其更好地适应目标任务特定要求的能力。在 SFT 过程中,有必要为目标任务准备一个标记的数据集,其中包括输入文本及其相应的标签。指令调整是在LLMs微调过程中常用的一种技术,可以被视为 SFT 的一种特定形式。它涉及在由(指令,输出)对组成的集合上进一步训练LLMs,通过理解和遵循人类指令,专注于增强大型语言模型的能力和可控性。我们编制了常用的指令调整数据集,如表 3 所示。
表 3:常用指令微调数据集。
Datasets 数据集 | Links 链接 |
static-hh 静态-hh | https://huggingface.co/datasets/Dahoas/static-hh |
OIG | https://huggingface.co/datasets/laion/OIG |
Self-Instruct [102] 自指导 [102] | https://github.com/yizhongw/self-instruct |
Natural instructions [103] 自然指令[103] |
https://github.com/allenai/natural-instructions |
P3 [104] P3 [104] | https://huggingface.co/datasets/bigscience/P3 |
Promptsource [105] | https://github.com/bigscience-workshop/promptsource |
WebGPT [106] | https://huggingface.co/datasets/openai/webgpt_comparisons |
Flan [107] 布兰 [107] | https://github.com/google-research/flan |
MVPCorpus [108] | https://github.com/RUCAIBox/MVP |
3.5.2 Alignment Tuning 3.5.2 对齐调整
Due to LLMs being pre-trained on massive and diverse internet data, even though the training data undergoes some preprocessing, it is still challenging to guarantee the absence of biased or harmful content in terabyte-scale training datasets. Despite LLMs demonstrating impressive performance across various natural language processing tasks, they frequently exhibit behaviors diverging from human intent. This includes generating false information, producing expressions with bias or misleading content, and so on [93; 109]. To address these issues of LLMs displaying behaviors beyond human intent, alignment tuning becomes crucial [93; 110].
由于LLMs在大量且多样化的互联网数据上进行了预训练,尽管训练数据经过了一些预处理,但在 PB 级规模的训练数据集中仍然难以保证不存在有偏见或有害的内容。尽管LLMs在各种自然语言处理任务中表现出令人印象深刻的性能,但它们经常表现出与人类意图不一致的行为。这包括生成虚假信息,产生带有偏见或误导性内容的表达等[ 93; 109]。为了解决LLMs显示超出人类意图的行为问题,对齐调整变得至关重要[ 93; 110]。
In general, alignment tuning aims to meet the following three criteria: being helpful, honest, and harmless.
一般来说,对齐调整旨在满足以下三个标准:有帮助、诚实、无害。
Helpful: The concept of helpfulness revolves around whether the model-generated output proves genuinely beneficial for a specific task or inquiry. In the realm of natural language processing, the model’s generated text or responses should furnish valuable information, positively impacting the user’s requirements or task objectives.
有用:有用性的概念围绕模型生成的输出是否真正对特定任务或查询有益。在自然语言处理领域,模型生成的文本或响应应提供有价值的信息,积极影响用户的需求或任务目标。
Honest: Honesty entails whether the model-generated output is authentic and reliable. The model should produce information consistent with facts, steering clear of fabrication or distortion. This contributes to maintaining user trust in the authenticity of the model’s outputs.
诚实:诚实意味着模型生成的输出是否真实可靠。模型应产生与事实一致的信息,避免虚构或扭曲。这有助于维持用户对模型输出真实性的信任。
Harmless: Harmlessness is concerned with whether the model-generated output poses no harm to users or society. The model should refrain from generating content that is harmful, offensive, or perilous, ensuring its utilization remains safe for all relevant stakeholders.
无害:无害性关注模型生成的输出是否对用户或社会造成伤害。模型应避免生成有害、冒犯性或危险的内容,确保其使用对所有相关利益相关者都是安全的。
In training LLMs, a noteworthy approach to alignment tuning is based on Reinforcement Learning with Human Feedback (RLHF) [93]. This method involves collecting human feedback data to train a reward model (RM) for reinforcement learning. The RM serves as the reward function during reinforcement learning training, and algorithms such as Proximal Policy Optimization (PPO) [111] are employed to fine-tune the LLM. In this context, LLM is considered as the policy, and the action space is considered as the vocabulary of the LLM.
在训练LLMs时,一种值得注意的对齐调整方法是基于人类反馈的强化学习(RLHF)[93]。该方法涉及收集人类反馈数据以训练奖励模型(RM)用于强化学习。RM 在强化学习训练期间作为奖励函数,并采用如近端策略优化(PPO)[111]等算法来微调LLM。在这种情况下,LLM被视为策略,动作空间被视为LLM的词汇表。
3.5.3 Parameter-efficient Tuning
3.5.3 参数高效调优
Currently, large-scale PLMs such as ChatGPT [93; 19] continue to grow in scale. However, for the majority of researchers, conducting full fine-tuning on consumer-grade hardware has become cost-prohibitive and impractical. Unlike SFT and alignment tuning, the objective of parameter-efficient tuning is to reduce computational and memory overhead. This method involves fine-tuning only a small or additional subset of model parameters while keeping the majority of pre-trained parameters fixed, thereby significantly lowering computational and storage costs. It is noteworthy that state-of-the-art parameter-efficient tuning techniques have achieved performance levels comparable to full fine-tuning. Some common parameter-efficient tuning methods include Low-Rank Adaptation (LoRA) [112], Prefix Tuning [113] and P-Tuning [114; 115]. The adoption of these methods enables efficient model tuning even in resource-constrained environments, offering feasibility and efficiency for practical applications.
目前,大型 PLM 如 ChatGPT[93; 19]的规模仍在不断扩大。然而,对于大多数研究人员来说,在消费级硬件上进行全面微调已成为成本高昂且不切实际的做法。与 SFT 和 alignment tuning 不同,参数高效调优的目标是降低计算和内存开销。这种方法仅对模型参数的小部分或额外子集进行微调,同时保持大部分预训练参数不变,从而显著降低计算和存储成本。值得注意的是,最先进的参数高效调优技术已经达到了与全面微调相当的性能水平。一些常见的参数高效调优方法包括低秩适应(LoRA)[112]、前缀调优[113]和 P-Tuning[114; 115]。采用这些方法即使在资源受限的环境中也能实现高效的模型调优,为实际应用提供了可行性和效率。
With the rise of LLMs, parameter-efficient tuning has garnered increasing attention, with LoRA being widely employed in the latest releases of LLMs. LoRA [112] and its related advancements [116; 117] are noteworthy and deserve attention.
随着LLMs的兴起,参数高效调优越来越受到关注,LoRA 在LLMs的最新版本中得到广泛应用。LoRA[112]及其相关进展[116; 117]值得关注。
3.5.4 Safety Fine-Tuning 3.5.4 安全微调
To enhance the safety and responsibility of LLMs, the integration of additional safety techniques during fine-tuning is essential. This encompasses three primary techniques, applicable to both SFT and RLHF phases.
为了提高LLMs的安全性和责任感,在微调过程中集成额外的安全技术至关重要。这包括三个主要技术,适用于 SFT 和 RLHF 阶段。
Supervised Safety Fine-Tuning: In this technique, labelers are tasked with generating demonstration data that incorporates high safety risk adversarial prompts. This handcraft safety demonstration data is then incorporated into the SFT phase, thereby augmenting the model’s capacity to manage safety risks.
监督安全微调:在这种技术中,标注员负责生成包含高安全风险对抗性提示的演示数据。然后,这种手工安全演示数据被纳入 SFT 阶段,从而增强模型管理安全风险的能力。
Safety RLHF: This technique employs the same or even more aggressive adversarial prompts to query the models. The safest response exhibiting refusal behavior is then used to train a safety reward model within the RLHF framework.
安全 RLHF:该技术采用相同甚至更激进的对抗性提示来查询模型。然后使用表现出拒绝行为的最高安全响应在 RLHF 框架内训练一个安全奖励模型。
Safety Context Distillation: This technique employs context distillation [118] by initially prefixing safety preprompts, like “You are a safe and responsible assistant,” to adversarial prompts. This process yields safer generated responses. The model is then fine-tuned on these safer demonstration data but without the inclusion of the safety pre-prompts. This safety distillation further enhances the model’s safety capabilities.
安全上下文蒸馏:该技术通过在对抗性提示前添加安全预提示,如“你是一个安全且负责任的助手”,来采用上下文蒸馏[118]。这个过程产生了更安全的生成响应。然后,模型在这些更安全的演示数据上微调,但不包括安全预提示。这种安全蒸馏进一步增强了模型的安全性。
3.6 Evaluation 3.6 评估
Unlike in the past, large-scale deep learning models have a wider range of applications and stronger performance compared to ordinary models. However, with great power comes great responsibility, and evaluating these models has become more complex, requiring consideration of potential problems and risks from all aspects. Since the popularity of ChatGPT, many related studies have been published, including the survey and summary of LLMs evaluation in reference [119; 120], which is helpful for developing large-scale deep learning models. This section will introduce some testing datasets, evaluation directions and methods, and potential threats that need to be considered based on previous evaluation work on large models.
与过去不同,大规模深度学习模型的应用范围更广,性能比普通模型更强。然而,权力越大,责任越大,评估这些模型变得更加复杂,需要从各个方面考虑潜在的问题和风险。自从 ChatGPT 流行以来,许多相关研究已经发表,包括对参考文献[119; 120]中LLMs评估的调查和总结,这对开发大规模深度学习模型很有帮助。本节将介绍一些测试数据集、评估方向和方法,以及基于之前对大型模型的评估工作需要考虑的潜在威胁。
3.6.1 Static testing dataset
3.6.1 静态测试数据集
The evaluation of large models’ capabilities requires appropriate datasets for validation. Here, we introduce several commonly used datasets for testing purposes. Considering multimodal large models, typical datasets for computer vision include ImageNet [121] and Open Images [122]. In addition to the commonly used GLUE [123] and SuperGLUE [124] for LLMs, MMLU [125] is highly competitive in testing comprehensive capability. If your model primarily uses Chinese language, then CMMLU [126], as a benchmark for Chinese large models, should also be considered, and XTREME [127] and XTREME-R [128] are suitable choices for multilingual large models. For assessing mathematical knowledge capabilities, there are datasets such as MATH [129] and GSM8K [130], while HumanEval [131] and MBPP [132] can serve as benchmarks for code generation. For common sense reasoning tests in daily human life and work, the following datasets are available: HelloSwag [133], PIQA [134], BoolQ [135], SIQA [136], WinoGrande [137], ARC [138], and OpenBookQA [139]. For medical knowledge, there are datasets such as MedQA-USMLE [140] and MedMCQA [141].
大型模型能力评估需要适当的验证数据集。在此,我们介绍了几个常用的测试数据集。考虑到多模态大型模型,计算机视觉的典型数据集包括 ImageNet [121] 和 Open Images [122]。除了常用的 GLUE [123] 和 SuperGLUE [124] 外,MMLU [125] 在测试综合能力方面具有高度竞争力。如果您的模型主要使用中文,那么作为中文大型模型的基准,CMMLU [126] 也应考虑,XTREME [127] 和 XTREME-R [128] 是多语言大型模型的合适选择。在评估数学知识能力方面,有如 MATH [129] 和 GSM8K [130] 的数据集,而 HumanEval [131] 和 MBPP [132] 可作为代码生成的基准。对于日常人类生活和工作中常识推理测试,以下数据集可用:HelloSwag [133]、PIQA [134]、BoolQ [135]、SIQA [136]、WinoGrande [137]、ARC [138] 和 OpenBookQA [139]。在医学知识方面,有如 MedQA-USMLE [140] 和 MedMCQA [141] 的数据集。
3.6.2 Open domain Q&A evaluation
3.6.2 开放域问答评估
Currently, LLMs interact with humans in the form of questions and answers. Compared to the fragmented and ambiguous information returned by traditional searches, LLMs provide more realistic and efficient question-and-answer results that align with human habits. Therefore, the evaluation of ODQA (Open Domain Question Answering) [142] capability is essential. The performance of open-domain question answering greatly affects user experience. Commonly used datasets for testing include SquAD [143] and Natural Questions [144], with F1 score and Exact-Match accuracy (EM) as evaluation metrics. However, note that the method of word matching may have certain issues, such as when a factually correct answer is not in the golden answer list. Therefore, human evaluation seems to be necessary, and literature [145] has conducted detailed research on this matter.
目前,LLMs 以问答的形式与人类互动。与传统搜索返回的碎片化和模糊信息相比,LLMs 提供更符合人类习惯的更真实、更高效的问答结果。因此,评估 ODQA(开放域问答)[ 142] 能力至关重要。开放域问答的性能极大地影响用户体验。常用的测试数据集包括 SquAD [ 143] 和 Natural Questions [ 144],F1 分数和精确匹配准确率(EM)作为评估指标。然而,请注意,单词匹配的方法可能存在某些问题,例如当事实正确的答案不在黄金答案列表中时。因此,似乎需要进行人工评估,文献[ 145] 对此进行了详细研究。
3.6.3 Security evaluation 3.6.3 安全评估
As an emerging and hot research field, LLMs must pay attention to their potential security threats, prevent malicious use or vulnerabilities to malicious attacks, and address any potential long-term issues that may pose a threat to human development. Additionally, red teaming in various domains is necessary to critically assess and test the model, identifying vulnerabilities, biases, inaccuracies, and areas for safety improvement.
作为一个新兴和热门的研究领域,LLMs必须关注其潜在的安全威胁,防止恶意使用或漏洞被恶意攻击利用,并解决可能对人类发展构成威胁的任何潜在长期问题。此外,在各个领域进行红队行动是必要的,以批判性地评估和测试模型,识别漏洞、偏差、不准确之处以及安全改进的领域。
Potential bias:The training data for LLMs may contain potential biases, such as gender or race. Security assessments need to address whether the model generates or amplifies these biases and how to reduce or correct them. Reference [146] discusses in detail the causes of bias in LLMs and the serious consequences that may arise. Reference [147] extensively studies how pre-trained language models generate harmful content to what extent, and how to use controlled text generation algorithms to prevent the generation of toxic content. CHBias [148] is a Chinese dataset that can be used to evaluate and mitigate the bias problem of LLMs.
潜在偏差:LLMs的训练数据可能存在潜在偏差,例如性别或种族。安全评估需要解决模型是否产生或放大这些偏差,以及如何减少或纠正它们。参考文献[146]详细讨论了LLMs偏差的原因以及可能出现的严重后果。参考文献[147]广泛研究了预训练语言模型在多大程度上生成有害内容,以及如何使用受控文本生成算法来防止生成有毒内容。CHBias[148]是一个可用于评估和减轻LLMs偏差问题的中文数据集。
Privacy protection: LLMs may come into contact with a large amount of user data, such as text and images, during the training process. Security assessments need to ensure the effective protection of user data privacy to prevent leaks and misuse. Reference [149] conducted research on models like ChatGPT and found that it is possible to extract training data effectively from these models. Reference [150] provides a solution by proposing a framework called DEPN (Detect and Editing Privacy Neurons) to detect and edit privacy neurons in pre-trained language models. It also introduces a privacy neuron aggregator to eliminate privacy information in a batch-processing manner, effectively reducing the leakage of privacy data while maintaining model performance.
隐私保护:LLMs 在训练过程中可能会接触到大量用户数据,如文本和图像。安全评估需要确保用户数据隐私的有效保护,以防止泄露和滥用。参考文献[149]对 ChatGPT 等模型进行了研究,发现可以从这些模型中有效地提取训练数据。参考文献[150]提出了一种名为 DEPN(检测和编辑隐私神经元)的框架作为解决方案,用于检测和编辑预训练语言模型中的隐私神经元。它还引入了一个隐私神经元聚合器,以批量处理的方式消除隐私信息,在保持模型性能的同时,有效减少隐私数据的泄露。
Adversarial attacks: LLMs may be susceptible to adversarial attacks, such as input tampering, intentional misinformation, or generating false information. Security assessments need to consider the robustness of the model, i.e., its ability to withstand such attacks. As mentioned in reference [151], LLMs still have "jailbreak" risks, where users can manipulate the model to generate toxic content using specific input methods like role-playing or adding special suffixes as studied in the referenced paper. Especially when using open-source pre-trained models, any vulnerabilities in the pre-training models regarding adversarial attacks are inherited as well. Reference [152] provides a solution to mitigate the harm caused by these vulnerabilities.
对抗攻击:LLMs可能容易受到对抗攻击,如输入篡改、故意传播错误信息或生成虚假信息。安全评估需要考虑模型的鲁棒性,即其抵御此类攻击的能力。如参考文献[151]中所述,LLMs仍然存在“越狱”风险,用户可以通过角色扮演或添加特殊后缀等特定输入方法操纵模型生成有害内容,正如参考文献中研究的那样。特别是在使用开源预训练模型时,预训练模型中关于对抗攻击的任何漏洞也会被继承。参考文献[152]提供了一种减轻这些漏洞造成的损害的解决方案。
3.6.4 Evaluation method 3.6.4 评估方法
Automated evaluation and manual evaluation play crucial roles in Language Model (LLM) research. Automated evaluation typically involves using various metrics and indicators to quantify the performance of models, such as BIEU [153], ROUGE [154], and BERTSScore [155], which can measure the accuracy of LLM-generated content. These metrics can help researchers quickly assess model performance on large-scale data and compare different models. However, automated evaluation also has limitations as it cannot fully capture the complexity of language understanding and generation. Research in reference [156] has shown that manual evaluation is more reliable for some open-ended generation tasks. Manual evaluation typically involves human annotators subjectively judging and assessing the quality of model-generated outputs. This evaluation method can help reveal how models perform in specific tasks or scenarios and identify subtle issues and errors that automated evaluation may overlook. However, manual evaluation also faces challenges such as high time costs and subjectivity. Therefore, it is often necessary to combine the strengths of automated and manual evaluation to comprehensively assess the performance of language models.
自动评估和人工评估在语言模型(LLM)研究中起着关键作用。自动评估通常涉及使用各种指标和指标来量化模型的性能,例如 BIEU [153]、ROUGE [154] 和 BERTSScore [155],这些指标可以衡量LLM生成内容的准确性。这些指标可以帮助研究人员快速评估大规模数据上的模型性能并比较不同模型。然而,自动评估也存在局限性,因为它无法完全捕捉语言理解和生成的复杂性。参考文献[156]的研究表明,对于某些开放式生成任务,人工评估更为可靠。人工评估通常涉及人类标注员主观判断和评估模型生成输出的质量。这种方法可以帮助揭示模型在特定任务或场景中的表现,并识别自动评估可能忽略的细微问题和错误。然而,人工评估也面临诸如时间成本高和主观性等挑战。 因此,通常需要结合自动评估和人工评估的优势,全面评估语言模型的表现。
3.7 LLM Framework 3.7LLM 框架
Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as distributed training have solved fundamental limitations to fit these models into limited device memory while obtaining computation, communication, and development efficiency. Next, this section will introduce several large language model frameworks that utilize distributed training technology leveraging GPU, CPU, and NVMe memory to allow for unprecedented model scale on limited resources without requiring model code refactoring.
大型深度学习模型提供了显著的精度提升,但训练数十亿到数万亿参数具有挑战性。现有的解决方案,如分布式训练,已解决将这些模型拟合到有限设备内存中的基本限制,同时获得计算、通信和开发效率。接下来,本节将介绍几个利用分布式训练技术,借助 GPU、CPU 和 NVMe 内存,在有限资源上实现前所未有的模型规模的大型语言模型框架,而无需对模型代码进行重构。
Transformers
Transformers[157], an open-source Python library by Hugging Face, is dedicated to building models using the Transformer architecture. Featuring a simple and user-friendly API, it facilitates easy customization of various pre-trained models. With a robust community of users and developers, transformers continuously update and improve models and algorithms.
Transformers 是由 Hugging Face 开发的一个开源 Python 库,专注于使用 Transformer 架构构建模型。它提供了一个简单易用的 API,便于对各种预训练模型进行定制。拥有强大的用户和开发者社区,Transformers 持续更新和改进模型和算法。
DeepSpeed: Deepspeed [158], an open-source optimization library compatible with PyTorch, is developed by Microsoft and utilized in training LLMs like MTNLG [79] and BLOOM [38]. Currently, It provides full support for ZeRO technology which includes Optimizer state partitioning, Gradient partitioning and parameter partitioning, Custom mixed precision training, A range of fast CUDA-extension-based optimizers [159] and ZeRO-offload to CPU and Disk/NVMe. Through the above technologies. Additionally, Deepspeed has achieved excellent scalability and efficiency with small memory requirements.
DeepSpeed:Deepspeed [158] 是一个与 PyTorch 兼容的开源优化库,由微软开发,用于训练 LLMs 如 MTNLG [79] 和 BLOOM [38]。目前,它提供了对 ZeRO 技术的全面支持,包括优化器状态分区、梯度分区和参数分区、自定义混合精度训练、一系列基于快速 CUDA 扩展的优化器 [159] 以及 ZeRO-offload 到 CPU 和 Disk/NVMe。通过上述技术。此外,Deepspeed 在小内存需求下实现了出色的可扩展性和效率。
BMTrain: BMTrain [160] is an efficient large model training toolkit developed by Tsinghua University that can be used to train large models with tens of billions of parameters. It can train models in a distributed manner while keeping the code as simple as stand-alone training. BMTrain does not require model refactoring to work. In fact, PyTorch users can enable BMTrain with a few lines of code change to their existing training pipeline. It provides the support of various optimization techniques such as ZeRO optimization and communication optimization.
BMTrain:BMTrain [160] 是清华大学开发的一个高效的大模型训练工具包,可用于训练具有数十亿参数的大模型。它可以在保持代码尽可能简单的情况下以分布式方式训练模型。BMTrain 无需重构模型即可工作。实际上,PyTorch 用户可以通过对现有训练流程进行少量代码更改来启用 BMTrain。它提供了各种优化技术的支持,如 ZeRO 优化和通信优化。
Megatron-LM: Megatron-LM [96; 161; 162] is a deep learning library developed by NVIDIA for training large-scale language models.Megatron-LM presents their techniques including model and data parallelism, mixed-precision training, and FlashAttention for training very large transformer models. Specifically, it takes advantage of the structure of transformer networks to create a simple model parallel implementation by adding a few synchronization primitives and it enables training transformer models with billions of parameters and trains efficiently in PyTorch.It also performs an in-depth empirical analysis of their model and data parallel technique and demonstrates up to 76% scaling efficiency using 512 GPUs which can largely improve the training efficiency and speed, enabling efficient distributed training across GPUs.
Megatron-LM:Megatron-LM [96; 161; 162] 是 NVIDIA 开发的一个用于训练大规模语言模型的深度学习库。Megatron-LM 展示了他们的技术,包括模型和数据并行、混合精度训练以及用于训练非常大的 Transformer 模型的 FlashAttention。具体来说,它利用 Transformer 网络的结构,通过添加一些同步原语来创建一个简单的模型并行实现,并支持训练具有数十亿参数的 Transformer 模型,在 PyTorch 中高效训练。它还对他们的模型和数据并行技术进行了深入的实证分析,并使用 512 个 GPU 展示了高达 76% 的扩展效率,这可以大大提高训练效率和速度,实现跨 GPU 的高效分布式训练。
In addition to the aforementioned frameworks, Colossal-AI [163] and FastMoE [164; 165] are also two popular frameworks for training LLMs. In principle, any deep learning framework that supports parallel computing can be used to train LLMs. Examples include PyTorch [166], TensorFlow [167; 168], PaddlePaddle [169], MXNet [170], OneFlow [171], MindSpore [172] and JAX [173].
除了上述框架之外,Colossal-AI [163] 和 FastMoE [164; 165] 也是训练 LLMs 的两个流行框架。原则上,任何支持并行计算的深度学习框架都可以用于训练 LLMs。例如,包括 PyTorch [166]、TensorFlow [167; 168]、PaddlePaddle [169]、MXNet [170]、OneFlow [171]、MindSpore [172] 和 JAX [173]。
4 Inference with Large Language Models
4 大语言模型推理
The scale of large models is growing at a rate of nearly 10 times per year, which brings about huge computational consumption and carbon emissions [174]. Therefore, reducing the computational burden of training large models while retaining their reasoning ability has become a common concern for everyone. In this chapter, we mainly introduce how to reduce costs from both computational and storage aspects, that is, how to efficiently perform large-scale model inference from four aspects: model compression, memory scheduling, parallelism, and structural optimization.
大型模型的规模每年增长近 10 倍,这导致了巨大的计算消耗和碳排放[ 174]。因此,在保持其推理能力的同时减少大型模型训练的计算负担已成为大家的共同关注。在本章中,我们主要介绍如何从计算和存储两个方面降低成本,即如何从四个方面高效地进行大规模模型推理:模型压缩、内存调度、并行性和结构优化。
4.1 Model Compression 4.1 模型压缩
4.1.1 Knowledge Distillation
4.1.1 知识蒸馏
Knowledge Distillation [175] refers to transferring knowledge from a cumbersome (teacher) model to a smaller (student) model that is more suitable for deployment. This is achieved by fitting the soft targets of the two models, as soft targets provide more information than gold labels. Initially, the calculation for model distillation involved only fitting the outputs from the last layer of both the teacher and student models [176]. PKD [177] improves this process by computing the mean-square loss between normalized hidden states, allowing the student model to learn from multiple intermediate layers of the teacher model. In order to discover more intermediate representations suitable for knowledge distillation, Jiao et al. [178] proposed Tiny BERT. This enables the student model to learn from the embedding layer and attention matrices of the teacher model.
知识蒸馏[175]是指将知识从笨重的(教师)模型转移到更小、更适合部署的(学生)模型。这是通过拟合两个模型的软目标实现的,因为软目标比金标签提供更多信息。最初,模型蒸馏的计算仅涉及拟合教师和学生模型最后层的输出[176]。PKD[177]通过计算归一化隐藏状态之间的均方损失来改进这一过程,使学生模型能够从教师模型的多个中间层中学习。为了发现更多适合知识蒸馏的中间表示,Jiao 等人[178]提出了 Tiny BERT。这使得学生模型能够从教师模型的嵌入层和注意力矩阵中学习。
4.1.2 Model Pruning 4.1.2 模型剪枝
Model pruning involves removing redundant portions from the parameter matrices of large models. It is divided into unstructured pruning and structured pruning. Unstructured pruning involves removing individual connections or weights in a neural network without adhering to any specific structural pattern. In structured pruning, specific structural patterns or units within a neural network are pruned or removed. Gordon et al. [179] compared the effects of unstructured and structured pruning on the BERT model. They found that the effectiveness of unstructured pruning significantly decreases as the pruning ratio increases, while in structured pruning, 30-40% of the weights can be discarded without affecting BERT’s universality.
Different structures in the model can be structurally pruned. Michel et al. [180] pruned attention heads and found that ablating one head often positively impacts the performance of WMT and BERT. They proposed a gradient-based metric for evaluating the importance of attention heads to enhance pruning effectiveness. Fan et al. [179] performed layer pruning by extending dropout from weights to layers. During training, they randomly dropped layers and achieved good inference results by selecting sub-networks with any desired depth during testing.
模型剪枝涉及从大型模型的参数矩阵中移除冗余部分。它分为无结构剪枝和结构剪枝。无结构剪枝涉及在不遵循任何特定结构模式的情况下从神经网络中移除单个连接或权重。在结构剪枝中,神经网络中的特定结构模式或单元被剪枝或移除。Gordon 等人[179]比较了无结构和结构剪枝对 BERT 模型的影响。他们发现,随着剪枝比的增加,无结构剪枝的有效性显著降低,而在结构剪枝中,可以丢弃 30-40%的权重而不影响 BERT 的通用性。模型中的不同结构可以进行结构剪枝。Michel 等人[180]剪除了注意力头,发现移除一个头通常对 WMT 和 BERT 的性能有积极影响。他们提出了一种基于梯度的指标来评估注意力头的重要性,以增强剪枝的有效性。Fan 等人[179]通过将 dropout 从权重扩展到层来执行层剪枝。 在训练过程中,他们随机丢弃层,并在测试时通过选择任何所需深度的子网络实现了良好的推理结果。
4.1.3 Model Quantization 4.1.3 模型量化
The fundamental idea behind model quantization is to reduce the number of floating-point bits used in numerical calculations within a large model network, thereby decreasing storage and computation costs. This involves converting floating-point operations into fixed-precision operations. However, as precision decreases, the model’s loss gradually increases, and when precision drops to 1 bit, the model’s performance experiences a sudden decline. To address the optimization challenges introduced by low-precision quantization, Bai et al. [181] proposed BinaryBERT. They initially trained a half-sized ternary model and then initialized a binary model with the ternary model through weight splitting. Finally, they fine-tuned the binary model. This approach yielded better results for the binary model compared to training a binary model from scratch.
模型量化背后的基本思想是在大型模型网络中减少数值计算中使用的浮点位数量,从而降低存储和计算成本。这涉及到将浮点运算转换为定点运算。然而,随着精度的降低,模型的损失逐渐增加,当精度降低到 1 位时,模型的性能会突然下降。为了解决低精度量化引入的优化挑战,Bai 等人[181]提出了 BinaryBERT。他们最初训练了一个半大小的三元模型,然后通过权重分割用三元模型初始化二进制模型。最后,他们对二进制模型进行了微调。与从头开始训练二进制模型相比,这种方法为二进制模型带来了更好的结果。
4.1.4 Weight Sharing 4.1.4 权重共享
The basic idea of weight sharing is to use the same set of parameters for multiple parts of a LLM. Instead of learning different parameters for each instance or component, the model shares a common set of parameters across various parts. Weight sharing helps reduce the number of parameters that need to be learned, making the model more computationally efficient and reducing the risk of overfitting, especially in situations where there is limited data. ALBERT [182] uses the Cross-layer parameter-sharing strategy to effectively reduce the number of parameters of the model, and can achieve better training results than the baseline with the same parameter number.
权重共享的基本思想是为LLM的多个部分使用相同的参数集。而不是为每个实例或组件学习不同的参数,模型在各个部分之间共享一组公共参数。权重共享有助于减少需要学习的参数数量,使模型在计算上更高效,并降低过拟合的风险,尤其是在数据有限的情况下。ALBERT [182] 使用跨层参数共享策略有效地减少了模型的参数数量,并且在使用相同参数数量的基线模型中可以达到更好的训练结果。
4.1.5 Low-rank Approximation
4.1.5 低秩逼近
Low-rank decomposition methods are crucial in the field of model compression, as they allow for the creation of more compact models with fewer parameters. This reduction in model size is particularly beneficial for deploying neural networks on resource-constrained devices, improving efficiency during inference. Chen et al. [183] performed a low-rank decomposition on the input matrix, enabling matrix operations within the large model to occur at a lower-rank level, effectively reducing the computational workload. From the results, their proposed method, DRONE, not only ensures the inference performance of the large model but also achieves an acceleration ratio of more than 1.3 times compared to the baseline method. The specific choice of low-rank decomposition method depends on the architecture of the neural network and the requirements of the target application.
低秩分解方法在模型压缩领域至关重要,因为它们允许创建具有更少参数的更紧凑模型。这种模型尺寸的减少对于在资源受限的设备上部署神经网络特别有益,提高了推理效率。Chen 等人[183]对输入矩阵进行了低秩分解,使得在大模型中进行矩阵操作可以在较低秩级别进行,从而有效地减少了计算工作量。从结果来看,他们提出的 DRONE 方法不仅确保了大型模型的推理性能,而且与基线方法相比,实现了超过 1.3 倍的加速比。低秩分解方法的具体选择取决于神经网络的架构和目标应用的要求。
4.2 Memory Scheduling 4.2 内存调度
Deploying LLMs on a single consumer-grade GPU is constrained by the limitations of the available video memory, given the substantial parameters of LLMs. Therefore, appropriate Memory Scheduling strategies can be used to solve the hardware limitations of large model inference. Memory scheduling in large model inference involves the efficient organization and management of memory access patterns during the reasoning or inference phase of complex neural network models. In the context of sophisticated reasoning tasks, such as natural language understanding or complex decision-making, large models often have intricate architectures and considerable memory requirements. Memory scheduling optimizes the retrieval and storage of intermediate representations, model parameters, and activation values, ensuring that the inference process is both accurate and performed with minimal latency. For example, BMInf [184] utilizes the principle of virtual memory, achieving efficient inference for large models by intelligently scheduling the parameters of each layer between the GPU and CPU.
部署LLMs到单个消费级 GPU 上受到可用视频内存限制,鉴于LLMs的参数量很大。因此,可以使用适当的内存调度策略来解决大型模型推理的硬件限制。大型模型推理中的内存调度涉及在复杂神经网络模型的推理或推理阶段对内存访问模式的有效组织和管理工作。在复杂的推理任务背景下,如自然语言理解或复杂决策,大型模型通常具有复杂的架构和大量的内存需求。内存调度优化了中间表示、模型参数和激活值的检索和存储,确保推理过程既准确又具有最小延迟。例如,BMInf [184] 利用虚拟内存原理,通过智能调度每层的参数在 GPU 和 CPU 之间,实现了大型模型的效率推理。
4.3 Parallelism 4.3 并行
Both inference and training can leverage parallelization techniques. Presently, parallelization techniques for inference primarily manifest across three dimensions: Data Parallelism, Tensor Parallelism, and Pipeline Parallelism. Data Parallelism primarily involves increasing the overall throughput of the inference system by adding more GPU devices [101; 97; 159; 185]. Tensor parallelism is a form of model parallelism where the model’s parameters are partitioned into multiple tensors, each computed on different processing units. This approach proves beneficial when dealing with models that are too large to fit into the memory of a single GPU. Tensor parallelism primarily involves increasing the number of devices horizontally through parallel computation to reduce latency [96]. Pipeline parallelism primarily involves vertically increasing the number of GPU devices through parallel computation to support larger models and enhance device utilization. Typically, it is combined with tensor parallelism to achieve optimal performance [98].
推理和训练都可以利用并行化技术。目前,推理的并行化技术主要表现在三个维度:数据并行、张量并行和流水线并行。数据并行主要涉及通过添加更多 GPU 设备来提高推理系统的整体吞吐量[101;97;159;185]。张量并行是一种模型并行,其中模型的参数被分割成多个张量,每个张量在不同的处理单元上计算。当处理无法适应单个 GPU 内存的大型模型时,这种方法很有益。张量并行主要涉及通过并行计算水平增加设备数量以减少延迟[96]。流水线并行主要涉及通过并行计算垂直增加 GPU 设备的数量以支持更大的模型并提高设备利用率。通常,它与张量并行结合以实现最佳性能[98]。
4.4 Structural Optimization
4.4 结构优化
In the forward propagation computation of LLMs, the calculation speed is significantly faster than the speed of memory access. Inference speed can be impacted by numerous memory access operations. One goal in LLM inference is to minimize the number of memory accesses during forward propagation. FlashAttention [186] and PagedAttention [187] enhance computational speed by employing a chunked computation approach, mitigating the storage overhead associated with matrices. The entire operation takes place within SRAM, reducing the number of accesses to High Bandwidth Memory (HBM) and significantly boosting computational speed. Both FlashAttention and PagedAttention have been adopted by mainstream inference frameworks, and seamlessly integrated into these frameworks for straightforward utilization.
在LLMs的前向传播计算中,计算速度比内存访问速度显著更快。推理速度可能会受到大量内存访问操作的影响。LLM推理的一个目标是尽量减少前向传播过程中的内存访问次数。FlashAttention [186] 和 PagedAttention [187] 通过采用分块计算方法来提高计算速度,减轻了与矩阵相关的存储开销。整个操作都在 SRAM 内进行,减少了访问高带宽内存(HBM)的次数,并显著提高了计算速度。FlashAttention 和 PagedAttention 都已被主流推理框架采用,并无缝集成到这些框架中,以便于直接使用。
4.5 Inference Framework 4.5 推理框架
Parallel computing, model compression, memory scheduling, and specific optimizations for transformer structures, all integral to LLM inference, have been effectively implemented in mainstream inference frameworks. These frameworks furnish the foundational infrastructure and tools required for deploying and running LLM models. They offer a spectrum of tools and interfaces, streamlining the deployment and inference processes for researchers and engineers across diverse application scenarios. The choice of a framework typically hinges on project requirements, hardware support, and user preferences. In Table 4, we compile some of these frameworks for reference.
并行计算、模型压缩、内存调度以及针对 Transformer 结构的特定优化,对于LLM推理至关重要,已在主流推理框架中得到有效实现。这些框架提供了部署和运行LLM模型所需的基础基础设施和工具。它们提供了一系列工具和接口,简化了研究人员和工程师在不同应用场景中的部署和推理过程。框架的选择通常取决于项目需求、硬件支持和用户偏好。在表 4 中,我们整理了一些这些框架以供参考。
表 4:LLM推理框架列表。
Framework 框架 | Links 链接 |
TensorRT | https://github.com/NVIDIA/TensorRT-LLM |
FasterTransformer 更快 Transformer | https://github.com/NVIDIA/FasterTransformer |
Megatron-LM [96] Megatron-LM [96] | https://github.com/NVIDIA/Megatron-LM |
FlexGen [188] | https://github.com/FMInference/FlexGen |
DeepSpeed [158] | https://github.com/microsoft/DeepSpeed |
vLLM [187] | https://github.com/vllm-project/vllm |
FlexFlow [189] | https://github.com/flexflow/FlexFlow |
StreamingLLM [190] | https://github.com/mit-han-lab/streaming-llm |
ColossalAI [163] ColossalAI [163] | https://github.com/hpcaitech/ColossalAI |
BMCook [191] BMCook [191] | https://github.com/OpenBMB/BMCook |
BMInf [184] | https://github.com/OpenBMB/BMInf |
Petals [192] 花瓣 [192] | https://github.com/bigscience-workshop/petals |
5 Utilization of LLMs 5 使用LLMs
The application scope of LLMs is extensive and can be practically employed in almost any specialized domain [1; 193; 46; 194; 195]. Following pre-training and fine-tuning, LLMs are primarily utilized by designing suitable prompts for various tasks. Leveraging powerful zero-shot capabilities, many tasks can be directly accomplished by guiding LLMs with straightforward prompts. For more complex tasks that cannot be achieved through simple prompts, a few-shot approach involving in-context learning is employed to guide LLMs in task completion. Additionally, incorporating chain-of-thought [196; 197] prompts in the prompt enhances in-context learning by introducing a reasoning process. The pipeline of the in-context learning and chain-of-thought is shown in Figure 6. In some specialized research directions, obtaining intermediate layer representations of LLMs may be necessary. For instance, in neuroscience studies, embedding representations from the model are used to investigate activation regions of brain functions [198; 199; 200; 201].
LLMs的应用范围广泛,几乎可以在任何专业领域实际应用[1;193;46;194;195]。在预训练和微调之后,LLMs主要通过设计适合各种任务的提示来主要使用。利用强大的零样本能力,许多任务可以通过引导LLMs使用简单的提示直接完成。对于无法通过简单提示完成的更复杂任务,采用涉及上下文学习的少量样本方法来指导LLMs完成任务。此外,通过在提示中引入推理过程,将思维链[196;197]提示纳入提示中,可以增强上下文学习。上下文学习和思维链的流程如图 6 所示。在某些专业研究方向中,可能需要获取LLMs的中间层表示。例如,在神经科学研究领域,模型中的嵌入表示被用来研究大脑功能激活区域[198;199;200;201]。
Generally, there are several approaches to employing LLMs. The first involves accessing the capabilities of robust proprietary models through open API services, such as utilizing the API provided by ChatGPT [19]. The second approach includes deploying open-source LLMs for local use [9]. The third method entails fine-tuning open-source LLMs to meet specific domain standards [43; 202], enabling their application in a particular field, and subsequently deploying them locally. In Table 5, we have compiled information on various open-source LLMs for reference. Researchers can choose from these open-source LLMs to deploy applications that best suit their needs.
通常,有几种方法来使用LLMs。第一种是通过开放 API 服务访问强大专有模型的功能,例如利用 ChatGPT 提供的 API[19]。第二种方法包括部署开源LLMs进行本地使用[9]。第三种方法包括微调开源LLMs以满足特定领域标准[43; 202],使其在特定领域应用,并在本地部署。在表 5 中,我们整理了各种开源LLMs的信息以供参考。研究人员可以从这些开源LLMs中选择最适合他们需求的应用程序。
表 5:开源列表 LLMs
LLM | Size (B) 尺寸(B) | Links 链接 |
T5 [68] | 11B | https://github.com/google-research/text-to-text-transfer-transformer |
CodeGen [81] 代码生成 [81] | 16B | https://github.com/salesforce/CodeGen |
MOSS [203] MOSS [203] | 16B | https://github.com/OpenLMLab/MOSS |
GLM [37] GLM [37] | 130B | https://github.com/THUDM/GLM |
ChatGLM [37] | 6B | https://github.com/THUDM/ChatGLM3 |
ChatYuan [204] | 0.7B | https://github.com/clue-ai/ChatYuan |
OPT [83] OPT [83] | 175B | https://github.com/facebookresearch/metaseq |
BLOOM [38] | 176B | https://huggingface.co/bigscience/bloom |
LLaMA [9] | 65B | https://github.com/facebookresearch/llama |
CodeGeeX [82] | 13B | https://github.com/THUDM/CodeGeeX |
Baichuan [205] 百川 [205] | 13B | https://github.com/baichuan-inc/Baichuan2 |
Aquila 天鹰座 | 7B | https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila |
MiniGPT-4 [206] | 25B | https://github.com/Vision-CAIR/MiniGPT-4 |
Vicuna [207] vicuña [207] | 13B | https://github.com/lm-sys/FastChat |
6 Future Directions and Implications
6 未来方向及影响
This section will delve into the future trends and impact of LLM technology. Our discussion will be structured into three parts: firstly, an exploration of the developmental trends within LLMs technology itself; secondly, an examination of the developmental directions for AI researchers; and finally, an analysis of the societal impact resulting from the ongoing development of LLMs.
本节将深入探讨LLM技术的未来趋势和影响。我们的讨论将分为三个部分:首先,探讨LLMs技术本身的开发趋势;其次,考察 AI 研究者的开发方向;最后,分析LLMs技术持续发展带来的社会影响。
Based on existing experiences, it is evident that an ample supply of high-quality data and a sufficient number of parameters significantly contribute to enhancing the performance of models [8]. Looking ahead, the model scale of LLMs is expected to continue expanding, thereby augmenting their learning capabilities and overall performance. Moreover, the majority of currently available LLMs are confined to a single natural language modality, lacking extensions to process multimodal data such as images, videos, and speech. There is a potential future trajectory for LLMs to evolve towards handling information beyond text, incorporating multimodal data like images and audio. This evolution would empower models to comprehensively understand and generate multimodal content, significantly broadening the application scope of LLMs. The inevitable expansion of LLMs into the field of multimodality is bound to incur increased training costs. A pivotal focus for future developments lies in the efficient fine-tuning of parameters and the deployment of LLMs through techniques such as knowledge distillation, model compression, and quantization, aimed at reducing both the training and inference costs of LLMs. Another emerging trend is the domain-specific training and fine-tuning of LLMs for particular sectors, facilitating a more adept adaptation to and understanding of industry-specific terminologies and contexts. Lastly, in the exploration of potential new architectures for LLMs the current landscape predominantly relies on the transformer architecture. While the transformer architecture naturally boasts advantages such as parallel computing and adaptability to various input modalities, its design typically necessitates fixed-size inputs. This requirement may necessitate padding or truncation when dealing with variable-length sequences, potentially leading to computational and information inefficiencies, as well as challenges in generating coherent data. Investigating the potential of Recurrent Neural Network (RNN) architectures in the era of LLMs could emerge as a pivotal research direction. For instance, RWKV [208], an LLM designed under the RNN architecture, has demonstrated competitive performance on various third-party evaluations, proving itself comparable to the majority of transformer-based LLMs.
基于现有经验,显然,充足的高质量数据和足够的参数数量对提高模型性能有显著贡献[8]。展望未来,LLMs的模型规模预计将继续扩大,从而增强其学习能力和整体性能。此外,目前大多数可用的LLMs都局限于单一的自然语言模式,缺乏处理图像、视频和语音等多模态数据的扩展。LLMs有潜力在未来向处理文本之外的信息发展,整合图像和音频等多模态数据。这种演变将使模型能够全面理解和生成多模态内容,显著扩大LLMs的应用范围。LLMs不可避免地扩展到多模态领域必然会增加训练成本。 未来发展的关键焦点在于参数的高效微调和通过知识蒸馏、模型压缩和量化等技术部署LLMs,旨在降低LLMs的训练和推理成本。另一个新兴趋势是针对特定行业对LLMs进行领域特定训练和微调,以促进对行业特定术语和语境的更熟练适应和理解。最后,在探索LLMs的潜在新架构时,当前格局主要依赖于 Transformer 架构。虽然 Transformer 架构自然具有并行计算和适应各种输入模态的优势,但其设计通常需要固定大小的输入。这一要求在处理可变长度序列时可能需要填充或截断,这可能导致计算和信息效率低下,以及生成连贯数据时的挑战。在LLMs时代研究循环神经网络(RNN)架构的潜力可能成为一个关键的研究方向。 例如,RWKV [ 208],一个基于 RNN 架构设计的 LLM,在各种第三方评估中表现出竞争力,证明了自己可以与大多数基于 transformer 的 LLMs 相媲美。
For researchers in the field of AI, working in isolation is becoming increasingly impractical. The future direction of AI development will intertwine with various industries, necessitating close collaboration with professionals from diverse fields. It is crucial to engage in collaborative efforts, bridging research disciplines, and collectively addressing challenges by combining expertise from different domains. Simultaneously, there is a fresh set of requirements for the comprehensive skills of AI researchers. Training and deploying LLMs necessitate proficiency in managing large-scale data and substantial practical experience in distributed parallel training. This criterion underscores the importance for researchers involved in LLM development to possess substantial engineering capabilities, addressing the challenges inherent in the process. Researchers who are interested in the field of LLMs must either possess engineering skills or adeptly collaborate with engineers to navigate the complexities of model development [3].
对于人工智能领域的科研人员来说,独立工作越来越不切实际。人工智能发展的未来方向将与各个行业交织在一起,需要与来自不同领域的专业人士紧密合作。进行协作努力,跨越研究学科,通过结合不同领域的专业知识共同应对挑战至关重要。同时,对人工智能研究者的综合技能也提出了一组新的要求。训练和部署LLMs需要掌握管理大规模数据的能力以及在分布式并行训练方面的丰富实践经验。这一标准强调了在LLM开发中涉及的研究者必须具备强大的工程能力,以应对过程中固有的挑战。对LLMs领域感兴趣的科研人员必须具备工程技能或与工程师熟练合作,以应对模型开发的复杂性[3]。
As LLMs find widespread applications in societal life, concerns about ethical issues and societal impact are on a continuous rise. This may involve research and improvements in areas such as managing model biases and controlling the risk of misuse [4]. Considering the paramount importance of privacy and data security, the future development of LLMs might involve more federated learning and decentralized approaches to enhance model performance while safeguarding user privacy. Developers should engage in interdisciplinary collaboration with experts from various fields, including decision-making, legal studies, and sociology, to establish standards and ethical frameworks for the development, deployment, and utilization of LLMs, mitigating potential harmful consequences. In terms of public awareness and education, mandatory awareness training should be implemented before large-scale public deployment and applications. This aims to enhance public understanding of the capabilities and limitations of LLMs, fostering responsible and informed use, especially in industries such as education and journalism.
随着LLMs在社会生活中的广泛应用,对伦理问题和社会影响的担忧持续上升。这可能涉及在管理模型偏差和控制滥用风险等方面的研究和改进[4]。考虑到隐私和数据安全的重要性,LLMs的未来发展可能涉及更多联邦学习和去中心化方法,以提升模型性能同时保护用户隐私。开发者应与来自决策、法学和社会学等各个领域的专家进行跨学科合作,为LLMs的开发、部署和使用建立标准和伦理框架,减轻潜在的有害后果。在公众意识和教育方面,在大规模公共部署和应用之前应实施强制性的意识培训。这旨在提高公众对LLMs的能力和局限性的理解,促进负责任和明智的使用,尤其是在教育和新闻等行业。
7 Conclusion 7 结论
The introduction of ChatGPT has ushered in a transformative era in the realm of Large LLMs, significantly influencing their utilization for diverse downstream tasks. The emphasis on cost-effective training and deployment has emerged as a crucial aspect in the evolution of LLMs. This paper has provided a comprehensive survey of the evolution of large language model training techniques and inference deployment technologies in alignment with the emerging trend of low-cost development. The progression from traditional statistical language models to neural language models, and subsequently to PLMs such as ELMo and transformer architecture, has set the stage for the dominance of LLMs. The scale and performance of these models, particularly exemplified by the GPT series, have reached unprecedented levels, showcasing the phenomenon of emergence and enabling versatile applications across various domains. Notably, the release of ChatGPT by OpenAI in November 2022 has marked a pivotal moment in the LLM landscape, revolutionizing the strength and effectiveness of AI algorithms. However, the current reliance on OpenAI’s infrastructure underscores the necessity for alternative LLMs, emphasizing the need for domain-specific models and advancements in the training and deployment processes.
ChatGPT 的推出引领了大型语言模型领域的一个变革时代,极大地影响了它们在多样化下游任务中的应用。低成本训练和部署的重视已成为LLMs演变中的一个关键方面。本文对大型语言模型训练技术和推理部署技术的演变进行了全面调查,与低成本开发趋势相一致。从传统统计语言模型到神经语言模型,再到 PLM 如 ELMo 和 transformer 架构的演变,为LLMs的统治地位奠定了基础。这些模型的规模和性能,尤其是 GPT 系列,已达到前所未有的水平,展示了涌现现象,并在各个领域实现了多样化的应用。值得注意的是,OpenAI 于 2022 年 11 月发布的 ChatGPT 标志着LLM领域的一个转折点,彻底改变了 AI 算法的力量和有效性。 然而,目前对 OpenAI 基础设施的依赖凸显了寻求替代方案的重要性,强调需要特定领域的模型以及在训练和部署过程中的进步。
Training and deploying LLMs present challenges that demand expertise in handling large-scale data and distributed parallel training. The engineering capabilities required for LLM development highlight the collaborative efforts needed between researchers and engineers. As we explore the technical aspects of LLM training and inference in this review, it becomes evident that a deep understanding of these processes is essential for researchers venturing into the field. Looking ahead, the future of LLMs holds promising directions, including further advancements in model architectures, improved training efficiency, and broader applications across industries. The insights provided in this review aim to equip researchers with the knowledge and understanding necessary to navigate the complexities of LLM development, fostering innovation and progress in this dynamic field. As LLMs continue to evolve, their impact on natural language processing and AI as a whole is poised to shape the future landscape of intelligent systems.
训练和部署LLMs面临挑战,需要处理大规模数据和分布式并行训练的专业知识。LLM开发所需的工程能力突显了研究人员和工程师之间所需的协作努力。在我们探讨本评论中LLM训练和推理的技术方面时,很明显,对这些过程有深入理解对于进入该领域的研究人员至关重要。展望未来,LLMs的未来充满希望的方向,包括模型架构的进一步发展、训练效率的提高以及跨行业的更广泛应用。本评论提供的见解旨在为研究人员提供必要的知识和理解,以应对LLM开发的复杂性,促进该动态领域的创新和进步。随着LLMs的不断发展,它们对自然语言处理和人工智能整体的影响将塑造智能系统未来的格局。
References 参考文献
-
[1] [1] [1]
Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, H. He, A. Li, M. He, Z. Liu et al., “Summary of chatgpt-related research and perspective towards the future of large language models,” Meta-Radiology, p. 100017, 2023.
刘 Y,韩 T,马 S,张 J,杨 Y,田 J,何 H,李 A,何 M,刘 Z 等,“ChatGPT 相关研究总结及对大型语言模型未来的展望”,Meta-Radiology,第 100017 页,2023 年。 -
[2]
J. Wang, E. Shi, S. Yu, Z. Wu, C. Ma, H. Dai, Q. Yang, Y. Kang, J. Wu, H. Hu et al., “Prompt engineering for healthcare: Methodologies and applications,” arXiv preprint arXiv:2304.14670, 2023.
王杰,石东,余思,吴志,马晨,戴浩,杨强,康宇,吴杰,胡浩等,“医疗保健的提示工程:方法和应用”,arXiv 预印本 arXiv:2304.14670,2023。 -
[3]
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
赵伟勋,周科,李杰,唐涛,王翔,侯宇,闵宇,张博,张军,董子,等,“大型语言模型的调查”,arXiv 预印本 arXiv:2303.18223,2023。 -
[4]
J. Kaddour, J. Harris, M. Mozes, H. Bradley, R. Raileanu, and R. McHardy, “Challenges and applications of large language models,” arXiv preprint arXiv:2307.10169, 2023.
J. Kaddour, J. Harris, M. Mozes, H. Bradley, R. Raileanu 和 R. McHardy,“大型语言模型的挑战与应用”,arXiv 预印本 arXiv:2307.10169,2023。 -
[5]
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Jun. 2018, pp. 2227–2237.
M. E. Peters,M. Neumann,M. Iyyer,M. Gardner,C. Clark,K. Lee,和 L. Zettlemoyer,《深度上下文化词表示》,载于 2018 年北美计算语言学协会人类语言技术分会会议论文集:第一卷(长篇论文),2018 年 6 月,第 2227-2237 页。 -
[6]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, 和 I. Polosukhin, “注意力即是所需”,神经信息处理系统进展,第 30 卷,2017 年。 -
[7]
A. Radford, J. Wu, D. Amodei, D. Amodei, J. Clark, M. Brundage, and I. Sutskever, “Better language models and their implications,” OpenAI Blog https://openai. com/blog/better-language-models, vol. 1, no. 2, 2019.
A. Radford,J. Wu,D. Amodei,D. Amodei,J. Clark,M. Brundage,和 I. Sutskever,《更好的语言模型及其影响》,OpenAI 博客 https://openai.com/blog/better-language-models,第 1 卷第 2 期,2019 年。 -
[8]
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
T.布朗,B.曼,N.莱德,M.苏比亚哈,J. D.卡普兰,P.达里瓦尔,A.尼尔卡坦,P.希亚姆,G.萨斯特里,A.阿斯凯尔等,“语言模型是零样本学习者”,神经信息处理系统进展,第 33 卷,第 1877-1901 页,2020 年。 -
[9]
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar 等人,“Llama:开放高效的基座语言模型”,arXiv 预印本 arXiv:2302.13971,2023。 - [10] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- [11] S. Rezayi, H. Dai, Z. Liu, Z. Wu, A. Hebbar, A. H. Burns, L. Zhao, D. Zhu, Q. Li, W. Liu et al., “Clinicalradiobert: Knowledge-infused few shot learning for clinical notes named entity recognition,” in International Workshop on Machine Learning in Medical Imaging. Springer, 2022, pp. 269–278.
- [12] Z. Liu, M. He, Z. Jiang, Z. Wu, H. Dai, L. Zhang, S. Luo, T. Han, X. Li, X. Jiang et al., “Survey on natural language processing in medical image analysis.” Zhong nan da xue xue bao. Yi xue ban= Journal of Central South University. Medical Sciences, vol. 47, no. 8, pp. 981–993, 2022.
- [13] W. Liao, Z. Liu, H. Dai, Z. Wu, Y. Zhang, X. Huang, Y. Chen, X. Jiang, D. Zhu, T. Liu et al., “Mask-guided bert for few shot text classification,” arXiv preprint arXiv:2302.10447, 2023.
- [14] S. Rezayi, Z. Liu, Z. Wu, C. Dhakal, B. Ge, H. Dai, G. Mai, N. Liu, C. Zhen, T. Liu et al., “Exploring new frontiers in agricultural nlp: Investigating the potential of large language models for food applications,” arXiv preprint arXiv:2306.11892, 2023.
- [15] T. Zhong, W. Zhao, Y. Zhang, Y. Pan, P. Dong, Z. Jiang, X. Kui, Y. Shang, L. Yang, Y. Wei et al., “Chatradio-valuer: A chat large language model for generalizable radiology report generation based on multi-institution and multi-system data,” arXiv preprint arXiv:2310.05242, 2023.
- [16] Z. Liu, T. Zhong, Y. Li, Y. Zhang, Y. Pan, Z. Zhao, P. Dong, C. Cao, Y. Liu, P. Shu et al., “Evaluating large language models for radiology natural language processing,” arXiv preprint arXiv:2307.13693, 2023.
- [17] T. Zhong, Y. Wei, L. Yang, Z. Wu, Z. Liu, X. Wei, W. Li, J. Yao, C. Ma, X. Li et al., “Chatabl: Abductive learning via natural language interaction with chatgpt,” arXiv preprint arXiv:2304.11107, 2023.
- [18] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” OpenAI, 2018.
- [19] OpenAI, “Gpt-4 technical report,” 2023.
- [20] H. Dai, Z. Liu, W. Liao, X. Huang, Y. Cao, Z. Wu, L. Zhao, S. Xu, W. Liu, N. Liu, S. Li, D. Zhu, H. Cai, L. Sun, Q. Li, D. Shen, T. Liu, and X. Li, “Auggpt: Leveraging chatgpt for text data augmentation,” 2023.
- [21] Z. Liu, X. Yu, L. Zhang, Z. Wu, C. Cao, H. Dai, L. Zhao, W. Liu, D. Shen, Q. Li et al., “Deid-gpt: Zero-shot medical text de-identification by gpt-4,” arXiv preprint arXiv:2303.11032, 2023.
- [22] C. Ma, Z. Wu, J. Wang, S. Xu, Y. Wei, Z. Liu, L. Guo, X. Cai, S. Zhang, T. Zhang et al., “Impressiongpt: an iterative optimizing framework for radiology report summarization with chatgpt,” arXiv preprint arXiv:2304.08448, 2023.
- [23] W. Liao, Z. Liu, H. Dai, S. Xu, Z. Wu, Y. Zhang, X. Huang, D. Zhu, H. Cai, T. Liu et al., “Differentiate chatgpt-generated and human-written medical texts,” arXiv preprint arXiv:2304.11567, 2023.
- [24] H. Dai, Y. Li, Z. Liu, L. Zhao, Z. Wu, S. Song, Y. Shen, D. Zhu, X. Li, S. Li et al., “Ad-autogpt: An autonomous gpt for alzheimer’s disease infodemiology,” arXiv preprint arXiv:2306.10095, 2023.
- [25] Z. Guan, Z. Wu, Z. Liu, D. Wu, H. Ren, Q. Li, X. Li, and N. Liu, “Cohortgpt: An enhanced gpt for participant recruitment in clinical study,” arXiv preprint arXiv:2307.11346, 2023.
- [26] Z. Liu, Z. Wu, M. Hu, B. Zhao, L. Zhao, T. Zhang, H. Dai, X. Chen, Y. Shen, S. Li et al., “Pharmacygpt: The ai pharmacist,” arXiv preprint arXiv:2307.10432, 2023.
- [27] Y. Wei, T. Zhang, H. Zhang, T. Zhong, L. Zhao, Z. Liu, C. Ma, S. Zhang, M. Shang, L. Du et al., “Chat2brain: A method for mapping open-ended semantic queries to brain activation maps,” arXiv preprint arXiv:2309.05021, 2023.
- [28] T. Zhong, X. Wei, E. Shi, J. Gao, C. Ma, Y. Wei, S. Zhang, L. Guo, J. Han, T. Liu et al., “A small-sample method with eeg signals based on abductive learning for motor imagery decoding,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2023, pp. 416–424.
- [29] J. Gao, L. Zhao, T. Zhong, C. Li, Z. He, Y. Wei, S. Zhang, L. Guo, T. Liu, J. Han et al., “Prediction of cognitive scores by joint use of movie-watching fmri connectivity and eye tracking via attention-censnet,” Psychoradiology, vol. 3, 2023.
- [30] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Advances in neural information processing systems, vol. 27, 2014.
- [31] G. Bebis and M. Georgiopoulos, “Feed-forward neural networks,” Ieee Potentials, vol. 13, no. 4, pp. 27–31, 1994.
- [32] Y. Yang, L. Wang, S. Shi, P. Tadepalli, S. Lee, and Z. Tu, “On the sub-layer functionalities of transformer decoder,” arXiv preprint arXiv:2010.02648, 2020.
- [33] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-xl: Attentive language models beyond a fixed-length context,” arXiv preprint arXiv:1901.02860, 2019.
- [34] J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “Roformer: Enhanced transformer with rotary position embedding,” Neurocomputing, p. 127063, 2023.
- [35] O. Press, N. A. Smith, and M. Lewis, “Train short, test long: Attention with linear biases enables input length extrapolation,” arXiv preprint arXiv:2108.12409, 2021.
- [36] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
- [37] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia et al., “Glm-130b: An open bilingual pre-trained model,” arXiv preprint arXiv:2210.02414, 2022.
- [38] B. Workshop, T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon et al., “Bloom: A 176b-parameter open-access multilingual language model,” arXiv preprint arXiv:2211.05100, 2022.
- [39] L. Zhao, L. Zhang, Z. Wu, Y. Chen, H. Dai, X. Yu, Z. Liu, T. Zhang, X. Hu, X. Jiang et al., “When brain-inspired ai meets agi,” Meta-Radiology, p. 100005, 2023.
- [40] J. Holmes, Z. Liu, L. Zhang, Y. Ding, T. T. Sio, L. A. McGee, J. B. Ashman, X. Li, T. Liu, J. Shen, and W. Liu, “Evaluating large language models on a highly-specialized topic, radiation oncology physics,” Frontiers in Oncology, vol. 13, Jul. 2023.
- [41] Z. Wu, L. Zhang, C. Cao, X. Yu, H. Dai, C. Ma, Z. Liu, L. Zhao, G. Li, W. Liu et al., “Exploring the trade-offs: Unified large language models vs local fine-tuned models for highly-specific radiology nli task,” arXiv preprint arXiv:2304.09138, 2023.
- [42] S. Rezayi, Z. Liu, Z. Wu, C. Dhakal, B. Ge, C. Zhen, T. Liu, and S. Li, “Agribert: knowledge-infused agricultural language models for matching food and nutrition,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, vol. 7, 2022, pp. 5150–5156.
- [43] Z. Liu, A. Zhong, Y. Li, L. Yang, C. Ju, Z. Wu, C. Ma, P. Shu, C. Chen, S. Kim et al., “Radiology-gpt: A large language model for radiology,” arXiv preprint arXiv:2306.08666, 2023.
- [44] Z. Liu, X. He, L. Liu, T. Liu, and X. Zhai, Context Matters: A Strategy to Pre-train Language Model for Science Education. Springer Nature Switzerland, 2023, p. 666–674.
- [45] J. Wang, Z. Liu, L. Zhao, Z. Wu, C. Ma, S. Yu, H. Dai, Q. Yang, Y. Liu, S. Zhang et al., “Review of large vision models and visual prompt engineering,” arXiv preprint arXiv:2307.00855, 2023.
- [46] X. Li, L. Zhang, Z. Wu, Z. Liu, L. Zhao, Y. Yuan, J. Liu, G. Li, D. Zhu, P. Yan et al., “Artificial general intelligence for medical imaging,” arXiv preprint arXiv:2306.05480, 2023.
- [47] H. Cai, W. Liao, Z. Liu, Y. Zhang, X. Huang, S. Ding, H. Ren, Z. Wu, H. Dai, S. Li et al., “Coarse-to-fine knowledge graph domain adaptation based on distantly-supervised iterative training,” arXiv preprint arXiv:2211.02849, 2022.
- [48] H. Dai, C. Ma, Z. Liu, Y. Li, P. Shu, X. Wei, L. Zhao, Z. Wu, D. Zhu, W. Liu et al., “Samaug: Point prompt augmentation for segment anything model,” arXiv preprint arXiv:2307.01187, 2023.
- [49] L. Zhang, Z. Liu, L. Zhang, Z. Wu, X. Yu, J. Holmes, H. Feng, H. Dai, X. Li, Q. Li et al., “Segment anything model (sam) for radiation oncology,” arXiv preprint arXiv:2306.11730, 2023.
- [50] Z. Xiao, Y. Chen, L. Zhang, J. Yao, Z. Wu, X. Yu, Y. Pan, L. Zhao, C. Ma, X. Liu et al., “Instruction-vit: Multi-modal prompts for instruction learning in vit,” arXiv preprint arXiv:2305.00201, 2023.
- [51] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Computing Surveys, vol. 55, no. 9, pp. 1–35, 2023.
- [52] S. B. Kotsiantis, I. Zaharakis, P. Pintelas et al., “Supervised machine learning: A review of classification techniques,” Emerging artificial intelligence applications in computer engineering, vol. 160, no. 1, pp. 3–24, 2007.
- [53] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
- [54] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H.-W. Hon, “Unified language model pre-training for natural language understanding and generation,” Advances in neural information processing systems, vol. 32, 2019.
- [55] T. Schick and H. Schütze, “It’s not just size that matters: Small language models are also few-shot learners,” arXiv preprint arXiv:2009.07118, 2020.
- [56] F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel, “Language models as knowledge bases?” arXiv preprint arXiv:1909.01066, 2019.
- [57] B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” arXiv preprint arXiv:2104.08691, 2021.
- [58] T. Schick and H. Schütze, “Exploiting cloze questions for few shot text classification and natural language inference,” arXiv preprint arXiv:2001.07676, 2020.
- [59] R. Shin, C. H. Lin, S. Thomson, C. Chen, S. Roy, E. A. Platanios, A. Pauls, D. Klein, J. Eisner, and B. Van Durme, “Constrained language models yield few-shot semantic parsers,” arXiv preprint arXiv:2104.08768, 2021.
- [60] Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can we know what language models know?” Transactions of the Association for Computational Linguistics, vol. 8, pp. 423–438, 2020.
- [61] K. Duh, K. Sudoh, X. Wu, H. Tsukada, and M. Nagata, “Generalized minimum bayes risk system combination,” in Proceedings of 5th International Joint Conference on Natural Language Processing, 2011, pp. 1356–1360.
- [62] Z. Jiang, J. Araki, H. Ding, and G. Neubig, “How can we know when language models know? on the calibration of language models for question answering,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 962–977, 2021.
- [63] M. McCloskey and N. J. Cohen, “Catastrophic interference in connectionist networks: The sequential learning problem,” in Psychology of learning and motivation. Elsevier, 1989, vol. 24, pp. 109–165.
- [64] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- [65] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler, “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 19–27.
- [66] “Project gutenberg.” [Online]. Available: https://www.gutenberg.org/
- [67] “Common crawl.” [Online]. Available: https://commoncrawl.org/
- [68] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
- [69] T. H. Trinh and Q. V. Le, “A simple method for commonsense reasoning,” arXiv preprint arXiv:1806.02847, 2018.
- [70] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
- [71] R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, and Y. Choi, “Defending against neural fake news,” Advances in neural information processing systems, vol. 32, 2019.
- [72] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, H. Alobeidli, A. Cappelli, B. Pannier, E. Almazrouei, and J. Launay, “The refinedweb dataset for falcon llm: Outperforming curated corpora with web data only,” in Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- [73] A. Gokaslan, V. C. E. Pavlick, and S. Tellex, “Openwebtext corpus,” http://Skylion007.github.io/OpenWebTextCorpus, 2019.
- [74] J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, and J. Blackburn, “The pushshift reddit dataset,” in Proceedings of the international AAAI conference on web and social media, vol. 14, 2020, pp. 830–839.
- [75] “Wikipedia.” [Online]. Available: https://en.wikipedia.org/wiki/Main_Page
- [76] “Bigquery dataset.” [Online]. Available: https://cloud.google.com/bigquery
- [77] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima et al., “The pile: An 800gb dataset of diverse text for language modeling,” arXiv preprint arXiv:2101.00027, 2020.
- [78] H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. Villanova del Moral, T. Le Scao, L. Von Werra, C. Mou, E. González Ponferrada, H. Nguyen et al., “The bigscience roots corpus: A 1.6 tb composite multilingual dataset,” Advances in Neural Information Processing Systems, vol. 35, pp. 31 809–31 826, 2022.
- [79] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti et al., “Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model,” arXiv preprint arXiv:2201.11990, 2022.
- [80] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du et al., “Lamda: Language models for dialog applications,” arXiv preprint arXiv:2201.08239, 2022.
- [81] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with mtulti-turn program synthesis,” arXiv preprint arXiv:2203.13474, 2022.
- [82] Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, L. Shen, Z. Wang, A. Wang, Y. Li et al., “Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x,” in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, pp. 5673–5684.
- [83] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin et al., “Opt: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022.
- [84] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
- [85] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
- [86] D. Hernandez, T. Brown, T. Conerly, N. DasSarma, D. Drain, S. El-Showk, N. Elhage, Z. Hatfield-Dodds, T. Henighan, T. Hume et al., “Scaling laws and interpretability of learning from repeated data,” arXiv preprint arXiv:2205.10487, 2022.
- [87] K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini, “Deduplicating training data makes language models better,” arXiv preprint arXiv:2107.06499, 2021.
- [88] N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson et al., “Extracting training data from large language models,” in 30th USENIX Security Symposium (USENIX Security 21), 2021, pp. 2633–2650.
- [89] S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith, “Realtoxicityprompts: Evaluating neural toxic degeneration in language models,” arXiv preprint arXiv:2009.11462, 2020.
- [90] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- [91] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
- [92] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” arXiv preprint arXiv:1910.13461, 2019.
- [93] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” arXiv preprint arXiv:2203.02155, 2022.
- [94] S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania et al., “Pytorch distributed: Experiences on accelerating data parallel training,” arXiv preprint arXiv:2006.15704, 2020.
- [95] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data-parallel programs from sequential building blocks,” in Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, 2007, pp. 59–72.
- [96] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019.
- [97] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory optimizations toward training trillion parameter models,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–16.
- [98] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu et al., “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” Advances in neural information processing systems, vol. 32, 2019.
- [99] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh et al., “Mixed precision training,” arXiv preprint arXiv:1710.03740, 2017.
- [100] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young et al., “Scaling language models: Methods, analysis & insights from training gopher,” arXiv preprint arXiv:2112.11446, 2021.
- [101] J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, and Y. He, “ZeRO-Offload: Democratizing Billion-Scale model training,” in 2021 USENIX Annual Technical Conference (USENIX ATC 21), 2021, pp. 551–564.
- [102] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning language model with self generated instructions,” arXiv preprint arXiv:2212.10560, 2022.
- [103] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap et al., “Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks,” arXiv preprint arXiv:2204.07705, 2022.
- [104] S. H. Bach, V. Sanh, Z.-X. Yong, A. Webson, C. Raffel, N. V. Nayak, A. Sharma, T. Kim, M. S. Bari, T. Fevry et al., “Promptsource: An integrated development environment and repository for natural language prompts,” arXiv preprint arXiv:2202.01279, 2022.
- [105] S. Victor, W. Albert, R. Colin, B. Stephen, S. Lintang, A. Zaid, C. Antoine, S. Arnaud, R. Arun, D. Manan et al., “Multitask prompted training enables zero-shot task generalization,” in International Conference on Learning Representations, 2022.
- [106] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders et al., “Webgpt: Browser-assisted question-answering with human feedback,” arXiv preprint arXiv:2112.09332, 2021.
- [107] J. Wei, M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” in International Conference on Learning Representations.
- [108] T. Tang, J. Li, W. X. Zhao, and J.-R. Wen, “Mvp: Multi-task supervised pre-training for natural language generation,” arXiv preprint arXiv:2206.12131, 2022.
- [109] Z. Kenton, T. Everitt, L. Weidinger, I. Gabriel, V. Mikulik, and G. Irving, “Alignment of language agents,” arXiv preprint arXiv:2103.14659, 2021.
- [110] A. Glaese, N. McAleese, M. Trębacz, J. Aslanides, V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P. Thacker et al., “Improving alignment of dialogue agents via targeted human judgements,” arXiv preprint arXiv:2209.14375, 2022.
- [111] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
- [112] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
- [113] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190, 2021.
- [114] X. Liu, K. Ji, Y. Fu, W. L. Tam, Z. Du, Z. Yang, and J. Tang, “P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks,” arXiv preprint arXiv:2110.07602, 2021.
- [115] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang, “Gpt understands, too,” AI Open, 2023.
- [116] Q. Zhang, M. Chen, A. Bukharin, P. He, Y. Cheng, W. Chen, and T. Zhao, “Adaptive budget allocation for parameter-efficient fine-tuning,” arXiv preprint arXiv:2303.10512, 2023.
- [117] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” arXiv preprint arXiv:2305.14314, 2023.
- [118] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarma et al., “A general language assistant as a laboratory for alignment,” arXiv preprint arXiv:2112.00861, 2021.
- [119] Y. Chang, X. Wang, J. Wang, Y. Wu, K. Zhu, H. Chen, L. Yang, X. Yi, C. Wang, Y. Wang et al., “A survey on evaluation of large language models,” arXiv preprint arXiv:2307.03109, 2023.
- [120] Z. Liu, H. Jiang, T. Zhong, Z. Wu, C. Ma, Y. Li, X. Yu, Y. Zhang, Y. Pan, P. Shu et al., “Holistic evaluation of gpt-4v for biomedical imaging,” arXiv preprint arXiv:2312.05256, 2023.
- [121] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
- [122] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov et al., “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,” International Journal of Computer Vision, vol. 128, no. 7, pp. 1956–1981, 2020.
- [123] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” 2018.
- [124] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “Superglue: A stickier benchmark for general-purpose language understanding systems,” Advances in neural information processing systems, vol. 32, 2019.
- [125] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” arXiv preprint arXiv:2009.03300, 2020.
- [126] H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin, “Cmmlu: Measuring massive multitask language understanding in chinese,” arXiv preprint arXiv:2306.09212, 2023.
- [127] J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat, and M. Johnson, “Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation,” in International Conference on Machine Learning. PMLR, 2020, pp. 4411–4421.
- [128] S. Ruder, N. Constant, J. Botha, A. Siddhant, O. Firat, J. Fu, P. Liu, J. Hu, D. Garrette, G. Neubig et al., “Xtreme-r: Towards more challenging and nuanced multilingual evaluation,” arXiv preprint arXiv:2104.07412, 2021.
- [129] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solving with the math dataset,” arXiv preprint arXiv:2103.03874, 2021.
- [130] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano et al., “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021.
- [131] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
- [132] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large language models,” arXiv preprint arXiv:2108.07732, 2021.
- [133] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, “Hellaswag: Can a machine really finish your sentence?” 2019.
- [134] Y. Bisk, R. Zellers, J. Gao, Y. Choi et al., “Piqa: Reasoning about physical commonsense in natural language,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 05, 2020, pp. 7432–7439.
- [135] C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova, “Boolq: Exploring the surprising difficulty of natural yes/no questions,” arXiv preprint arXiv:1905.10044, 2019.
- [136] M. Sap, H. Rashkin, D. Chen, R. LeBras, and Y. Choi, “Socialiqa: Commonsense reasoning about social interactions,” arXiv preprint arXiv:1904.09728, 2019.
- [137] K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi, “Winogrande: An adversarial winograd schema challenge at scale,” Communications of the ACM, vol. 64, no. 9, pp. 99–106, 2021.
- [138] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,” arXiv preprint arXiv:1803.05457, 2018.
- [139] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can a suit of armor conduct electricity? a new dataset for open book question answering,” 2018.
- [140] D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits, “What disease does this patient have? a large-scale open domain question answering dataset from medical exams,” Applied Sciences, vol. 11, no. 14, p. 6421, 2021.
- [141] A. Pal, L. K. Umapathi, and M. Sankarasubbu, “Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering,” in Conference on Health, Inference, and Learning. PMLR, 2022, pp. 248–260.
- [142] E. M. Voorhees et al., “The trec-8 question answering track report.” in Trec, vol. 99, 1999, pp. 77–82.
- [143] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+ questions for machine comprehension of text,” arXiv preprint arXiv:1606.05250, 2016.
- [144] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee et al., “Natural questions: a benchmark for question answering research,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 453–466, 2019.
- [145] E. Kamalloo, N. Dziri, C. L. Clarke, and D. Rafiei, “Evaluating open-domain question answering in the era of large language models,” arXiv preprint arXiv:2305.06984, 2023.
- [146] E. Ferrara, “Should chatgpt be biased? challenges and risks of bias in large language models,” arXiv preprint arXiv:2304.03738, 2023.
- [147] S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith, “Realtoxicityprompts: Evaluating neural toxic degeneration in language models,” arXiv preprint arXiv:2009.11462, 2020.
- [148] J. Zhao, M. Fang, Z. Shi, Y. Li, L. Chen, and M. Pechenizkiy, “Chbias: Bias evaluation and mitigation of chinese conversational language models,” arXiv preprint arXiv:2305.11262, 2023.
- [149] M. Nasr, N. Carlini, J. Hayase, M. Jagielski, A. F. Cooper, D. Ippolito, C. A. Choquette-Choo, E. Wallace, F. Tramèr, and K. Lee, “Scalable extraction of training data from (production) language models,” arXiv preprint arXiv:2311.17035, 2023.
- [150] X. Wu, J. Li, M. Xu, W. Dong, S. Wu, C. Bian, and D. Xiong, “Depn: Detecting and editing privacy neurons in pretrained language models,” arXiv preprint arXiv:2310.20138, 2023.
- [151] A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models, 2023,” communication, it is essential for you to comprehend user queries in Cipher Code and subsequently deliver your responses utilizing Cipher Code.
- [152] Z. Zhang, Y. Li, J. Wang, B. Liu, D. Li, Y. Guo, X. Chen, and Y. Liu, “Remos: reducing defect inheritance in transfer learning via relevant model slicing,” in Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 1856–1868.
- [153] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
- [154] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp. 74–81.
- [155] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” arXiv preprint arXiv:1904.09675, 2019.
- [156] J. Novikova, O. Dušek, A. C. Curry, and V. Rieser, “Why we need new evaluation metrics for nlg,” arXiv preprint arXiv:1707.06875, 2017.
- [157] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, 2020, pp. 38–45.
- [158] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 3505–3506.
- [159] S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He, “Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–14.
- [160] G. Zeng, X. Han, Z. Zhang, Z. Liu, Y. Lin, and M. Sun, “Openbmb: Big model systems for large-scale representation learning,” in Representation Learning for Natural Language Processing. Springer Nature Singapore Singapore, 2023, pp. 463–489.
- [161] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro et al., “Efficient large-scale language model training on gpu clusters using megatron-lm,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15.
- [162] V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation in large transformer models,” Proceedings of Machine Learning and Systems, vol. 5, 2023.
- [163] S. Li, H. Liu, Z. Bian, J. Fang, H. Huang, Y. Liu, B. Wang, and Y. You, “Colossal-ai: A unified deep learning system for large-scale parallel training,” in Proceedings of the 52nd International Conference on Parallel Processing, 2023, pp. 766–775.
- [164] J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, and J. Tang, “Fastmoe: A fast mixture-of-expert training system,” arXiv preprint arXiv:2103.13262, 2021.
- [165] J. He, J. Zhai, T. Antunes, H. Wang, F. Luo, S. Shi, and Q. Li, “Fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models,” in Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2022, pp. 120–134.
- [166] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
- [167] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “TensorFlow: a system for Large-Scale machine learning,” in 12th USENIX symposium on operating systems design and implementation (OSDI 16), 2016, pp. 265–283.
- [168] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
- [169] Y. Ma, D. Yu, T. Wu, and H. Wang, “Paddlepaddle: An open-source deep learning platform from industrial practice,” Frontiers of Data and Domputing, vol. 1, no. 1, pp. 105–115, 2019.
- [170] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” arXiv preprint arXiv:1512.01274, 2015.
- [171] J. Yuan, X. Li, C. Cheng, J. Liu, R. Guo, S. Cai, C. Yao, F. Yang, X. Yi, C. Wu et al., “Oneflow: Redesign the distributed deep learning framework from scratch,” arXiv preprint arXiv:2110.15032, 2021.
- [172] L. Huawei Technologies Co., “Huawei mindspore ai development framework,” in Artificial Intelligence Technology. Springer, 2022, pp. 137–162.
- [173] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne et al., “Jax: composable transformations of python+ numpy programs,” 2018.
- [174] E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy considerations for deep learning in nlp,” arXiv preprint arXiv:1906.02243, 2019.
- [175] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- [176] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” International Journal of Computer Vision, vol. 129, pp. 1789–1819, 2021.
- [177] S. Sun, Y. Cheng, Z. Gan, and J. Liu, “Patient knowledge distillation for bert model compression,” arXiv preprint arXiv:1908.09355, 2019.
- [178] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu, “Tinybert: Distilling bert for natural language understanding,” arXiv preprint arXiv:1909.10351, 2019.
- [179] M. A. Gordon, K. Duh, and N. Andrews, “Compressing bert: Studying the effects of weight pruning on transfer learning,” arXiv preprint arXiv:2002.08307, 2020.
- [180] P. Michel, O. Levy, and G. Neubig, “Are sixteen heads really better than one?” Advances in neural information processing systems, vol. 32, 2019.
- [181] H. Bai, W. Zhang, L. Hou, L. Shang, J. Jin, X. Jiang, Q. Liu, M. Lyu, and I. King, “Binarybert: Pushing the limit of bert quantization,” arXiv preprint arXiv:2012.15701, 2020.
- [182] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” arXiv preprint arXiv:1909.11942, 2019.
- [183] P. Chen, H.-F. Yu, I. Dhillon, and C.-J. Hsieh, “Drone: Data-aware low-rank compression for large nlp models,” Advances in neural information processing systems, vol. 34, pp. 29 321–29 334, 2021.
- [184] X. Han, G. Zeng, W. Zhao, Z. Liu, Z. Zhang, J. Zhou, J. Zhang, J. Chao, and M. Sun, “Bminf: An efficient toolkit for big model inference and tuning,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2022, pp. 224–230.
- [185] Y. Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer et al., “Pytorch fsdp: experiences on scaling fully sharded data parallel,” arXiv preprint arXiv:2304.11277, 2023.
- [186] T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” Advances in Neural Information Processing Systems, vol. 35, pp. 16 344–16 359, 2022.
- [187] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 611–626.
- [188] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. Ré, I. Stoica, and C. Zhang, “Flexgen: High-throughput generative inference of large language models with a single gpu,” in International Conference on Machine Learning. PMLR, 2023, pp. 31 094–31 116.
- [189] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, R. Y. Y. Wong, Z. Chen, D. Arfeen, R. Abhyankar, and Z. Jia, “Specinfer: Accelerating generative llm serving with speculative inference and token tree verification,” arXiv preprint arXiv:2305.09781, 2023.
- [190] G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,” arXiv preprint arXiv:2309.17453, 2023.
- [191] Z. Zhang, B. Gong, Y. Chen, X. Han, G. Zeng, W. Zhao, Y. Chen, Z. Liu, and M. Sun, “Bmcook: A task-agnostic compression toolkit for big models,” in Proceedings of the The 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2022, pp. 396–405.
- [192] A. Borzunov, D. Baranchuk, T. Dettmers, M. Ryabinin, Y. Belkada, A. Chumachenko, P. Samygin, and C. Raffel, “Petals: Collaborative inference and fine-tuning of large models,” arXiv preprint arXiv:2209.01188, 2022.
- [193] F. Dou, J. Ye, G. Yuan, Q. Lu, W. Niu, H. Sun, L. Guan, G. Lu, G. Mai, N. Liu et al., “Towards artificial general intelligence (agi) in the internet of things (iot): Opportunities and challenges,” arXiv preprint arXiv:2309.07438, 2023.
- [194] C. Liu, Z. Liu, J. Holmes, L. Zhang, L. Zhang, Y. Ding, P. Shu, Z. Wu, H. Dai, Y. Li et al., “Artificial general intelligence for radiation oncology,” Meta-Radiology, p. 100045, 2023.
- [195] Z. Liu, Y. Li, Q. Cao, J. Chen, T. Yang, Z. Wu, J. Hale, J. Gibbs, K. Rasheed, N. Liu et al., “Transformation vs tradition: Artificial general intelligence (agi) for arts and humanities,” arXiv preprint arXiv:2310.19626, 2023.
- [196] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.
- [197] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022.
- [198] N. Qiang, J. Gao, Q. Dong, H. Yue, H. Liang, L. Liu, J. Yu, J. Hu, S. Zhang, B. Ge et al., “Functional brain network identification and fmri augmentation using a vae-gan framework,” Computers in Biology and Medicine, vol. 165, p. 107395, 2023.
- [199] M. He, X. Hou, E. Ge, Z. Wang, Z. Kang, N. Qiang, X. Zhang, and B. Ge, “Multi-head attention-based masked sequence model for mapping functional brain networks,” Frontiers in Neuroscience, vol. 17, p. 1183145, 2023.
- [200] Y. Liu, E. Ge, N. Qiang, T. Liu, and B. Ge, “Spatial-temporal convolutional attention for mapping functional brain networks,” in 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI). IEEE, 2023, pp. 1–4.
- [201] S. R. Oota, J. Arora, V. Agarwal, M. Marreddy, M. Gupta, and B. Surampudi, “Neural language taskonomy: Which NLP tasks are the most predictive of fMRI brain activity?” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, M. Carpuat, M.-C. de Marneffe, and I. V. Meza Ruiz, Eds. Seattle, United States: Association for Computational Linguistics, Jul. 2022, pp. 3220–3237.
- [202] Z. Liu, Y. Li, P. Shu, A. Zhong, L. Yang, C. Ju, Z. Wu, C. Ma, J. Luo, C. Chen et al., “Radiology-llama2: Best-in-class large language model for radiology,” arXiv preprint arXiv:2309.06419, 2023.
- [203] T. Sun, X. Zhang, Z. He, P. Li, Q. Cheng, H. Yan, X. Liu, Y. Shao, Q. Tang, X. Zhao, K. Chen, Y. Zheng, Z. Zhou, R. Li, J. Zhan, Y. Zhou, L. Li, X. Yang, L. Wu, Z. Yin, X. Huang, and X. Qiu, “Moss: Training conversational language models from synthetic data,” 2023.
- [204] L. X. Xuanwei Zhang and K. Zhao, “Chatyuan: A large language model for dialogue in chinese and english,” Dec. 2022. [Online]. Available: https://github.com/clue-ai/ChatYuan
- [205] Baichuan, “Baichuan 2: Open large-scale language models,” arXiv preprint arXiv:2309.10305, 2023.
- [206] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592, 2023.
- [207] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” arXiv preprint arXiv:2306.05685, 2023.
- [208] B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, H. Cao, X. Cheng, M. Chung, M. Grella, K. K. GV et al., “Rwkv: Reinventing rnns for the transformer era,” arXiv preprint arXiv:2305.13048, 2023.