这是用户在 2024-3-21 24:34 为 https://arxiv.org/html/2402.06196v2 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
License: CC BY 4.0 许可证:CC BY 4.0
arXiv:2402.06196v2 [cs.CL] 20 Feb 2024
arXiv:2402.06196v2 [cs.CL] 2024 年 2 月 20 日

Large Language Models: A Survey
大型语言模型:调查

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu
谢尔文·米纳伊、托马斯·米科洛夫、纳尔吉斯·尼克扎德、梅萨姆·切纳格鲁

Richard Socher, Xavier Amatriain, Jianfeng Gao
理查德·索彻、泽维尔·阿马特里安、高剑峰


Abstract 抽象的

Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs’ ability of general-purpose language understanding and generation is acquired by training billions of model’s parameters on massive amounts of text data, as predicted by scaling laws [1, 2]. The research area of LLMs, while very recent, is evolving rapidly in many different ways. In this paper, we review some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discuss their characteristics, contributions and limitations. We also give an overview of techniques developed to build, and augment LLMs. We then survey popular datasets prepared for LLM training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare the performance of several popular LLMs on a set of representative benchmarks. Finally, we conclude the paper by discussing open challenges and future research directions.
自 2022 年 11 月 ChatGPT 发布以来,大型语言模型 (LLMs) 因其在广泛的自然语言任务上的强劲表现而受到广泛关注。LLMs 的能力通用语言理解和生成的能力是通过在大量文本数据上训练数十亿个模型参数来获得的,正如缩放定律所预测的那样 [1, 2]。 LLMs 的研究领域虽然是最近的,但正在以许多不同的方式迅速发展。在本文中,我们回顾了一些最著名的 LLMs,包括三个流行的 LLM 系列(GPT、LLaMA、PaLM),并讨论了它们的特征、贡献和局限性。我们还概述了为构建和增强 LLMs 而开发的技术。然后,我们调查为LLM训练、微调和评估准备的流行数据集,审查广泛使用的LLM评估指标,并比较几种流行LLMs的性能建立在一组有代表性的基准上。最后,我们通过讨论开放的挑战和未来的研究方向来总结本文。

I Introduction 一、简介

Language modeling is a long-standing research topic, dating back to the 1950s with Shannon’s application of information theory to human language, where he measured how well simple n-gram language models predict or compress natural language text [3]. Since then, statistical language modeling became fundamental to many natural language understanding and generation tasks, ranging from speech recognition, machine translation, to information retrieval [4, 5, 6].
语言建模是一个长期存在的研究课题,可以追溯到 20 世纪 50 年代,香农将信息论应用于人类语言,他测量了简单的 n-gram 语言模型预测或压缩自然语言文本的效果 [3]。从那时起,统计语言建模成为许多自然语言理解和生成任务的基础,从语音识别、机器翻译到信息检索[4,5,6]。

The recent advances on transformer-based large language models (LLMs), pretrained on Web-scale text corpora, significantly extended the capabilities of language models (LLMs). For example, OpenAI’s ChatGPT and GPT-4 can be used not only for natural language processing, but also as general task solvers to power Microsoft’s Co-Pilot systems, for instance, can follow human instructions of complex new tasks performing multi-step reasoning when needed. LLMs are thus becoming the basic building block for the development of general-purpose AI agents or artificial general intelligence (AGI).
基于 Transformer 的大型语言模型 (LLMs) 的最新进展,在网络规模文本语料库上进行预训练,显着扩展了语言模型的功能 (LLMs)。例如,OpenAI 的 ChatGPT 和 GPT-4 不仅可以用于自然语言处理,还可以作为通用任务求解器来为微软的 Co-Pilot 系统提供动力,例如,可以遵循复杂新任务的人类指令,在以下情况下执行多步推理:需要。 LLMs 因此成为开发通用人工智能代理或通用人工智能 (AGI) 的基本构建块。

As the field of LLMs is moving fast, with new findings, models and techniques being published in a matter of months or weeks [7, 8, 9, 10, 11], AI researchers and practitioners often find it challenging to figure out the best recipes to build LLM-powered AI systems for their tasks. This paper gives a timely survey of the recent advances on LLMs. We hope this survey will prove a valuable and accessible resource for students, researchers and developers.
随着 LLMs 领域发展迅速,新的发现、模型和技术在几个月或几周内就会发布[7,8,9,10,11],人工智能研究人员和从业者经常发现它找出为他们的任务构建由LLM驱动的人工智能系统的最佳方法是一项挑战。本文对LLMs的最新进展进行了及时的综述。我们希望这项调查能够为学生、研究人员和开发人员提供宝贵且易于使用的资源。

LLMs are large-scale, pre-trained, statistical language models based on neural networks. The recent success of LLMs is an accumulation of decades of research and development of language models, which can be categorized into four waves that have different starting points and velocity: statistical language models, neural language models, pre-trained language models and LLMs.
LLMs 是基于神经网络的大规模、预训练的统计语言模型。 LLMs最近的成功是语言模型几十年研究和发展的积累,可以分为四个起点和速度不同的浪潮:统计语言模型、神经语言模型、预训练语言模型语言模型和LLMs。

Statistical language models (SLMs) view text as a sequence of words, and estimate the probability of text as the product of their word probabilities. The dominating form of SLMs are Markov chain models known as the n-gram models, which compute the probability of a word conditioned on its immediate proceeding n1𝑛1n-1italic_n - 1 words. Since word probabilities are estimated using word and n-gram counts collected from text corpora, the model needs to deal with data sparsity (i.e., assigning zero probabilities to unseen words or n-grams) by using smoothing, where some probability mass of the model is reserved for unseen n-grams [12]. N-gram models are widely used in many NLP systems. However, these models are incomplete in that they cannot fully capture the diversity and variability of natural language due to data sparsity.
统计语言模型 (SLM) 将文本视为单词序列,并将文本的概率估计为其单词概率的乘积。 SLM 的主要形式是被称为 n-gram 模型的马尔可夫链模型,它根据紧接着的 n1𝑛1n-1italic_n - 1 单词来计算单词的概率。由于单词概率是使用从文本语料库收集的单词和 n-gram 计数来估计的,因此模型需要通过使用平滑处理数据稀疏性(即,为未见过的单词或 n-gram 分配零概率),其中模型的一些概率质量为看不见的 n 元语法保留[12]。 N-gram 模型广泛应用于许多 NLP 系统中。然而,这些模型是不完整的,因为由于数据稀疏,它们无法完全捕捉自然语言的多样性和可变性。

Early neural language models (NLMs) [13, 14, 15, 16] deal with data sparsity by mapping words to low-dimensional continuous vectors (embedding vectors) and predict the next word based on the aggregation of the embedding vectors of its proceeding words using neural networks. The embedding vectors learned by NLMs define a hidden space where the semantic similarity between vectors can be readily computed as their distance. This opens the door to computing semantic similarity of any two inputs regardless their forms (e.g., queries vs. documents in Web search [17, 18], sentences in different languages in machine translation [19, 20]) or modalities (e.g., image and text in image captioning [21, 22]). Early NLMs are task-specific models, in that they are trained on task-specific data and their learned hidden space is task-specific.
早期的神经语言模型(NLM)[13,14,15,16]通过将单词映射到低维连续向量(嵌入向量)来处理数据稀疏性,并根据其前一个单词的嵌入向量的聚合来预测下一个单词使用神经网络。 NLM 学习到的嵌入向量定义了一个隐藏空间,其中向量之间的语义相似性可以很容易地计算为它们的距离。这为计算任意两个输入的语义相似性打开了大门,无论其形式(例如,网络搜索中的查询与文档 [17, 18]、机器翻译中不同语言的句子 [19, 20])或模态(例如,图像)以及图像字幕中的文本 [21, 22])。早期的 NLM 是特定于任务的模型,因为它们是根据特定于任务的数据进行训练的,并且它们学习的隐藏空间是特定于任务的。

Pre-trained language models (PLMs), unlike early NLMs, are task-agnostic. This generality also extends to the learned hidden embedding space. The training and inference of PLMs follows the pre-training and fine-tuning paradigm, where language models with recurrent neural networks [23] or transformers [24, 25, 26] are pre-trained on Web-scale unlabeled text corpora for general tasks such as word prediction, and then finetuned to specific tasks using small amounts of (labeled) task-specific data. Recent surveys on PLMs include [8, 27, 28].

Refer to caption
Figure 1: LLM Capabilities.
图 1:LLM 功能。

Large language models (LLMs) mainly refer to transformer-based neural language models 111Recently, several very promising non-transformer LLMs have been proposed, such as the LLMs based on structured state space models [29, 30]. See Section VII for more details. that contain tens to hundreds of billions of parameters, which are pre-trained on massive text data, such as PaLM [31], LLaMA [32], and GPT-4 [33], as summarized in Table III. Compared to PLMs, LLMs are not only much larger in model size, but also exhibit stronger language understanding and generation abilities, and more importantly, emergent abilities that are not present in smaller-scale language models. As illustrated in Fig. 1, these emergent abilities include (1) in-context learning, where LLMs learn a new task from a small set of examples presented in the prompt at inference time, (2) instruction following, where LLMs, after instruction tuning, can follow the instructions for new types of tasks without using explicit examples, and (3) multi-step reasoning, where LLMs can solve a complex task by breaking down that task into intermediate reasoning steps as demonstrated in the chain-of-thought prompt [34]. LLMs can also be augmented by using external knowledge and tools [35, 36] so that they can effectively interact with users and environment [37], and continually improve itself using feedback data collected through interactions (e.g. via reinforcement learning with human feedback (RLHF)).
大型语言模型(LLMs)主要是指基于Transformer的神经语言模型 1 ,包含数百到数千亿个参数,这些模型是在海量文本数据上进行预训练的,例如PaLM [31]、LLaMA [32] 和 GPT-4 [33],如表 III 中总结。与 PLM 相比,LLMs 不仅模型尺寸大得多,而且表现出更强的语言理解和生成能力,更重要的是,具有较小规模语言模型所不具备的涌现能力。如图 1 所示,这些新兴能力包括 (1) 上下文学习,其中 LLMs 从推理时提示中呈现的一小组示例中学习新任务,(2) 遵循指令,其中 LLMs 在指令调整后,可以遵循新类型任务的指令,而无需使用明确的示例,以及(3)多步推理,其中 LLMs 可以解决复杂的任务通过将该任务分解为中间推理步骤,如思维链提示中所示[34]。 LLMs 还可以通过使用外部知识和工具 [35, 36] 进行增强,以便它们能够有效地与用户和环境交互 [37],并使用通过交互收集的反馈数据(例如通过强化)不断改进自身通过人类反馈进行学习(RLHF))。

Through advanced usage and augmentation techniques, LLMs can be deployed as so-called AI agents: artificial entities that sense their environment, make decisions, and take actions. Previous research has focused on developing agents for specific tasks and domains. The emergent abilities demonstrated by LLMs make it possible to build general-purpose AI agents based on LLMs. While LLMs are trained to produce responses in static settings, AI agents need to take actions to interact with dynamic environment. Therefore, LLM-based agents often need to augment LLMs to e.g., obtain updated information from external knowledge bases, verify whether a system action produces the expected result, and cope with when things do not go as expected, etc. We will discuss in detail LLM-based agents in Section IV.
通过先进的使用和增强技术,LLMs 可以部署为所谓的人工智能代理:感知环境、做出决策并采取行动的人工实体。之前的研究主要集中在开发特定任务和领域的代理。 LLMs 所展示的新兴能力使得基于 LLMs 构建通用人工智能代理成为可能。虽然LLMs经过训练可以在静态环境中产生响应,但人工智能代理需要采取行动与动态环境进行交互。因此,基于LLM的代理通常需要增强LLMs,例如从外部知识库获取更新的信息,验证系统操作是否产生预期结果,并在事情发生时进行处理我们将在第四节详细讨论基于 LLM 的代理。

In the rest of this paper, Section II presents an overview of state of the art of LLMs, focusing on three LLM families (GPT, LLaMA and PaLM) and other representative models. Section III discusses how LLMs are built. Section IV discusses how LLMs are used, and augmented for real-world applications Sections V and VI review popular datasets and benchmarks for evaluating LLMs, and summarize the reported LLM evaluation results. Finally, Section VII concludes the paper by summarizing the challenges and future research directions.

Refer to caption
Figure 2: The paper structure.

II Large Language Models

In this section we start with a review of early pre-trained neural language models as they are the base of LLMs, and then focus our discussion on three families of LLMs: GPT, LlaMA, and PaLM. Table I provides an overview of some of these models and their characteristics.

TABLE I: High-level Overview of Popular Language Models
Type Model Name #Parameters Release Base Models Open Source #Tokens Training dataset

BERT

110M, 340M

2018

-

137B

BooksCorpus, English Wikipedia

RoBERTa

355M

2019

-

2.2T

BooksCorpus, English Wikipedia, CC-NEWS, STORIES (a subset of Common Crawl), Reddit

Encoder-Only

ALBERT

12M, 18M, 60M, 235M

2019

-

137B

BooksCorpus, English Wikipedia

DeBERTa

-

2020

-

-

BooksCorpus, English Wikipedia, STORIES, Reddit content

XLNet

110M, 340M

2019

-

32.89B

BooksCorpus, English Wikipedia, Giga5, Common Crawl, ClueWeb 2012-B

Decoder-only

GPT-1

120M

2018

-

1.3B

BooksCorpus

GPT-2

1.5B

2019

-

10B

Reddit outbound

T5 (Base)

223M

2019

-

156B

Common Crawl

Encoder-Decoder

MT5 (Base)

300M

2020

-

-

New Common Crawl-based dataset in 101 languages (m Common Crawl)

BART (Base)

139M

2019

-

-

Corrupting text

GPT-3

125M, 350M, 760M, 1.3B, 2.7B, 6.7B, 13B, 175B

2020

×\times×

300B

Common Crawl (filtered), WebText2, Books1, Books2, Wikipedia

GPT Family

CODEX

12B

2021

GPT

-

Public GitHub software repositories

WebGPT

760M, 13B, 175B

2021

GPT-3

×\times×

-

ELI5

GPT-4

1.76T

2023

-

×\times×

13T

-

LLaMA1

7B, 13B, 33B, 65B

2023

-

1T, 1.4T

Online sources

LLaMA2

7B, 13B, 34B, 70B

2023

-

2T

Online sources

Alpaca

7B

2023

LLaMA1

-

GPT-3.5

Vicuna-13B

13B

2023

LLaMA1

-

GPT-3.5

LLaMA Family

Koala

13B

2023

LLaMA

-

Dialogue data

Mistral-7B

7.3B

2023

-

-

Code Llama

34

2023

LLaMA2

500B

Publicly available code

LongLLaMA

3B, 7B

2023

OpenLLaMA

1T

-

LLaMA-Pro-8B

8.3B

2024

LLaMA2-7B

80B

Code and math corpora

TinyLlama-1.1B

1.1B

2024

LLaMA1.1B

3T

SlimPajama, Starcoderdata

PaLM

8B, 62B, 540B

2022

-

×\times×

780B

Web documents, books, Wikipedia, conversations, GitHub code

U-PaLM

8B, 62B, 540B

2022

-

×\times×

1.3B

Web documents, books, Wikipedia, conversations, GitHub code

PaLM Family

PaLM-2

340B

2023

-

3.6T

Web documents, books, code, mathematics, conversational data

Med-PaLM

540B

2022

PaLM

×\times×

780B

HealthSearchQA, MedicationQA, LiveQA

Med-PaLM 2

-

2023

PaLM 2

×\times×

-

MedQA, MedMCQA, HealthSearchQA, LiveQA, MedicationQA

FLAN

137B

2021

LaMDA-PT

-

Web documents, code, dialog data, Wikipedia

Gopher

280B

2021

-

×\times×

300B

MassiveText

ERNIE 4.0

10B

2023

-

×\times×

4TB

Chinese text

Retro

7.5B

2021

-

×\times×

600B

MassiveText

LaMDA

137B

2022

-

×\times×

168B

public dialog data and web documents

ChinChilla

70B

2022

-

×\times×

1.4T

MassiveText

Galactia-120B

120B

2022

-

450B

Other Popular LLMs

CodeGen

16.1B

2022

-

-

THE PILE, BIGQUERY, BIGPYTHON

BLOOM

176B

2022

-

366B

ROOTS

Zephyr

7.24B

2023

Mistral-7B

800B

Synthetic data

Grok-0

33B

2023

-

×\times×

-

Online source

ORCA-2

13B

2023

LLaMA2

-

2001B

-

StartCoder

15.5B

2023

-

35B

GitHub

MPT

7B

2023

-

1T

RedPajama, m Common Crawl, S2ORC, Common Crawl

Mixtral-8x7B

46.7B

2023

-

-

Instruction dataset

Falcon 180B

180B

2023

-

3.5T

RefinedWeb

Gemini

1.8B, 3.25B

2023

-

Web documents, books, and code, image data, audio data, video data

DeepSeek-Coder

1.3B, 6.7B, 33B

2024

-

2T

GitHub’s Markdown and StackExchange

DocLLM

1B,7B

2024

-

×\times×

2T

IIT-CDIP Test Collection 1.0, DocBank

II-A Early Pre-trained Neural Language Models

Language modeling using neural networks was pioneered by [38, 39, 40]. Bengio et al. [13] developed one of the first neural language models (NLMs) that are comparable to n-gram models. Then, [14] successfully applied NLMs to machine translation. The release of RNNLM (an open source NLM toolkit) by Mikolov [41, 42] helped significantly popularize NLMs. Afterwards, NLMs based on recurrent neural networks (RNNs) and their variants, such as long short-term memory (LSTM) [19] and gated recurrent unit (GRU) [20], were widely used for many natural language applications including machine translation, text generation and text classification [43].

Then, the invention of the Transformer architecture [44] marks another milestone in the development of NLMs. By applying self-attention to compute in parallel for every word in a sentence or document an “attention score” to model the influence each word has on another, Transformers allow for much more parallelization than RNNs, which makes it possible to efficiently pre-train very big language models on large amounts of data on GPUs. These pre-trained language models (PLMs) can be fine-tuned for many downstream tasks.

We group early popular Transformer-based PLMs, based on their neural architectures, into three main categories: encoder-only, decoder-only, and encoder-decoder models. Comprehensive surveys of early PLMs are provided in [43, 28].

II-A1 Encoder-only PLMs

As the name suggests, the encoder-only models only consist of an encoder network. These models are originally developed for language understanding tasks, such as text classification, where the models need to predict a class label for an input text. Representative encoder-only models include BERT and its variants, e.g., RoBERTa, ALBERT, DeBERTa, XLM, XLNet, UNILM, as to be described below.

BERT (Birectional Encoder Representations from Transformers) [24] is one of the most widely used encoder-only language models. BERT consists of three modules: (1) an embedding module that converts input text into a sequence of embedding vectors, (2) a stack of Transformer encoders that converts embedding vectors into contextual representation vectors, and (3) a fully connected layer that converts the representation vectors (at the final layer) to one-hot vectors. BERT is pre-trained uses two objectives: masked language modeling (MLM) and next sentence prediction. The pre-trained BERT model can be fine-tuned by adding a classifier layer for many language understanding tasks, ranging from text classification, question answering to language inference. A high-level overview of BERT framework is shown in Fig 3. As BERT significantly improved state of the art on a wide range of language understanding tasks when it was published, the AI community was inspired to develop many similar encoder-only language models based on BERT.

Refer to caption
Figure 3: Overall pre-training and fine-tuning procedures for BERT. Courtesy of [24]

RoBERTa [25] significantly improves the robustness of BERT using a set of model design choices and training strategies, such as modifying a few key hyperparameters, removing the next-sentence pre-training objective and training with much larger mini-batches and learning rates. ALBERT [45] uses two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT: (1) splitting the embedding matrix into two smaller matrices, and (2) using repeating layers split among groups. DeBERTa (Decoding-enhanced BERT with disentangled attention) [26] improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a novel virtual adversarial training method is used for fine-tuning to improve models’ generalization. ELECTRA [46] uses a new pre-training task, known as replaced token detection (RTD), which is empirically proven to be more sample-efficient than MLM. Instead of masking the input, RTD corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, a discriminative model is trained to predict whether a token in the corrupted input was replaced by a generated sample or not. RTD is more sample-efficient than MLM because the former is defined over all input tokens rather than just the small subset being masked out, as illustrated in Fig 4.

Refer to caption
Figure 4: A comparison between replaced token detection and masked language modeling. Courtesy of [46].

XLMs [47] extended BERT to cross-lingual language models using two methods: (1) a unsupervised method that only relies on monolingual data, and (2) a supervised method that leverages parallel data with a new cross-lingual language model objective, as illustrated in Fig 5. XLMs had obtained state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation, at the time they were proposed.

Refer to caption
Figure 5: Cross-lingual language model pretraining. The MLM objective is similar to BERT, but with continuous streams of text as opposed to sentence pairs. The TLM objective extends MLM to pairs of parallel sentences. To predict a masked English word, the model can attend to both the English sentence and its French translation, and is encouraged to align English and French representations. Courtesy of [47].

There are also encoder-only language models that leverage the advantages of auto-regressive (decoder) models for model training and inference. Two examples are XLNet and UNILM. XLNet [48] is based on Transformer-XL, pre-trained using a generalized autoregressive method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order. UNILM (UNIfied pre-trained Language Model) [49] is pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. This is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction is conditioned on, as illustrated in Fig 6. The pre-trained model can be fine-tuned for both natural language understanding and generation tasks.

Refer to caption
Figure 6: Overview of unified LM pre-training. The model parameters are shared across the LM objectives (i.e., bidirectional LM, unidirectional LM, and sequence-to-sequence LM). Courtesy of [49].

II-A2 Decoder-only PLMs

Two of the most widely used decoder-only PLMs are GPT-1 and GPT-2, developed by OpenAI. These models lay the foundation to more powerful LLMs subsequently, i.e., GPT-3 and GPT-4.

GPT-1 [50] demonstrates for the first time that good performance over a wide range of natural language tasks can be obtained by Generative Pre-Training (GPT) of a decoder-only Transformer model on a diverse corpus of unlabeled text in a self-supervised learning fashion (i.e., next word/token prediction), followed by discriminative fine-tuning on each specific downstream task (with much fewer samples), as illustrated in Fig 7. GPT-1 paves the way for subsequent GPT models, with each version improving upon the architecture and achieving better performance on various language tasks.

Refer to caption
Figure 7: High-level overview of GPT pretraining, and fine-tuning steps. Courtesy of OpenAI.

GPT-2 [51] shows that language models are able to learn to perform specific natural language tasks without any explicit supervision when trained on a large WebText dataset consisting of millions of webpages. The GPT-2 model follows the model designs of GPT-1 with a few modifications: Layer normalization is moved to the input of each sub-block, additional layer normalization is added after the final self-attention block, initialization is modified to account for the accumulation on the residual path and scaling the weights of residual layers, vocabulary size is expanded to 50,25, and context size is increased from 512 to 1024 tokens.

II-A3 Encoder-Decoder PLMs

In [52], Raffle et al. shows that almost all NLP tasks can be cast as a sequence-to-sequence generation task. Thus, an encoder-decoder language model, by design, is a unified model in that it can perform all natural language understanding and generation tasks. Representative encoder-decoder PLMs we will review below are T5, mT5, MASS, and BART.

T5 [52] is a Text-to-Text Transfer Transformer (T5) model, where transfer learning is effectively exploited for NLP via an introduction of a unified framework in which all NLP tasks are cast as a text-to-text generation task. mT5 [53] is a multilingual variant of T5, which is pre-trained on a new Common Crawl-based dataset consisting of texts in 101 languages.

MASS (MAsked Sequence to Sequence pre-training) [54] adopts the encoder-decoder framework to reconstruct a sentence fragment given the remaining part of the sentence. The encoder takes a sentence with randomly masked fragment (several consecutive tokens) as input, and the decoder predicts the masked fragment. In this way, MASS jointly trains the encoder and decoder for language embedding and generation, respectively.

BART [55] uses a standard sequence-to-sequence translation model architecture. It is pre-trained by corrupting text with an arbitrary noising function, and then learning to reconstruct the original text.

II-B Large Language Model Families

Large language models (LLMs) mainly refer to transformer-based PLMs that contain tens to hundreds of billions of parameters. Compared to PLMs reviewed above, LLMs are not only much larger in model size, but also exhibit stronger language understanding and generation and emergent abilities that are not present in smaller-scale models. In what follows, we review three LLM families: GPT, LLaMA, and PaLM, as illustrated in Fig 8.

Refer to caption
Figure 8: Popular LLM Families.

II-B1 The GPT Family

Generative Pre-trained Transformers (GPT) are a family of decoder-only Transformer-based language models, developed by OpenAI. This family consists of GPT-1, GPT-2, GPT-3, InstrucGPT, ChatGPT, GPT-4, CODEX, and WebGPT. Although early GPT models, such as GPT-1 and GPT-2, are open-source, recent models, such as GPT-3 and GPT-4, are close-source and can only be accessed via APIs. GPT-1 and GPT-2 models have been discussed in the early PLM subsection. We start with GPT-3 below.

GPT-3 [56] is a pre-trained autoregressive language model with 175 billion parameters. GPT-3 is widely considered as the first LLM in that it not only is much larger than previous PLMs, but also for the first time demonstrates emergent abilities that are not observed in previous smaller PLMs. GPT-3 shows the emergent ability of in-context learning, which means GPT-3 can be applied to any downstream tasks without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieved strong performance on many NLP tasks, including translation, question-answering, and the cloze tasks, as well as several ones that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, 3-digit arithmetic. Fig 9 plots the performance of GPT-3 as a function of the number of examples in in-context prompts.

Refer to caption
Figure 9: GPT-3 shows that larger models make increasingly efficient use of in-context information. It shows in-context learning performance on a simple task requiring the model to remove random symbols from a word, both with and without a natural language task description. Courtesy of [56].

CODEX [57], released by OpenAI in March 2023, is a general-purpose programming model that can parse natural language and generate code in response. CODEX is a descendant of GPT-3, fine-tuned for programming applications on code corpora collected from GitHub. CODEX powers Microsoft’s GitHub Copilot.

WebGPT [58] is another descendant of GPT-3, fine-tuned to answer open-ended questions using a text-based web browser, facilitating users to search and navigate the web. Specifically, WebGPT is trained in three steps. The first is for WebGPT to learn to mimic human browsing behaviors using human demonstration data. Then, a reward function is learned to predict human preferences. Finally, WebGPT is refined to optimize the reward function via reinforcement learning and rejection sampling.

To enable LLMs to follow expected human instructions, InstructGPT [59] is proposed to align language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, a dataset of labeler demonstrations of the desired model behavior is collected. Then GPT-3 is fine-tuned on this dataset. Then, a dataset of human-ranked model outputs is collected to further fine-tune the model using reinforcement learning. The method is known Reinforcement Learning from Human Feedback (RLHF), as shown in 10. The resultant InstructGPT models have shown improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets.

Refer to caption
Figure 10: The high-level overview of RLHF. Courtesy of [59].

The most important milestone of LLM development is the launch of ChatGPT (Chat Generative Pre-trained Transformer) [60] on November 30, 2022. ChatGPT is chatbot that enables users to steer a conversation to complete a wide range of tasks such as question answering, information seeking, text summarization, and more. ChatGPT is powered by GPT-3.5 (and later by GPT-4), a sibling model to InstructGPT, which is trained to follow an instruction in a prompt and provide a detailed response.

GPT-4 [33] is the latest and most powerful LLM in the GPT family. Launched in March, 2023, GPT-4 is a multimodal LLM in that it can take image and text as inputs and produce text outputs. While still less capable than humans in some of the most challenging real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers, as shown in Fig 11. Like early GPT models, GPT-4 was first pre-trained to predict next tokens on large text corpora, and then fine-tuned with RLHF to align model behaviors with human-desired ones.

Refer to caption
Figure 11: GPT-4 performance on academic and professional exams, compared with GPT 3.5. Courtesy of [33].

II-B2 The LLaMA Family

LLaMA is a collection of foundation language models, released by Meta. Unlike GPT models, LLaMA models are open-source, i.e., model weights are released to the research community under a noncommercial license. Thus, the LLaMA family grows rapidly as these models are widely used by many research groups to develop better open-source LLMs to compete the closed-source ones or to develop task-specific LLMs for mission-critical applications.

The first set of LLaMA models [32] was released in February 2023, ranging from 7B to 65B parameters. These models are pre-trained on trillions of tokens, collected from publicly available datasets. LLaMA uses the transformer architecture of GPT-3, with a few minor architectural modifications, including (1) using a SwiGLU activation function instead of ReLU, (2) using rotary positional embeddings instead of absolute positional embedding, and (3) using root-mean-squared layer-normalization instead of standard layer-normalization. The open-source LLaMA-13B model outperforms the proprietary GPT-3 (175B) model on most benchmarks, making it a good baseline for LLM research.

In July 2023, Meta, in partnership with Microsoft, released the LLaMA-2 collection [61], which include both foundation language models and Chat models finetuned for dialog, known as LLaMA-2 Chat. The LLaMA-2 Chat models were reported to outperform other open-source models on many public benchmarks. Fig 12 shows the training process of LLaMA-2 Chat. The process begins with pre-training LLaMA-2 using publicly available online data. Then, an initial version of LLaMA-2 Chat is built via supervised fine-tuning. Subsequently, the model is iteratively refined using RLHF, rejection sampling and proximal policy optimization. In the RLHF stage, the accumulation of human feedback for revising the reward model is crucial to prevent the reward model from being changed too much, which could hurt the stability of LLaMA model training.

Refer to caption
Figure 12: Training of LLaMA-2 Chat. Courtesy of [61].

Alpaca [62] is fine-tuned from the LLaMA-7B model using 52K instruction-following demonstrations generated in the style of self-instruct using GPT-3.5 (text-davinci-003). Alpaca is very cost-effective for training, especially for academic research. On the self-instruct evaluation set, Alpaca performs similarly to GPT-3.5, despite that Alpaca is much smaller.

The Vicuna team has developed a 13B chat model, Vicuna-13B, by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. Preliminary evaluation using GPT-4 as a evaluator shows that Vicuna-13B achieves more than 90% quality of OpenAI’s ChatGPT, and Google’s Bard while outperforming other models like LLaMA and Stanford Alpaca in more than 90% of cases. 13 shows the relative response quality of Vicuna and a few other well-known models by GPT-4. Another advantage of Vicuna-13B is its relative limited computational demand for model training. The training cost of Vicuna-13B is merely $300.

Refer to caption
Figure 13: Relative Response Quality of Vicuna and a few other well-known models by GPT-4. Courtesy of Vicuna Team.

Like Alpaca and Vicuna, the Guanaco models [63] are also finetuned LLaMA models using instruction-following data. But the finetuning is done very efficiently using QLoRA such that finetuning a 65B parameter model can be done on a single 48GB GPU. QLoRA back-propagates gradients through a frozen, 4-bit quantized pre-trained language model into Low Rank Adapters (LoRA). The best Guanaco model outperforms all previously released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of fine-tuning on a single GPU.

Koala [64] is yet another instruction-following language model built on LLaMA, but with a specific focus on interaction data that include user inputs and responses generated by highly capable closed-source chat models such as ChatGPT. The Koala-13B model performs competitively with state-of-the-art chat models according to human evaluation based on real-world user prompts.

Mistral-7B [65] is a 7B-parameter language model engineered for superior performance and efficiency. Mistral-7B outperforms the best open-source 13B model (LLaMA-2-13B) across all evaluated benchmarks, and the best open-source 34B model (LLaMA-34B) in reasoning, mathematics, and code generation. This model leverages grouped-query attention for faster inference, coupled with sliding window attention to effectively handle sequences of arbitrary length with a reduced inference cost.

The LLaMA family is growing rapidly, as more instruction-following models have been built on LLaMA or LLaMA-2, including Code LLaMA [66], Gorilla [67], Giraffe [68], Vigogne [69], Tulu 65B [70], Long LLaMA [71], and Stable Beluga2 [72], just to name a few.

II-B3 The PaLM Family

The PaLM (Pathways Language Model) family are developed by Google. The first PaLM model [31] was announced in April 2022 and remained private until March 2023. It is a 540B parameter transformer-based LLM. The model is pre-trained on a high-quality text corpus consisting of 780 billion tokens that comprise a wide range of natural language tasks and use cases. PaLM is pre-trained on 6144 TPU v4 chips using the Pathways system, which enables highly efficient training across multiple TPU Pods. PaLM demonstrates continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. PaLM-540B outperforms not only state-of-the-art fine-tuned models on a suite of multi-step reasoning tasks, but also on par with humans on the recently released BIG-bench benchmark.

The U-PaLM models of 8B, 62B, and 540B scales are continually trained on PaLM with UL2R, a method of continue training LLMs on a few steps with UL2’s mixture-of-denoiser objective [73]. An approximately 2x computational savings rate is reported.

U-PaLM is later instruction-finetuned as Flan-PaLM [74]. Compared to other instruction finetuning work mentioned above, Flan-PaLM’s finetuning is performed using a much larger number of tasks, larger model sizes, and chain-of-thought data. As a result, Flan-PaLM substantially outperforms previous instruction-following models. For instance, Flan-PaLM-540B, which is instruction-finetuned on 1.8K tasks, outperforms PaLM-540B by a large margin (+9.4% on average). The finetuning data comprises 473 datasets, 146 task categories, and 1,836 total tasks, as illustrated in Fig 14.

Refer to caption
Figure 14: Flan-PaLM finetuning consist of 473 datasets in above task categories. Courtesy of [74].

PaLM-2 [75] is a more compute-efficient LLM with better multilingual and reasoning capabilities, compared to its predecessor PaLM. PaLM-2 is trained using a mixture of objectives. Through extensive evaluations on English, multilingual, and reasoning tasks, PaLM-2 significantly improves the model performance on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference than PaLM.

Med-PaLM [76] is a domain-specific PaLM, and is designed to provide high-quality answers to medical questions. Med-PaLM is finetuned on PaLM using instruction prompt tuning, a parameter-efficient method for aligning LLMs to new domains using a few exemplars. Med-PaLM obtains very encouraging results on many healthcare tasks, although it is still inferior to human clinicians. Med-PaLM 2 improves Med-PaLM via med-domain finetuning and ensemble prompting [77]. Med-PaLM 2 scored up to 86.5% on the MedQA dataset (i.e., a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries), improving upon Med-PaLM by over 19% and setting a new state-of-the-art.

II-C Other Representative LLMs

In addition to the models discussed in the previous subsections, there are other popular LLMs which do not belong to those three model families, yet they have achieved great performance and have pushed the LLMs field forward. We briefly describe these LLMs in this subsection.

FLAN: In [78], Wei et al. explored a simple method for improving the zero-shot learning abilities of language models. They showed that instruction tuning language models on a collection of datasets described via instructions substantially improves zero-shot performance on unseen tasks. They take a 137B parameter pretrained language model and instruction tune it on over 60 NLP datasets verbalized via natural language instruction templates. They call this instruction-tuned model FLAN. Fig 15 provides a comparison of instruction tuning with pretrain–finetune and prompting.

Refer to caption
Figure 15: comparison of instruction tuning with pretrain–finetune and prompting. Courtesy of [78].

Gopher: In [79], Rae et al. presented an analysis of Transformer-based language model performance across a wide range of model scales — from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models were evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. The number of layers, the key/value size, and other hyper-parameters of different model sizes are shown in Fig 16.

Refer to caption
Figure 16: Model architecture details of Gopher with different number of parameters. Courtesy of [78].

T0: In [80], Sanh et al. developed T0, a system for easily mapping any natural language tasks into a human-readable prompted form. They converted a large set of supervised datasets, each with multiple prompts with diverse wording. These prompted datasets allow for benchmarking the ability of a model to perform completely held-out tasks. Then, a T0 encoder-decoder model is developed to consume textual inputs and produces target responses. The model is trained on a multitask mixture of NLP datasets partitioned into different tasks.

ERNIE 3.0: In [81], Sun et al. proposed a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models. It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks using zero-shot learning, few-shot learning or fine-tuning. They have trained ERNIE 3.0 with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph. Fig 17 illustrates the model architecture of Ernie 3.0.

Refer to caption
Figure 17: High-level model architecture of ERNIE 3.0. Courtesy of [81].

RETRO: In [82], Borgeaud et al. enhanced auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. Using a 2-trillion-token database, the Retrieval-Enhanced Transformer (Retro) obtains comparable performance to GPT-3 and Jurassic-1 [83] on the Pile, despite using 25% fewer parameters. As shown in Fig 18, Retro combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training.

Refer to caption
Figure 18: Retro architecture. Left: simplified version where a sequence of length n = 12 is split into l = 3 chunks of size m = 4. For each chunk, we retrieve k = 2 neighbours of r = 5 tokens each. The retrieval pathway is shown on top. Right: Details of the interactions in the CCA operator. Causality is maintained as neighbours of the first chunk only affect the last token of the first chunk and tokens from the second chunk. Courtesy of [82].

GLaM: In [84], Du et al. proposed a family of LLMs named GLaM (Generalist Language Model), which use a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest GLaM has 1.2 trillion parameters, which is approximately 7x larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall zero, one and few-shot performance across 29 NLP tasks. Fig 19 shows the high-level architecture of GLAM.

Refer to caption
Figure 19: GLaM model architecture. Each MoE layer (the bottom block) is interleaved with a Transformer layer (the upper block). Courtesy of [84].

LaMDA: In [85], Thoppilan et al. presented LaMDA, a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and are pre-trained on 1.56T words of public dialog data and web text. They showed that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding.

OPT: In [86], Zhang et al. presented Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which they share with researchers. The OPT models’ parameters are shown in 20

Refer to caption
Figure 20: Different OPT Models’ architecture details. Courtesy of [86].

Chinchilla: In [2], Hoffmann et al. investigated the optimal model size and number of tokens for training a transformer language model under a given compute budget. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, they found that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. They tested this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4% more more data.

Galactica: In [87], Taylor et al. introduced Galactica, a large language model that can store, combine and reason about scientific knowledge. They trained on a large scientific corpus of papers, reference material, knowledge bases and many other sources. Galactica performed well on reasoning, outperforming Chinchilla on mathematical MMLU by 41.3% to 35.7%, and PaLM 540B on MATH with a score of 20.4% versus 8.8%.

CodeGen: In [88], Nijkamp et al. trained and released a family of large language models up to 16.1B parameters, called CODEGEN, on natural language and programming language data, and open sourced the training library JAXFORMER. They showed the utility of the trained model by demonstrating that it is competitive with the previous state-of-the-art on zero-shot Python code generation on HumanEval. They further investigated the multi-step paradigm for program synthesis, where a single program is factorized into multiple prompts specifying sub-problems. They also constructed an open benchmark, Multi-Turn Programming Benchmark (MTPB), consisting of 115 diverse problem sets that are factorized into multi-turn prompts.

AlexaTM: In [89], Soltan et al. demonstrated that multilingual large-scale sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners than decoder-only models on various task. They trained a 20 billion parameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B) and showed that it achieves state-of-the-art (SOTA) performance on 1-shot summarization tasks, outperforming a much larger 540B PaLM decoder model. AlexaTM consist of 46 encoder layers, 32 decoder layers, 32 attention heads, and dmodel=4096subscript𝑑𝑚𝑜𝑑𝑒𝑙4096d_{model}=4096italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT = 4096.

Sparrow: In [90], Glaese et al. presented Sparrow, an information-seeking dialogue agent trained to be more helpful, correct, and harmless compared to prompted language model baselines. They used reinforcement learning from human feedback to train their models with two new additions to help human raters judge agent behaviour. The high-level pipeline of Sparrow model is shown in Fig 21.

Refer to caption
Figure 21: Sparrow pipeline relies on human participation to continually expand a training set. Courtesy of [90].

Minerva: In [91], Lewkowycz et al. introduced Minerva, a large language model pretrained on general natural language data and further trained on technical content, to tackle previous LLM struggle with quantitative reasoning (such as solving mathematics, science, and engineering problems).

MoD: In [92], Tay et al. presented a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. They proposed Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. This framework is known as Unifying Language Learning (UL2). An overview of UL2 pretraining paradigm is shown in Fig 21.

Refer to caption
Figure 22: An overview of UL2 pretraining paradigm. Courtesy of [92].

BLOOM: In [93], Scao et al. presented BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). An overview of BLOOM architecture is shown in Fig 23.

Refer to caption
Figure 23: An overview of BLOOM architecture. Courtesy of [93].

GLM: In [94], Zeng et al. introduced GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It was an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained.

Pythia: In [95], Biderman et al. introduced Pythia, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders for further study.

Orca: In [96], Mukherjee et al. develop Orca, a 13-billion parameter model that learns to imitate the reasoning process of large foundation models. Orca learns from rich signals from GPT-4 including explanation traces; step-by-step thought processes; and other complex instructions, guided by teacher assistance from ChatGPT.

StarCoder: In [97], Li et al. introduced StarCoder and StarCoderBase. They are 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on one trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. They fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. They performed the most comprehensive evaluation of Code LLMs to date and showed that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model.

KOSMOS: In [98], Huang et al. introduced KOSMOS-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e. zero-shot). Specifically, they trained KOSMOS-1 from scratch on web-scale multi-modal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. Experimental results show that KOSMOS-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions).

Gemini: In [99], Gemini team introduced a new family of multimodal models, that exhibit promising capabilities across image, audio, video, and text understanding. Gemini family includes three versions: Ultra for highly-complex tasks, Pro for enhanced performance and deployability at scale, and Nano for on-device applications. Gemini architecture is built on top of Transformer decoders, and is trained to support 32k context length (via using efficient attention mechanisms).

Refer to caption
Figure 24: Timeline of some of the most representative LLM frameworks (so far). In addition to large language models with our #parameters threshold, we included a few representative works, which pushed the limits of language models, and paved the way for their success (e.g. vanilla Transformer, BERT, GPT-1), as well as some small language models. \clubsuit shows entities that serve not only as models but also as approaches. \blacklozenge shows only approaches.

Some of the other popular LLM frameworks (or techniques used for efficient developments of LLMs) includes Inner-Monologue [100], Megatron-Turing NLG [101], LongFormer [102], OPT-IML [103], MeTaLM [104], Dromedary [105], Palmyra [106], Camel [107], Yalm [108], MPT [109], ORCA-2 [110], Gorilla [67], PAL [111], Claude [112], CodeGen 2 [113], Zephyr [114], Grok [115], Qwen [116], Mamba [30], Mixtral-8x7B [117], DocLLM [118], DeepSeek-Coder [119], FuseLLM-7B [120], TinyLlama-1.1B [121], LLaMA-Pro-8B [122].

Fig 24 provides an overview of some of the most representative LLM frameworks, and the relevant works that have contributed to the success of LLMs and helped to push the limits of LLMs.

III How LLMs Are Built

In this section, we first review the popular architectures used for LLMs, and then discuss data and modeling techniques ranging from data preparation, tokenization, to pre-training, instruction tuning, and alignment.

Once the model architecture is chosen, the major steps involved in training an LLM includes: data preparation (collection, cleaning, deduping, etc.), tokenization, model pre-training (in a self-supervised learning fashion), instruction tuning, and alignment. We will explain each of them in a separate subsection below. These steps are also illustrated in Fig 25.

Refer to caption
Figure 25: This figure shows different components of LLMs.

III-A Dominant LLM Architectures

The most widely used LLM architectures are encoder-only, decoder-only, and encoder-decoder. Most of them are based on Transformer (as the building block). Therefore we also review the Transformer architecture here.

III-A1 Transformer

in a ground-breaking work [44], Vaswani et al. proposed the Transformer framework, which was originally designed for effective parallel computing using GPUs. The heart of Transformer is the (self-)attention mechanism, which can capture long-term contextual information much more effectively using GPUs than the recurrence and convolution mechanisms. Fig 26 provides a high-level overview of transformer work. In this section we provide an overview of the main elements and variants, see [44, 123] for more details.

The Transformer language model architecture, originally proposed for machine translation, consists of an encoder and a decoder. The encoder is composed of a stack of N = 6 identical Transformer layers. Each layer has two sub-layers. The first one is a multi-head self-attention layer, and the other one is a simple position-wise fully connected feed-forward network. The decoder is composed of a stack of 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder has a third sub-layer, which performs multi-head attention over the output of the encoder stack. The attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. Instead of performing a single attention function with dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT dimensional keys, values and queries, it is found to be beneficial to linearly project the queries, keys and values hhitalic_h with different, learned linear projections to dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT dimensions, respectively. Positional encoding is incorporated to fuse information about the relative or absolute position of the tokens in the sequence.

Refer to caption
Figure 26: High-level overview of transformer work. Courtesy of [44].

III-A2 Encoder-Only

For this family, at each stage, the attention layers can access all the words in the initial sentence. The pre-training of these models usually consist of somehow corrupting a given sentence (for instance, by masking random words in it) and tasking the model with finding or reconstructing the initial sentence. Encoder models are great for tasks requiring an understanding of the full sequence, such as sentence classification, named entity recognition, and extractive question answering. One prominent encoder only model is BERT (Bidirectional Encoder Representations from Transformers), proposed in [24].

III-A3 Decoder-Only

For these models, at each stage, for any word, the attention layers can only access the words positioned before that in the sentence. These models are also sometimes called auto-regressive models. The pretraining of these models is usually formulated as predicting the next word (or token) in the sequence. The decoder-only models are best suited for tasks involving text generation. GPT models are prominent example of this model category.

III-A4 Encoder-Decoder

These models use both encoder and decoder, and are sometimes called sequence-to-sequence models. At each stage, the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder only accesses the words positioned before a given word in the input. These models are usually pre-trained using the objectives of encoder or decoder models, but usually involve something a bit more complex. For instance, some models are pretrained by replacing random spans of text (that can contain several words) with a single mask special word, and the objective is then to predict the text that this mask word replaces. Encoder-decoder models are best suited for tasks about generating new sentences conditioned on a given input, such as summarization, translation, or generative question answering.

III-B Data Cleaning

Data quality is crucial to the performance of language models trained on them. Data cleaning techniques such as filtering, deduplication, are shown to have a big impact on the model performance.

As an example, in Falcon40B [124], Penedo et al. showed that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, they were able to obtain five trillion tokens from CommonCrawl. They also released an extract of 600 billion tokens from our REFINEDWEB dataset, and 1.3/7.5B parameters language models trained on it. 27 shows the Refinement process of CommonCrawl data by this work.

Refer to caption
Figure 27: Subsequent stages of Macrodata Refinement remove nearly 90% of the documents originally in CommonCrawl. Courtesy of [124].

III-B1 Data Filtering

Data filtering aims to enhance the quality of training data and the effectiveness of the trained LLMs. Common data filtering techniques include:

Removing Noise: refers to eliminating irrelevant or noisy data that might impact the model’s ability to generalize well. As an example, one can think of removing false information from the training data, to lower the chance of model generating false responses. Two mainstream approaches for quality filtering includes: classifier-based, and heuristic-based frameworks.

Handling Outliers: Identifying and handling outliers or anomalies in the data to prevent them from disproportionately influencing the model.

Addressing Imbalances: Balancing the distribution of classes or categories in the dataset to avoid biases and ensure fair representation. This is specially useful for responsible model training and evaluation.

Text Preprocessing: Cleaning and standardizing text data by removing stop words, punctuation, or other elements that may not contribute significantly to the model’s learning.

Dealing with Ambiguities: Resolving or excluding ambiguous or contradictory data that might confuse the model during training. This can help the model to provide more definite and reliable answers.

III-B2 Deduplication

De-duplication refers to the process of removing duplicate instances or repeated occurrences of the same data in a dataset. Duplicate data points can introduce biases in the model training process and reduce the diversity, as the model may learn from the same examples multiple times, potentially leading to overfitting on those particular instances. Some works [125] have shown that de-duplication improves models’ ability to generalize to new, unseen data.

The de-duplication process is particularly important when dealing with large datasets, as duplicates can unintentionally inflate the importance of certain patterns or characteristics. This is especially relevant in NLP tasks, where diverse and representative training data is crucial for building robust language models.

The specific de-duplication method can vary based on the nature of the data and the requirements of the particular language model being trained. It may involve comparing entire data points or specific features to identify and eliminate duplicates. At the document level, existing works mainly rely on the overlap ratio of high-level features (e.g. n-grams overlap) between documents to detect duplicate samples.

III-C Tokenizations

Tokenization referes to the process of converting a sequence of text into smaller parts, known as tokens. While the simplest tokenization tool simply chops text into tokens based on white space, most tokenization tools rely on a word dictionary. However, out-of-vocabulary (OOV) is a problem in this case because the tokenizer only knows words in its dictionary. To increase the coverage of dictionaries, popular tokenizers used for LLMs are based on sub-words, which can be combined to form a large number of words, including the words unseen in training data or words in different languages. In what follows, we describe three popular tokenizers.

III-C1 BytePairEncoding

BytePairEncoding is originally a type of data compression algorithm that uses frequent patterns at byte level to compress the data. By definition, this algorithm mainly tries to keep the frequent words in their original form and break down ones that are not common. This simple paradigm keeps the vocabulary not very large, but also good enough to represent common words at the same time. Also morphological forms of the frequent words can be represented very well if suffix or prefix is also commonly presented in the training data of the algorithm.

III-C2 WordPieceEncoding

This algorithm is mainly used for very well-known models such as BERT and Electra. At the beginning of training, the algorithm takes all the alphabet from the training data to make sure that nothing will be left as UNK or unknown from the training dataset. This case happens when the model is given an input that can not be tokenized by the tokenizer. It mostly happens in cases where some characters are not tokenizable by it. Similar to BytePairEncoding, it tries to maximize the likelihood of putting all the tokens in vocabulary based on their frequency.

III-C3 SentencePieceEncoding

Although both tokenizers described before are strong and have many advantages compared to white-space tokenization, they still take assumption of words being always separated by white-space as granted. This assumption is not always true, in fact in some languages, words can be corrupted by many noisy elements such as unwanted spaces or even invented words. SentencePieceEncoding tries to address this issue.

III-D Positional Encoding

III-D1 Absolute Positional Embeddings

(APE) [44] has been used in the original Transformer model to preserve the information of sequence order. Therefore, the positional information of words is added to the input embeddings at the bottom of both the encoder and decoder stacks. There are various options for positional encodings, either learned or fixed. In the vanilla Transformer, sine and cosine functions are employed for this purpose. The main drawback of using APE in Transformers is the restriction to a certain number of tokens. Additionally, APE fails to account for the relative distances between tokens.

III-D2 Relative Positional Embeddings

(RPE) [126] involves extending self-attention to take into account the pairwise links between input elements. RPE is added to the model at two levels: first as an additional component to the keys, and subsequently as a sub-component of the values matrix. This approach looks at the input as a fully-connected graph with labels and directed edges. In the case of linear sequences, edges can capture information about the relative position differences between input elements. A clipping distance, represented as k 2kn42𝑘𝑛42\leq k\leq n-42 ≤ italic_k ≤ italic_n - 4, specifies the maximum limit on relative locations. This allows the model to make reasonable predictions for sequence lengths that are not part of the training data.

III-D3 Rotary Position Embeddings

Rotary Positional Embedding (RoPE) [127] tackles problems with existing approaches. Learned absolute positional encodings can lack generalizability and meaningfulness, particularly when sentences are short. Moreover, current methods like T5’s positional embedding face challenges with constructing a full attention matrix between positions. RoPE uses a rotation matrix to encode the absolute position of words and simultaneously includes explicit relative position details in self-attention. RoPE brings useful features like flexibility with sentence lengths, a decrease in word dependency as relative distances increase, and the ability to improve linear self-attention with relative position encoding. GPT-NeoX-20B, PaLM, CODEGEN, and LLaMA are among models that take advantage of RoPE in their architectures.

III-D4 Relative Positional Bias

The concept behind this type of positional embedding is to facilitate extrapolation during inference for sequences longer than those encountered in training. In [128] Press et al. proposed Attention with Linear Biases (ALiBi). Instead of simply adding positional embeddings to word embeddings, they introduced a bias to the attention scores of query-key pairs, imposing a penalty proportional to their distance. In the BLOOM model, ALiBi is leveraged.

Refer to caption
(a) Absolute Positional Embeddings [129]
Refer to caption
(b) Relative Positional Embeddings
Refer to caption
(c) Rotary Positional Embedding [127]
Refer to caption
(d) Relative Positional Bias [128]
Figure 28: Various positional encodings are employed in LLMs.

III-E Model Pre-training

Pre-training is the very first step in large language model training pipeline, and it helps LLMs to acquire fundamental language understanding capabilities, which can be useful in a wide range of language related tasks. During pre-training, the LLM is trained on a massive amount of (usually) unlabeled texts, usually in a self-supervised manner. There are different approaches used for pre-training like next sentence prediction [24], two most common ones include, next token prediction (autoregressive language modeling), and masked language modeling.

In Autoregressive Language Modeling framework, given a sequence of n𝑛nitalic_n tokens x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, …, xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the model tries to predict next token xn+1subscript𝑥𝑛1x_{n+1}italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT (and sometimes next sequence of tokens) in an auto-regressive fashion. One popular loss function in this case is the log-likelihood of predicted tokens as shown in Eq 2

ALM(x)=i=1Np(xi+n|xi,,xi+n1)subscript𝐴𝐿𝑀𝑥superscriptsubscript𝑖1𝑁𝑝conditionalsubscript𝑥𝑖𝑛subscript𝑥𝑖subscript𝑥𝑖𝑛1\mathscr{L}_{ALM}(x)=\sum_{i=1}^{N}p(x_{i+n}|x_{i},...,x_{i+n-1})script_L start_POSTSUBSCRIPT italic_A italic_L italic_M end_POSTSUBSCRIPT ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_i + italic_n end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i + italic_n - 1 end_POSTSUBSCRIPT ) (1)

Given the auto-regressive nature of this framework, the decoder-only models are naturally better suited to learn how to accomplish these task.

In Masked Language Modeling, some words are masked in a sequence and the model is trained to predict the masked words based on the surrounding context. Sometimes people refer to this approach as denoising autoencoding, too. If we denote the masked/corrupted samples in the sequence x𝑥xitalic_x, as x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG, then the training objective of this approach can be written as:

MLM(x)=i=1Np(x~|x\x~)subscript𝑀𝐿𝑀𝑥superscriptsubscript𝑖1𝑁𝑝conditional~𝑥\𝑥~𝑥\mathscr{L}_{MLM}(x)=\sum_{i=1}^{N}p(\tilde{x}|x\backslash\tilde{x})script_L start_POSTSUBSCRIPT italic_M italic_L italic_M end_POSTSUBSCRIPT ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( over~ start_ARG italic_x end_ARG | italic_x \ over~ start_ARG italic_x end_ARG ) (2)

And more recently, Mixture of Experts (MoE) [130, 131] have become very popular in LLM space too. MoEs enable models to be pre-trained with much less compute, which means one can dramatically scale up the model or dataset size with the same compute budget as a dense model. MoE consists of two main elements: Sparse MoE layers, which are used instead of dense feed-forward network (FFN) layers, and have a certain number of “experts” (e.g. 8), in which each expert is a neural network. In practice, the experts are FFNs, but they can also be more complex networks. A gate network or router, that determines which tokens are sent to which expert. It is worth noting that, one can send a token to more than one expert. How to route a token to an expert is one of the big decisions when working with MoEs - the router is composed of learned parameters and is pretrained at the same time as the rest of the network. Fig 29 provides an illustration of a Switch Transformer encoder block, which are used in MoE.

Refer to caption
Figure 29: : Illustration of a Switch Transformer encoder block. They replaced the dense feed forward network (FFN) layer present in the Transformer with a sparse Switch FFN layer (light blue). . Courtesy of [131].

III-F Fine-tuning and Instruction Tuning

Early language models such as BERT trained using self-supervision as explained in section III-E were not able to perform specific tasks. In order for the foundation model to be useful it needed to be fine-tuned to a specific task with labeled data (so-called supervised fine-tuning or SFT for short). For example, in the original BERT paper [24], the model was fine-tuned to 11 different tasks. While more recent LLMs no longer require fine-tuning to be used, they can still benefit from task or data-specific fine-tuning. For example, OpenAI reports that the much smaller GPT-3.5 Turbo model can outperform GPT-4 when fine-tuned with task specific data 222https://platform.openai.com/docs/guides/fine-tuning.

Fine-tuning does not need to be performed to a single task though, and there are different approaches to multi-task fine-tuning (see e.g. Mahabi et al. [132]). Fine-tuning to one or more tasks is known to improve results and reduce the complexity of prompt engineering, and it can serve as an alternative to retrieval augmented generation. Furthermore, there are other reasons why it might be advisable to fine-tune. For example, one might want to fine-tune to expose the model to new or proprietary data that it has not been exposed to during pre-training.

An important reason to fine-tune LLMs is to align the responses to the expectations humans will have when providing instructions through prompts. This is the so-called instruction tuning [133]. We dive into the details of how to design and engineer prompts in section IV-B, but in the context of instruction tuning, it is important to understand that the instruction is a prompt that specifies the task that the LLM should accomplish. Instruction tuning datasets such as Natural Instructions [134] include not only the task definition but other components such as positive/negative examples or things to avoid.

The specific approach and instruction datasets used to instruction-tune an LLM varies, but, generally speaking, instruction tuned models outperform their original foundation models they are based on. For example, InstructGPT [59] outperforms GPT-3 on most benchmarks. The same is true for Alpaca [62] when compared to LLaMA.

Self-Instruct [135], proposed by Wang et al. is also a popular approach along this line, in which they introduced a framework for improving the instruction-following capabilities of pre-trained language models by bootstrapping their own generations. Their pipeline generates instructions, input, and output samples from a language model, then filters invalid or similar ones before using them to fine tune the original model.

III-G Alignment

AI Alignment is the process of steering AI systems towards human goals, preferences, and principles. LLMs, pre-trained for word prediction, often exhibit unintended behaviors. For example, they might generate contents that are toxic, harmful, misleading and biased.

Instruction tuning, discussed above, gets LLMs a step closer to being aligned. However, in many cases, it is important to include further steps to improve the alignment of the model and avoid unintended behaviors 333According to very recent research by Ethayarajh et al. [136], further alignment besides SFT mainly improves models of at least 7B parameters. For smaller models, SFT is sufficient.. We review the most popular approaches to alignment in this subsection.

RLHF (reinforcement learning from human feedback) and RLAIF (reinforcement learning from AI feedback) are two popular approaches. RLHF uses a reward model to learn alignment from human feedback. This reward model, after being tuned, is able to rate different outputs and score them according to their alignment preferences given by humans. The reward model gives feedback to the original LLM and this feedback is used to tune the LLM further [137]. Reinforcement learning from AI feedback on the other hand, directly connects a pretrained and well-aligned model to the LLM and helps it to learn from larger and more aligned models [138].

In another recent work (known as DPO) [139], Rafailov et al. discussed that RLHF is a complex and often unstable procedure, and tried to address this with a new approach. They leveraged a mapping between reward functions and optimal policies to show that this constrained reward maximization problem can be optimized exactly with a single stage of policy training, essentially solving a classification problem on the human preference data. The resulting algorithm, which they called Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for fitting a reward model, sampling from the LM during fine-tuning, or performing significant hyperparameter tuning. They observed that fine-tuning with DPO exceeds RLHF’s ability to control sentiment of generations and improves response quality in summarization. Fig 30 shows the high-level comparison between DPO vs RLHF.

Refer to caption
Figure 30: DPO optimizes for human preferences while avoiding reinforcement learning. Existing methods for fine-tuning language models with human feedback first fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward. In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, without an explicit reward function or RL. Courtesy of [139].

Even more recently Ethayarajh et al. proposed a new alignment approach called the Kahneman-Tversky Optimization (KTO) [136]. Unlike existing state-of-the-art approaches, KTO does not require paired preference data (x𝑥xitalic_x, ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT), and it only needs (x,y) and knowledge of whether y𝑦yitalic_y is desirable or undesirable. KTO-aligned models are shown to be good or better than DPO-aligned models at scales from 1B to 30B, despite not using paired preferences. KTO is also far easier to use in the real world than preference optimization methods, as the kind of data it needs is far more abundant. As an example, every retail company has a lot of customer interaction data and whether that interaction was successful (e.g., purchase made) or unsuccessful (e.g., no purchase made). However, They have little to no counterfactual data (i.e., what would have made an unsuccessful customer interaction ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT into a successful one ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT). Fig 31 shows a high-level comparison between KTO and other alignment approaches discussed above.

Refer to caption
Figure 31: LLM alignment involves supervised finetuning followed by optimizing a human-centered loss (HALO). However, the paired preferences that existing approaches need are hard-to-obtain. In contrast, KTO uses a far more abundant kind of data, making it much easier to use in the real world. Courtesy of [136].

III-H Decoding Strategies

Decoding refers to the process of text generation using pre-trained LLMs. Given an input prompt, the tokenizer translates each token in the input text into a corresponding token ID. Then, the language model uses these token IDs as input and predicts the next most likely token (or a sequence of tokens). Finally, the model generates logits, which are converted to probabilities using a softmax function. Different decoding strategies have been proposed. Some of the most popular ones are greedy search, beam search, as well as different sample techniques such as top-K, top-P (Nucleus sampling).

III-H1 Greedy Search

Greedy search takes the most probable token at each step as the next token in the sequence, discarding all other potential options. As you can imagine, this is a simple approach and can loose a lot of temporal consistency and coherency. It only considers the most probable token at each step, without considering the overall effect on the sequence. This property makes it fast, but it also means that it can miss out on better sequences that might have appeared with slightly less probable next tokens.

III-H2 Beam Search

Unlike greedy search that only considers the next most probable token, beam search takes into account the N most likely tokens, where N denotes the number of beams. This procedure is repeated until a predefined maximum sequence length is reached or an end-of-sequence token appears. At this point, the sequence of tokens (AKA “beam”) with the highest overall score is chosen as the output. For example for beam size of 2 and maximum length of 5, the beam search needs to keep track of 25=32superscript25322^{5}=322 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT = 32 possible sequences. So it is more computationally intensive than greedy search.

III-H3 Top-k Sampling

Top-k sampling is a technique that uses the probability distribution generated by the language model to select a token randomly from the k most likely options.

Suppose we have 6 tokens (A, B, C, D, E, F) and k=2, and P(A)= 30%, and P(B)= 20%, P(C)= P(D)= P(E)= P(F)= 12.5%. In top-k sampling, tokens C, D, E, F are disregarded, and the model outputs A 60% of the time, and B, 40% of the time. This approach ensures that we prioritize the most probable tokens while introducing an element of randomness in the selection process.

The randomness is usually introduced via the concept of temperature. The temperature T is a parameter that ranges from 0 to 1, which affects the probabilities generated by the softmax function, making the most likely tokens more influential. In practice, it simply consists of dividing the input logits by temperature value:

softmax(xi)=exi/Tjexj/T𝑠𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑥𝑖superscript𝑒subscript𝑥𝑖𝑇subscript𝑗superscript𝑒subscript𝑥𝑗𝑇softmax(x_{i})=\frac{e^{x_{i}/T}}{\sum_{j}e^{x_{j}/T}}italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_T end_POSTSUPERSCRIPT end_ARG (3)

A low temperature setting significantly alters the probability distribution (and is commonly used in text generation to control the level of “creativity” in the generated output), while a large temperature prioritizes the tokens with higher probabilities. Top-k is a creative way of sampling, and can be used along with beam search. The sequence chosen by top-k sampling may not be the sequence with highest probability in beam search. But it’s important to remember that highest scores do not always lead to more realistic or meaningful sequences.

III-H4 Top-p Sampling

Top-p sampling, also known as Nucleus sampling, takes a slightly different approach from top-k sampling. Instead of selecting the top k most probable tokens, nucleus sampling chooses a cutoff value p such that the sum of the probabilities of the selected tokens exceeds p. This forms a “nucleus” of tokens from which to randomly choose the next token. In other words, in top-p sampling the language model examines the most probable tokens in descending order and keeps adding them to the list until the sum of probabilities surpasses the threshold p. As you can imagine, this could be better specially for scenarios in which top-k tokens do not have a large probability mass. Unlike top-k sampling, the number of tokens included in the nucleus sampling is not fixed. This variability often results in a more diverse and creative output, making nucleus sampling popular for text generation related tasks.

III-I Cost-Effective Training/Inference/Adaptation/Compression

In this part, we review some of the popular approaches used for more cost-friendly (and compute-friendly) training and usage of LLMs.

III-I1 Optimized Training

There are many frameworks developed for optimized training of LLMs, here we introduce some of the prominent ones.

ZeRO: In [140], Rajbhandari et al. developed a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed of LLMs while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing one to scale the model size proportional to the number of devices with sustained high efficiency.

RWKV: In [141], Peng et al. proposed a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of Transformers with the efficient inference of RNNs. Their approach leverages a linear attention mechanism and allows them to formulate the model as either a Transformer or an RNN, which parallelizes computations during training and maintains constant computational and memory complexity during inference, leading to the first non-transformer architecture to be scaled to tens of billions of parameters. RWKV architecture is shown in Fig 32.


Refer to caption
Figure 32: RWKV architecture. Courtesy of [141].

The Time Complexity comparison of RWKV with different Transformers are provided in Fig 33.

Refer to caption
Figure 33: Time Complexity comparison of RWKV with different Transformers. Here T denotes the sequence length, d the feature dimension, and c is MEGA’s chunk size of quadratic attention. Courtesy of [141].

III-I2 Low-Rank Adaption (LoRA)

Low-Rank Adaptation is a popular and lightweight training technique that significantly reduces the number of trainable parameters, and is based on a crucial insight that the difference between the fine-tuned weights for a specialized task and the initial pre-trained weights often exhibits “low intrinsic rank” - meaning that it can be approximated well by a low rank matrix [142]. Training with LoRA is much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), that are easier to store and share. One property of low-rank matrices is that they can be represented as the product of two smaller matrices. This realization leads to the hypothesis that this delta between fine-tuned weights and initial pre-trained weights can be represented as the matrix product of two much smaller matrices. By focusing on updating these two smaller matrices rather than the entire original weight matrix, computational efficiency can be substantially improved.

Specifically, for a pre-trained weight matrix W0Rd×ksubscript𝑊0superscript𝑅𝑑𝑘W_{0}\in R^{d\times k}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_d × italic_k end_POSTSUPERSCRIPT, LoRA constrains its update by representing the latter with a low-rank decomposition W0+ΔW=W0+BAsubscript𝑊0Δ𝑊subscript𝑊0𝐵𝐴W_{0}+\Delta W=W_{0}+BAitalic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + roman_Δ italic_W = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_B italic_A, where BRd×r𝐵superscript𝑅𝑑𝑟B\in R^{d\times r}italic_B ∈ italic_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT , ARr×k𝐴superscript𝑅𝑟𝑘A\in R^{r\times k}italic_A ∈ italic_R start_POSTSUPERSCRIPT italic_r × italic_k end_POSTSUPERSCRIPT, and the rank rmin(d,k)much-less-than𝑟𝑚𝑖𝑛𝑑𝑘r\ll min(d,k)italic_r ≪ italic_m italic_i italic_n ( italic_d , italic_k ). During training, W0subscript𝑊0W_{0}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is frozen and does not receive gradient updates, while A𝐴Aitalic_A and B𝐵Bitalic_B contain trainable parameters. It is worth mentioning that both W0subscript𝑊0W_{0}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ΔW=BAΔ𝑊𝐵𝐴\Delta W=BAroman_Δ italic_W = italic_B italic_A are multiplied with the same input, and their respective output vectors are summed coordinate-wise. For h=W0xsubscript𝑊0𝑥h=W_{0}xitalic_h = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x, their modified forward pass yields: h=W0x+ΔWx=W0x+BAxsubscript𝑊0𝑥Δ𝑊𝑥subscript𝑊0𝑥𝐵𝐴𝑥h=W_{0}x+\Delta Wx=W_{0}x+BAxitalic_h = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x + roman_Δ italic_W italic_x = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x + italic_B italic_A italic_x. Usually a random Gaussian initialization is used for A𝐴Aitalic_A, and zero initialization for B𝐵Bitalic_B, so ΔW=BAΔ𝑊𝐵𝐴\Delta W=BAroman_Δ italic_W = italic_B italic_A is zero at the beginning of training. They then scale ΔWxΔ𝑊𝑥\Delta Wxroman_Δ italic_W italic_x by αr𝛼𝑟\alpha ritalic_α italic_r, where α𝛼\alphaitalic_α is a constant in r. This reparametrization is illustrated in Figure 34

Refer to caption
Figure 34: An illustration of LoRA reparametrizan. Only A𝐴Aitalic_A and B𝐵Bitalic_B trained during this process. Courtesy of [142].

It is worth mentioning that LoRA can be applied to any a subset of weight matrices in a neural network to reduce the number of trainable parameters. In the Transformer architecture, there are four weight matrices in the self-attention module (Wqsubscript𝑊𝑞W_{q}italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , Wksubscript𝑊𝑘W_{k}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, Wvsubscript𝑊𝑣W_{v}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , Wosubscript𝑊𝑜W_{o}italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT), and two in the MLP module. Most of the time, LoRA is focused on adapting the attention weights only for downstream tasks, and freezes the MLP modules, so they are not trained in downstream tasks both for simplicity and parameter-efficiency.

III-I3 Knowledge Distillation

Knowledge distillation is the process of learning from a larger model [143]. Earlier days of best-performing models release have proven that this approach is very useful even if it is used in an API distillation approach. It is also referred to as an approach to distill the knowledge of not a single model but in fact multiple models into a smaller one. Creating smaller models by this approach yields smaller model sizes that can be used even on edge devices. Knowledge distillation as shown in Fig 35, illustrates a general setup of this training scheme.

Refer to caption
Figure 35: A generic knowledge distillation framework with student and teacher (Courtesy of [144]).

Knowledge can be transferred by different forms of learning: response distillation, feature distillation, and API distillation. Response distillation is concerned only with the outputs of the teacher model and tries to teach the student model how to exactly or at least similarly perform (in the sense of prediction) as the teacher. Feature distillation not only uses the last layer but also intermediate layers as well to create a better inner representation for the student model. This helps the smaller model to have a similar representation as the teacher model.

API distillation is the process of using an API (typically from an LLM provider such as OpenAI) to train smaller models. In the case of LLMs, it is used to train the model from the direct output of the larger model which makes it very similar to response distillation. Many concerns are raised by this type of distillation because in cases where the model itself is not openly available, a (usually) paid API is exposed for end users. On the other hand, while users pay for each call, how to use the predictions is limited, for example, OpenAI prohibits usage of its API to create LLMs that later will be used to compete with it. The main value in such case is training data.

III-I4 Quantization

deep learning in its core, is a set of mathematical functions applied to matrices, with a specific precision for model weights. Reducing the precision of the weights can be used to reduce the size of the model and also make it faster. As an example, Float-32 operations compared to Int-8 operations are slower. This process, which is called quantization, can be applied in different phases. Main approaches for model quantization can be categorized as: post training quantization and quantization-aware training. Post-training quantization is concerned with quantized trained models in two well-known methods: dynamic and static. Dynamic post-training quantization computes the range of quantization on the runtime and is slower compared to static. Quantization-aware training adds quantization criteria into training, and a quantized model is trained and optimized during training process. This approach ensures that the end model will have good performance and also does not need to be quantized after training.

IV How LLMs Are Used and Augmented

Once the LLMs are trained, we can use them to generate desired outputs for a variety of tasks. LLMs can be used directly through basic prompting. However, in order to exploit their full potential or to address some of the shortcomings, we need to augment the models through some external means. In this section we first provide a brief overview of the main shortcoming of LLMs, with a deeper look at the issue of hallucination. We then describe how prompting and some augmentation approaches can not only address those limitations but also be used to augment the capabilities of LLMs going as far as turning an LLM into a full-blown AI agent with the ability to interface with the external world.

Refer to caption
Figure 36: How LLMs Are Used and Augmented.

IV-A LLM limitations

It is important to remember that LLMs are trained to predict a token. While fine-tuning and alignment improves their performance and adds different dimensions to their abilities, there are still some important limitations that come up, particularly if they are used naively. Some of them include the following:

  • They don’t have state/memory. LLMs on their own cannot remember even what was sent to them in the previous prompt. That is an important limitation for many of the uses cases that require some form of state.

  • They are stochastic/probabilistic. If you send the same prompt to an LLM several times, you are likely to get different responses. While there are parameters, and in particular the temperature, to limit the variability in the response, this is an inherent property of their training that can create issues.

  • They have stale information and, on their own, don’t have access to external data. An LLM on its own does not even know about the current time or day and does not have access to any information that was not present in its training set.

  • They are generally very large. This means that many costly GPU machines are needed for training and serving. In some cases, largest models have poor SLAs, particularly in terms of latency.

  • They hallucinate. LLMs do not have a notion of ”truth” and they have usually been trained on a mix of good and bad content. They can produce very plausible but untruthful answers.

While the previous limitations can all become important for some applications, it is worth for us to dive a bit into the last one, hallucinations, since it has gathered a lot of interest over the past few months and it has also sparked many of the prompt approaches and LLM augmentation methods we will later describe.

Hallucination: In the realm of Large Language Models (LLMs), the phenomenon of ”hallucinations” has garnered significant attention. Defined in the literature, notably in the ”Survey of Hallucination in Natural Language Generation” paper [145], hallucination in an LLM is characterized as ”the generation of content that is nonsensical or unfaithful to the provided source.” This terminology, although rooted in psychological parlance, has been appropriated within the field of artificial intelligence.

Hallucinations in LLMs can be broadly categorized into two types:

  1. 1.

    Intrinsic Hallucinations: These directly conflict with the source material, introducing factual inaccuracies or logical inconsistencies.

  2. 2.

    Extrinsic Hallucinations: These, while not contradicting, are unverifiable against the source, encompassing speculative or unconfirmable elements.

The definition of ’source’ in LLM contexts varies with the task. In dialogue-based tasks, it refers to ’world knowledge’, whereas in text summarization, it pertains to the input text itself. This distinction plays a crucial role in evaluating and interpreting hallucinations. The impact of hallucinations is also highly context-dependent. For instance, in creative endeavors like poem writing, hallucinations might be deemed acceptable or even beneficial.

LLMs, trained on diverse datasets including the internet, books, and Wikipedia, generate text based on probabilistic models without an inherent understanding of truth or falsity. Recent advancements like instruct tuning and Reinforcement Learning from Human Feedback (RLHF) have attempted to steer LLMs towards more factual outputs, but the fundamental probabilistic nature and its inherent limitations remain. A recent study, “Sources of Hallucination by Large Language Models on Inference Tasks” [146], highlights two key aspects contributing to hallucinations in LLMs: the veracity prior and the relative frequency heuristic, underscoring the complexities inherent in LLM training and output generation.

Effective automated measurement of hallucinations in LLMs requires a combination of statistical and model-based metrics.

Statistical Metrics:

  • Metrics like ROUGE [147] and BLEU [148] are common for assessing text similarity, focusing on intrinsic hallucinations.

  • Advanced metrics such as PARENT [149], PARENT-T [150], and Knowledge F1 [151] are utilized when structured knowledge sources are available. These metrics, while effective, have limitations in capturing syntactic and semantic nuances.

Model-Based Metrics:

  • IE-Based Metrics: Utilize Information Extraction models to simplify knowledge into relational tuples, then compare these with the source.

  • QA-Based Metrics: Assess the overlap between generated content and the source through a question-answering framework (see [152]).

  • NLI-Based Metrics: Use Natural Language Inference datasets to evaluate the truthfulness of a generated hypothesis based on a given premise (see [153]).

  • Faithfulness Classification Metrics: Offer a refined assessment by creating task-specific datasets for a nuanced evaluation (see [154]).

Despite advances in automated metrics, human judgment remains a vital piece. It typically involves two methodologies:

  1. 1.

    Scoring: Human evaluators rate the level of hallucination within a predefined scale.

  2. 2.

    Comparative Analysis: Evaluators compare generated content against baseline or ground-truth references, adding an essential layer of subjective assessment.

FactScore [155] is a recent example of a metric that can be used both for human and model-based evaluation. The metric breaks an LLM generation into “atomic facts”. The final score is computed as the sum of the accuracy of each atomic fact, giving each of them equal weight. Accuracy is a binary number that simply states whether the atomic fact is supported by the source. The authors implement different automation strategies that use LLMs to estimate this metric.

Finally, mitigating hallucinations in LLMs is a multifaceted challenge, requiring tailored strategies to suit various applications. Those include:

  • Product Design and User Interaction Strategies such as use case design, structuring the input/output, or providing mechanisms for user feedback.

  • Data Management and Continuous Improvement. Maintaining and analyzing a tracking set of hallucinations is essential for ongoing model improvement.

  • Prompt Engineering and Metaprompt Design. Many of the advanced prompt techniques described in IV-B such as Retrieval Augmented Generation directly address hallucination risks.

  • Model Selection and Configuration for Hallucination Mitigation. For exemple, larger models with lower temperature settings usually perform better. Also, techniques such as RLHF or domain-sepcific fine-tuning can mitigate hallucination risks.

IV-B Using LLMs: Prompt Design and Engineering

A prompt in generative AI models is the textual input provided by users to guide the model’s output. This could range from simple questions to detailed descriptions or specific tasks. Prompts generally consist of instructions, questions, input data, and examples. In practice, to elicit a desired response from an AI model, a prompt must contain either instructions or questions, with other elements being optional. Advanced prompts involve more complex structures, such as ”chain of thought” prompting, where the model is guided to follow a logical reasoning process to arrive at an answer.

Prompt engineering is a rapidly evolving discipline that shapes the interactions and outputs of LLMs and other generative AI models. The essence of prompt engineering lies in crafting the optimal prompt to achieve a specific goal with a generative model. This process is not only about instructing the model but also involves some understanding of the model’s capabilities and limitations, and the context within which it operates.

Prompt engineering transcends the mere construction of prompts; it requires a blend of domain knowledge, understanding of the AI model, and a methodical approach to tailor prompts for different contexts. This might involve creating templates that can be programmatically modified based on a given dataset or context. For example, generating personalized responses based on user data might use a template that is dynamically filled with relevant user information.

Furthermore, prompt engineering is an iterative and exploratory process, akin to traditional machine learning practices such as model evaluation or hyperparameter tuning. The rapid growth of this field suggests its potential to revolutionize certain aspects of machine learning, moving beyond traditional methods like feature or architecture engineering. On the other hand, traditional engineering practices such as version control and regression testing need to be adapted to this new paradigm just like they were adapted to other machine learning approaches [156].

In the following paragraphs we detail some of the most interesting and popular prompt engineering approaches.

IV-B1 Chain of Thought (CoT)

The Chain of Thought (CoT) technique, initially described in the paper “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”[34] by Google researchers, represents a pivotal advancement in prompt engineering for Large Language Models (LLMs). This approach hinges on the understanding that LLMs, while proficient in token prediction, are not inherently designed for explicit reasoning. CoT addresses this by guiding the model through essential reasoning steps.

CoT is based on making the implicit reasoning process of LLMs explicit. By outlining the steps required for reasoning, the model is directed closer to a logical and reasoned output, especially in scenarios demanding more than simple information retrieval or pattern recognition.

CoT prompting manifests in two primary forms:

  1. 1.

    Zero-Shot CoT: This form involves instructing the LLM to “think step by step”, prompting it to deconstruct the problem and articulate each stage of reasoning.

  2. 2.

    Manual CoT: A more complex variant, it requires providing step-by-step reasoning examples as templates for the model. While yielding more effective results, it poses challenges in scalability and maintenance.

Manual CoT is more effective than zero-shot. However, the effectiveness of this example-based CoT depends on the choice of diverse examples, and constructing prompts with such examples of step by step reasoning by hand is hard and error prone. That is where automatic CoT [157] comes into play.

IV-B2 Tree of Thought (ToT)

The Tree of Thought (ToT) [158] prompting technique is inspired by the concept of considering various alternative solutions or thought processes before converging on the most plausible one. ToT is based on the idea of branching out into multiple ”thought trees” where each branch represents a different line of reasoning. This method allows the LLM to explore various possibilities and hypotheses, much like human cognitive processes where multiple scenarios are considered before determining the most likely one.

A critical aspect of ToT is the evaluation of these reasoning paths. As the LLM generates different branches of thought, each is assessed for its validity and relevance to the query. This process involves real-time analysis and comparison of the branches, leading to a selection of the most coherent and logical outcome.

ToT is particularly useful in complex problem-solving scenarios where a single line of reasoning might not suffice. It allows LLMs to mimic a more human-like problem-solving approach, considering a range of possibilities before arriving at a conclusion. This technique enhances the model’s ability to handle ambiguity, complexity, and nuanced tasks, making it a valuable tool in advanced AI applications.

IV-B3 Self-Consistency

Self-Consistency [159] utilizes an ensemble-based method, where the LLM is prompted to generate multiple responses to the same query. The consistency among these responses serves as an indicator of their accuracy and reliability.

The Self-Consistency approach is grounded in the principle that if an LLM generates multiple, similar responses to the same prompt, it is more likely that the response is accurate. This method involves asking the LLM to tackle a query multiple times, each time analyzing the response for consistency. This technique is especially useful in scenarios where factual accuracy and precision are paramount.

The consistency of responses can be measured using various methods. One common approach is to analyze the overlap in the content of the responses. Other methods may include comparing the semantic similarity of responses or employing more sophisticated techniques like BERT-scores or n-gram overlaps. These measures help in quantifying the level of agreement among the responses generated by the LLM.

Self-Consistency has significant applications in fields where the veracity of information is critical. It is particularly relevant in scenarios like fact-checking, where ensuring the accuracy of information provided by AI models is essential. By employing this technique, prompt engineers can enhance the trustworthiness of LLMs, making them more reliable for tasks that require high levels of factual accuracy.

IV-B4 Reflection

Reflection [160] involves prompting LLMs to assess and potentially revise their own outputs based on reasoning about the correctness and coherence of their responses. The concept of Reflection centers on the ability of LLMs to engage in a form of self-evaluation. After generating an initial response, the model is prompted to reflect on its own output, considering factors like factual accuracy, logical consistency, and relevance. This introspective process can lead to the generation of revised or improved responses.

A key aspect of Reflection is the LLM’s capacity for self-editing. By evaluating its initial response, the model can identify potential errors or areas of improvement. This iterative process of generation, reflection, and revision enables the LLM to refine its output, enhancing the overall quality and reliability of its responses.

IV-B5 Expert Prompting

Expert Prompting [161] enhances the capabilities of Large Language Models (LLMs) by simulating the responses of experts in various fields. This method involves prompting the LLMs to assume the role of an expert and respond accordingly, providing high-quality, informed answers. A key strategy within Expert Prompting is the multi-expert approach. The LLM is prompted to consider responses from multiple expert perspectives, which are then synthesized to form a comprehensive and well-rounded answer. This technique not only enhances the depth of the response but also incorporates a range of viewpoints, reflecting a more holistic understanding of the subject matter.

IV-B6 Chains

Chains refer to the method of linking multiple components in a sequence to handle complex tasks with Large Language Models (LLMs). This approach involves creating a series of interconnected steps or processes, each contributing to the final outcome. The concept of Chains is based on the idea of constructing a workflow where different stages or components are sequentially arranged. Each component in a Chain performs a specific function, and the output of one serves as the input for the next. This end-to-end arrangement allows for more complex and nuanced processing, as each stage can be tailored to handle a specific aspect of the task. Chains can vary in complexity and structure, depending on the requirements. In “PromptChainer: Chaining Large Language Model Prompts through Visual Programming” [162], the authors not only describe the main challenges in designing chains, but also describe a visual tool to support those tasks.

IV-B7 Rails

Rails in advanced prompt engineering refer to a method of guiding and controlling the output of Large Language Models (LLMs) through predefined rules or templates. This approach is designed to ensure that the model’s responses adhere to certain standards or criteria, enhancing the relevance, safety, and accuracy of the output. The concept of Rails involves setting up a framework or a set of guidelines that the LLM must follow while generating responses. These guidelines are typically defined using a modeling language or templates known as Canonical Forms, which standardize the way natural language sentences are structured and delivered.

Rails can be designed for various purposes, depending on the specific needs of the application:

  • Topical Rails: Ensure that the LLM sticks to a particular topic or domain.

  • Fact-Checking Rails: Aimed at minimizing the generation of false or misleading information.

  • Jailbreaking Rails: Prevent the LLM from generating responses that attempt to bypass its own operational constraints or guidelines.

IV-B8 Automatic Prompt Engineering (APE)

Automatic Prompt Engineering (APE) [163] focuses on automating the process of prompt creation for Large Language Models (LLMs). APE seeks to streamline and optimize the prompt design process, leveraging the capabilities of LLMs themselves to generate and evaluate prompts. APE involves using LLMs in a self-referential manner where the model is employed to generate, score, and refine prompts. This recursive use of LLMs enables the creation of high-quality prompts that are more likely to elicit the desired response or outcome.

The methodology of APE can be broken down into several key steps:

  • Prompt Generation: The LLM generates a range of potential prompts based on a given task or objective.

  • Prompt Scoring: Each generated prompt is then evaluated for its effectiveness, often using criteria like clarity, specificity, and likelihood of eliciting the desired response.

  • Refinement and Iteration: Based on these evaluations, prompts can be refined and iterated upon, further enhancing their quality and effectiveness.

IV-C Augmenting LLMs through external knowledge - RAG

One of the main limitations of pre-trained LLMs is their lack of up-to-date knowledge or access to private or use-case-specific information. This is where retrieval augmented generation (RAG) comes into the picture [164]. RAG, illustrated in figure 37, involves extracting a query from the input prompt and using that query to retrieve relevant information from an external knowledge source (e.g. a search engine or a knowledge graph, see figure 38 ). The relevant information is then added to the original prompt and fed to the LLM in order for the model to generate the final response. A RAG system includes three important components: Retrieval, Generation, Augmentation [165].

Refer to caption
Figure 37: An example of synthesizing RAG with LLMs for question answering application [166].
Refer to caption
Figure 38: This is one example of synthesizing the KG as a retriever with LLMs [167].
RAG-aware prompting techniques

Because of the importance of RAG to build advanced LLM systems, several RAG-aware prompting techniques have been developed recently. One such technique is Forward-looking Active Retrieval Augmented Generation (FLARE)

Forward-looking Active Retrieval Augmented Generation (FLARE) [168] enhances the capabilities of Large Language Models (LLMs) by iteratively combining prediction and information retrieval. FLARE represents an evolution in the use of retrieval-augmented generation, aimed at improving the accuracy and relevance of LLM responses.

FLARE involves an iterative process where the LLM actively predicts upcoming content and uses these predictions as queries to retrieve relevant information. This method contrasts with traditional retrieval-augmented models that typically retrieve information once and then proceed with generation. In FLARE, this process is dynamic and ongoing throughout the generation phase. In FLARE, each sentence or segment generated by the LLM is evaluated for confidence. If the confidence level is below a certain threshold, the model uses the generated content as a query to retrieve relevant information, which is then used to regenerate or refine the sentence. This iterative process ensures that each part of the response is informed by the most relevant and current information available.

For more details on RAG framework and its relevant works, we refer the readers to this survey of retrieval augmented generations [165].

IV-D Using External Tools

Retrieving information from an external knowledge source as described above is only one of the potential ways to augment an LLM. More generally, an LLM can access any number of external tools (e.g. an API to a service) to augment its functionality. In that regards, RAG can be seen as a specific instance of the broader category of the so called ”tools”.

Tools in this context are external functions or services that LLMs can utilize. These tools extend the range of tasks an LLM can perform, from basic information retrieval to complex interactions with external databases or APIs.

In the paper ”Toolformer: Language Models Can Teach Themselves to Use Tools” [169], the authors go beyond simple tool usage by training an LLM to decide what tool to use when, and even what parameters the API needs. Tools include two different search engines, or a calculator. In the following examples, the LLM decides to call an external Q&A tool, a calculator, and a Wikipedia Search Engine More recently, researchers at Berkeley have trained a new LLM called Gorilla [67] that beats GPT-4 at the use of APIs, a specific but quite general tool.

Tool-aware prompting techniques

Similarly to what was described with RAG, several tool-aware prompting approaches have been developed to make usage of tools more scalable. A popular technique is the so called Automatic Multi-step Reasoning and Tool-use (ART).

Automatic Multi-step Reasoning and Tool-use (ART) [170] is a prompt engineering technique that combines automated chain of thought prompting with the use of external tools. ART represents a convergence of multiple prompt engineering strategies, enhancing the ability of Large Language Models (LLMs) to handle complex tasks that require both reasoning and interaction with external data sources or tools.

ART involves a systematic approach where, given a task and input, the system first identifies similar tasks from a task library. These tasks are then used as examples in the prompt, guiding the LLM on how to approach and execute the current task. This method is particularly effective when tasks require a combination of internal reasoning and external data processing or retrieval.

IV-E LLM Agents

The idea of AI agents has been well-explored in the history of AI. An agent is typically an autonomous entity that can perceive the environment using its sensors, make a judgment based on the state it currently is, and accordingly act based on the actions that are available to it.

In the context of LLMs, an agent refers to a system based on a specialized instantiation of an (augmented) LLM that is capable of performing specific tasks autonomously. These agents are designed to interact with users and environment to make decisions based on the input and the intended goal of the interaction. Agents are based on LLMs equipped with the ability to access and use tools, and to make decisions based on the given input. They are designed to handle tasks that require a degree of autonomy and decision-making, typically beyond simple response generation.

The functionalities of a generic LLM-based agent include:

  • Tool Access and Utilization: Agents have the capability to access external tools and services, and to utilize these resources effectively to accomplish tasks.

  • Decision Making: They can make decisions based on the input, context, and the tools available to them, often employing complex reasoning processes.

As an example, an LLM that has access to a function (or an API) such as weather API, can answer any question related to the weather of the specific place. In other words, it can use APIs to solve problems. Furthermore, if that LLM has access to an API that allows to make purchases, a purchasing agent can be built to not only have capabilities to read information from the external world, but also act on it [171].

Refer to caption
Figure 39: HuggingGPT: An agent-based approach to use tools and planning [image courtesy of [171]]

Fig. 40 shows another example of LLM-based agents for conversational information seeking [36], where an LLM is augmented with a set of plug-and-play modules, including a working memory that tracks the dialog state, a policy that makes an execution plan for the task and selects next system action, an action executor that performs an action selected by the policy (consolidating evidence from external knowledge, or prompting the LLM to generate responses), and a utility that accesses the alignment of the LLM’s responses with user expectations or specific business requirements, and generate feedback to improve agent performance.

Refer to caption
Figure 40: A LLM-based agent for conversational information seeking. Courtesy of [36].

For more details on LLM-based AI agents see recent survey [172, 173, 174].

Prompt engineering techniques for agents

Like RAG and Tools, prompt engineering techniques that specifically address the needs of LLM-based agents have been developed. Three such examples are Reasoning without Observation (ReWOO), Reason and Act (ReAct), and Dialog-Enabled Resolving Agents (DERA).

Reasoning without Observation (ReWOO) [175] aims to decouple reasoning from direct observations. ReWOO operates by enabling LLMs to formulate comprehensive reasoning plans or meta-plans without immediate reliance on external data or tools. This approach allows the agent to create a structured framework for reasoning that can be executed once the necessary data or observations are available. In ReWOO, the LLM initially develops a plan (a series of steps) that outlines how to approach and solve a given problem. This meta-planning phase is crucial as it sets the stage for the agent to process information once it becomes available. The execution phase then involves integrating actual data or observations into the pre-specified plan, leading to coherent and contextually relevant responses. ReWOO offers significant advantages in terms of token efficiency and robustness to tool failure. It enables LLMs to handle tasks where immediate access to external data is not available, relying instead on a well-structured reasoning framework. This method is particularly advantageous in scenarios where data retrieval is costly, slow, or uncertain, allowing the LLM-based agent to maintain a high level of performance and reliability.

Reason and Act (ReAct)[176] prompts LLMs to generate not only verbal reasoning but also actionable steps, thus enhancing the model’s dynamic problem-solving capabilities. ReAct is grounded in the principle of integrating reasoning with action. In this approach, the LLM is prompted to alternate between generating reasoning traces (explanations) and taking actions (steps or commands) in an interleaved manner. This approach allows the model to dynamically reason about a problem, and propose and take concrete actions simultaneously.

Dialog-Enabled Resolving Agents (DERA) [177] are specialized AI agents that can engage in dialogue, resolve queries, and make decisions based on interactive exchanges. DERA is developed based on the idea of utilizing multiple agents within a dialog context, each with specific roles and functions. These agents can include Researchers, who gather and analyze information, and Deciders, who make final judgments based on the information provided. This division of roles allows for a well-organized and efficient approach to problem-solving and decision-making. DERA is particularly advantageous in scenarios requiring complex decision-making and problem-solving, such as those in medical diagnostics or customer service. The collaborative and interactive nature of DERA agents allows them to handle intricate queries with a level of depth and nuance that single-agent systems might struggle with. Moreover, this approach aligns well with human decision-making processes, making AI reasoning more relatable and trustworthy.

V Popular Datasets for LLMs

Large language models exhibit promising accomplishments, but the main question that arises is how effectively they function and how their performance can be assessed in specific tasks or applications.

The evaluation of LLMs poses particular challenges due to the evolving landscape of their applications. The original intent behind developing LLMs was to boost the performance of NLP tasks such as translation, summarization, question-answering, and so on [178]. However, it is evident today that these models are finding utility across diverse domains including code generation and finance. Moreover, the evaluation of LLMs encompasses several critical considerations such as fairness and bias, fact-checking, and reasoning. In this section, we outline the commonly used benchmarks for assessing LLMs. These benchmarks are categorized based on training or evaluating the LLM Capabilities.

Refer to caption
Figure 41: Dataset applications.
Refer to caption
Figure 42: Datasets licensed under different licenses.

V-A Datasets for Basic Tasks: language modeling/understanding/generation

This section provides an overview of the benchmarks and datasets suited to evaluate the basic abilities of LLMs.

  • Natural Questions [179] is a QA dataset that consists of real anonymized, aggregated queries submitted to the Google search engine as questions. An annotator is presented with a question along with a Wikipedia page from the top 5555 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present.

  • MMLU [180] is intended to evaluate the knowledge gained in zero-shot and few-shot scenarios. That means that MMLU assesses both the general knowledge and problem-solving ability of a model. It covers 57 subjects in STEM, humanities, social sciences, and other areas. The benchmark varies in complexity, ranging from elementary to advanced professional. It is worth mentioning that the main contribution of this dataset is for multi-task language understanding, question answering, and arithmetic reasoning.

  • MBPP [181] stands for “Mostly Basic Python Problems” and provides a benchmark for evaluating the performance of models designed for code generation. The benchmark encompasses 974974974974 short Python programs including a wide range of topics, including fundamental programming concepts and standard library usage, and more. Each challenge comprises a task description, a code solution, and three automated test cases.

  • HumanEval [182] is a dataset for code generation task. This dataset consists of 164164164164 hand-crafted programming challenges. Each challenge is accompanied by a function signature, docstring, code body, and multiple unit tests. The main intuition behind developing this dataset is to guarantee the exclusion of its contents from training datasets for code generation models.

  • APPS [183] is designed for code generation task focusing on the Python programming language. The APPS dataset contains a collection of 232,444232444232,444232 , 444 Python programs. Each program in the dataset has an average of 18181818 lines of Python code. Additionally, APPS offers access to a repository of 10,0001000010,00010 , 000 unique programming exercises, each with text-based problem descriptions. The final aspect to highlight is that the it includes test cases.

  • WikiSQL [184] is crafted for code generation task and it has 87,726 carefully labeled pairs of SQL queries and corresponding natural language questions from Wikipedia tables. The SQL queries comprise three subsets: test sets (17,2841728417,28417 , 284 examples), development (9,14591459,1459 , 145 examples), and training (61,2976129761,29761 , 297 examples).

  • TriviaQA [185] is designed for QA task. This dataset comprises more than 650,000650000650,000650 , 000 question-answer-evidence triples. There are 95,0009500095,00095 , 000 question-answer pairs in this dataset, each authored by trivia enthusiasts and supported by an average of six independently sourced evidence documents. These documents are automatically acquired from Wikipedia or broader web search results. The dataset is categorized into two segments, including those with authentic answers from Wikipedia and web domains, and verified sets embody the accurately answered questions along with their associated documents from both Wikipedia and online.

  • RACE [186] suits for reading comprehension task. This dataset is based on English tests completed by Chinese students from middle school and high school, aged 12121212 to 18181818, and it contains roughly 28,0002800028,00028 , 000 texts and 100,000100000100,000100 , 000 questions rigorously prepared by human specialists, primarily English instructors. This dataset contains a wide range of subjects that were purposefully chosen to assess students’ comprehension and reasoning abilities. This dataset is available in three subgroups: RACE-M, RACE-H, and RACE. RACE-M refers to the middle school examinations, whereas RACE-H denotes the high school tests. Finally, RACE is the synthesis of RACE-M and RACE-H.

  • SQuAD [187] stands for “Stanford Question Answering Dataset” and is a crowdsourced reading comprehension dataset based on Wikipedia articles. It has approximately 100,000100000100,000100 , 000 question-answer pairs connected to more than 500500500500 articles. The answers to these questions are typically text fragments or spans taken from the corresponding reading passages. The questions may be unanswerable in some cases. The dataset is divided into three sets: an 80%percent8080\%80 % training set, a 10%percent1010\%10 % development set, and a 10%percent1010\%10 % hidden test set.

  • BoolQ [188] is a yes/no question-answering dataset where the goal is reading comprehension task. BoolQ includes 15,9421594215,94215 , 942 examples. Each example is a triplet that includes a question, a relevant paragraph, and the solution. Although the main intuition behind this dataset is for reading comprehension, it can be used for reasoning, natural language inference, and question-answering tasks.

  • MultiRC [189] is another dataset that fits reading comprehension task. MultiRC contains brief paragraphs as well as multi-sentence questions that can be answered using the information in the paragraph. The paragraphs in this dataset come from a variety of sources, including news, fiction, historical texts, Wikipedia articles, discussions on society and law, elementary school science textbooks, and 9/11 reports. Each question has many response choices, with one or more of them being correct. Answering the questions requires reasoning across several sentences. MultiRC dataset encompasses around 6,00060006,0006 , 000 multi-sentence questions gathered from over 800 paragraphs. On average, each question offers about two valid answer alternatives out of a total of five.

V-B Datasets for Emergent: ICL, reasoning (CoT), instruction following

This section centers on the benchmarks and datasets employed to evaluate the emergent abilities of LLMs.

  • GSM8K [190] is designed to evaluate the model’s ability for multi-step mathematical reasoning. GSM8K includes 8.5K linguistically diverse grade school math word problems written by humans. The dataset is split into two sets: a training set with 7.5K7.5𝐾7.5K7.5 italic_K problems, and a test set with 1111K problems. These problems need 2222 to 8888 steps to be solved. Solutions mainly are a series of elementary calculations using basic arithmetic operations.

  • MATH [191] enables to assess how well models can solve math problems. MATH dataset hast 12121212, 500500500