这是用户在 2024-3-28 15:25 为 https://ar5iv.labs.arxiv.org/html/2210.03493?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Automatic Chain of Thought Prompting
in Large Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Alex Smola
张卓生 {{0}, Aston Zhang , Mu Li , Alex Smola

Shanghai Jiao Tong University, Amazon Web Services
上海交通大学, 亚马逊网络服务
Work done during an internship at Amazon Web Services. Correspondence to Zhuosheng Zhang <zhangzs@sjtu.edu.cn> and Aston Zhang <astonz@amazon.com>
在亚马逊网络服务公司实习期间完成的工作。通讯作者:张卓生 <zhangzs@sjtu.edu.cn> 和 Aston Zhang <astonz@amazon.com>
Abstract 摘要

Large language models (LLMs) can perform complex reasoning by generating intermediate reasoning steps. Providing these steps for prompting demonstrations is called chain-of-thought (CoT) prompting. CoT prompting has two major paradigms. One leverages a simple prompt like “Let’s think step by step” to facilitate step-by-step thinking before answering a question. The other uses a few manual demonstrations one by one, each composed of a question and a reasoning chain that leads to an answer. The superior performance of the second paradigm hinges on the hand-crafting of task-specific demonstrations one by one. We show that such manual efforts may be eliminated by leveraging LLMs with the “Let’s think step by step” prompt to generate reasoning chains for demonstrations one by one, i.e., let’s think not just step by step, but also one by one. However, these generated chains often come with mistakes. To mitigate the effect of such mistakes, we find that diversity matters for automatically constructing demonstrations. We propose an automatic CoT prompting method: Auto-CoT.
大型语言模型(LLMs)可以通过生成中间推理步骤来执行复杂的推理。为提示演示提供这些步骤被称为思维链(CoT)提示。CoT 提示有两种主要模式。一种是利用类似 "让我们一步步思考 "这样的简单提示来促进回答问题前的逐步思考。另一种范式则是逐一使用一些手动演示,每个演示都由一个问题和一个推理链组成,最终得出答案。第二种范式的优越性能取决于逐一手工制作特定任务的演示。我们的研究表明,利用LLMs和 "让我们一步一步地思考 "提示为演示逐一生成推理链(即我们不仅要一步一步地思考,还要一个接一个地思考),就可以省去这种手工劳动。然而,这些生成的推理链经常会出现错误。为了减少这些错误的影响,我们发现自动构建演示的多样性非常重要。我们提出了一种自动 CoT 提示方法:Auto-CoT.

It samples questions with diversity and generates reasoning chains to construct demonstrations. On ten public benchmark reasoning tasks with GPT-3, Auto-CoT consistently matches or exceeds the performance of the CoT paradigm that requires manual designs of demonstrations.
它能对问题进行多样性采样,并生成推理链以构建演示。在使用 GPT-3 的十个公开基准推理任务中,Auto-CoT 的表现始终与需要人工设计演示的 CoT 范式相当,甚至更胜一筹。

Code is available at https://github.com/amazon-research/auto-cot
代码见 https://github.com/amazon-research/auto-cot

1 Introduction 1 引言

Large language models (LLMs) (Brown et al., 2020; Thoppilan et al., 2022; Rae et al., 2021; Chowdhery et al., 2022) have performed impressively on complex reasoning tasks by decomposing the multi-step problems into intermediate steps before producing the answer. This reasoning process is elicited by a very recent technique: chain-of-thought (CoT) prompting (Wei et al., 2022a).
大型语言模型(LLMs)(Brown等人,2020年;Thoppilan等人,2022年;Rae等人,2021年;Chowdhery等人,2022年)通过将多步骤问题分解为产生答案前的中间步骤,在复杂的推理任务中表现出色。这种推理过程是由一种最新技术激发的:思维链(CoT)提示(Wei 等人,2022a)。

Refer to caption
Figure 1: Zero-Shot-CoT (Kojima et al., 2022) (using the “Let’s think step by step” prompt) and Manual-CoT (Wei et al., 2022a) (using manually designed demonstrations one by one) with example inputs and outputs of an LLM.
图 1:Zero-Shot-CoT(小岛等人,2022 年)(使用 "让我们一步一步地思考 "提示)和 Manual-CoT(魏等人,2022 年a)(使用人工设计的逐个演示)与 LLM 的输入和输出示例。

CoT prompting can be categorized into two major paradigms. One adds a single prompt like “Let’s think step by step” after the test question to facilitate the reasoning chains in LLMs (Kojima et al., 2022). Since this prompting paradigm is task-agnostic and does not need input-output demonstrations, it is called Zero-Shot-CoT (left of Figure 1). With Zero-Shot-CoT, LLMs have shown to be decent zero-shot reasoners. The other paradigm is few-shot prompting with manual reasoning demonstrations one by one (Wei et al., 2022a). Each demonstration has a question and a reasoning chain. A reasoning chain is composed of a rationale (a series of intermediate reasoning steps) and an expected answer. With all the demonstrations being manually designed, this paradigm is referred to as Manual-CoT (right of Figure 1).
CoT提示可分为两大范式。一种是在试题后添加类似 "让我们逐步思考 "的单个提示,以促进LLMs中的推理链(Kojima et al.(小岛等人,2022 年)。由于这种提示范式与任务无关,也不需要输入-输出演示,因此被称为 "零镜头-CoT"(图 1 左)。LLMs使用零点提示-CoT进行推理的结果表明,LLMs的零点提示推理能力非常出色。另一种范式是逐个手动推理演示的少次提示(Wei等人,2022a)。每个演示都有一个问题和一个推理链。推理链由理由(一系列中间推理步骤)和预期答案组成。由于所有演示都是人工设计的,因此这种模式被称为人工-CoT(图 1 右侧)。

In practice, Manual-CoT has obtained stronger performance than Zero-Shot-CoT (Wei et al., 2022a; Kojima et al., 2022). However, this superior performance hinges on the hand-drafting of effective demonstrations. Specifically, the hand-drafting involves nontrivial efforts in designs of both questions and their reasoning chains for demonstrations.
在实践中,Manual-CoT 获得了比 Zero-Shot-CoT 更强的性能(Wei 等人,2022a;Kojima 等人,2022)。然而,这种优越的性能取决于手工起草有效的演示。具体来说,手工起草涉及问题设计和演示推理链设计方面的大量工作。

Moreover, human efforts for designing task-specific demonstrations are even more: different tasks, such as arithmetic (Roy and Roth, 2015) and commonsense reasoning (Talmor et al., 2019), require different ways of demonstrations.
此外,人类为设计针对特定任务的演示所做的努力更多:不同的任务,如算术(Roy 和 Roth,2015 年)和常识推理(Talmor 等人,2019 年),需要不同的演示方式。

To eliminate such manual designs, we advocate another Auto-CoT paradigm to automatically construct demonstrations with questions and reasoning chains. Specifically, Auto-CoT leverages LLMs with the “Let’s think step by step” prompt to generate reasoning chains for demonstrations one by one, i.e., let’s think not just step by step, but also one by one. However, we find that this challenge cannot be effectively addressed by simple solutions. For example, given a test question of a dataset, retrieving semantically similar questions and invoking Zero-Shot-CoT to generate reasoning chains will fail.
为了消除这种手工设计,我们提倡采用另一种 Auto-CoT 范式来自动构建带有问题和推理链的演示。具体来说,Auto-CoT 利用LLMs和 "让我们一步一步地思考 "的提示,为演示逐一生成推理链,也就是说,我们不仅要一步一步地思考,还要一个接一个地思考。然而,我们发现简单的解决方案无法有效解决这一难题。例如,给定一个数据集的测试问题,检索语义相似的问题并调用 Zero-Shot-CoT 生成推理链就会失败。

Although LLMs are decent zero-shot reasoners, they are not perfect: Zero-Shot-CoT can still make mistakes in reasoning chains.

To mitigate the effect of reasoning chain mistakes from Zero-Shot-CoT, our analysis shows that diversity of demonstration questions is the key. Based on this insight, we propose an Auto-CoT method to automatically construct demonstrations. Auto-CoT consists of two main steps.
我们的分析表明,要减轻推理链错误对 Zero-Shot-CoT 的影响,关键在于示范问题的多样性。基于这一观点,我们提出了一种自动构建演示的 Auto-CoT 方法。Auto-CoT 包括两个主要步骤。

First, partition questions of a given dataset into a few clusters. Second, select a representative question from each cluster and generate its reasoning chain using Zero-Shot-CoT with simple heuristics.
首先,将给定数据集中的问题划分为几个群组。其次,从每个簇中选择一个有代表性的问题,并使用简单启发式的 Zero-Shot-CoT 生成其推理链。

We evaluate Auto-CoT on ten benchmark reasoning tasks including: (i) arithmetic reasoning (MultiArith (Roy and Roth, 2015), GSM8K (Cobbe et al., 2021), AQUA-RAT (Ling et al., 2017), SVAMP (Patel et al., 2021)); (ii) commonsense reasoning (CSQA (Talmor et al., 2019), StrategyQA (Geva et al., 2021)); (iii) symbolic reasoning (Last Letter Concatenation, Coin Flip) (Wei et al., 2022a). Experimental results show that with GPT-3, Auto-CoT consistently matches or exceeds the performance of Manual-CoT that requires manual designs. This indicates that LLMs can perform CoT reasoning by automatically constructing demonstrations.
2021 年);(ii) 常识推理(CSQA(Talmor 等人,2019 年)、StrategyQA(Geva 等人,2021 年));(iii) 符号推理(最后字母连接、硬币翻转)(Wei 等人,2022a)。实验结果表明,在使用 GPT-3 的情况下,Auto-CoT 的性能始终与需要人工设计的 Manual-CoT 相匹配,甚至更胜一筹。这表明LLMs可以通过自动构建演示来进行CoT推理。

2 Related Work 2 相关工作

This section reviews two lines of research that form the basis of this work: chain-of-thought (CoT) prompting for multi-step reasoning and in-context learning for inducing LLMs to learn from demonstrations.

2.1 Chain-of-thought Prompting
2.1 思维链提示

CoT prompting is a gradient-free technique of inducing LLMs to produce intermediate reasoning steps that lead to the final answer. Wei et al. (2022a) formally studied the topic of CoT prompting in language models. This technique elicits LLMs to generate a coherent series of intermediate reasoning steps that lead to the final answer to a question.
CoT 提示是一种无梯度技术,它诱导LLMs产生中间推理步骤,从而得出最终答案。Wei等人( 2022a)正式研究了语言模型中的CoT提示主题。该技术诱导LLMs产生一系列连贯的中间推理步骤,从而得出问题的最终答案。

Studies have shown that LLMs can perform CoT reasoning with zero-shot prompting (Zero-Shot-CoT) (Kojima et al., 2022) or manually written few-shot demonstrations (Manual-CoT) (Wei et al., 2022a).
研究表明,LLMs可以通过零点提示(Zero-Shot-CoT)(Kojima et al.,2022)或手动编写的少量提示演示(Manual-CoT)(Wei et al.,2022a)进行CoT推理。


Kojima et al. (2022) showed that LLMs are decent zero-shot reasoners whose generated rationales have already reflected the CoT reasoning. This finding inspires our work to leverage the self-generated rationales for demonstrations.

Generating rationales by LLMs was shown to be practical in a recent work (Zelikman et al., 2022). In their work, an LLM is prompted to generate rationales and those rationales that lead to the correct answer are selected. The selection requires a training dataset of questions with annotated answers.
最近的一项研究(Zelikman et al.在他们的研究中,LLM会被提示生成理由,并选择那些能得出正确答案的理由。选择需要一个带有注释答案的问题训练数据集。

In contrast, our work considers a more challenging scenario where only a set of test questions are given (without a training dataset), following CoT prompting studies by Wei et al. (2022a) and Kojima et al. (2022).
相比之下,我们的工作考虑了一种更具挑战性的情况,即在没有训练数据集的情况下,只给出一组测试问题,这是继 Wei 等人 ( 2022a) 和 Kojima 等人 ( 2022) 的 CoT 提示研究之后的又一项研究。

Manual-CoT. 手动-CoT。

Manual-CoT achieves stronger performance by eliciting the CoT reasoning ability with effective manual demonstrations. The demonstrations for the reasoning process are manually designed.
However, the human efforts in designs of both questions and their reasoning chains are nontrivial. Instead of addressing this limitation, recent studies mainly focus on hand-crafting more complex demonstrations or leveraging ensemble-like methods.
One trend is problem decomposition. In least-to-most prompting (Zhou et al., 2022), complex problems are reduced to sub-problems, and then the sub-problems are solved sequentially. The other trend is to vote over multiple reasoning paths for a test question. Wang et al. (2022a) introduced a self-consistency decoding strategy to sample multiple outputs of LLMs and then took a majority over the final answers. Wang et al. (2022b) and Li et al. (2022) introduced randomness in the input space to produce more diverse outputs for voting.
They used manually-designed demonstrations as the seed set and generated additional rationales: leave one question from the seed set and use the remaining demonstrations to generate rationales for this question by the LLM. Unlike the aforementioned research lines that rely on manually-designed demonstrations, our work intends to eliminate manual designs with competitive performance.

2.2 In-Context Learning 2.2 情境学习

CoT prompting is closely related to in-context learning (ICL) (Radford et al., 2019; Brown et al., 2020). ICL enables LLMs to perform a target task by feeding a few prompted examples as part of the input. Without gradient update, ICL allows a single model to perform various tasks universally.
There are various research lines to improve the performance of ICL: (i) retrieving related demonstrations to the test instance where the popular practice is dynamically retrieving related training examples for a given test input (Rubin et al., 2022; Su et al., 2022); (ii) augmenting with fine-grained information, such as incorporating task instruction (Mishra et al., 2022; Wei et al., 2022b; Sanh et al., 2022); (iii) manipulating output probabilities of LLMs instead of directly computing the likelihood of target labels (Holtzman et al., 2021; Zhao et al., 2021; Min et al., 2022a).

Despite the success of ICL, studies (Liu et al., 2022a; Lu et al., 2022) have shown that the strength of ICL may vary widely depending on the choice of in-context demonstrations (Liu et al., 2022b). In detail, the formatting of the prompt, such as wording or order of demonstrations, may lead to performance fluctuations (Webson and Pavlick, 2022; Zhao et al., 2021). A recent work (Min et al., 2022b) even questioned the necessity of ground-truth input-output mapping: using incorrect labels in the examples only marginally lowers the performance.
尽管 ICL 取得了成功,但研究(Liu 等人,2022a;Lu 等人,2022)表明,ICL 的强度可能会因情境中示范的选择而有很大差异(Liu 等人,2022b)。具体而言,提示的格式,如措辞或演示顺序,可能会导致成绩波动(Webson 和 Pavlick,2022;Zhao 等人,2021)。最近的一项研究(Min 等人,2022b)甚至质疑了地面实况输入输出映射的必要性:在示例中使用不正确的标签只会稍微降低成绩。

However, the existing analysis of ICL is mainly based on standard classification and multi-choice datasets that only have simple <input\rightarrowoutput> mappings. We discover that those findings may not be applicable to the CoT prompting scenario with more complex <input\rightarrowrationale\rightarrowoutput> mappings. For example, mistakes in either the <input\rightarrowrationale> mapping or the <rationale\rightarrowoutput> mapping will lead to a dramatic performance drop (Appendix A.1).
然而,现有的 ICL 分析主要基于标准分类和多选数据集,这些数据集只有简单的 <输入 \rightarrow 输出>映射。我们发现,这些发现可能并不适用于具有更复杂的<输入 \rightarrow 理由 \rightarrow 输出>映射的CoT提示场景。例如,<输入 \rightarrow 理由 \rightarrow 输出>映射或<理由 \rightarrow 输出>映射的错误都会导致性能急剧下降(附录 A.1)。

3 Challenge of Auto-CoT 3 Auto-CoT 的挑战

As just discussed, the performance of ICL hinges on hand-crafted demonstrations. As reported in Manual-CoT (Wei et al., 2022a), using demonstrations written by different annotators brings up to 28.2% accuracy disparity in a symbolic reasoning task, while changing the order of demonstrations results in less than 2% changes in most tasks. This suggests that the key challenge of Auto-CoT lies in automatically constructing demonstrations with good questions and their reasoning chains.
如前所述,ICL 的性能取决于手工制作的演示。正如在 Manual-CoT(Wei 等人,2022a)中所报告的,在符号推理任务中,使用由不同注释者编写的演示文稿会带来高达 28.2% 的准确率差异,而在大多数任务中,改变演示文稿的顺序只会带来不到 2% 的变化。这表明,Auto-CoT 的关键挑战在于自动构建包含好问题及其推理链的演示。

Recall that Manual-CoT hand-crafts a few (e.g., 8) questions in demonstrations. With similarity-based retrieval methods being widely adopted for prompting LLMs (Rubin et al., 2022; Su et al., 2022), a promising candidate solution is to sample demonstration questions using similarity-based retrieval. We follow the more challenging assumption in CoT studies (Wei et al., 2022a; Kojima et al., 2022) that only a set of test questions are given (without a training dataset). Following Liu et al. (2022a), we use Sentence-BERT (Reimers and Gurevych, 2019) to encode questions. For each question qtestsuperscript𝑞testq^{\text{test}} in a test dataset, we sample demonstration questions qidemosubscriptsuperscript𝑞demo𝑖q^{\text{demo}}_{i} (i=1,,k𝑖1𝑘i=1,\ldots,k) from the rest of the questions. We design a Retrieval-Q-CoT method to retrieve the top-k𝑘k (e.g., k=8𝑘8k=8) similar questions based on cosine similarity. To compare with this similarity-based method, we also test a relatively more diversity-based method: Random-Q-CoT, which randomly samples k𝑘k other test questions for each test question.

Both Retrieval-Q-CoT and Random-Q-CoT invoke Zero-Shot-CoT (Kojima et al., 2022) to generate the reasoning chain cidemosubscriptsuperscript𝑐demo𝑖c^{\text{demo}}_{i} (rationale and answer) for each sampled question qidemosubscriptsuperscript𝑞demo𝑖q^{\text{demo}}_{i}, as LLMs are decent zero-shot reasoners (Kojima et al., 2022). We use GPT-3 (Brown et al., 2020) with 175B parameters (text-davinci-002) for the LLM unless otherwise stated. On a high level, both Retrieval-Q-CoT and Random-Q-CoT take the concatenation of qidemo,cidemosubscriptsuperscript𝑞demo𝑖subscriptsuperscript𝑐demo𝑖q^{\text{demo}}_{i},c^{\text{demo}}_{i} pairs (i=1,,k𝑖1𝑘i=1,\ldots,k) and qtestsuperscript𝑞testq^{\text{test}} as input to predict the reasoning chain for qtestsuperscript𝑞testq^{\text{test}}, which contains the answer in the end (like right of Figure 1).
检索-Q-CoT 和随机-Q-CoT 都调用 Zero-Shot-CoT(Kojima 等人,2022 年)来生成推理链 cidemosubscriptsuperscript𝑐demo𝑖c^{\text{demo}}_{i} (理由和答案)。(理由和答案)。,因为LLMs是体面的零点推理(小岛等人,2022 年)。除非另有说明,否则我们使用带有 175B 参数(text-davinci-002)的 GPT-3(Brown 等人,2020 年)来处理LLM。从高层次来看,检索-Q-CoT 和随机-Q-CoT 都将 qidemo,cidemosubscriptsuperscript𝑞demo𝑖subscriptsuperscript𝑐demo𝑖q^{\text{demo}}_{i},c^{\text{demo}}_{i} 对( i=1,,k𝑖1𝑘i=1,\ldots,k )和 qtestsuperscript𝑞testq^{\text{test}} 的连接作为输入,以预测 qtestsuperscript𝑞testq^{\text{test}} 的推理链。的推理链,该推理链最终包含答案(如图 1 右图)。

Table 1: Accuracy (%) of different sampling methods. Symbol \dagger indicates using training sets with annotated reasoning chains.
表 1:不同取样方法的准确率(%)。符号 \dagger 表示使用带有注释推理链的训练集。
Method 方法 MultiArith GSM8K AQuA
Zero-Shot-CoT 零点射击--CoT 78.7 40.7 33.5
Manual-CoT 手动-CoT 91.7 46.9 35.8\dagger
Random-Q-CoT 随机-Q-CoT 86.2 47.6\dagger 36.2\dagger
Retrieval-Q-CoT 检索-Q-CoT 82.8 48.0\dagger 39.7\dagger

To our surprise, Retrieval-Q-CoT underperforms Random-Q-CoT on the arithmetic dataset MultiArith (Roy and Roth, 2015) (Table 1). Note that the retrieval methods were originally proposed in tasks with annotated labels (Rubin et al., 2022; Su et al., 2022), however, invoking Zero-Shot-CoT does not guarantee entirely correct reasoning chains. Thus, we hypothesize that the inferior performance of Retrieval-Q-CoT is caused by incorrect reasoning chains by Zero-Shot-CoT. To test this hypothesis, we experiment with Retrieval-Q-CoT on two other datasets GSM8K (Cobbe et al., 2021) and AQuA (Ling et al., 2017) that have training sets with annotated reasoning chains. The results are shown with \dagger in Table 1. Under the setting with annotated reasoning chains, Retrieval-Q-CoT even outperforms Manual-CoT. The result indicates that Retrieval-Q-CoT is effective when human annotations are available.
令我们惊讶的是,在算术数据集 MultiArith(Roy 和 Roth,2015 年)上,检索-Q-CoT 的表现不如随机-Q-CoT(表 1)。需要注意的是,这些检索方法最初是在带有注释标签的任务中提出的(Rubin 等人,2022 年;Su 等人,2022 年),然而,调用 Zero-Shot-CoT 并不能保证推理链完全正确。因此,我们假设,Retrieval-Q-CoT 性能较差的原因是 Zero-Shot-CoT 的推理链不正确。为了验证这一假设,我们在另外两个数据集 GSM8K(Cobbe 等人,2021 年)和 AQuA(Ling 等人,2017 年)上进行了 Retrieval-Q-CoT 实验,这两个数据集都有带注释推理链的训练集。结果以 \dagger 显示于表 1。在带有注释推理链的情况下,Retrieval-Q-CoT 甚至优于 Manual-CoT。这一结果表明,在有人工标注的情况下,Retrieval-Q-CoT 是有效的。

Although human annotations are useful, such manual efforts are nontrivial. However, automatically generating reasoning chains via Zero-Shot-CoT underperforms Manual-CoT, especially when the challenge of question sampling is not addressed. To design more effective Auto-CoT, we need to understand its challenge better.
虽然人工注释很有用,但这种人工工作并非易事。然而,通过 Zero-Shot-CoT 自动生成推理链的效果不如 Manual-CoT,尤其是在问题抽样这一难题没有得到解决的情况下。要设计出更有效的 Auto-CoT,我们需要更好地了解其面临的挑战。

3.1 Retrieval-Q-CoT Fails due to Misleading by Similarity
3.1 检索-Q-CoT 因相似性误导而失败

Since Retrieval-Q-CoT uses a few prompting demonstrations like in Manual-CoT, Retrieval-Q-CoT is expected to perform competitively as well.
由于 Retrieval-Q-CoT 和 Manual-CoT 一样使用了一些提示演示,因此预计 Retrieval-Q-CoT 的表现也会很有竞争力。

However, reasoning chains (both rationales and answers) in Retrieval-Q-CoT are generated by Zero-Shot-CoT: they may have mistakes that lead to wrong answers. Let us simply call demonstrations with wrong answers as wrong demonstrations. Intuitively, after similar questions to a test question are retrieved, wrong demonstrations caused by Zero-Shot-CoT may mislead the same LLM to reason similarly with a wrong answer (e.g., replicating mistakes) for the test question. We refer to this phenomenon as misleading by similarity. We will investigate whether misleading by similarity contributes to the inferior performance of Retrieval-Q-CoT.
然而,Retrieval-Q-CoT 中的推理链(包括理由和答案)是由 Zero-Shot-CoT 生成的:它们可能会有导致错误答案的错误。我们姑且把有错误答案的演示称为错误演示。直观地说,在检索到与测试问题相似的问题后,由 Zero-Shot-CoT 导致的错误演示可能会误导相同的 LLM 以错误的答案(如复制错误)对测试问题进行类似的推理。我们将这种现象称为相似性误导。我们将研究相似性误导是否是导致 Retrieval-Q-CoT 性能低下的原因。

Retrieval-Q-CoTRandom-Q-CoT202020303030404040505050Rate (%)
Figure 2: Unresolving Rate.
图 2:未解决率。

To begin with, we invoke Zero-Shot-CoT on all the 600 questions from the MultiArith dataset. Among them, we collect those 128 questions (denoted as 𝒬𝒬\mathcal{Q}) where Zero-Shot-CoT generates wrong answers (error rate: 21.3%=128/600percent21.312860021.3\%=128/600). As we mentioned, with extra demonstrations, Retrieval-Q-CoT and Random-Q-CoT are expected to perform more competitively than Zero-Shot-CoT. Among 𝒬𝒬\mathcal{Q} where Zero-Shot-CoT fails, we call those where Retrieval-Q-CoT or Random-Q-CoT still fail as their unresolved questions. We divide the number of unresolved questions by 128 (number of questions in 𝒬𝒬\mathcal{Q}) to calculate the unresolving rate. A higher unresolving rate means that a method more likely still makes mistakes like Zero-Shot-CoT. Figure 2 shows that the unresolving rate of Retrieval-Q-CoT (46.9%) is much higher than Random-Q-CoT (25.8%). It indicates that with similar questions being sampled for test questions, Retrieval-Q-CoT is negatively affected by misleading by similarity.
首先,我们在 MultiArith 数据集中的所有 600 个问题上调用 Zero-Shot-CoT 技术。其中,我们收集了 128 个问题(记为 𝒬𝒬\mathcal{Q} ),在这些问题中,Zero-Shot-CoT 生成了错误答案(错误率: 21.3%=128/600percent21.312860021.3\%=128/600 )。正如我们所提到的,通过额外的演示,Retrieval-Q-CoT 和 Random-Q-CoT 的表现有望比 Zero-Shot-CoT 更具竞争力。在 Zero-Shot-CoT 失败的 𝒬𝒬\mathcal{Q} 中,我们将检索-Q-CoT 或随机-Q-CoT 仍然失败的 𝒬𝒬\mathcal{Q} 称为未解决的问题。我们用未解决的问题数除以 128( 𝒬𝒬\mathcal{Q} 中的问题数)来计算未解决率。未解题率越高,说明该方法越有可能像 Zero-Shot-CoT 一样仍然犯错。图 2 显示,检索-Q-CoT 的未解率(46.9%)远高于随机-Q-CoT 的未解率(25.8%)。这表明,在抽取相似问题作为测试问题的情况下,Retrieval-Q-CoT 会受到相似性误导的负面影响。

To show that unresolved questions of Retrieval-Q-CoT tend to be similar, we present a case study in Table 2. In the left part, the retrieved demonstration questions are similar to the test question and ask “how long will it take him to cook the rest?” The reasoning chains generated by Zero-Shot-CoT produce answers regarding “the total of” instead of “the rest”. Following the demonstrations, Retrieval-Q-CoT also fails by misunderstanding the meaning of “the rest”. In contrast, Random-Q-CoT correctly understands “the rest” better without making similar mistakes in the demonstrations, thanks to relatively more diverse (random) demonstrations.
为了说明检索-Q-CoT 未解决的问题往往是相似的,我们在表 2 中介绍了一个案例研究。在左侧部分,检索到的演示问题与测试问题类似,都是问 "他需要多长时间才能煮好剩下的菜?由 Zero-Shot-CoT 生成的推理链产生的答案是 "总共",而不是 "剩下的"。在演示之后,Retrieval-Q-CoT 也因为误解了 "剩下的 "的含义而失败。相比之下,Random-Q-CoT 能更好地正确理解 "其余的",而不会在演示中犯类似的错误,这要归功于相对更多样化(随机)的演示。

3.2 Errors Frequently Fall into the Same Cluster
3.2 错误经常出现在同一个群组中

Motivated by the observations in Table 2, we use k𝑘k-means to partition all the 600 test questions into k=8𝑘8k=8 clusters, where each cluster contains similar questions.111We use Sentence-BERT (Reimers and Gurevych, 2019) to encode questions and apply k𝑘k-means for clustering.
我们使用 Sentence-BERT(Reimers 和 Gurevych,2019 年)对问题进行编码,并应用 k𝑘k -means进行聚类。-均值进行聚类。

根据表 2 中的观察结果,我们使用 k𝑘k -均值将所有 600 道试题划分为 k=8𝑘8k=8 簇,每个簇包含相似的试题。-均值法将所有 600 道试题划分为 k=8𝑘8k=8 簇,每个簇包含相似的试题。 1
With these clusters and reasoning chains generated by Zero-Shot-CoT (in Section 3.1), now we are curious if certain clusters contain questions where Zero-Shot-CoT frequently fails. Thus, we calculate the error rate (questions with wrong Zero-Shot-CoT answers / total questions) for each cluster.
有了这些簇和由 Zero-Shot-CoT 生成的推理链(第 3.1 节),现在我们很好奇某些簇是否包含 Zero-Shot-CoT 经常失败的问题。因此,我们要计算每个簇的错误率(Zero-Shot-CoT 答案错误的问题/问题总数)。

1234567800202020404040606060Error Rate (%)
Figure 3: Clusters of similar questions.
图 3:相似问题群。

As shown in Figure 3, there exists a cluster (Cluster 2) with frequent Zero-Shot-CoT errors (52.3%). The phenomenon could be generic as Zero-Shot-CoT may lack some skills to solve some common problems in target tasks.222We observe similar phenomena when changing the cluster number or using other datasets (Appendix A.2).
在改变群集数或使用其他数据集时,我们也观察到类似的现象(附录 A.2)。

如图 3 所示,有一个群组(群组 2)经常出现 Zero-Shot-CoT 错误(52.3%)。这种现象可能具有普遍性,因为 "Zero-Shot-CoT "可能缺乏解决目标任务中一些常见问题的技能。 2
For convenience of descriptions, let us call the cluster with the highest error rate as the frequent-error cluster (e.g., Cluster 2 in Figure 3).
为方便描述,我们将错误率最高的群组称为频繁出错群组(如图 3 中的群组 2)。

Therefore, the imperfect nature of generated reasoning chains in a zero-shot fashion poses risks of retrieving multiple similar questions inside a frequent-error cluster by using similarity-based methods. For the test question in the frequent-error cluster, Retrieval-Q-CoT more easily constructs demonstrations with multiple similar mistakes.
因此,以零点方式生成的推理链并不完美,使用基于相似性的方法有可能检索到频发错误集群中的多个相似问题。对于常错群中的测试问题,Retrieval-Q-CoT 更容易构建出具有多个相似错误的演示。

As a result, Retrieval-Q-CoT often makes similar mistakes like Zero-Shot-CoT, reiterated by its higher unresolving rate in Figure 2.
因此,Retrieval-Q-CoT 经常犯与 Zero-Shot-CoT 类似的错误,图 2 中更高的未解决率再次说明了这一点。

Table 2: Examples of Retrieval-Q-CoT and Random-Q-CoT for constructing demonstrations and prediction. The upper part contains the automatically constructed demonstrations, and the lower part contains test questions along with predicted reasoning chains.
表 2:检索-Q-CoT 和随机-Q-CoT 构建演示和预测的示例。上半部分是自动构建的演示,下半部分是测试问题和预测推理链。
Retrieval-Q-CoT 检索-Q-CoT Random-Q-CoT 随机-Q-CoT
Q: A chef needs to cook 9 potatoes. He has already cooked 7. If each potato takes 3 minutes to cook, how long will it take him to cook the rest?
问:一位厨师需要烹制 9 个马铃薯。如果每个马铃薯需要 3 分钟煮熟,那么他需要多长时间才能煮熟剩下的马铃薯?
Q: Wendy uploaded 79 pictures to Facebook. She put 44 pics into one album and put the rest into 5 different albums. How many pictures were in each album?
问:Wendy 上传了 79 张照片到 Facebook。她把 44 张照片放到一个相册中,其余的放到 5 个不同的相册中。每个相册中有多少张照片?
A: Let’s think step by step. The chef has already cooked 7 potatoes. That means it has taken him 7 * 3 minutes to cook those 7 potatoes. That means it will take him 3 more minutes to cook each of the remaining 2 potatoes. That means it will take him a total of 7 * 3 + 2 * 3 = 21 minutes to cook all 9 potatoes. The answer is 21.
答:让我们一步一步想。厨师已经煮了 7 个土豆。也就是说,他用了 7 * 3 分钟来烹制这 7 个土豆。也就是说,他还需要 3 分钟来烹制剩下的 2 个土豆。也就是说,他总共需要 7 * 3 + 2 * 3 = 21 分钟来烹制全部 9 个土豆。答案是 21。✗
A: Let’s think step by step. Wendy uploaded 79 pictures to Facebook. She put 44 pictures into one album. That means she put the rest of the pictures into 5 different albums. That means each album would have (79-44)/5 = 15 pictures in each album. The answer is 15.\checkmark
答:让我们一步一步想。Wendy 在 Facebook 上传了 79 张照片。她把 44 张照片放到了一个相册中。也就是说,她把其余的照片放到了 5 个不同的相册中。也就是说,每个相册里有 (79-44)/5 = 15 张照片。答案是 15。 \checkmark
Q: A chef needs to cook 13 potatoes. He has already cooked 5. If each potato takes 6 minutes to cook, how long will it take him to cook the rest?
问:一位厨师需要烹制 13 个马铃薯。如果每个马铃薯需要 6 分钟煮熟,那么他需要多长时间才能煮熟剩下的马铃薯?
Q: At a restaurant each adult meal costs $5 and kids eat free. If a group of 15 people came in and 8 were kids, how much would it cost for the group to eat?
问:在一家餐馆,成人每餐 5 美元,儿童免费。如果来了 15 个人,其中 8 个是孩子,那么这群人的餐费是多少?
A: Let’s think step by step. The chef has already cooked 5 potatoes. That means it has taken him 30 minutes so far. There are 13 potatoes in total. 5 have been cooked so far. That means there are 8 left to cook. Each potato takes 6 minutes to cook.
答:让我们一步一步想。厨师已经煮了 5 个土豆。也就是说,到目前为止他花了 30 分钟。一共有 13 个土豆。目前已经煮了 5 个。也就是说,还剩 8 个。每个土豆需要 6 分钟。

That means it will take the chef 48 minutes to cook the 8 remaining potatoes. In total, it will take the chef 78 minutes to cook all 13 potatoes. The answer is 78.
这意味着厨师需要 48 分钟来烹制剩下的 8 个土豆。厨师总共需要 78 分钟来烹制全部 13 个土豆。答案是 78。✗
A: Let’s think step by step. There are 8 kids, so that means there are 7 adults. Each adult meal costs $5, so that means that the 7 adults will cost $5*7=$35. Now we just need to add the two numbers together. $35+8=43$. So it would cost $43 for the group to eat. The answer is 43.
答:让我们一步一步来想。有 8 个孩子,那么就有 7 个成年人。每份成人餐费为 5 美元,因此 7 位成人的餐费为 5*7=35 美元。现在我们只需将这两个数字相加。$35+8=43$.因此,这群人吃饭需要花费 43 美元。答案是 43。✗
Q: A chef needs to cook 15 potatoes. He has already cooked 8. If each potato takes 9 minutes to cook, how long will it take him to cook the rest?
问:一位厨师需要烹制 15 个马铃薯。如果每个马铃薯需要 9 分钟煮熟,那么他需要多长时间才能煮熟剩下的马铃薯?
Q: A chef needs to cook 15 potatoes. He has already cooked 8. If each potato takes 9 minutes to cook, how long will it take him to cook the rest?
问:一位厨师需要烹制 15 个马铃薯。如果每个马铃薯需要 9 分钟煮熟,那么他需要多长时间才能煮熟剩下的马铃薯?
A: Let’s think step by step. The chef has already cooked 8 potatoes. That means it has taken him 8 * 9 minutes to cook those 8 potatoes. That means it will take him 9 minutes to cook each of the remaining 7 potatoes. That means it will take him a total of 8 * 9 + 7 * 9 = 135 minutes to cook all 15 potatoes. The answer is 135.
答:让我们一步一步想。厨师已经煮了 8 个土豆。也就是说,他用了 8 * 9 分钟来烹制这 8 个土豆。也就是说,他需要 9 分钟来烹制剩下的 7 个土豆。也就是说,他总共需要 8 * 9 + 7 * 9 = 135 分钟来烹制全部 15 个土豆。答案是 135。✗
A: Let’s think step by step. The chef needs to cook 15 potatoes. He has already cooked 8. That means he has to cook 15-8=7 more potatoes. Each potato takes 9 minutes to cook. That means it will take him 9*7=63 minutes to cook the rest of the potatoes. The answer is 63. \checkmark
答:让我们一步一步想。厨师需要烹制 15 个土豆。他已经煮了 8 个,也就是说还需要煮 15-8=7 个土豆。每个土豆需要煮 9 分钟。也就是说,他需要 9*7=63 分钟来烹制剩下的土豆。答案是 63。 \checkmark

3.3 Diversity May Mitigate Misleading by Similarity
3.3Diversity may mitigate misleading by similarity(3D多样性可减少相似性误导

The analysis so far compellingly shows that LLMs are still not perfect zero-shot reasoners; thus, we aim to mitigate the effect of their Zero-Shot-CoT errors, especially to mitigate misleading by similarity in the design of Auto-CoT.

As we will show later (Section 5.5), presenting a small portion of mistakes (e.g., 1 or 2 wrong demonstrations out of 8) would not harm the overall reasoning performance for test questions. Suppose that questions of all the wrong demonstrations fall into the same frequent-error cluster; then sampling one question from every different cluster will lead to a higher than 7/8=87.5%78percent87.57/8=87.5\% chance to construct all the 8 correct demonstrations. Since different clusters reflect diverse semantics of the questions, this clustering-based sampling method can be considered as diversity-based, which is in sharp contrast to similarity-based Retrieval-Q-CoT. On one hand, sampling questions with diversity may mitigate the effect of misleading by similarity (Section 3.1). On the other hand, if we took each demonstration as a kind of skill, diverse demonstrations seem to cover more alternative skills for solving target questions: even though there still exists a small portion (e.g., 1/8181/8) of mistakes in the demonstrations, the performance will not be negatively affected (to be shown in Figure 5.5).

Nevertheless, the clustering-based sampling method may still construct a small portion of wrong demonstrations, such as from questions in the frequent-error cluster. As we will show later, some of these wrong demonstrations may be eliminated with heuristics. For example, wrong demonstrations often come with long questions and long rationales. Using simple and generic heuristics, such as only considering shorter questions with shorter rationales, further helps mitigate the effect of imperfect Zero-Shot-CoT capabilities (Appendix C.2).
尽管如此,基于聚类的抽样方法仍有可能产生一小部分错误的演示,例如来自经常出错聚类中的问题的演示。正如我们稍后将展示的,这些错误演示中的一部分可以通过启发式方法消除。例如,错误演示往往伴随着长问题和长理由。使用简单而通用的启发式方法,例如只考虑较短的问题和较短的理由,可以进一步帮助减轻不完善的 "零-射-CoT "能力的影响(附录 C.2)。

4 Auto-CoT: Automatic Chain-of-Thought Prompting

Based on the observations and considerations in Section 3, we propose an Auto-CoT method to construct demonstrations with questions and reasoning chains automatically.
Auto-CoT consists of two main stages: (i) question clustering: partition questions of a given dataset into a few clusters; (ii) demonstration sampling: select a representative question from each cluster and generate its reasoning chain using Zero-Shot-CoT with simple heuristics.
The overall procedure is illustrated in Figure 4.

Refer to caption
Figure 4: Overview of the Auto-CoT method. Different from Manual-CoT in Figure 1, demonstrations (on the right) are automatically constructed one by one (total: k𝑘k) using an LLM with the “Let’s think step by step” prompt.
图 4:Auto-CoT 方法概述。与图 1 中的 "手动-CoT "不同,演示(右侧)是在 "让我们一步步思考 "的提示下,使用LLM逐个自动构建的(总计: k𝑘k )。

4.1 Question Clustering 4.1 问题聚类

Since diversity-based clustering may mitigate misleading by similarity (Section 3.3), we perform cluster analysis for a given set of questions 𝒬𝒬\mathcal{Q}. We first compute a vector representation for each question in 𝒬𝒬\mathcal{Q} by Sentence-BERT (Reimers and Gurevych, 2019). The contextualized vectors are averaged to form a fix-sized question representation. Then, the question representations are processed by the k𝑘k-means clustering algorithm to produce k𝑘k clusters of questions. For questions in each cluster i𝑖i, sort them into a list 𝐪(i)=[q1(i),q2(i),]superscript𝐪𝑖superscriptsubscript𝑞1𝑖superscriptsubscript𝑞2𝑖\mathbf{q}^{(i)}=[q_{1}^{(i)},q_{2}^{(i)},\ldots] in the ascending order of the distance to the center of cluster i𝑖i. This question clustering stage is summarized in Algorithm 1.
由于基于多样性的聚类可能会减少相似性的误导(第 3.3 节),因此我们对给定的问题集 𝒬𝒬\mathcal{Q} 进行聚类分析。.我们首先通过 Sentence-BERT(Reimers 和 Gurevych,2019 年)为 𝒬𝒬\mathcal{Q} 中的每个问题计算一个向量表示。对上下文向量进行平均,形成固定大小的问题表示。然后,问题表征由 k𝑘k -均值聚类算法处理。-均值聚类算法进行处理,生成 k𝑘k 问题聚类。对于每个聚类中的问题 i𝑖i ,按照与聚类中心 i𝑖i 的距离从大到小排序,将其排入一个列表 𝐪(i)=[q1(i),q2(i),]superscript𝐪𝑖superscriptsubscript𝑞1𝑖superscriptsubscript𝑞2𝑖\mathbf{q}^{(i)}=[q_{1}^{(i)},q_{2}^{(i)},\ldots] 。.算法 1 总结了这一问题聚类阶段。

4.2 Demonstration Sampling
4.2 示范取样

In the second stage, we need to generate reasoning chains for those sampled questions and then sample demonstrations that satisfy our selection criteria.

More concretely, we construct a demonstration d(i)superscript𝑑𝑖d^{(i)} (concatenation of a question, a rationale, and an answer) for each cluster i𝑖i (i=1,,k𝑖1𝑘i=1,\ldots,k). For cluster i𝑖i, we iterate over questions in the sorted list 𝐪(i)=[q1(i),q2(i),]superscript𝐪𝑖superscriptsubscript𝑞1𝑖superscriptsubscript𝑞2𝑖\mathbf{q}^{(i)}=[q_{1}^{(i)},q_{2}^{(i)},\ldots] (obtained by Algorithm 1) until satisfying our selection criteria. In other words, a question that is closer to the center of cluster i𝑖i is considered earlier. Say that the j𝑗j-th closest question qj(i)superscriptsubscript𝑞𝑗𝑖q_{j}^{(i)} is being considered. A prompted input is formulated as: [Q: qj(i)superscriptsubscript𝑞𝑗𝑖q_{j}^{(i)}. A: [P]], where [P] is a single prompt “Let’s think step to step”. This formed input is fed into an LLM using Zero-Shot-CoT (Kojima et al., 2022) to output the reasoning chain consisting of the rationale rj(i)superscriptsubscript𝑟𝑗𝑖r_{j}^{(i)} and the extracted answer aj(i)superscriptsubscript𝑎𝑗𝑖a_{j}^{(i)}. Then, a candidate demonstration dj(i)superscriptsubscript𝑑𝑗𝑖d_{j}^{(i)} for the i𝑖i-th cluster is constructed by concatenating the question, rationale, and answer: [Q: qj(i),A: rj(i)aj(i)]Q: superscriptsubscript𝑞𝑗𝑖A: superscriptsubscript𝑟𝑗𝑖superscriptsubscript𝑎𝑗𝑖[\text{Q: }q_{j}^{(i)},\text{A: }r_{j}^{(i)}\circ a_{j}^{(i)}].

Similar to the criteria of the hand-crafting demonstrations in Wei et al. (2022a), our selection criteria follow simple heuristics to encourage sampling simpler questions and rationales: set the selected demonstration d(i)superscript𝑑𝑖d^{(i)} as dj(i)superscriptsubscript𝑑𝑗𝑖d_{j}^{(i)} if it has a question qj(i)superscriptsubscript𝑞𝑗𝑖q_{j}^{(i)} with no more than 606060 tokens and a rationale rj(i)superscriptsubscript𝑟𝑗𝑖r_{j}^{(i)} with no more than 555 reasoning steps.333Because Zero-Shot-CoT often uses “\\\backslashn” for separating the reasoning steps, the rule can be easily implemented by counting the “\\\backslashn” tokens in the generated rationales.
由于 Zero-Shot-CoT 通常使用" \\\backslash n "来分隔推理步骤,因此可以通过计算生成的理由中的" \\\backslash n "标记来轻松实现该规则。

与 Wei 等(2022a)中手工制作演示的标准类似,我们的选择标准遵循简单的启发式方法,以鼓励抽取较简单的问题和推理:如果所选演示 d(i)superscript𝑑𝑖d^{(i)} 的问题 qj(i)superscriptsubscript𝑞𝑗𝑖q_{j}^{(i)} 不超过 606060 个标记,推理 rj(i)superscriptsubscript𝑟𝑗𝑖r_{j}^{(i)} 不超过 555 个推理步骤,则将其设为 dj(i)superscriptsubscript𝑑𝑗𝑖d_{j}^{(i)}3

Algorithm 1 Cluster 算法 1 集群
1:A set of questions 𝒬𝒬\mathcal{Q} and the number of demonstrations k𝑘k
1:一组问题 𝒬𝒬\mathcal{Q} 和演示次数 k𝑘k
2:Sorted questions 𝐪(i)=[q1(i),q2(i),]superscript𝐪𝑖superscriptsubscript𝑞1𝑖superscriptsubscript𝑞2𝑖\mathbf{q}^{(i)}=[q_{1}^{(i)},q_{2}^{(i)},\ldots] for each cluster i𝑖i (i=1,,k𝑖1𝑘i=1,\ldots,k)
2:每个群组的分类问题 𝐪(i)=[q1(i),q2(i),]superscript𝐪𝑖superscriptsubscript𝑞1𝑖superscriptsubscript𝑞2𝑖\mathbf{q}^{(i)}=[q_{1}^{(i)},q_{2}^{(i)},\ldots] i𝑖i ( i=1,,k𝑖1𝑘i=1,\ldots,k )
3:procedure Cluster(𝒬𝒬\mathcal{Q}, k𝑘k)
4:     for each question q𝑞q in 𝒬𝒬\mathcal{Q} do
4: 对于 𝒬𝒬\mathcal{Q} 中的每个问题 q𝑞q
5:         Encode q𝑞q by Sentence-BERT      
5: 通过句子-BERT 对 q𝑞q 进行编码
6:     Cluster all the encoded question representations into k𝑘k clusters
6: 将所有已编码的问题表征聚类为 k𝑘k 聚类
7:     for each cluster i=1,,k𝑖1𝑘i=1,\ldots,k do
7: 对每个群组 i=1,,k𝑖1𝑘i=1,\ldots,k
8:         Sort questions 𝐪(i)=[q1(i),q2(i),]superscript𝐪𝑖superscriptsubscript𝑞1𝑖superscriptsubscript𝑞2𝑖\mathbf{q}^{(i)}=[q_{1}^{(i)},q_{2}^{(i)},\ldots] in the ascending order of the distance to the cluster center      
8: 按照与群集中心的距离从大到小排列问题 𝐪(i)=[q1(i),q2(i),]superscript𝐪𝑖superscriptsubscript𝑞1𝑖superscriptsubscript𝑞2𝑖\mathbf{q}^{(i)}=[q_{1}^{(i)},q_{2}^{(i)},\ldots]
9:     return 𝐪(i)superscript𝐪𝑖\mathbf{q}^{(i)} (i=1,,k𝑖1𝑘i=1,\ldots,k)
9: 返回 𝐪(i)superscript𝐪𝑖\mathbf{q}^{(i)} ( i=1,,k𝑖1𝑘i=1,\ldots,k )
Algorithm 2 Construct 算法 2 构建
1:Sorted questions 𝐪(i)=[q1(i),q2(i),]superscript𝐪𝑖superscriptsubscript𝑞1𝑖superscriptsubscript𝑞2𝑖\mathbf{q}^{(i)}=[q_{1}^{(i)},q_{2}^{(i)},\ldots] for each cluster i𝑖i (i=1,,k𝑖1𝑘i=1,\ldots,k), empty demonstration list 𝐝𝐝\mathbf{d}
1:每个群组的分类问题 𝐪(i)=[q1(i),q2(i),]superscript𝐪𝑖superscriptsubscript𝑞1𝑖superscriptsubscript𝑞2𝑖\mathbf{q}^{(i)}=[q_{1}^{(i)},q_{2}^{(i)},\ldots] i𝑖i ( i=1,,k𝑖1𝑘i=1,\ldots,k ), 空示范列表 𝐝𝐝\mathbf{d}
2:Demonstration list 𝐝=[d(1),,d(k)]𝐝superscript𝑑1superscript𝑑𝑘\mathbf{d}=[d^{(1)},\ldots,d^{(k)}]
2:示范列表 𝐝=[d(1),,d(k)]𝐝superscript𝑑1superscript𝑑𝑘\mathbf{d}=[d^{(1)},\ldots,d^{(k)}]
3:procedure Construct(𝐪(i),,𝐪(k)superscript𝐪𝑖superscript𝐪𝑘\mathbf{q}^{(i)},\ldots,\mathbf{q}^{(k)})
4:     for each cluster i=1,,k𝑖1𝑘i=1,\ldots,k do
4: 对每个群组 i=1,,k𝑖1𝑘i=1,\ldots,k
5:         for each question qj(i)superscriptsubscript𝑞𝑗𝑖q_{j}^{(i)} in 𝐪(i)superscript𝐪𝑖\mathbf{q}^{(i)} do
5: 对 𝐪(i)superscript𝐪𝑖\mathbf{q}^{(i)} 中的每个问题 qj(i)superscriptsubscript𝑞𝑗𝑖q_{j}^{(i)}
6:              Generate rationale rj(i)superscriptsubscript𝑟𝑗𝑖r_{j}^{(i)} and answer aj(i)superscriptsubscript𝑎𝑗𝑖a_{j}^{(i)} for qj(i)superscriptsubscript𝑞𝑗𝑖q_{j}^{(i)} using Zero-Shot-CoT
6: 使用零点碰撞--CoT 生成 qj(i)superscriptsubscript𝑞𝑗𝑖q_{j}^{(i)} 的理由 rj(i)superscriptsubscript𝑟𝑗𝑖r_{j}^{(i)} 和答案 aj(i)superscriptsubscript𝑎𝑗𝑖a_{j}^{(i)}
7:              if qj(i),rj(i)superscriptsubscript𝑞𝑗𝑖superscriptsubscript𝑟𝑗𝑖q_{j}^{(i)},r_{j}^{(i)} satisfy selection criteria then
8:                  Add d(i)=[Q: qj(i),A: rj(i)aj(i)]superscript𝑑𝑖Q: superscriptsubscript𝑞𝑗𝑖A: superscriptsubscript𝑟𝑗𝑖superscriptsubscript𝑎𝑗𝑖d^{(i)}=[\text{Q: }q_{j}^{(i)},\text{A: }r_{j}^{(i)}\circ a_{j}^{(i)}] to 𝐝𝐝\mathbf{d}
8:将 d(i)=[Q: qj(i),A: rj(i)aj(i)]superscript𝑑𝑖Q: superscriptsubscript𝑞𝑗𝑖A: superscriptsubscript𝑟𝑗𝑖superscriptsubscript𝑎𝑗𝑖d^{(i)}=[\text{Q: }q_{j}^{(i)},\text{A: }r_{j}^{(i)}\circ a_{j}^{(i)}] 添加到 𝐝𝐝\mathbf{d}
9:                  break 9: 休息                             
10:     return 𝐝𝐝\mathbf{d} 10: 返回 𝐝𝐝\mathbf{d}

As summarized in Algorithm 2, after demonstration sampling for all the k𝑘k clusters, there will be k𝑘k constructed demonstrations [d(1),,d(k)]superscript𝑑1superscript𝑑𝑘[d^{(1)},\ldots,d^{(k)}]. The constructed demonstrations are used to augment a test question qtestsuperscript𝑞testq^{\text{test}} for in-context learning. Specifically, the input is the concatenation of all the demonstrations [d(1),,d(k)]superscript𝑑1superscript𝑑𝑘[d^{(1)},\ldots,d^{(k)}] followed by [Q: qtestsuperscript𝑞testq^{\text{test}}. A: [P]]. This input is fed to LLMs to obtain the reasoning chain with the answer in the end for qtestsuperscript𝑞testq^{\text{test}} (right of Figure 4).
如算法 2 所述,在对所有 k𝑘k 集群进行演示抽样后,将有 k𝑘k 个构建演示 [d(1),,d(k)]superscript𝑑1superscript𝑑𝑘[d^{(1)},\ldots,d^{(k)}] 。.构建的演示将用于增强测试问题 qtestsuperscript𝑞testq^{\text{test}} ,以便进行情境学习。具体来说,输入是所有演示 [d(1),,d(k)]superscript𝑑1superscript𝑑𝑘[d^{(1)},\ldots,d^{(k)}] 的串联,然后是 [Q: qtestsuperscript𝑞testq^{\text{test}} .A:[P]]。该输入被输入到 LLMs 中,从而得到推理链,最后得到 qtestsuperscript𝑞testq^{\text{test}} 的答案(图 4 右侧)。(图 4 右侧)。

5 Experiments 5实验

We briefly describe the experimental setup and present main experimental results. More experimental details and results can be found in the appendices.

5.1 Experimental setup 5.1 实验设置

Tasks and Datasets. 任务和数据集。

Our method is evaluated on ten benchmark datasets from three categories of reasoning tasks: (i) arithmetic reasoning (MultiArith (Roy and Roth, 2015), GSM8K (Cobbe et al., 2021), AddSub (Hosseini et al., 2014), AQUA-RAT (Ling et al., 2017), SingleEq (Koncel-Kedziorski et al., 2015), SVAMP (Patel et al., 2021)); (ii) commonsense reasoning (CSQA (Talmor et al., 2019), StrategyQA (Geva et al., 2021)); (iii) symbolic reasoning (Last Letter Concatenation, Coin Flip) (Wei et al., 2022a).
2017 年)、SingleEq(Koncel-Kedziorski 等人,2015 年)、SVAMP(Patel 等人,2021 年));(ii) 常识推理(CSQA(Talmor 等人,2019 年)、StrategyQA(Geva 等人,2021 年));(iii) 符号推理(最后字母连接、硬币翻转)(Wei 等人,2022a)。

Implementation. 实施。

We use the public GPT-3 (Brown et al., 2020) of the text-davinci-002 version with 175B parameters for the LLM (Ouyang et al., 2022) unless otherwise stated. We select this LLM because it has the strongest CoT reasoning performance among public LLMs, as reported in Kojima et al. (2022) and Wei et al. (2022a). We also evaluate the Codex model (Chen et al., 2021) (code-davinci-002) as the LLM. Following Wei et al. (2022a), the number of demonstrations k𝑘k is 8 except for AQuA and Letter (4), CSQA (7), and StrategyQA (6).
除非另有说明,我们使用的是公开的 GPT-3(Brown 等人,2020 年)text-davinci-002 版本,LLM的参数为 175B(欧阳等人,2022 年)。(欧阳等人,2022),除非另有说明。我们之所以选择该LLM,是因为根据小岛等人的报告(2022年)和魏等人的报告(2022年a),该LLM在公共LLMs中具有最强的CoT推理性能。我们还评估了作为LLM的Codex模型(Chen等人,2021年)(code-davinci-002)。按照 Wei 等人( 2022a)的方法,除了 AQuA 和 Letter(4)、CSQA(7)和 StrategyQA(6)之外,演示 k𝑘k 的数量为 8。

Baselines. 基线。

We compare our methods with four baseline methods: Zero-Shot (Kojima et al., 2022), Zero-Shot-CoT (Kojima et al., 2022), Few-Shot (Wei et al., 2022a), and Manual-CoT (Wei et al., 2022a). Zero-Shot-CoT and Manual-CoT are illustrated in Figure 1. The Zero-Shot baseline concatenates a test question with the prompt “The answer is” as the LLM input. The Few-Shot baseline has the same LLM input as Manual-CoT except for removed rationales from all the demonstrations.

Table 3: Accuracy on ten datasets from three categories of reasoning tasks.
表 3:三类推理任务中十个数据集的准确率
Model 模型 Arithmetic 算术 Commonsense 常识 Symbolic 象征性
MultiArith GSM8K AddSub 添加子项 AQuA SingleEq 单当量 SVAMP CSQA Strategy 战略 Letter 信件 Coin 硬币
Zero-Shot 零点射击 22.7 12.5 77.0 22.4 78.7 58.8 72.6 54.3 0.2 53.8
Zero-Shot-CoT 零点射击--CoT 78.7 40.7 74.7 33.5 78.7 63.7 64.6 54.8 57.6 91.4
Few-Shot 33.8 15.6 83.3 24.8 82.7 65.7 79.5 65.9 0.2 57.2
Manual-CoT 手动-CoT 91.7 46.9 81.3 35.8 86.6 68.9 73.5 65.4 59.0 97.2
Auto-CoT 92.0 47.9 84.8 36.5 87.0 69.5 74.4 65.4 59.7 99.9
Table 4: Accuracy using the Codex LLM.
表 4:使用法典 LLM 的准确性。
Method 方法 MultiArith GSM8K AddSub 添加子项
Zero-Shot-CoT 零点射击--CoT 64.8 31.8 65.6
Manual-CoT 手动-CoT 96.8 59.4 84.6
Auto-CoT 93.2 62.8 91.9

5.2 Competitive Performance of Auto-CoT on Ten Datasets
5.2 Auto-CoT 在十个数据集上的竞争性能

Table 3 compares accuracy on ten datasets from three categories of reasoning tasks. The Zero-Shot and Zero-Shot-CoT results are taken from Kojima et al. (2022), the Few-Shot and Manual-CoT results are taken from Wei et al. (2022a), and the Auto-CoT results are averaged over three random runs. Overall, Auto-CoT consistently matches or exceeds the performance of the CoT paradigm that requires manual designs of demonstrations. Due to the cost of manual designs, Manual-CoT may design the same demonstrations for multiple datasets (e.g., 5/6565/6 of the arithmetic datasets). In contrast, Auto-CoT is more flexible and task-adaptive: every single dataset gets its own demonstrations that are automatically constructed.

5.3 Visualization of Question Clustering
5.3 问题聚类的可视化

Figure 5 visualizes question clustering (with PCA projection) in ten datasets. The illustration indicates that there exist generic patterns, where different patterns may be characterized by questions from different clusters.
图 5 展示了十个数据集的问题聚类(使用 PCA 投影)。图示表明,存在着通用模式,不同模式的问题可能来自不同的聚类。

We present the constructed demonstrations of Auto-CoT in Appendix D.
我们在附录 D 中介绍了构建的 Auto-CoT 演示。

Refer to caption
Figure 5: Question clustering on ten datasets of reasoning tasks. Stars denote cluster centers.
图 5:十个推理任务数据集的问题聚类。星星表示聚类中心。

5.4 General Effectiveness Using the Codex LLM
5.4 使用《LLM法典》的一般效果

To evaluate the general effectiveness of Auto-CoT using different LLMs, here we change the LLM to the Codex model (Chen et al., 2021). As in Table 4, the Codex LLM leads to performance improvement for Manual-CoT when compared with Table 3 that uses the GPT-3 (text-davinci-002) LLM. Nonetheless, using the Codex LLM, the overall performance of Auto-CoT is still competitive compared to Manual-CoT, providing additional empirical evidence for the effectiveness of Auto-CoT.

5.5 Effect of Wrong Demonstrations
5.5 错误示范的影响

Recall our discussions in Section 3.3 that there can be wrong demonstrations (whose answers are wrong). To see if diversity mitigates this effect, we design an In-Cluster Sampling baseline that constructs demonstrations by randomly sampling questions from the same cluster that contains a test question. Figure 5.5 compares accuracy with varying amounts of wrong demonstrations on MultiArith. Compared with In-Cluster Sampling, Auto-CoT (using diversity-based clustering) is less affected by wrong demonstrations: its performance still does not degrade significantly even when presented with 50% wrong demonstrations.
回顾我们在第 3.3 节中的讨论,可能会有错误的演示(其答案是错误的)。为了了解多样性是否能减轻这种影响,我们设计了一个集群内取样基线,通过从包含测试问题的同一集群中随机抽取问题来构建演示。图 5.5 比较了 MultiArith 上不同错误示范数量下的准确率。与 In-Cluster Sampling 相比,Auto-CoT(使用基于多样性的聚类)受错误演示的影响较小:即使出现 50% 的错误演示,其性能也不会显著下降。

12.5%25.0%37.5%50.0%808080858585909090959595100100100Percentage of wrong demonstrationsAccuracy (%)In-Cluster SamplingAuto-CoT
Figure 6: Effect of wrong demonstrations.
图 6:错误示范的影响。
12345678910606060707070808080909090100100100BatchAccuracy (%)Zero-Shot-CoTManual-CoTAuto-CoT*
Figure 7: Bootstraping for the streaming setting.
图 7:流媒体设置的引导。

5.6 More Challenging Streaming Setting
5.6 更具挑战性的流媒体设置

CoT studies commonly assume that a full dataset with test questions is given (Wei et al., 2022a; Kojima et al., 2022). Based on the given dataset, Auto-CoT samples questions to construct the demonstrations. Nonetheless, now we consider a more challenging streaming setting where a small batch of test questions (say m𝑚m questions) arrive at a time like in data streams.
CoT 研究通常假定已给出包含测试问题的完整数据集(Wei 等人,2022a;Kojima 等人,2022)。基于给定的数据集,Auto-CoT 会对问题进行采样以构建演示。尽管如此,我们现在要考虑的是更具挑战性的流设置,即像数据流一样,每次都有一小批测试问题(例如 m𝑚m 问题)到达。

To address this challenge, we extend Auto-CoT to a bootstrapping version Auto-CoT*: (i) Initialize an empty set 0subscript0\mathcal{M}_{0}; (ii) When batch 111 of questions q1(1),,qm(1)superscriptsubscript𝑞11superscriptsubscript𝑞𝑚1q_{1}^{(1)},\ldots,q_{m}^{(1)} arrive, invoke Zero-Shot-CoT (no clustering due to small m𝑚m) for each qi(1)superscriptsubscript𝑞𝑖1q_{i}^{(1)} to obtain its reasoning chain ci(1)superscriptsubscript𝑐𝑖1c_{i}^{(1)}. Add question-chain pairs (q1(1),c1(1)),,(qm(1),cm(1))superscriptsubscript𝑞11superscriptsubscript𝑐11superscriptsubscript𝑞𝑚1superscriptsubscript𝑐𝑚1(q_{1}^{(1)},c_{1}^{(1)}),\ldots,(q_{m}^{(1)},c_{m}^{(1)}) to 0subscript0\mathcal{M}_{0} and call the new set 1subscript1\mathcal{M}_{1}; (iii) When batch b𝑏b (b>1𝑏1b>1) of questions q1(b),,qm(b)superscriptsubscript𝑞1𝑏superscriptsubscript𝑞𝑚𝑏q_{1}^{(b)},\ldots,q_{m}^{(b)} arrive, construct demonstrations with existing questions and reasoning chains in b1subscript𝑏1\mathcal{M}_{b-1} (like Auto-CoT) and use the demonstrations for in-context reasoning for each qi(b)superscriptsubscript𝑞𝑖𝑏q_{i}^{(b)}. Add question-chain pairs (q1(b),c1(b)),,(qm(b),cm(b))superscriptsubscript𝑞1𝑏superscriptsubscript𝑐1𝑏superscriptsubscript𝑞𝑚𝑏superscriptsubscript𝑐𝑚𝑏(q_{1}^{(b)},c_{1}^{(b)}),\ldots,(q_{m}^{(b)},c_{m}^{(b)}) to b1subscript𝑏1\mathcal{M}_{b-1} and call the new set bsubscript𝑏\mathcal{M}_{b}.
为了应对这一挑战,我们将 Auto-CoT 扩展为引导版本 Auto-CoT*:(i) 初始化空集 0subscript0\mathcal{M}_{0} ;(ii) 当一批 111 问题 q1(1),,qm(1)superscriptsubscript𝑞11superscriptsubscript𝑞𝑚1q_{1}^{(1)},\ldots,q_{m}^{(1)} 到达时,对每个 qi(1)superscriptsubscript𝑞𝑖1q_{i}^{(1)} 调用 Zero-Shot-CoT(由于 m𝑚m 较小,因此不进行聚类),以获得其推理链 ci(1)superscriptsubscript𝑐𝑖1c_{i}^{(1)} 。.将问题链对 (q1(1),c1(1)),,(qm(1),cm(1))superscriptsubscript𝑞11superscriptsubscript𝑐11superscriptsubscript𝑞𝑚1superscriptsubscript𝑐𝑚1(q_{1}^{(1)},c_{1}^{(1)}),\ldots,(q_{m}^{(1)},c_{m}^{(1)}) 添加到 0subscript0\mathcal{M}_{0} 中,并调用新集合 1subscript1\mathcal{M}_{1} ;(iii)当批量 b𝑏b ( b>1𝑏1b>1 )问题 q1(b),,qm(b)superscriptsubscript𝑞1𝑏superscriptsubscript𝑞𝑚𝑏q_{1}^{(b)},\ldots,q_{m}^{(b)} 到达时,利用 b1subscript𝑏1\mathcal{M}_{b-1} 中的现有问题和推理链构建演示(如 Auto-CoT 和 b1subscript𝑏1\mathcal{M}_{b-1} )。(如 Auto-CoT),并使用这些演示对每个 qi(b)superscriptsubscript𝑞𝑖𝑏q_{i}^{(b)} 进行上下文推理。.将问题链对 (q1(b),c1(b)),,(qm(b),cm(b))superscriptsubscript𝑞1𝑏superscriptsubscript𝑐1𝑏superscriptsubscript𝑞𝑚𝑏superscriptsubscript𝑐𝑚𝑏(q_{1}^{(b)},c_{1}^{(b)}),\ldots,(q_{m}^{(b)},c_{m}^{(b)}) 添加到 b1subscript𝑏1\mathcal{M}_{b-1} 中,并将新集合称为 bsubscript𝑏\mathcal{M}_{b} 。.

Figure 5.5 compares the accuracy on MultiArith at each batch (m=30𝑚30m=30) in this streaming setting (extended version: Figure 11 in the Appendix). As expected, for batch 111, Auto-CoT* and Zero-Shot-CoT obtain equal accuracy. From batch 222, Auto-CoT* performs comparably with Manual-CoT. This result indicates that our method is still effective in the more challenging streaming setting.
图 5.5 比较了在这种流式计算设置下,每个批次( m=30𝑚30m=30 )MultiArith 的精度(扩展版:附录中的图 11)。不出所料,对于批次 111 ,Auto-CoT* 和 Zero-Shot-CoT 获得了相同的精度。从批次 222 开始,Auto-CoT* 和手动-CoT 的表现相当。这一结果表明,我们的方法在更具挑战性的流媒体环境中仍然有效。

6 Conclusion 6 结束语

LLMs have shown reasoning capabilities with CoT prompting. The superior performance of Manual-CoT hinges on the hand-crafting of demonstrations. To eliminate such manual designs, we proposed Auto-CoT to automatically construct demonstrations.
LLMs显示了在CoT提示下的推理能力。手动 CoT 的卓越性能取决于演示的手工制作。为了消除这种手工设计,我们提出了自动构建演示的 Auto-CoT。

It samples questions with diversity and generates reasoning chains to construct demonstrations.

Experimental results on ten public benchmark reasoning datasets showed that with GPT-3, Auto-CoT consistently matches or exceeds the performance of the CoT paradigm that requires manual designs of demonstrations.
在十个公共基准推理数据集上的实验结果表明,在使用 GPT-3 的情况下,Auto-CoT 的性能始终与需要人工设计演示的 CoT 范式相当,甚至更胜一筹。

References 参考资料

  • Brown et al. [2020] 布朗等人[2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.
    Tom B. Brown、Benjamin Mann、Nick Ryder、Melanie Subbiah、Jared Kaplan、Prafulla Dhariwal、Arvind Neelakantan、Pranav Shyam、Girish Sastry、Amanda Askell、Sandhini Agarwal、Ariel Herbert-Voss、Gretchen Krueger、Tom Henighan、Rewon Child、Aditya Ramesh、Daniel M. R. R.

    Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.
    Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
    语言模型是少量学习者。见 Hugo Larochelle、Marc'Aurelio Ranzato、Raia Hadsell、Maria-Florina Balcan 和 Hsuan-Tien Lin 编辑,《神经信息处理系统进展 33:2020 年神经信息处理系统年会 NeurIPS 2020》,2020 年 12 月 6-12 日,虚拟,2020 年。网址 https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html。
  • Thoppilan et al. [2022] 托普皮兰等人[2022] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, and Quoc Le. Lamda: Language models for dialog applications, 2022. URL https://arxiv.org/abs/2201.08239.
    Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali、黄艳萍、马克西姆-克里昆、德米特里-列皮欣、秦俊杰、陈德豪、徐元忠、陈志峰、亚当-罗伯茨、马腾-博斯玛、赵文森、周彦琦、张忠清、伊戈尔-克里沃孔、威尔-鲁什、马克-皮克特、普拉内什-斯里尼瓦桑、Laichee Man、Kathleen Meier-Hellstern、Meredith Ringel Morris、Tulsee Doshi、Renelito Delos Santos、Toju Duke、Johnny Soraker、Ben Zevenbergen、Vinodkumar Prabhakaran、Mark Diaz、Ben Hutchinson、Kristen Olson、Alejandra Molina、Lamda:对话应用语言模型,2022 年。网址 https://arxiv.org/abs/2201.08239。
  • Rae et al. [2021] 瑞等人[2021] Jack W.
    Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geoffrey Irving.
    Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl、Sumanth Dathathri、Saffron Huang、Jonathan Uesato、John Mellor、Irina Higgins、Antonia Creswell、Nat McAleese、Amy Wu、Erich Elsen、Siddhant Jayakumar、Elena Buchatskaya、David Budden、Esme Sutherland、Karen Simonyan、Michela Paganini、Laurent Sifre、Lena Martens、Xiang Lorraine Li、Adhiguna Kuncoro、Aida Nematzadeh、埃琳娜-格里波夫斯卡娅、多米尼克-多纳托、安杰利基-拉扎里杜、阿瑟-门施、让-巴蒂斯特-莱斯皮奥、玛丽亚-津普凯利、尼古拉-格里戈列夫、道格-弗里茨、蒂博-索蒂奥克斯、曼塔斯-帕亚斯卡斯、托比-波伦、龚志涛、丹尼尔-富山、塞普里安-德-马松-达奥图姆、李雨佳、塔伊丰-特尔齐、弗拉基米尔-米库利克、伊戈尔-巴布什金、艾丹-克拉克、Diego de Las Casas、Aurelia Guy、Chris Jones、James Bradbury、Matthew Johnson、Blake Hechtman、Laura Weidinger、Iason Gabriel、William Isaac、Ed Lockhart、Simon Osindero、Laura Rimell、Chris Dyer、Oriol Vinyals、Kareem Ayoub、Jeff Stanway、Lorrayne Bennett、Demis Hassabis、Koray Kavukcuoglu 和 Geoffrey Irving。
    Scaling language models: Methods, analysis & insights from training gopher, 2021. URL https://arxiv.org/abs/2112.11446.
    扩展语言模型:方法、分析和训练 Gopher 的启示,2021 年。网址 https://arxiv.org/abs/2112.11446。
  • Chowdhery et al. [2022] Chowdhery 等人[2022] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M.
    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi、Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury、Liam Fedus、Denny Zhou、Daphne Ippolito、David Luan、Hyeontaek Lim、Barret Zoph、Alexander Spiridonov、Ryan Sepassi、David Dohan、Shivani Agrawal、Mark Omernick、Andrew M.

    Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel.
    Palm: Scaling language modeling with pathways, 2022. URL https://arxiv.org/abs/2204.02311.
    Palm:利用路径扩展语言建模,2022 年。网址 https://arxiv.org/abs/2204.02311。
  • Wei et al. [2022a] Wei 等人 [2022a] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022), 2022a. URL https://arxiv.org/abs/2201.11903.
    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou.大型语言模型中的思维链提示推理。第三十六届神经信息处理系统会议(NeurIPS 2022),2022a。URL https://arxiv.org/abs/2201.11903.
  • Kojima et al. [2022] 小岛等人[2022] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022), 2022. URL