这是用户在 2024-11-24 20:37 为 https://app.immersivetranslate.com/pdf-pro/11283c40-d4ff-467c-ba7a-83ea6e4d4e88 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Chain of Thoughtlessness? An Analysis of CoT in Planning
无思之链?规划中的协同效应分析

Kaya Stechly* 卡亚-斯蒂奇利*SCAI, Arizona State University
亚利桑那州立大学 SCAI
kstechl@asu.edu

Karthik Valmeekam*SCAI, Arizona State University
亚利桑那州立大学 SCAI
kvalmeek@asu.edu

Subbarao KambhampatiSCAI, Arizona State University
亚利桑那州立大学 SCAI
rao@asu.edu

Abstract 摘要

Large language model (LLM) performance on reasoning problems typically does not generalize out of distribution. Previous work has claimed that this can be mitigated with chain of thought prompting-a method of demonstrating solution procedures-with the intuition that it is possible to in-context teach an LLM an algorithm for solving the problem. This paper presents a case study of chain of thought on problems from Blocksworld, a classical planning domain, and examines the performance of two state-of-the-art LLMs across two axes: generality of examples given in prompt, and complexity of problems queried with each prompt. While our problems are very simple, we only find meaningful performance improvements from chain of thought prompts when those prompts are exceedingly specific to their problem class, and that those improvements quickly deteriorate as the size n n nn of the query-specified stack grows past the size of stacks shown in the examples. We also create scalable variants of three domains commonly studied in previous CoT papers and demonstrate the existence of similar failure modes. Our results hint that, contrary to previous claims in the literature, CoT’s performance improvements do not stem from the model learning general algorithmic procedures via demonstrations but depend on carefully engineering highly problem specific prompts. This spotlights drawbacks of chain of thought, especially the sharp tradeoff between possible performance gains and the amount of human labor necessary to generate examples with correct reasoning traces.
大型语言模型(LLM)在推理问题上的表现通常不会超出分布范围。以前的工作声称,这种情况可以通过思维链提示--一种演示求解过程的方法--得到缓解,其直觉是,可以在上下文中向 LLM 传授解决问题的算法。本文针对经典规划领域 Blocksworld 中的问题进行了思维链案例研究,并从两个方面考察了两个最先进的 LLMs 的性能:提示中给出的示例的通用性,以及每个提示所查询的问题的复杂性。虽然我们的问题非常简单,但我们发现只有当这些提示对其问题类别非常具体时,思维链提示才会带来有意义的性能提升,而且当查询指定的堆栈大小 n n nn 超过示例中显示的堆栈大小时,这些提升就会迅速减弱。我们还创建了以前的 CoT 论文中常用的三个领域的可扩展变体,并证明了类似故障模式的存在。我们的研究结果表明,与以往文献中的说法相反,CoT 的性能改进并非源于模型通过演示学习了通用算法程序,而是取决于精心设计的高度特定于问题的提示。这就凸显了思维链的缺点,尤其是在可能的性能提升与生成具有正确推理轨迹的示例所需的人力之间的巨大权衡。

1 Introduction 1 引言

While originally designed for text completion, Large Language Models (LLMs) have shown promise on a diverse set of unrelated tasks. While initial anecdotal results were unexpectedly impressive [8], followup systematic studies showed that-outside of limited, non-generalizable classes of problemsthese models generally perform poorly on basic, multi-hop reasoning tasks [17] ranging from arithmetic [35] and logic puzzles [14] to constraint satisfaction [42, 2] and classical planning [47].
虽然大型语言模型(LLMs)最初是为文本补全而设计的,但它已在一系列不同的非相关任务中显示出良好的前景。虽然最初的传闻结果出乎意料地令人印象深刻[8],但后续的系统研究表明,除了有限的、不可通用的问题类别外,这些模型在基本的多跳推理任务[17]中表现一般,从算术[35]和逻辑谜题[14]到约束满足[42, 2]和经典规划[47],不一而足。

At the same time, the subfield of prompt engineering [36] has grown rapidly, promising improvements in performance without retraining. A core tenet of this subfield is that LLMs are capable of powerful in-context learning [12, 56], that is, capable of intelligently using additional context provided in a prompt to correctly respond to queries that would otherwise be answered incorrectly. Generally, this
与此同时,提示工程[36]子领域发展迅速,有望在无需重新训练的情况下提高性能。该子领域的一个核心原则是,LLMs具有强大的上下文学习能力[12, 56],也就是说,能够智能地使用提示中提供的额外上下文来正确地回答查询,否则这些查询就会被错误地回答。一般来说,这种
requires operationalizing algorithmic/procedural advice, and, in principle, learning such procedures includes being able to effectively apply them beyond syntactically similar instances.
原则上讲,学习这些程序包括能够有效地将它们应用到句法相似的实例之外。

The foundational method for inducing in-context learning is the chain of thought approach, which has been claimed to “unlock the reasoning abilities of LLMs” [50]. To create a chain of thought (CoT) prompt, a user annotates similar problems with intermediate reasoning steps and prepends them to the standard prompt. These annotations are meant as demonstrations, intended to teach a procedure applicable to both the examples and the new query. When prompted like this, the LLM is expected to output a similar series of reasoning steps prior to the new answer. Numerous studies have claimed that this procedure significantly enhances LLM performance in complex reasoning tasks [49, 54, 39, 56, 52, 43]. However, in general it is unclear how “similar” the examples need to be to the problem, how broadly any given chain of thought prompt will apply, and-most importantly-how much human effort is necessary to craft prompts specific to each problem subclasses. Followup work has claimed that merely adding magic phrases (“let’s think step by step”) to every prompt is sufficient for some improvement [26]. While in some domains, this technique has proven to be even more brittle than manual CoT, it has achieved the same performance increases in others, hinting that improvements observed with CoT may not indicate as much about LLMs’ general in-context learning abilities as previously thought.
诱导情境学习的基本方法是思维链方法,据称这种方法可以 "释放LLMs的推理能力"[50]。要创建思维链(CoT)提示,用户需要在类似的问题上标注中间推理步骤,并将其添加到标准提示中。这些注释的目的是作为示范,教授一种适用于示例和新查询的程序。当出现这样的提示时,LLM预计会在新答案之前输出一系列类似的推理步骤。许多研究都声称,在复杂的推理任务中,这种程序能显著提高 LLM 的性能 [49、54、39、56、52、43]。然而,一般来说,我们还不清楚例子需要与问题有多 "相似",任何给定的思维链提示的适用范围有多广,以及--最重要的是--针对每个问题子类制作提示需要多少人力。后续工作声称,只需在每个提示语中添加神奇短语("让我们一步步思考"),就足以带来一些改进[26]。虽然在某些领域,这种技术被证明比手动 CoT 更加脆弱,但在其他领域,它却实现了同样的性能提升,这暗示着通过 CoT 观察到的改进可能并不像之前认为的那样,能够说明 LLMs 的一般上下文学习能力。

We are interested in the tradeoff between possible performance gains from chain of thought prompt engineering and the amount of human labor necessary to generate examples with useful reasoning traces. Ideally, a properly constructed prompt should teach the LLM how to robustly generalize a basic algorithmic procedure in order to increase performance on a large class of problems, thereby converting a modest amount of human teaching effort into a significant capability boost. Unfortunately, this only seems to be possible to a very limited extent [14].
我们感兴趣的是,思维链提示工程可能带来的性能提升与生成具有有用推理跟踪的示例所需的人力之间的权衡。理想情况下,正确构造的提示应教会 LLM 如何稳健地概括基本算法过程,以提高在大量问题上的性能,从而将少量的人工教学工作转化为显著的能力提升。遗憾的是,这似乎只能在非常有限的范围内实现[14]。

In the current work, we examine the limits of chain of thought in solving classical planning problems. Test domains commonly used in previous chain of thought studies (e.g. GSM8K [10], CommonSense QA [44]) present two significant issues: (a) they lack a systematic method to scale instances, which is essential for evaluating whether LLMs can extend provided procedures to larger instances of the same type, and (b) due to their static nature, are more likely to be well-represented on the web[51], increasing the chance that they were part of LLM training data, a factor which could obscure the true reasoning capabilities of LLMs. Planning is a well-studied kind of sequential decision-making which tasks an agent with devising a plan that takes a given initial state to a pre-specified goal state. New, diverse, and unique problem instances are easy to generate, but potentially hard to solve.
在当前工作中,我们研究了思维链在解决经典规划问题时的局限性。以往思维链研究中常用的测试域(如GSM8K [10]、CommonSense QA [44])中常用的测试域存在两个重大问题:(a) 它们缺乏系统的方法来扩展实例,而这对于评估 LLMs 是否能将所提供的程序扩展到同一类型的更大实例来说是至关重要的;(b) 由于它们的静态性质,它们更有可能在网络上得到很好的体现[51],从而增加了它们成为 LLM 训练数据一部分的机会,而这一因素可能会掩盖 LLMs 的真正推理能力。规划是一种经过充分研究的顺序决策,其任务是让代理设计一个计划,将给定的初始状态带向预先指定的目标状态。新的、多样的和独特的问题实例很容易生成,但可能很难解决。

We focus on Blocksworld, a simple commonsense domain widely recognized and utilized in International Planning Competitions [23], where a set of blocks in an initial configuration must be rearranged step-by-step into a goal configuration. For a subset of our results, we simplify even further, and only consider problem instances where every block starts on the table and the goal is a single stack of blocks. These instances require very minimal reasoning: one need only figure out which block is on the bottom, and then stack the remaining blocks in the sequence directly defined in the goal. For 3 n 20 3 n 20 3 <= n <= 203 \leq n \leq 20, we generate a variety of instances where the goal requires a specific n n nn height stack, while providing examples of how to solve 2 and 3 height instances.
我们将重点放在 "积木世界"(Blocksworld)上,这是一个在国际规划竞赛中被广泛认可和使用的简单常识性领域[23]。对于我们的成果子集,我们进一步简化,只考虑每个积木都从桌子上开始、目标是单个积木堆叠的问题实例。这些实例只需要极少的推理:我们只需要找出哪个积木块在底部,然后按照目标直接定义的顺序堆叠剩余的积木块。对于 3 n 20 3 n 20 3 <= n <= 203 \leq n \leq 20 ,我们生成了目标要求特定 n n nn 高度堆叠的各种实例,同时提供了如何解决 2 和 3 高度实例的示例。

We consider different chain of thought prompts, where each is more specific-and provides more problem-specific knowledge-than the last: a zero-shot variant, a PDDL-general progression proof, a suboptimal algorithm specific to Blocksworld, a table-to-stack specific simplification of that algorithm, and a lexicographic version of the simplification. The most general could be applied to any problem, while the least is specific to an easier version of the stacking problem. The three human-crafted prompts all teach algorithms which could, in principle, solve any of the instances they are tested on. We test on three state of the art models: GPT-4 [3], Claude-3-Opus, [5] and GPT-4-Turbo.
我们考虑了不同的思维链提示,每个提示都比上一个提示更具体,也提供了更多针对具体问题的知识:一个零射变体、一个 PDDL 通用级数证明、一个针对 Blocksworld 的次优算法、一个从表格到堆叠的特定简化算法,以及一个词典版本的简化算法。最通用的算法可应用于任何问题,而最不通用的算法则专门针对较简单版本的堆叠问题。这三种人为提示所教授的算法,原则上都可以解决所测试的任何实例。我们在三种最先进的模型上进行了测试:GPT-4 [3]、Claude-3-Opus [5] 和 GPT-4-Turbo。
Our results reconfirm that LLMs are generally incapable of solving simple planning problems [47], and demonstrate that chain of thought approaches only improve performance when the hand-annotated examples and the query are sufficiently similar to the current query. As goal stack size increases, accuracy drops drastically, regardless of the specificity of the chain of thought prompt. As generality of the prompt increases, performance on even the smallest goal stacks also decreases, and often falls short of standard prompting. Even state of the art extensions of CoT (like self-consistency [49]), show similar or sometimes even worse performance. Overall, this case study calls into question assumptions about the generalizable effectiveness of chain of thought, and suggests that LLMs do not learn new, general algorithms in context, but instead rely on some form of pattern matching to
我们的研究结果再次证实,LLMs通常无法解决简单的规划问题[47],并证明只有当手标注的示例和查询与当前查询足够相似时,思维链方法才能提高性能。随着目标堆栈大小的增加,无论思维链提示的具体程度如何,准确率都会急剧下降。随着提示语的通用性增加,即使是最小的目标堆栈,其性能也会下降,而且往往达不到标准提示语的要求。即使是 CoT 的最新扩展(如自洽性 [49])也显示出类似的性能,有时甚至更差。总之,本案例研究对思维链的通用有效性假设提出了质疑,并表明 LLMs并没有在上下文中学习新的通用算法,而是依赖于某种形式的模式匹配来学习。

achieve prompt-design-specific performance increases. This in turn increases the burden on humans giving advice.
实现及时设计的性能也会增加。这反过来又加重了人类提供建议的负担。

To better compare to previous work, we construct scalable versions of three previously studied synthetic problems-Coin Flip, Last Letter Concatenation, and multi-step arithmetic [49, 50, 26, 48]and replicate reported chain of thought prompts. While these domains do not have a corresponding notion of prompt granularity, they do cover a range of difficulties. When testing on GPT-4-Turbo, We see a similar lack of generalization on these problem sets as we saw in Blocksworld.
为了更好地与之前的工作进行比较,我们构建了之前研究过的三个合成问题的可扩展版本--硬币翻转、最后一个字母连接和多步算术[49, 50, 26, 48],并复制了报告中的思维链提示。虽然这些领域没有相应的提示粒度概念,但它们确实涵盖了一系列难度。在 GPT-4-Turbo 上进行测试时,我们发现在这些问题集上缺乏与 Blocksworld 类似的通用性。
In the rest of this paper, we first review related work, then describe the chain of thought approaches we have developed in the context of planning, analyze the overall effectiveness of chain of thought prompting on Blocksworld problems, and extend our results to three synthetic tasks well-represented in the CoT literature.
在本文的其余部分,我们首先回顾了相关工作,然后介绍了我们在规划方面开发的思维链方法,分析了在 Blocksworld 问题上思维链提示的整体有效性,并将我们的结果扩展到 CoT 文献中代表性较强的三个合成任务。
Modifying text prompts to elicit intermediate problem-solving steps from LLMs originally took the form of scratchpads [33]. [50] proposed a similar prompt style in natural language, dubbing this approach chain of thought (CoT), and claiming that-with some human hand-annotation of examples-this not only boosts performance without retraining, but “allows reasoning abilities to emerge naturally”. They argued that by merely interspersing intermediate reasoning steps in natural language into examples, they were inducing the LLM to “learn via a few examples”, motivating this idea with anthropomorphizations (“Consider one’s own thought process when solving a complicated reasoning task such as a multi-step math word problem”). [26] argued that some of the performance of CoT could be retained without providing any examples, and instead just appending the magic phrase “let’s think step by step” to the end of a prompt. This has been called zero-shot CoT.
修改文本提示以引出LLMs的中间问题解决步骤最初采用的是刮板的形式[33]。[50]提出了一种类似的自然语言提示风格,并将这种方法命名为思维链(CoT),他们声称,只要对示例进行一些人工标注,这不仅能在不进行再训练的情况下提高成绩,而且 "还能让推理能力自然显现"。他们认为,只需在示例中穿插自然语言的中间推理步骤,就能诱导LLM"通过一些示例来学习",并用拟人化的方法("在解决复杂的推理任务(如多步骤数学单词问题)时,考虑自己的思维过程")来激发这一想法。[26]认为,无需提供任何例子,只需在提示语末尾加上 "让我们一步步思考 "这一神奇短语,就能保留 CoT 的某些性能。这种方法被称为 "0-shot CoT"。

However, CoT has long been known to be imperfect and incomplete. Previous work has investigated improving the consistency of CoT through self-consistency [49], multi-agent debate [13], least-tomost prompting [55], deductive verification [28], and other approaches. Unfortunately, many of these involve prompting the LLM multiple times for a single problem, which can balloon the cost of inference. Other work has examined the possibility of reducing or removing the need for human annotation of examples by using LLMs to generate their own examples automatically [54, 9]. To avoid well-known issues with the brittleness of LLM self-verification and self-teaching [42, 22, 20, 19, 24], we restrict this paper’s scope to manually written chains of thought.
然而,众所周知,CoT 长期以来都是不完善和不完整的。以前的工作已经研究了通过自洽性 [49]、多代理辩论 [13]、最少提示 [55]、演绎验证 [28] 和其他方法来改善 CoT 的一致性。遗憾的是,其中许多方法都涉及针对单个问题多次提示LLM,这会增加推理的成本。其他一些工作则研究了通过使用 LLMs 自动生成自己的示例来减少或消除对人工注释示例的需求的可能性 [54, 9]。为了避免众所周知的 LLM 自我验证和自我教学的脆性问题 [42, 22, 20, 19, 24],我们将本文的范围限制在人工编写的思维链上。

Previous papers have analyzed CoT from multiple perspectives [15, 37], finding that there is only a loose relationship between the presented chain and the final answer [6], and that the correctness of provided annotations has little effect on resultant performance [38]. LLM-produced chains of thought are also known to be unfaithful to the underlying reasoning process [29, 25, 11]. In particular, the way the examples are presented can bias a model into giving some answer (e.g. if all the example answers are A, the model will be more likely to output A), but its CoT will not reflect this [45].
以前的论文从多个角度对 CoT 进行了分析[15, 37],发现所呈现的思维链与最终答案之间只有松散的关系[6],而且所提供注释的正确性对结果性能的影响很小[38]。LLM生成的思维链也不忠实于基本推理过程[29, 25, 11]。特别是,示例的呈现方式会使模型偏向于给出某些答案(例如,如果所有示例的答案都是 A,那么模型就更有可能输出 A),但其 CoT 不会反映这一点 [45]。

Motivated by claims that CoT prompts allow models to learn in context how to reason-that is, to learn how to execute human-specified algorithms-we focus on CoT prompting’s out-of-domain generalization. [14] previously showcased a lack of generalization in multiplication, puzzles, and a number sequence problem, even when the model was fine-tuned on CoT examples. However, they only examined one set of prompts, did not experiment with levels of prompt specificity, and were much more interested in local failures of compositionality arising from cumulating error. More broadly, previous work has examined generalization limits of LLMs in arithmetic tasks [35], formula simplification [34], and theorem proving [4].
CoT 提示允许模型在上下文中学习如何推理--即学习如何执行人类指定的算法--我们将重点放在 CoT 提示的域外泛化上。此前,[14] 的研究表明,即使模型在 CoT 示例上进行了微调,在乘法、谜题和数列问题上也缺乏通用性。不过,他们只研究了一组提示,没有对提示的具体程度进行实验,而且对累积错误导致的局部组合性失效更感兴趣。从更广泛的角度来看,以前的工作已经研究了LLMs在算术任务[35]、公式简化[34]和定理证明[4]中的泛化极限。

While early accounts claimed LLMs, despite not being trained for it, were capable of reasoning and planning [8], later work showcased serious brittleness across these domains [47]. [50] claims that “standard prompting only provides a lower bound on the capabilities of large language models”, with proper prompting allowing reasoning to “emerge naturally.” Recent work seems to maintain this optimism [7]. In this paper, we examine the effectiveness of CoT in the context of classical planning problems, which have well-defined and algorithmically checkable ground truths, can be generated with arbitrary size and difficulty, and are unlikely to be in the training data. If CoT induces more than just pattern matching, and can in fact teach LLMs to perform generalizable, compositional reasoning, then we should expect that to be reflected in robust and maintainable improvements on a simple
虽然早期的研究声称LLMs尽管没有经过训练,却能够进行推理和规划[8],但后来的研究表明,大型语言模型在这些领域中存在严重的缺陷[47]。[50]声称,"标准的提示只为大型语言模型的能力提供了一个下限",适当的提示可以让推理 "自然出现"。最近的研究似乎保持了这种乐观态度[7]。在本文中,我们将在经典规划问题的背景下检验 CoT 的有效性,这些问题具有定义明确、可通过算法检查的基本真理,可以以任意大小和难度生成,而且不太可能出现在训练数据中。如果 CoT 所诱导的不仅仅是模式匹配,而且事实上还能教会 LLMs 执行可通用的组合推理,那么我们就可以预期,这将反映在对简单规划问题的稳健且可维护的改进上。

commonsense benchmark set like Blocksworld, and we should expect these results to hold for scaled variants of the very benchmarks tested in [50] and later CoT work.
我们应该期待这些结果在 [50] 和后来的 CoT 工作中测试的基准的缩放变体中也能成立。

3 Background 3 背景

Classical planning problems task a planner with finding a sequence of actions that, when executed, will take an agent from a pre-specified initial state to a desired goal state. STRIPS planning is a discrete, deterministic formalism that encompasses this class. Problems are represented using the Planning Domain and Definition Language (PDDL) [30] and have long featured in various planning challenges and competitions. Our main experiments are all on the Blocksworld PDDL domain.
经典规划问题要求规划者找到一连串行动,执行这些行动后,代理将从预先指定的初始状态到达期望的目标状态。STRIPS 规划是一种包含这类问题的离散、确定性形式主义。问题使用规划领域和定义语言(PDDL)[30]来表示,长期以来一直是各种规划挑战和竞赛的重点。我们的主要实验都是在 Blocksworld PDDL 领域中进行的。
A PDDL specification consists of three components. The domain doesn’t change between problems and consists of a set of predicates-whose truth values describe the state of the world-and a set of actions-defined by their preconditions and effects-that the agent is allowed to take. The initial state is a list of predicates that are true at the outset of the specific problem (an example predicate: “Block A is on the table”). The goal is a boolean expression of predicates (a goal: “Block A is on Block B.”).
PDDL 规范由三个部分组成。领域在不同问题之间不会改变,它由一组谓词(其真值描述了世界的状态)和一组行动(由其前提条件和效果定义)组成,代理可以采取这些行动。初始状态是特定问题开始时为真的谓词列表(例如:"A 块在桌子上")。目标是谓词的布尔表达式(目标:"A 块在 B 块上")。
A plan is a sequence of actions. The solution to a PDDL problem is a plan in which the preconditions of every action are satisfied at execution time, and which arrives at a goal-satisfying final state. To verify a plan, follow the actions in order and check that these two desiderata are achieved. In this work, we convert natural language responses into PDDL [46] and evaluate them with VAL [21].
计划是一系列行动。PDDL 问题的解决方案是一个计划,在这个计划中,每个行动的先决条件在执行时都得到满足,并到达一个满足目标的最终状态。要验证一个计划,需要按顺序执行各个行动,并检查是否实现了这两个要求。在这项工作中,我们将自然语言响应转换为 PDDL [46],并用 VAL [21] 对其进行评估。

4 Chain of Thought Setups for Planning
用于规划的 4 个思维链设置

We examine the influence of prompt selection on LLM performance within subsets of the Blocksworld domain. A formally specified problem instance can be translated into many possible prompts. The most basic of these is input/output (I/O) prompting: the problem is translated directly from PDDL into natural language and provided to the LLM [47]. While this directly tests the LLM’s ability to solve the problem, it is not always the most effective strategy for maximizing performance.
我们研究了在 Blocksworld 领域的子集中,提示选择对 LLM性能的影响。一个正式指定的问题实例可以转化为许多可能的提示。其中最基本的是输入/输出(I/O)提示:问题直接从 PDDL 转化为自然语言,并提供给 LLM [47]。虽然这可以直接测试 LLM 解决问题的能力,但它并不总是最大限度提高性能的最有效策略。
Drawing on metaphors of human learning, recent literature has claimed that LLMs are capable of in-context learning. The basic idea is that-by first presenting the model with examples of similar problems-it is possible to cause an LLM to acquire relevant new skills within the current context window. n n nn-shot prompts operationalize this by prepending a number of relevant examples. Chain of thought [50] approaches take this further, presenting human-crafted “thoughts” which the LLM is intended to imitate in its response. Practitioners argue that, intuitively, these augmented examples teach the LLM how to solve problems in the given set.
借鉴人类学习的隐喻,最近有文献声称LLMs能够进行情境学习。其基本思想是,通过首先向模型展示类似问题的示例,可以使 LLM 在当前情境窗口中获得相关的新技能。 n n nn 拍摄提示通过预置一些相关示例来实现这一点。思维链 [50] 方法在此基础上更进一步,提出了人类创建的 "思维",LLM旨在在其响应中模仿这些 "思维"。实践者认为,从直觉上讲,这些增强的示例教会 LLM 如何解决给定集合中的问题。
However, this method relies on human labor [53] to provide task-specific knowledge and an (at least rough)
然而,这种方法依赖于人类劳动 [53] 来提供特定任务的知识和(至少是粗略的)

Figure 1: Target Distributions of Problems This figure shows the levels of expected generality for each prompt.
图 1:问题的目标分布 本图显示了每个提示的预期通用性水平。

algorithmic or procedural approach to the problem. The more general the provided knowledge is, the more problems it can be applied to, and the less human prompt-crafting it requires. On the other hand, the more granular and specific it is, the more performance can be expected.
算法或程序方法来解决问题。所提供的知识越通用,可应用的问题就越多,所需的人工提示也就越少。另一方面,提供的知识越细化、越具体,预期的性能就越高。

In our experiments, we consider subsets of Blocksworld problems. We follow a prompt structure similar to that described in [47], 2 2 ^(2){ }^{2} but include “thoughts” in our n n nn-shot prompts. These thoughts are written to follow an algorithmic procedure for solving the example problem.
在我们的实验中,我们考虑的是Blocksworld问题的子集。我们采用的提示结构与 [47] 中描述的 2 2 ^(2){ }^{2} 类似,但在 n n nn 拍摄提示中加入了 "想法"。这些想法是按照解决示例问题的算法程序编写的。

Not every procedure is applicable to every problem. From the point of view of a human handcrafting a chain of thought prompt, there is intuitively an expected target distribution on which the demonstrated algorithm generally works. For instance, a prompt designer detailing how to stack C
并非每个程序都适用于每个问题。从人类手工制作思维链提示的角度来看,直观上有一个预期目标分布,所演示的算法一般都能在这个目标分布上运行。例如,提示设计者在详细说明如何堆叠 C
on top of B on top of A will expect that a model that learns this procedure will also be capable of stacking B on top of A on top of C, but may not expect it to know how to first properly dismantle an existing tower of blocks to access a necessary block. However, this distribution often differs from the effective target distribution-that is, the actual set of problems on which the prompt gives robust improvements in performance. We explicitly describe the gap between these two distributions.
在 A 的顶部堆叠 B 的模型,会期望学习了这一程序的模型也能在 C 的顶部堆叠 A 的顶部堆叠 B,但可能不会期望它知道如何首先正确地拆卸现有的积木塔来获取一个必要的积木。然而,这种分布往往不同于有效目标分布,即提示能显著提高性能的实际问题集。我们明确描述了这两种分布之间的差距。
Zero-Shot Chain of Thought (Universal): This is the most general approach, and involves merely appending “let’s think step by step” to the end of the prompt[26].
零镜头思维链(通用):这是最通用的方法,只需在提示语末尾加上 "让我们一步步思考"[26]。
Progression Proof (Specific to PDDL): Versions of this CoT could, in principle, be prepended to any PDDL problem prompt, as the generation of annotated examples is easy to automate without knowledge of the specific PDDL domain. [47] This prompt includes (1) a meta-prompt explaining plan correctness and (2) an example where each action is annotated with the state prior to the action, the reason why the action is applicable in that state, and the resulting state after the action is applied. Examples start from an arbitrary block configuration and construct a single stack of blocks from it.
进展证明(PDDL 专用):原则上,此 CoT 的各版本可附加到任何 PDDL 问题提示中,因为无需了解特定 PDDL 领域,就能轻松自动生成注释示例。[47] 该提示包括:(1) 一个解释计划正确性的元提示;(2) 一个示例,其中每个操作都标注了操作前的状态、操作适用于该状态的原因以及操作应用后的结果状态。示例从任意区块配置开始,并由此构建一个区块堆栈。
Blocksworld Universal Algorithm (Specific to the Domain): In Blocksworld, it is possible to reach any goal state from any initial state by simply unstacking all the blocks, placing them on the table, and then reassembling them into the required stacks. Resulting plans are not only executable and goal-reaching, but will never exceed twice the length of the optimal plan for any given instance [40]. This prompt demonstrates an annotated version of this approach, explaining and performing both the deconstruction and reconstruction steps of the algorithm. The same examples are used as in the previous prompt. The expected target distribution encompasses all Blocksworld problems.
积木世界通用算法(特定领域):在 Blocksworld 中,只需拆开所有积木,将它们放在桌面上,然后重新组合成所需的积木堆,就可以从任何初始状态到达任何目标状态。由此产生的计划不仅可执行、可达到目标,而且对于任何给定实例,其长度都不会超过最优计划的两倍[40]。本提示演示了这种方法的注释版本,解释并执行了算法的解构和重构步骤。所使用的示例与上一条提示相同。预期目标分布包含所有 Blocksworld 问题。
Stacking Prompt (Specific to a Narrow Problem Class): Every example is a table-to-stack problem: every block starts on the table, and the goal is to create a single specific stack of blocks. This specificity simplifies the problem greatly, and allows near-direct pattern matching between the examples and the LLM’s output; however, it is infeasible to specify prompts with this level of detail for every problem class. The expected target distribution is table-to-stack Blocksworld problem, as they are the only problems that can be solved by the described algorithm.
堆叠提示(针对狭窄的问题类别):每个示例都是一个从表格到堆叠的问题:每个图块都从表格开始,目标是创建一个特定的图块堆叠。这种特定性大大简化了问题,并允许在示例和 LLM 的输出之间进行近乎直接的模式匹配;但是,为每个问题类别指定这种详细程度的提示是不可行的。预期的目标分布是表格到堆栈的 Blocksworld 问题,因为只有这些问题可以用所述算法解决。
Lexicographic Stacking (Specific to Particular Syntactic Sequences): We simplify the problem further by focusing on a particular syntactic form of the goal. This prompt is very similar to the stacking prompt, but is specific to a subset of the target distribution: the goal state is always a lexicographic prefix (e.g., A , AB , ABC A , AB , ABC A,AB,ABC\mathrm{A}, \mathrm{AB}, \mathrm{ABC}, etc.).
词典堆叠(针对特定句法序列):我们通过关注目标的特定句法形式来进一步简化问题。这种提示与堆叠提示非常相似,但针对的是目标分布的一个子集:目标状态总是一个词法前缀(如 A , AB , ABC A , AB , ABC A,AB,ABC\mathrm{A}, \mathrm{AB}, \mathrm{ABC} 等)。

5 Blocksworld Results 5 项 Blocksworld 成果

We perform two parallel studies. The first tests each chain of thought prompt on its intended problem distribution, as explained in the previous section. Then, we focus on a specific subclass of Blocksworld problems and test every prompt on just that subclass. Together, we expect these two studies to give us a good picture of how effective LLMs are in applying advice beyond the specific instances.
我们同时进行了两项研究。第一项研究是在预期的问题分布上测试每个思维链提示,这一点在上一节中已经解释过。然后,我们将重点放在Blocksworld问题的一个特定子类上,并仅在该子类上测试每个提示。通过这两项研究,我们可以很好地了解LLMs在特定实例之外应用建议的有效性。
Prompt 提示 GPT-4-Turbo Claude-3-Opus 克劳德-3-奥普斯 GPT-4
Zero-shot 零点射击 4.07 % ( 11 / 270 ) 4.07 % ( 11 / 270 ) 4.07%(11//270)4.07 \%(11 / 270) 9.62 % ( 26 / 270 ) 9.62 % ( 26 / 270 ) 9.62%(26//270)9.62 \%(26 / 270) 3.33 % ( 9 / 270 ) 3.33 % ( 9 / 270 ) 3.33%(9//270)3.33 \%(9 / 270)
Zero-shot CoT 零发 CoT 5.55 % ( 15 / 270 ) 5.55 % ( 15 / 270 ) 5.55%(15//270)5.55 \%(15 / 270) 8.51 % ( 23 / 270 ) 8.51 % ( 23 / 270 ) 8.51%(23//270)8.51 \%(23 / 270) 4.44 % ( 12 / 270 ) 4.44 % ( 12 / 270 ) 4.44%(12//270)4.44 \%(12 / 270)
Domain-Specific n n nn-shot
特定领域的 n n nn 镜头
7.4 % ( 20 / 270 ) 7.4 % ( 20 / 270 ) 7.4%(20//270)7.4 \%(20 / 270) 11.4 % ( 31 / 270 ) 11.4 % ( 31 / 270 ) 11.4%(31//270)11.4 \%(31 / 270) 7.4 % ( 20 / 270 ) 7.4 % ( 20 / 270 ) 7.4%(20//270)7.4 \%(20 / 270)
Progression Proof CoT 进步证明 CoT 3.33 % ( 9 / 270 ) 3.33 % ( 9 / 270 ) 3.33%(9//270)3.33 \%(9 / 270) 1.11 % ( 3 / 270 ) 1.11 % ( 3 / 270 ) 1.11%(3//270)1.11 \%(3 / 270) 4.44 % ( 12 / 270 ) 4.44 % ( 12 / 270 ) 4.44%(12//270)4.44 \%(12 / 270)
Domain-Specific n n nn-shot
特定领域的 n n nn 镜头
7.4 % ( 20 / 270 ) 7.4 % ( 20 / 270 ) 7.4%(20//270)7.4 \%(20 / 270) 11.4 % ( 31 / 270 ) 11.4 % ( 31 / 270 ) 11.4%(31//270)11.4 \%(31 / 270) 7.4 % ( 20 / 270 ) 7.4 % ( 20 / 270 ) 7.4%(20//270)7.4 \%(20 / 270)
Blocksworld Universal Algorithm
积木世界通用算法
11.8 % ( 32 / 270 ) 11.8 % ( 32 / 270 ) 11.8%(32//270)11.8 \%(32 / 270) 17.7 % ( 48 / 270 ) 17.7 % ( 48 / 270 ) 17.7%(48//270)17.7 \%(48 / 270) 28.8 % ( 78 / 270 ) 28.8 % ( 78 / 270 ) 28.8%(78//270)28.8 \%(78 / 270)
Problem Class Specific n n nn-shot
问题类专用 n n nn 镜头
18 % ( 47 / 261 ) 18 % ( 47 / 261 ) 18%(47//261)18 \%(47 / 261) 15.7 % ( 41 / 261 ) 15.7 % ( 41 / 261 ) 15.7%(41//261)15.7 \%(41 / 261) 8.81 % ( 23 / 261 ) 8.81 % ( 23 / 261 ) 8.81%(23//261)8.81 \%(23 / 261)
Stacking Prompt 堆叠提示 40.6 % ( 106 / 261 ) 40.6 % ( 106 / 261 ) 40.6%(106//261)40.6 \%(106 / 261) 24.5 % ( 64 / 261 ) 24.5 % ( 64 / 261 ) 24.5%(64//261)24.5 \%(64 / 261) 59.3 % ( 155 / 261 ) 59.3 % ( 155 / 261 ) 59.3%(155//261)59.3 \%(155 / 261)
Lexicographic Specific n n nn-shot
特定词典 n n nn 镜头
5.88 % ( 1 / 17 ) 5.88 % ( 1 / 17 ) 5.88%(1//17)5.88 \%(1 / 17) 58.8 % ( 10 / 17 ) 58.8 % ( 10 / 17 ) 58.8%(10//17)58.8 \%(10 / 17) 5.88 % ( 1 / 17 ) 5.88 % ( 1 / 17 ) 5.88%(1//17)5.88 \%(1 / 17)
Lexicographic Stacking Prompt
词典堆叠提示
76.4 % ( 13 / 17 ) 76.4 % ( 13 / 17 ) 76.4%(13//17)76.4 \%(13 / 17) 100 % ( 17 / 17 ) 100 % ( 17 / 17 ) 100%(17//17)100 \%(17 / 17) 94.1 % ( 16 / 17 ) 94.1 % ( 16 / 17 ) 94.1%(16//17)94.1 \%(16 / 17)
Prompt GPT-4-Turbo Claude-3-Opus GPT-4 Zero-shot 4.07%(11//270) 9.62%(26//270) 3.33%(9//270) Zero-shot CoT 5.55%(15//270) 8.51%(23//270) 4.44%(12//270) Domain-Specific n-shot 7.4%(20//270) 11.4%(31//270) 7.4%(20//270) Progression Proof CoT 3.33%(9//270) 1.11%(3//270) 4.44%(12//270) Domain-Specific n-shot 7.4%(20//270) 11.4%(31//270) 7.4%(20//270) Blocksworld Universal Algorithm 11.8%(32//270) 17.7%(48//270) 28.8%(78//270) Problem Class Specific n-shot 18%(47//261) 15.7%(41//261) 8.81%(23//261) Stacking Prompt 40.6%(106//261) 24.5%(64//261) 59.3%(155//261) Lexicographic Specific n-shot 5.88%(1//17) 58.8%(10//17) 5.88%(1//17) Lexicographic Stacking Prompt 76.4%(13//17) 100%(17//17) 94.1%(16//17)| Prompt | GPT-4-Turbo | Claude-3-Opus | GPT-4 | | :--- | :---: | :---: | :---: | | Zero-shot | $4.07 \%(11 / 270)$ | $9.62 \%(26 / 270)$ | $3.33 \%(9 / 270)$ | | Zero-shot CoT | $5.55 \%(15 / 270)$ | $8.51 \%(23 / 270)$ | $4.44 \%(12 / 270)$ | | Domain-Specific $n$-shot | $7.4 \%(20 / 270)$ | $11.4 \%(31 / 270)$ | $7.4 \%(20 / 270)$ | | Progression Proof CoT | $3.33 \%(9 / 270)$ | $1.11 \%(3 / 270)$ | $4.44 \%(12 / 270)$ | | Domain-Specific $n$-shot | $7.4 \%(20 / 270)$ | $11.4 \%(31 / 270)$ | $7.4 \%(20 / 270)$ | | Blocksworld Universal Algorithm | $11.8 \%(32 / 270)$ | $17.7 \%(48 / 270)$ | $28.8 \%(78 / 270)$ | | Problem Class Specific $n$-shot | $18 \%(47 / 261)$ | $15.7 \%(41 / 261)$ | $8.81 \%(23 / 261)$ | | Stacking Prompt | $40.6 \%(106 / 261)$ | $24.5 \%(64 / 261)$ | $59.3 \%(155 / 261)$ | | Lexicographic Specific $n$-shot | $5.88 \%(1 / 17)$ | $58.8 \%(10 / 17)$ | $5.88 \%(1 / 17)$ | | Lexicographic Stacking Prompt | $76.4 \%(13 / 17)$ | $100 \%(17 / 17)$ | $94.1 \%(16 / 17)$ |
Table 1: Accuracy across CoT types and prompting methods in Blocksworld.
表 1:Blocksworld 中各种 CoT 类型和提示方法的准确性。

Figure 2: Accuracy of GPT-4-Turbo, Claude-3-Opus and GPT-4 across chain of thought prompting methods in their intended problem distributions with increasing number of blocks.
图 2:GPT-4-Turbo、Claude-3-Opus 和 GPT-4 跨思维链提示方法在其预定问题分布中的准确性,随着块数的增加而增加。

5.1 Testing on Intended Problem Distributions
5.1 预期问题分布测试

We evaluate the performance of GPT-4 and Claude-3-Opus on Blocksworld problems with both standard 2-shot prompts and chain of thought prompts of varying granularity. Each prompt is tested on its intended problem class, as discussed in the previous section.
我们评估了 GPT-4 和 Claude-3-Opus 在 Blocksworld 问题上的表现,其中既有标准的 2 次提示,也有不同粒度的思维链提示。如上一节所述,每种提示都在其目标问题类别上进行测试。

As illustrated in Table 1, chain of thought does not meaningfully enhance performance except on the narrowest problem distributions. While providing this chain of thought advice becomes significantly harder as the level of specificity increases, it is necessary, as the LLM succeeds only when the problem is reduced to a level where basic pattern matching suffices: at each stage, stack the next letter on top; if that letter does not exist on the table, then stop.
如表 1 所示,除了最窄的问题分布外,思维链并不能有意义地提高性能。虽然随着具体程度的增加,提供这种思维链建议的难度会明显增加,但这是必要的,因为只有当问题缩小到基本模式匹配就足够的程度时,LLM才会成功:在每个阶段,将下一个字母堆叠在上面;如果该字母不存在于表格中,则停止。
A key advantage of planning domains is that they provide the ability to easily and systematically generate larger test sets, including arbitrarily more challenging instances. The difficulty of a Blocksworld instance scales with the number of blocks involved, allowing us to clearly assess the out-of-domain generalization achievable with and without chain of thought. As shown in Figure 3 chain of thought does not generalize beyond a handful of blocks. Note that sound planning systems (such as Fast Downward) have a 100% accuracy on all problems tested.
规划域的一个关键优势在于,它们能够轻松、系统地生成更大的测试集,包括任意更具挑战性的实例。Blocksworld实例的难度与所涉及的区块数量成比例关系,这让我们能够清楚地评估有无思维链可实现的域外泛化。如图 3 所示,思维链的泛化能力不会超过少量积木。请注意,完善的规划系统(如 Fast Downward)在所有测试问题上的准确率都达到了 100%。
Prompt 提示 GPT-4-Turbo Claude-3-Opus 克劳德-3-奥普斯 GPT-4
zero-shot 零摄 19.1 % 19.1 % 19.1%19.1 \% 9.96 % 9.96 % 9.96%9.96 \% 3.83 % 3.83 % 3.83%3.83 \%
zero-shot CoT 零发 CoT 21 % 21 % 21%21 \% 10.34 % 10.34 % 10.34%10.34 \% 4.98 % 4.98 % 4.98%4.98 \%
Domain-Specific n n nn-shot
特定领域的 n n nn 镜头
13.7 % 13.7 % 13.7%13.7 \% 16.4 % 16.4 % 16.4%16.4 \% 6.13 % 6.13 % 6.13%6.13 \%
Progression Proof CoT 进步证明 CoT 15.3 % 15.3 % 15.3%15.3 \% 4.59 % 4.59 % 4.59%4.59 \% 6.89 % 6.89 % 6.89%6.89 \%
Domain-Specific n n nn-shot
特定领域的 n n nn 镜头
13.7 % 13.7 % 13.7%13.7 \% 16.4 % 16.4 % 16.4%16.4 \% 6.13 % 6.13 % 6.13%6.13 \%
Blocksworld Universal Algorithm
积木世界通用算法
37.1 % 37.1 % 37.1%37.1 \% 37.1 % 37.1 % 37.1%37.1 \% 51.3 % 51.3 % 51.3%51.3 \%
Problem Class Specific n n nn-shot
问题类专用 n n nn 镜头
18 % 18 % 18%18 \% 15.7 % 15.7 % 15.7%15.7 \% 8.81 % 8.81 % 8.81%8.81 \%
Stacking Prompt 堆叠提示 40.6 % 40.6 % 40.6%40.6 \% 24.5 % 24.5 % 24.5%24.5 \% 59.3 % 59.3 % 59.3%59.3 \%
Prompt GPT-4-Turbo Claude-3-Opus GPT-4 zero-shot 19.1% 9.96% 3.83% zero-shot CoT 21% 10.34% 4.98% Domain-Specific n-shot 13.7% 16.4% 6.13% Progression Proof CoT 15.3% 4.59% 6.89% Domain-Specific n-shot 13.7% 16.4% 6.13% Blocksworld Universal Algorithm 37.1% 37.1% 51.3% Problem Class Specific n-shot 18% 15.7% 8.81% Stacking Prompt 40.6% 24.5% 59.3%| Prompt | GPT-4-Turbo | Claude-3-Opus | GPT-4 | | :--- | :--- | :--- | :--- | | zero-shot | $19.1 \%$ | $9.96 \%$ | $3.83 \%$ | | zero-shot CoT | $21 \%$ | $10.34 \%$ | $4.98 \%$ | | Domain-Specific $n$-shot | $13.7 \%$ | $16.4 \%$ | $6.13 \%$ | | Progression Proof CoT | $15.3 \%$ | $4.59 \%$ | $6.89 \%$ | | Domain-Specific $n$-shot | $13.7 \%$ | $16.4 \%$ | $6.13 \%$ | | Blocksworld Universal Algorithm | $37.1 \%$ | $37.1 \%$ | $51.3 \%$ | | Problem Class Specific $n$-shot | $18 \%$ | $15.7 \%$ | $8.81 \%$ | | Stacking Prompt | $40.6 \%$ | $24.5 \%$ | $59.3 \%$ |
Table 2: Accuracy across CoT and example granularities over 261 instances in table-to-stack Blocksworld.
表 2:在表到栈 Blocksworld 的 261 个实例中,不同 CoT 和实例粒度的准确性。

5.2 Testing only on Table-to-Stack
5.2 仅对表到堆栈进行测试

As mentioned before, a table-to-stack problem is any problem in the intended target distribution of the stacking prompt. The initial state has every block on the table, with a goal of arranging all the blocks into a single, pre-specified stack. While a simple problem, GPT-4’s zero-shot performance over 261 instances is 3.8 % 3.8 % 3.8%3.8 \%. With the stacking CoT prompt, performance improves to 59.3 % 59.3 % 59.3%59.3 \%. Is this a result of the model learning in-context how to reason correctly over this type of problem? If so, we might expect it to perform the same when presented with a more general CoT prompt that demonstrates the same procedure, but is applicable to a greater variety of problems.
如前所述,"从桌子到堆叠 "问题是指堆叠提示的预定目标分布中的任何问题。初始状态下,每个区块都在表格上,目标是将所有区块排列成一个预先指定的堆栈。虽然这是一个简单的问题,但 GPT-4 在 261 个实例中的零射性能为 3.8 % 3.8 % 3.8%3.8 \% 。在堆叠 CoT 提示下,性能提高到 59.3 % 59.3 % 59.3%59.3 \% 。这是否是模型在情境中学习如何正确推理这类问题的结果?如果是这样的话,我们就可以预期,当出现更通用的 CoT 提示时,模型的表现也会相同,因为该提示展示了相同的程序,但适用于更多的问题。
To check this, we evaluate performance of our prompts on table-to-stack problems with prompts of varying granularity: standard I/O prompting, general n n nn-shot (drawn from arbitrary Blocksworld problems), goal-specific n n nn-shot (drawn from table-to-stack problems), and three levels of CoT specificity. Table 3 shows the results: only the most specific and least applicable prompt retains anywhere near this performance improvement. Figure A.1.1 in the appendix further illustrates that none of the prompts provide robust stack-height generalizability. We also tested self-consistency 49] on these prompts, but found that performance dropped. Details can be found in Appendix A. 2
为了验证这一点,我们使用不同粒度的提示对表格到堆栈问题上的提示性能进行了评估:标准 I/O 提示、一般 n n nn 镜头(取自任意 Blocksworld 问题)、特定目标 n n nn 镜头(取自表格到堆栈问题)以及三种级别的 CoT 特定性。表 3 显示了结果:只有最具体和最不适用的提示才保持了接近这一性能改进的效果。附录中的图 A.1.1 进一步说明,没有一个提示能提供稳健的堆高泛化能力。我们还在这些提示上测试了自一致性 49],但发现性能有所下降。详情见附录 A。

If chain of thought is meant to replicate human thinking or learning, it should generalize beyond the most direct pattern matches and allow for more robust reasoning across similar problems. However, our results only show a modest improvement in performance on some domains, with specific enough prompting strategies, which quickly deteriorates when the problems shown become slightly larger.
如果思维链的目的是复制人类的思维或学习,那么它就应该超越最直接的模式匹配,在类似问题上进行更稳健的推理。然而,我们的研究结果表明,在某些领域,只要有足够具体的提示策略,性能就会有适度的提高,而当所显示的问题稍大一些时,性能就会迅速下降。

6 Extension to Scalable Synthetic Benchmarks
6 扩展到可扩展合成基准

Previous work on CoT mainly constrained its evaluations to static test sets ranging from commonsense domains (Sports Understanding [41], StrategyQA [18], CommonSenseQA [44]), few-hop math word problems (AsDiv [31], GSM8k [10], MAWPS [27]), to a number of basic “symbolic reasoning” tasks (CoinFlip [26], LastLetterConcatenation [26], Shuffled Objects [41]). [26, 50, 55, 6]. Many of these benchmarks are difficult to scale, but a number of them can be modified to allow for the generation of arbitrary new instances which nevertheless have clear ground truths. We examine CoinFlip, LastLetterConcatenation, and a synthetic proxy for multi-step arithmetical reasoning. Exact prompt details can be found in the appendices A.7, A.8. and A.9. When possible we used the manual CoT prompts found in [50] and the zero-shot CoT prompt described in [26]. Number of examples ranges from 0 to 3 for both CoT and direct prompts.
以往关于CoT的研究主要将其评估限制在静态测试集上,测试范围包括常识性领域(体育理解[41]、StrategyQA[18]、CommonSenseQA[44])、少跳数学单词问题(AsDiv[31]、GSM8k[10]、MAWPS[27])以及一些基本的 "符号推理 "任务(CoinFlip[26]、LastLetterConcatenation[26]、Shuffled Objects[41])。[26, 50, 55, 6].这些基准中有很多都难以扩展,但其中一些可以修改,以便生成任意的新实例,但这些实例都有明确的基本真理。我们研究了 CoinFlip、LastLetterConcatenation 和多步算术推理的合成代理。具体的提示细节见附录 A.7、A.8 和 A.9。在可能的情况下,我们使用了 [50] 中的手动 CoT 提示和 [26] 中描述的零次 CoT 提示。CoT 提示和直接提示的示例数都从 0 到 3 不等。
CoinFlip: Parity tests have a long history in machine learning[32]. CoinFlip is a natural language version of this task introduced in [50] to showcase the performance of CoT, though that paper only studies up to four flip problems. An example prompt is “A coin is heads up. Craig flips the coin. Alice does not flip the coin. Is the coin still heads up?”. The correct answer is “no”. Note that chance performance on this domain is 50 % 50 % 50%50 \%, as there are only two possible answers. Our extension to the domain is detailed in A. 3
CoinFlip:奇偶校验在机器学习领域有着悠久的历史[32]。CoinFlip 是[50]中引入的该任务的自然语言版本,用于展示 CoT 的性能,不过该论文只研究了最多四个翻转问题。提示示例为 "一枚硬币正面朝上。克雷格抛硬币。爱丽丝没有掷硬币。硬币仍是正面朝上吗?正确答案是 "否"。请注意,由于只有两种可能的答案,因此该域的机会性能为 50 % 50 % 50%50 \% 。我们对该领域的扩展详见 A. 3
LastLetterConcatenation: Also introduced in [50], the LastLetterConcatenation task is a simple text processing task that asks for the concatenation of the last letters of a series of words. An example
LastLetterConcatenation:LastLetterConcatenation 任务也是在 [50] 中引入的,它是一项简单的文本处理任务,要求将一系列单词的最后一个字母连接起来。示例

Figure 3: Accuracy of GPT-4-Turbo with chain of thought prompting across variations of our synthetic datasets. “Direct” means direct prompting without any CoT.
图 3:GPT-4-Turbo 在不同合成数据集上的思维链提示精度。"直接 "指不使用任何 CoT 的直接提示。
Prompt 提示 CF LLC LVC FLC Arithmetic 算术 AE
Zero-Shot 零点射击 56.38 % 56.38 % 56.38%56.38 \% 10.00 % 10.00 % 10.00%10.00 \% 5.75 % 5.75 % 5.75%5.75 \% 1.81 % 1.81 % 1.81%1.81 \% 24.13 % 24.13 % 24.13%24.13 \% 45.60 % 45.60 % 45.60%45.60 \%
Zero-Shot CoT 零射击 CoT 95.71 % 95.71 % 95.71%95.71 \% 52.54 % 52.54 % 52.54%52.54 \% N/A 不适用 N/A 不适用 56.12 % 56.12 % 56.12%56.12 \% 42.76 % 42.76 % 42.76%42.76 \%
Manual CoT 手动 CoT 98.89 % 98.89 % 98.89%98.89 \% 51.06 % 51.06 % 51.06%51.06 \% 27.00 % 27.00 % 27.00%27.00 \% 26.00 % 26.00 % 26.00%26.00 \% 50.43 % 50.43 % 50.43%50.43 \% 69.31 % 69.31 % 69.31%69.31 \%
Incorrect Cot 不正确的婴儿床 96.76 % 96.76 % 96.76%96.76 \% 48.15 % 48.15 % 48.15%48.15 \% N/A 不适用 N/A 不适用 N/A 不适用 N/A 不适用
Prompt CF LLC LVC FLC Arithmetic AE Zero-Shot 56.38% 10.00% 5.75% 1.81% 24.13% 45.60% Zero-Shot CoT 95.71% 52.54% N/A N/A 56.12% 42.76% Manual CoT 98.89% 51.06% 27.00% 26.00% 50.43% 69.31% Incorrect Cot 96.76% 48.15% N/A N/A N/A N/A| Prompt | CF | LLC | LVC | FLC | Arithmetic | AE | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | Zero-Shot | $56.38 \%$ | $10.00 \%$ | $5.75 \%$ | $1.81 \%$ | $24.13 \%$ | $45.60 \%$ | | Zero-Shot CoT | $95.71 \%$ | $52.54 \%$ | N/A | N/A | $56.12 \%$ | $42.76 \%$ | | Manual CoT | $98.89 \%$ | $51.06 \%$ | $27.00 \%$ | $26.00 \%$ | $50.43 \%$ | $69.31 \%$ | | Incorrect Cot | $96.76 \%$ | $48.15 \%$ | N/A | N/A | N/A | N/A |
Table 3: Accuracy across CoT types and problem variations over all instances in our synthetic datasets. CF is CoinFlip, LLC is LastLetterConcatenation, LVC is LastVowelConcatenation, FLC is FoomLetterConcatenation, Arithmetic is baseline single-digit Arithmetic, AE is the same problems but with the explanation provided that all intermediate answers are single digit.
表 3:在合成数据集的所有实例中,不同 CoT 类型和问题变化的准确率。CF 是 CoinFlip,LLC 是 LastLetterConcatenation,LVC 是 LastVowelConcatenation,FLC 是 FoomLetterConcatenation,Arithmetic 是基准个位数算术,AE 是相同的问题,但解释为所有中间答案都是个位数。

prompt is “Take the last letters of each word in ‘Craig Alice’ and concatenate them.” for which the correct answer is “ge”. The set of possible answers on this task is much larger than in CoinFlip, but previous work has claimed significant performance increases on this kind of task with CoT. Modeling something similar to our Blocksworld granularity experiments, we create two other test sets, using the same underlying words in the same distribution, but which differ in what they ask the model to do. LastVowelConcatenation requires using only the last vowels of words. FoomLetterConcatenation requires using the first letter of the first word, the second letter of the second word, and so forth. If the n n nnth word does not have an n n nnth letter, the problem specifies that a 0 should be concatenated to the string instead.
提示语是 "取'Craig Alice'中每个单词的最后一个字母并连接起来",正确答案是 "ge"。这项任务的可能答案集比 CoinFlip 大得多,但之前的研究表明,CoT 在此类任务中的性能有显著提高。为了建立与 Blocksworld 粒度实验类似的模型,我们创建了另外两个测试集,使用相同分布中的相同底层单词,但要求模型完成的任务有所不同。LastVowelConcatenation 只要求使用单词的最后一个元音。FoomLetterConcatenation 要求使用第一个单词的第一个字母、第二个单词的第二个字母,以此类推。如果第 n n nn 个单词中没有第 n n nn 个字母,则问题会指定将 0 连接到字符串中。
Multi-step Arithmetic on Single-Digit Numbers: CoT is often tested on math word problems. However, many of these test sets only include problems which require very small numbers of reasoning steps. GSM8k was designed partly so that its problems would “require more steps to solve”, but its problems only range 2 to 8 steps[10], and, in fact, previous analyses have found that only 10 % 10 % 10%10 \% of those problems require more than five steps-the majority is 2,3 , or 4 . [16]
个位数的多步算术:CoT通常通过数学单词问题进行测试。然而,这些测试集中的许多问题只需要很少的推理步骤。GSM8k 的设计部分是为了让问题 "需要更多的步骤来解决",但它的问题只有 2 到 8 个步骤[10],而且事实上,以前的分析发现,这些问题中只有 10 % 10 % 10%10 \% 需要 5 个以上的步骤--大多数是 2、3 或 4 步。[16]

To sidestep this issue, we construct a synthetic dataset that involves linearly simplifying parenthesized expressions that consist of repeated applications of the four basic arithmetical operations on one digit numbers. An example prompt is "Simplify the following expression into a single number: 3 / ( 9 ( 5 + ( 1 ) ) ) . " ( 9 ( 5 + ( 1 ) ) ) . " (9-(5+(1)))."(9-(5+(1))) . ", where the correct answer is 1 . We filter our problems so that no operation ever results in a number that isn’t in the range 1 to 9 3 9 3 9^(3)9{ }^{3} This can be seen as a deeply simplified variant of the arithmetical expression simplification dataset presented in [34] where no modular arithmetic, negative numbers, or non-linear nesting is required. However, we extend our maximum number of required reasoning steps much further and we construct prompts which are more specific and spell out every single step explicitly. More details on the dataset can be found in A.5.
为了回避这个问题,我们构建了一个合成数据集,涉及对括号内的表达式进行线性简化,这些表达式由对一位数重复应用四种基本算术运算组成。例如,提示语是 "将下面的表达式简化为一个数字:3 / ( 9 ( 5 + ( 1 ) ) ) . " ( 9 ( 5 + ( 1 ) ) ) . " (9-(5+(1)))."(9-(5+(1))) . " ,其中正确答案是 1。我们对问题进行了过滤,使任何运算的结果都不在 1 到 9 3 9 3 9^(3)9{ }^{3} 的范围内。这可以看作是 [34] 中介绍的算术表达式简化数据集的深度简化变体,其中不需要模块运算、负数或非线性嵌套。不过,我们将所需推理步骤的最大数量进一步扩大,并构建了更加具体的提示,明确说明了每一个步骤。有关数据集的更多详情,请参见 A.5。

6.1 Results 6.1 结果

Length Generalization The only synthetic domain that shows any hints of generalization is CoinFlip. Using [50]'s prompt, performance is perfect for 1 through 4 step problems, starts to show the occasional mistake after, and only dips below 90 % 90 % 90%90 \% at 31 -step problems. However, the problems in this domain are very simple. Parallel to the lexicographic stacking case of Blocksworld, it does not require much reasoning beyond counting up to an average of half a given problem’s step count.
长度泛化 唯一显示出泛化迹象的合成域是 CoinFlip。使用[50]的提示,在 1 步到 4 步的问题上表现完美,之后开始偶尔出现错误,只有在 31 步的问题上才会下降到 90 % 90 % 90%90 \% 以下。然而,这一领域的问题非常简单。与 Blocksworld 的词法堆叠案例类似,它不需要过多的推理,只需要计算出问题平均步数的一半即可。

LastLetterConcatenation and multi-step arithmetic show behavior almost identical to our main experiments. While sufficiently specific CoT prompts do increase performance on small instances, this performance increase quickly degrades as the number of steps necessary increases. Notably, the string-based nature of the LastLetterConcatenation problem does allow us to examine what exact improvement CoT is inducing. We examine the data with different metrics and find that the only properties that do generalize with CoT are syntactic. In particular, while overall accuracy plummets back to that of direct prompting, CoT consistently improves the Levenshtein distance to the correct answer and ensures that the final response string contains exactly the right letters, just not in the right order or number. We take this as further evidence that CoT , rather than teaching algorithms or procedures, modifies the syntactic style of the LLM’s output, and that this pattern matching is what leads to observed increases in performance on smaller instances.
LastLetterConcatenation 和多步算术的表现与我们的主要实验几乎相同。虽然足够具体的 CoT 提示确实提高了小实例的性能,但随着必要步骤数的增加,性能的提高很快就会减弱。值得注意的是,LastLetterConcatenation 问题基于字符串的性质使我们能够检验 CoT 究竟带来了怎样的改进。我们使用不同的指标对数据进行了检验,发现唯一能通过 CoT 实现通用的属性是句法。特别是,虽然整体准确率急剧下降,回到了直接提示时的水平,但 CoT 始终改善了到正确答案的列文斯汉距离,并确保最终的回答串包含完全正确的字母,只是顺序或数量不正确而已。我们认为这进一步证明,CoT 不是在教授算法或程序,而是在修改 LLM 输出的语法风格,而这种模式匹配正是在较小的实例中观察到的性能提高的原因。
Prompt Granularity and Problem Variation Because of the simplicity of these problems, prompt granularity is much harder to examine than in Blocksworld. There isn’t enough variation in possible problems. However, across the three types of letter concatenation and two types of arithmetic expression simplification that we test, we see very similar patterns as before: CoT’s performance improvements are maintained much longer in easier cases, and take longer to collapse back to direct performance. There still seems to be a “sweet spot” where the problem is just barely hard enough that CoT makes a difference, but not so hard that this difference doesn’t matter.
提示粒度和问题变化 由于这些问题比较简单,因此提示粒度的考察要比 Blocksworld 难得多。可能出现的问题没有足够的变化。不过,在我们测试的三种字母串联类型和两种算术表达式简化类型中,我们看到了与之前非常相似的模式:在较为简单的情况下,CoT 的性能提升保持的时间更长,而且需要更长的时间才能恢复到直接性能。似乎仍然存在一个 "甜蜜点",在这个点上,问题勉强够难,CoT 可以带来差异,但又不至于难到这种差异无关紧要。
Examining Intermediate Reasoning The construction of our synthetic arithmetic task gives some hints as to what part of CoT may be failing. [14] argues that compositional reasoning fails because LLMs perform linearized subgraph matching and act as noisy estimators of intermediate functions (see e.g. proposition 4.2 in [14]) and that performance collapses follow from the fact that repeated application of any error-prone function estimator leads to exponentially accumulating error.
检查中间推理 我们的合成算术任务的构造为 CoT 可能失效的部分提供了一些提示。文献[14]认为,合成推理之所以失败,是因为 LLMs 执行了线性化子图匹配,并充当了中间函数的噪声估计器(例如,参见文献[14]中的命题 4.2)。

In our problem, it is possible to exhaustively check whether this is the case. There are exactly 118 possible 1-digit binary arithmetic problems which result in a 1-digit number. We tested GPT-4-Turbo, GPT-4, GPT-3.5-Turbo, Llama3-70b, and Llama3-8b on this dataset at various temperatures and every single model scored 100 % 100 % 100%100 \%. However, despite perfect performance on application of the required intermediate function, CoT still does not lead to robust generalization to arbitrary length problems. Therefore, at least on this problem set, the issue isn’t due to accumulating error. The problem must be with the LLM’s inability to learn the correct algorithm from contextual demonstrations, rather than with its inability to execute that algorithm.
在我们的问题中,可以详尽地检查情况是否如此。结果是一位数的二进制算术题正好有 118 种可能。我们在不同温度下对该数据集测试了 GPT-4-Turbo、GPT-4、GPT-3.5-Turbo、Llama3-70b 和 Llama3-8b,每个模型的得分都是 100 % 100 % 100%100 \% 。然而,尽管在应用所需的中间函数时表现出色,CoT 仍然不能将任意长度的问题进行稳健的推广。因此,至少在这个问题集上,问题不是由于误差累积造成的。问题一定出在 LLM 无法从上下文演示中学习正确的算法,而不是无法执行该算法。

Overall, we see that our results on planning are not a fluke. These three synthetic domains showcase similar generalization failures, but these failures only become clear when the problems tested on require sufficiently many reasoning steps or when the minor modifications of the domain are studied. This illustrates the need for testing on benchmarks which can generate arbitrary new instances of increasing difficulty. Without such testing, conclusions drawn from static test sets of limited size are unlikely to be robust. We implore the community at large to adopt more rigorous evaluation mechanisms, especially when making claims about the poorly-understood yet much-hyped algorithmic reasoning abilities of black box models.
总之,我们可以看到,我们在规划方面取得的成果并非偶然。这三个合成领域显示了类似的泛化失败,但这些失败只有在测试的问题需要足够多的推理步骤或研究领域的微小修改时才会变得明显。这说明有必要在可以任意生成难度不断增加的新实例的基准上进行测试。没有这样的测试,从规模有限的静态测试集中得出的结论不可能是可靠的。我们恳求整个社会采用更严格的评估机制,特别是在宣称黑盒模型的算法推理能力时。

7 Conclusion 7 结论

In this paper, we conducted a systematic evaluation of the effectiveness of chain of thought in large language models on a specific classical planning problem. Our case study indicates that, contrary to previous claims in the literature, providing examples of procedural reasoning does not induce the
在本文中,我们对大型语言模型中的思维链在特定经典规划问题上的有效性进行了系统评估。我们的案例研究表明,与以往文献中的说法相反,提供程序推理的示例并不能诱导人们去思考一个问题。
general ability to apply that procedure to novel instances in current state-of-the-art large language models. In fact, the performance improvements seen when prompting LLMs in this manner quickly vanish when queries differ in generality from the examples, despite the fact that the same algorithmic procedure applies to the larger or more general instance.
在当前最先进的大型语言模型中,将该程序应用于新实例的能力并不普遍。事实上,当查询的通用性与示例不同时,以这种方式提示 LLMs 所带来的性能提升很快就会消失,尽管相同的算法过程适用于更大或更通用的实例。
Very specific prompts are more likely to work, but they can require significantly more human labor to craft. Our results indicate that chain of thought prompts may only work consistently within a problem class if the problem class is narrow enough and the examples given are specific to that class. Both of these facts show that chain of thought approaches provide less generalization than previous claims seem to indicate, and hint that basic pattern matching rather than in context learning of general algorithmic procedures may better explain the improvements seen from chain of thought.
非常具体的提示更有可能奏效,但需要更多的人力来制作。我们的研究结果表明,只有当问题类别足够窄,并且所举的例子针对该问题类别时,思维链提示才可能在该问题类别中持续有效。这两个事实都表明,思维链方法提供的概括性比以前的说法似乎表明的要少,并暗示基本模式匹配而不是一般算法程序的上下文学习可能更能解释思维链带来的改进。

References 参考资料

[1] The United States Social Security Administration I SSA - ssa.gov. ssa.gov.
[1] 美国社会保障局 I SSA - ssa.gov.ssa.gov.

[2] Marah I Abdin, Suriya Gunasekar, Varun Chandrasekaran, Jerry Li, Mert Yuksekgonul, Rahee Ghosh Peshawaria, Ranjita Naik, and Besmira Nushi. Kitab: Evaluating llms on constraint satisfaction for information retrieval. arXiv preprint arXiv:2310.15511, 2023.
[2] Marah I Abdin、Suriya Gunasekar、Varun Chandrasekaran、Jerry Li、Mert Yuksekgonul、Rahee Ghosh Peshawaria、Ranjita Naik 和 Besmira Nushi。Kitab:ArXiv preprint arXiv:2310.15511, 2023.

[3] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[3] Josh Achiam、Steven Adler、Sandhini Agarwal、Lama Ahmad、Ilge Akkaya、Florencia Leoni Aleman、Diogo Almeida、Janko Altenschmidt、Sam Altman、Shyamal Anadkat 等,Gpt-4 技术报告。ArXiv 预印本 arXiv:2303.08774, 2023。

[4] Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. Exploring length generalization in large language models. Advances in Neural Information Processing Systems, 35:38546-38556, 2022.
[4] Cem Anil、Yuhuai Wu、Anders Andreassen、Aitor Lewkowycz、Vedant Misra、Vinay Ramasesh、Ambrose Slone、Guy Gur-Ari、Ethan Dyer 和 Behnam Neyshabur。探索大型语言模型的长度泛化。神经信息处理系统进展》,35:38546-38556,2022 年。

[5] Anthropic. Introducing the next generation of claude, Mar 2024.
[5] Anthropic.介绍新一代克劳德,2024 年 3 月。

[6] Guangsheng Bao, Hongbo Zhang, Linyi Yang, Cunxiang Wang, and Yue Zhang. Llms with chain-of-thought are non-causal reasoners. arXiv preprint arXiv:2402.16048, 2024.
[6] 包广生、张洪波、杨林一、王存祥、张越。Llms with chain-of-thought are non-causal reasoners. arXiv preprint arXiv:2402.16048, 2024.

[7] Maciej Besta, Florim Memedi, Zhenyu Zhang, Robert Gerstenberger, Nils Blach, Piotr Nyczyk, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Lukas Gianinazzi, et al. Topologies of reasoning: Demystifying chains, trees, and graphs of thoughts. arXiv preprint arXiv:2401.14295, 2024.
[7] Maciej Besta, Florim Memedi, Zhenyu Zhang, Robert Gerstenberger, Nils Blach, Piotr Nyczyk, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Lukas Gianinazzi, et al. Topologies of reasoning:解密思想链、思想树和思想图。arXiv 预印本 arXiv:2401.14295, 2024.

[8] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
[8] Sébastien Bubeck、Varun Chandrasekaran、Ronen Eldan、Johannes Gehrke、Eric Horvitz、Ece Kamar、Peter Lee、Yin Tat Lee、Yuanzhi Li、Scott Lundberg 等:人工通用智能的火花:ArXiv preprint arXiv:2303.12712, 2023.

[9] Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, and Ji-Rong Wen. Chainlm: Empowering large language models with improved chain-of-thought prompting. arXiv preprint arXiv:2403.14312, 2024.
[9] Xiaoxue Cheng,Junyi Li,Wayne Xin Zhao,and Ji-Rong Wen.Chainlm:ArXiv preprint arXiv:2403.14312, 2024.

[10] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
[10] Karl Cobbe、Vineet Kosaraju、Mohammad Bavarian、Mark Chen、Heewoo Jun、Lukasz Kaiser、Matthias Plappert、Jerry Tworek、Jacob Hilton、Reiichiro Nakano 等:《训练验证器解决数学单词问题》。ArXiv 预印本 arXiv:2110.14168, 2021。

[11] Antonia Creswell and Murray Shanahan. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271, 2022.
[11] Antonia Creswell 和 Murray Shanahan.ArXiv preprint arXiv:2208.14271, 2022.

[12] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey on in-context learning. arXiv preprint arXiv:2301.00234, 2022.
[12] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui.arXiv preprint arXiv:2301.00234, 2022.

[13] Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023.
[13] Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch.通过多代理辩论提高语言模型的事实性和推理能力。arXiv 预印本 arXiv:2305.14325, 2023.

[14] Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, et al. Faith and fate: Limits of transformers on compositionality. Advances in Neural Information Processing Systems, 36, 2024.
[14] Nouha Dziri、Ximing Lu、Melanie Sclar、Xiang Lorraine Li、Liwei Jiang、Bill Yuchen Lin、Sean Welleck、Peter West、Chandra Bhagavatula、Ronan Le Bras 等:《信仰与命运》:变压器对构成性的限制。神经信息处理系统进展》,36,2024。

[15] Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: a theoretical perspective. Advances in Neural Information Processing Systems, 36, 2024.
[15] Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang.揭示思维链背后的奥秘:一个理论视角。神经信息处理系统进展》,36,2024。

[16] Silin Gao, Jane Dwivedi-Yu, Ping Yu, Xiaoqing Ellen Tan, Ramakanth Pasunuru, Olga Golovneva, Koustuv Sinha, Asli Celikyilmaz, Antoine Bosselut, and Tianlu Wang. Efficient tool use with chain-of-abstraction reasoning. arXiv preprint arXiv:2401.17464, 2024.
[16] Silin Gao, Jane Dwivedi-Yu, Ping Yu, Xiaoqing Ellen Tan, Ramakanth Pasunuru, Olga Golovneva, Koustuv Sinha, Asli Celikyilmaz, Antoine Bosselut, and Tianlu Wang.使用抽象链推理的高效工具使用》,arXiv preprint arXiv:2401.17464, 2024.

[17] Gaël Gendron, Qiming Bao, Michael Witbrock, and Gillian Dobbie. Large language models are not abstract reasoners. arXiv preprint arXiv:2305.19555, 2023.
[17] Gaël Gendron, Qiming Bao, Michael Witbrock, and Gillian Dobbie.大型语言模型不是抽象推理器。arXiv预印本arXiv:2305.19555,2023.

[18] Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346-361, 2021.
[18] Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant.采用隐式推理策略的问题解答基准。计算语言学协会论文集》,9:346-361,2021 年。

[19] Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023.
[19] 苟志斌、邵志宏、龚业云、沈业龙、杨玉久、段楠、陈伟柱。批评家:arXiv preprint arXiv:2305.11738, 2023.

[20] Ruixin Hong, Hongming Zhang, Xinyu Pang, Dong Yu, and Changshui Zhang. A closer look at the self-verification abilities of large language models in logical reasoning. arXiv preprint arXiv:2311.07954, 2023.
[20] 洪瑞新、张红明、庞新宇、于东、张长水.大型语言模型在逻辑推理中的自我验证能力管窥.arXiv预印本arXiv:2311.07954,2023.

[21] Richard Howey, Derek Long, and Maria Fox. VAL: Automatic plan validation, continuous effects and mixed initiative planning using PDDL. In 16th IEEE International Conference on Tools with Artificial Intelligence, pages 294-301. IEEE, 2004.
[21] Richard Howey, Derek Long, and Maria Fox.VAL:使用 PDDL 的自动计划验证、连续效应和混合计划。第 16 届 IEEE 人工智能工具国际会议,第 294-301 页。IEEE, 2004.

[22] Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. In The Twelfth International Conference on Learning Representations, 2023.
[22] Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou.大型语言模型还不能自我修正推理。在第十二届学习表征国际会议上,2023.

[23] IPC. International planning competition, 1998.
[23] IPC。国际规划竞赛,1998 年。

[24] Dongwei Jiang, Jingyu Zhang, Orion Weller, Nathaniel Weir, Benjamin Van Durme, and Daniel Khashabi. Self-[in] correct: Llms struggle with refining self-generated responses. arXiv preprint arXiv:2404.04298, 2024.
[24] 江东伟、张靖宇、Orion Weller、Nathaniel Weir、Benjamin Van Durme 和 Daniel Khashabi。自[在]正确:

[25] Yeo Wei Jie, Ranjan Satapathy, Goh Siow Mong, Erik Cambria, et al. How interpretable are reasoning explanations from prompting large language models? arXiv preprint arXiv:2402.11863, 2024.
[26] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199-22213, 2022.
[26] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa.大语言模型是零射推理器。神经信息处理系统进展》,35:22199-22213,2022.

[27] Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. Mawps: A math word problem repository. In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, pages 1152-1157, 2016.
[27] Rik Koncel-Kedziorski、Subhro Roy、Aida Amini、Nate Kushman 和 Hannaneh Hajishirzi。Mawps:数学单词问题库。计算语言学协会北美分会 2016 年会议论文集:人类语言技术》,第 1152-1157 页,2016 年。

[28] Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, and Hao Su. Deductive verification of chain-of-thought reasoning. Advances in Neural Information Processing Systems, 36, 2024.
[28] Zhan Ling,Yunhao Fang,Xuanlin Li,Zhiao Huang,Mingu Lee,Roland Memisevic,and Hao Su.思维链推理的演绎验证。神经信息处理系统进展》,36,2024。

[29] Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379, 2023.
[29] Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch.ArXiv preprint arXiv:2301.13379, 2023.

[30] Drew McDermott, Malik Ghallab, Adele E. Howe, Craig A. Knoblock, Ashwin Ram, Manuela M. Veloso, Daniel S. Weld, and David E. Wilkins. Pddl-the planning domain definition language. 1998.
[30] Drew McDermott、Malik Ghallab、Adele E. Howe、Craig A. Knoblock、Ashwin Ram、Manuela M. Veloso、Daniel S. Weld 和 David E. Wilkins。Pddl--规划域定义语言。1998.

[31] Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing english math word problem solvers. arXiv preprint arXiv:2106.15772, 2021.
[31] Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su.用于评估和开发英语数学单词问题求解器的多样化语料库》,arXiv preprint arXiv:2106.15772, 2021.

[32] Marvin Minsky and Seymour Papert. An introduction to computational geometry. Cambridge tiass., HIT, 479(480):104, 1969.
[32] 马文-明斯基和西摩-帕帕特.计算几何导论.Cambridge tiass.,HIT,479(480):104,1969.

[33] Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
[33] Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work:Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.

[34] Flavio Petruzzellis, Alberto Testolin, and Alessandro Sperduti. Benchmarking gpt-4 on algorithmic problems: A systematic evaluation of prompting strategies. arXiv preprint arXiv:2402.17396, 2024.
[34] Flavio Petruzzellis、Alberto Testolin 和 Alessandro Sperduti.算法问题上的 gpt-4 基准:ArXiv preprint arXiv:2402.17396, 2024.

[35] Jing Qian, Hong Wang, Zekun Li, Shiyang Li, and Xifeng Yan. Limitations of language models in arithmetic and symbolic induction. arXiv preprint arXiv:2208.05051, 2022.
[35] 钱晶、王红、李泽坤、李世阳、闫西峰。算术和符号归纳中语言模型的局限性.arXiv预印本arXiv:2208.05051,2022.

[36] Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey. arXiv preprint arXiv:2212.09597, 2022.
[36] 乔硕飞、欧一心、张宁宇、陈翔、姚云志、邓淑敏、谭传奇、黄飞、陈华军。语言模型提示推理:arXiv preprint arXiv:2212.09597, 2022.

[37] Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. arXiv preprint arXiv:2210.01240, 2022.
[37] Abulhair Saparov 和 He He.语言模型是贪婪的推理者:arXiv preprint arXiv:2210.01240, 2022.

[38] Rylan Schaeffer, Kateryna Pistunova, Samar Khanna, Sarthak Consul, and Sanmi Koyejo. Invalid logic, equivalent gains: The bizarreness of reasoning in language model prompting. arXiv preprint arXiv:2307.10573, 2023.
[38] Rylan Schaeffer、Kateryna Pistunova、Samar Khanna、Sarthak Consul 和 Sanmi Koyejo。无效逻辑,等价收益:arXiv preprint arXiv:2307.10573, 2023.

[39] Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, 2022.
[39] Freda Shi、Mirac Suzgun、Markus Freitag、Xuezhi Wang、Suraj Srivats、Soroush Vosoughi、Hyung Won Chung、Yi Tay、Sebastian Ruder、Denny Zhou 等:语言模型是多语言思维链推理器。第 11 届学习表征国际会议,2022 年。

[40] John Slaney and Sylvie Thiébaux. Linear time near-optimal planning in the blocks world. In Proceedings of the National Conference on Artificial Intelligence, pages 1208-1214, 1996.
[40] John Slaney 和 Sylvie Thiébaux。积木世界中的线性时间近优规划。全国人工智能大会论文集》,第 1208-1214 页,1996 年。

[41] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
[41] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game:ArXiv preprint arXiv:2206.04615, 2022.

[42] Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. On the self-verification limitations of large language models on reasoning and planning tasks. arXiv preprint arXiv:2402.08115, 2024.
[42] Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati.ArXiv preprint arXiv:2402.08115, 2024.

[43] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging bigbench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
[43] Mirac Suzgun、Nathan Scales、Nathanael Schärli、Sebastian Gehrmann、Yi Tay、Hyung Won Chung、Aakanksha Chowdhery、Quoc V Le、Ed H Chi、Denny Zhou 等。挑战性大本营任务及思维链能否解决这些任务。arXiv 预印本 arXiv:2210.09261, 2022。

[44] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of NAACLHLT, pages 4149-4158, 2019.
[44] Alon Talmor、Jonathan Herzig、Nicholas Lourie 和 Jonathan Berant。Commonsenseqa:以常识性知识为目标的问题解答挑战。In Proceedings of NAACLHLT, pages 4149-4158, 2019.

[45] Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 2024.
[45] Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman.Language models don't always say what they think: unfaithful explanations in chain-of-thought prompting.神经信息处理系统进展》,36,2024。

[46] Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. Advances in Neural Information Processing Systems, 36, 2024.
[46] Karthik Valmeekam、Matthew Marquez、Alberto Olmo、Sarath Sreedharan 和 Subbarao Kambhampati。Planbench:用于评估规划和推理变化的大型语言模型的可扩展基准。神经信息处理系统进展》,36,2024。

[47] Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models-a critical investigation. Advances in Neural Information Processing Systems, 36, 2024.
[47] Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati.关于大型语言模型的规划能力--一项重要研究。神经信息处理系统进展》,36,2024。

[48] Jianing Wang, Qiushi Sun, Nuo Chen, Xiang Li, and Ming Gao. Boosting language models reasoning with chain-of-knowledge prompting. arXiv preprint arXiv:2306.06427, 2023.
[48] Jianing Wang, Qiushi Sun, Nuo Chen, Xiang Li, and Ming Gao.用知识链提示促进语言模型推理》,arXiv preprint arXiv:2306.06427, 2023.

[49] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2022.
[49] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou.自洽性改进了语言模型中的思维链推理。第十一届学习表征国际会议,2022年。

[50] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824-24837, 2022.
[50] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.神经信息处理系统进展》,35:24824-24837,2022。

[51] Yiran Wu, Feiran Jia, Shaokun Zhang, Qingyun Wu, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, and Chi Wang. An empirical study on challenging math problem solving with gpt-4. arXiv preprint arXiv:2306.01337, 2023.
[51] Yiran Wu, Feiran Jia, Shaokun Zhang, Qingyun Wu, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, and Chi Wang.使用 gpt-4 解决具有挑战性的数学问题的实证研究。

[52] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2022.
[52] Shunyu Yao、Jeffrey Zhao、Dian Yu、Nan Du、Izhak Shafran、Karthik R Narasimhan 和 Yuan Cao。React:语言模型中推理与行动的协同。第 11 届学习表征国际会议,2022 年。

[53] JD Zamfirescu-Pereira, Richmond Y Wong, Bjoern Hartmann, and Qian Yang. Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1-21, 2023.
[53] JD Zamfirescu-Pereira、Richmond Y Wong、Bjoern Hartmann 和 Qian Yang。为什么约翰尼不会提示:非人工智能专家如何尝试(和失败)设计 llm 提示。计算系统中的人为因素 2023 CHI 会议论文集》,第 1-21 页,2023 年。

[54] Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. In The Eleventh International Conference on Learning Representations, 2022.
[54] Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola.大型语言模型中的自动思维链提示。第 11 届学习表征国际会议,2022 年。

[55] Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
[56] Hattie Zhou, Azade Nova, Hugo Larochelle, Aaron Courville, Behnam Neyshabur, and Hanie Sedghi. Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066, 2022.
[56] Hattie Zhou, Azade Nova, Hugo Larochelle, Aaron Courville, Behnam Neyshabur, and Hanie Sedghi.arXiv preprint arXiv:2211.09066, 2022.

  1. *equal contribution * 平等贡献
  2. 2 2 ^(2){ }^{2} Prompt and response examples for each approach can be found in the Appendix.
    2 2 ^(2){ }^{2} 每种方法的提示和回答示例见附录。
  3. 3 3 ^(3){ }^{3} We exclude 0 , since any number multiplied by zero is zero, and this would quickly lead to zero representing around 50 % 50 % 50%50 \% of correct answers for larger numbers of reasoning steps.
    3 3 ^(3){ }^{3} 我们将 0 排除在外,因为任何数字乘以 0 都是 0,这将很快导致 0 在推理步数较多的情况下代表大约 50 % 50 % 50%50 \% 个正确答案。