Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Quiet-STaR：语言模型可以自我学习在发言前思考

Eric Zelikman 埃里克·泽利克曼Stanford University 斯坦福大学

Georges Harik 乔治·哈里克Notbad AI Inc 诺巴德人工智能公司

Yijia Shao 邵怡佳Stanford University 斯坦福大学

Varuna Jayasiri 瓦鲁纳·贾亚西里Notbad AI Inc 诺巴德人工智能公司

Nick Haber 尼克·哈伯Stanford University 斯坦福大学

Noah D. Goodman 诺亚·D·古德曼Stanford University 斯坦福大学

Abstract 摘要

When writing and talking, people sometimes pause to think. Although reasoning-focused works have often framed reasoning as a method of answering questions or completing agentic tasks, reasoning is implicit in almost all written text. For example, this applies to the steps not stated between the lines of a proof or to the theory of mind underlying a conversation. In the Self-Taught Reasoner (STaR, Zelikman et al. 2022), useful thinking is learned by inferring rationales from few-shot examples in question-answering and learning from those that lead to a correct answer. This is a highly constrained setting - ideally, a language model could instead learn to infer unstated rationales in arbitrary text. We present Quiet-STaR, a generalization of STaR in which LMs learn to generate rationales at each token to explain future text, improving their predictions. We address key challenges, including 1) the computational cost of generating continuations, 2) the fact that the LM does not initially know how to generate or use internal thoughts, and 3) the need to predict beyond individual next tokens. To resolve these, we propose a tokenwise parallel sampling algorithm, using learnable tokens indicating a thought’s start and end, and an extended teacher-forcing technique. Encouragingly, generated rationales disproportionately help model difficult-to-predict tokens and improve the LM’s ability to directly answer difficult questions. In particular, after continued pretraining of an LM on a corpus of internet text with Quiet-STaR, we find zero-shot improvements on GSM8K ( $5.9 % \to 10.9 %$ ) and CommonsenseQA $(36.3 % \to 47.2 %)$ and observe a perplexity improvement of difficult tokens in natural text. Crucially, these improvements require no fine-tuning on these tasks. Quiet-STaR marks a step towards LMs that can learn to reason in a more general and scalable way. “Life can only be understood backwards; but it must be lived forwards.” — Sren Kierkegaard
在写作和交谈时，人们有时会暂停思考。尽管以推理为中心的作品常常将推理框架视为回答问题或完成代理任务的方法，但推理在几乎所有书面文本中都是隐含的。例如，这适用于证明中未明确说明的步骤或对话中潜在的心智理论。在自学推理者（STaR，Zelikman 等，2022）中，有用的思维是通过从少量示例中推断出理由来学习的，这些示例在问答中引导出正确答案。这是一个高度受限的环境——理想情况下，语言模型可以学习在任意文本中推断未陈述的理由。我们提出了 Quiet-STaR，这是 STaR 的一个推广，其中语言模型学习在每个标记生成理由以解释未来的文本，从而改善其预测。我们解决了关键挑战，包括 1）生成延续的计算成本，2）语言模型最初不知道如何生成或使用内部思维，以及 3）需要超越单个下一个标记进行预测。为了解决这些问题，我们提出了一种基于令牌的并行采样算法，使用可学习的令牌来指示思维的开始和结束，并采用扩展的教师强制技术。令人鼓舞的是，生成的推理在很大程度上帮助模型处理难以预测的令牌，并提高了语言模型直接回答困难问题的能力。特别是，在使用 Quiet-STaR 对互联网文本语料库进行持续预训练后，我们发现 GSM8K（ $5.9 % \to 10.9 %$ ）和 CommonsenseQA（ $(36.3 % \to 47.2 %)$ ）的零样本改进，并观察到自然文本中困难令牌的困惑度改善。至关重要的是，这些改进在这些任务上不需要微调。Quiet-STaR 标志着朝着能够以更一般和可扩展的方式学习推理的语言模型迈出了一步。“生活只能向后理解；但必须向前生活。”“——索伦·基尔凯郭尔”

1 Introduction 1 引言

Much of the meaning of text is hidden between the lines: without understanding why statements appear in a document, a reader has only a shallow understanding. Moreover, this has been repeatedly shown to be true for LMs as well, in the contexts of tasks ranging from commonsense reasoning to theorem proving to programming (Wei et al., 2022b; Nye et al., 2021; Zelikman et al., 2022; 2023a; Kojima et al., 2022). Reasoning about implications of text to predict later text has consistently been shown to improve LM performance on a variety of tasks, but methods for allowing LMs to learn from their reasoning (e.g., Zelikman et al. 2022) have focused on solving individual tasks or predefined sets of tasks (e.g., Wei et al. 2021b). These works rely on carefully curated datasets to provide either specific reasoning tasks or in some cases, the reasoning itself. We instead ask, if reasoning is implicit in all text, why shouldn’t we leverage the task of language modeling to teach reasoning?
文本的许多意义隐藏在行间：如果不理解文档中陈述出现的原因，读者只能获得肤浅的理解。此外，这在语言模型（LM）中也反复被证明是正确的，涉及的任务范围从常识推理到定理证明再到编程（Wei et al., 2022b; Nye et al., 2021; Zelikman et al., 2022; 2023a; Kojima et al., 2022）。关于文本的含义进行推理以预测后续文本，已被一致证明能提高语言模型在各种任务上的表现，但允许语言模型从其推理中学习的方法（例如，Zelikman et al. 2022）主要集中在解决单个任务或预定义任务集（例如，Wei et al. 2021b）。这些研究依赖于精心策划的数据集，以提供特定的推理任务或在某些情况下，推理本身。我们反而提出，如果推理在所有文本中都是隐含的，为什么不利用语言建模的任务来教授推理呢？

Figure 1: Quiet-STaR. We visualize the algorithm as applied during training to a single thought. We generate thoughts, in parallel, following all tokens in the text (think). The model produces a mixture of its next-token predictions with and without a thought (talk). We apply REINFORCE, as in STaR, to increase the likelihood of thoughts that help the model predict future text while discarding thoughts that make the future text less likely (learn).
图 1：Quiet-STaR。我们将算法在训练期间应用于单个思维进行可视化。我们并行生成思维，跟随文本中的所有标记（思考）。模型生成其下一个标记预测的混合，包括有思维和没有思维的情况（谈话）。我们应用 REINFORCE，如同 STaR，以增加有助于模型预测未来文本的思维的可能性，同时丢弃使未来文本可能性降低的思维（学习）。

In particular, the Self-Taught Reasoner (STaR, Zelikman et al. 2022) showed that LMs can bootstrap their reasoning ability on question-answering (QA) datasets by sampling rationales to attempt to answer questions, training on rationales if they led to a correct final answer, and then repeating this to iteratively solve more difficult problems. Yet, training from curated QA datasets limits the scale and generalizability of the rationales. QA datasets, especially high-quality ones, require thoughtful curation and will inherently only ever cover a subset of reasoning tasks. Thus, we extend STaR - instead of the LM learning to reason on particular tasks like mathematical QA, we train an LM to generate reasoning that helps it infer future text from a large internet text corpus. As a result, we allow the LM to learn from the diverse tasks present in language (Weber et al., 2021). This builds on an intuition essential to the current language modeling paradigm, namely, that “language models are unsupervised multitask learners” (Radford et al., 2019). Thus, as in STaR, we leverage the LM’s pre-existing reasoning ability to generate rationales and train the LM on them with a REINFORCE-based reward (Williams, 1992). We refer to this technique as Quiet-STaR, as it can be understood as applying STaR “quietly”, training the model to think before it speaks.
特别是，自学推理器（STaR，Zelikman 等，2022）表明，语言模型可以通过从问答（QA）数据集中抽样推理来启动其推理能力，以尝试回答问题，如果推理导致正确的最终答案，则进行训练，然后重复这一过程以迭代解决更困难的问题。然而，从策划的 QA 数据集中进行训练限制了推理的规模和普遍性。QA 数据集，特别是高质量的数据集，需要经过深思熟虑的策划，并且本质上只会覆盖推理任务的一个子集。因此，我们扩展了 STaR——不是让语言模型在特定任务上学习推理，例如数学问答，而是训练语言模型生成推理，以帮助其从大型互联网文本语料库中推断未来的文本。因此，我们允许语言模型从语言中存在的多样化任务中学习（Weber 等，2021）。这建立在当前语言建模范式的一个重要直觉上，即“语言模型是无监督的多任务学习者”（Radford 等，2019）。因此，正如在 STaR 中，我们利用语言模型（LM）预先存在的推理能力来生成理由，并通过基于 REINFORCE 的奖励（Williams, 1992）对其进行训练。我们将这种技术称为 Quiet-STaR，因为它可以理解为“安静”地应用 STaR，训练模型在发言之前进行思考。
Broadly, Quiet-STaR proceeds by generating rationales after every token to explain future text (think), mixing the future-text predictions with and without rationales (talk), and then learning to generate better rationales using REINFORCE (learn). We apply Quiet-STaR to Mistral 7B (Jiang et al., 2023) using the web text datasets OpenWebMath (Paster et al., 2023) and Colossal Clean Crawled Corpus (C4, Raffel et al. 2020). We find that, even without dataset-specific fine-tuning, Quiet-STaR results in improvements to zero-shot directreasoning abilities on CommonsenseQA

(36.3 % \to 47.2 %)

and GSM8K (

5.9 % \to 10.9 %

), and that these improvements consistently increase with the number of tokens used in the LM’s internal thoughts. Lastly, we qualitatively investigate patterns in the generated rationales.
广义上，Quiet-STaR 通过在每个标记后生成理由来解释未来文本（思考），将带有和不带有理由的未来文本预测混合（交谈），然后使用 REINFORCE 学习生成更好的理由（学习）。我们将 Quiet-STaR 应用于 Mistral 7B（Jiang 等，2023），使用网络文本数据集 OpenWebMath（Paster 等，2023）和 Colossal Clean Crawled Corpus（C4，Raffel 等，2020）。我们发现，即使没有特定于数据集的微调，Quiet-STaR 也能提高 CommonsenseQA

(36.3 % \to 47.2 %)

和 GSM8K（

5.9 % \to 10.9 %

）的零-shot 直接推理能力，并且这些改进随着 LM 内部思维中使用的标记数量的增加而持续增加。最后，我们定性地研究生成理由中的模式。
In solving this task, we make the following contributions:
在解决此任务时，我们做出了以下贡献：

We generalize STaR to learn reasoning from diverse unstructured text data. To our knowledge, this is the first work explicitly training LMs to reason generally from text, rather than on curated reasoning tasks or collections of reasoning tasks.
我们将 STaR 推广到从多样化的非结构化文本数据中学习推理。据我们所知，这是第一项明确训练语言模型从文本中进行一般推理的工作，而不是在精心策划的推理任务或推理任务集合上进行。
We propose and implement a parallel sampling algorithm that makes our training procedure scalable, generating rationales from all token positions in a given string.
我们提出并实现了一种并行采样算法，使我们的训练过程具有可扩展性，从给定字符串中的所有标记位置生成理由。
We introduce custom meta-tokens at the start and end of each thought to allow the LM to learn that it should be generating a rationale and when it should make a prediction based on that rationale.
我们在每个思维的开始和结束引入自定义元标记，以使语言模型学习何时应该生成推理以及何时应该基于该推理做出预测。
We apply a mixing head to retrospectively determine how much to incorporate the next-token prediction from a given thought into the current next-token prediction.
我们应用混合头来回顾性地确定从给定思想中将多少下一个标记的预测纳入当前的下一个标记预测。
We show that a non-myopic loss, including multiple tokens ahead for language modeling, improves the effect of thinking.
我们表明，非短视损失，包括用于语言建模的多个标记，改善了思维的效果。
On multiple tasks, we demonstrate that thinking allows the LM to predict difficult tokens better than one trained on the same web text, improving with longer thoughts.
在多个任务中，我们证明了思考使得语言模型能够比在相同网络文本上训练的模型更好地预测困难的标记，并且随着思考时间的延长，预测能力有所提高。

Figure 2: Generalization Results. We evaluate the extent to which the model trained with Quiet-STaR generalizes to directly answering problems that require reasoning. The left plot (a) shows the zero-shot accuracy on GSM8K, while the right plot (b) shows the zero-shot accuracy on CommonsenseQA, without any fine-tuning. In both plots, the

x

-axis represents training steps, and each line corresponds to a different number of thinking tokens used during Quiet-STaR training. The y-axis measures the zero-shot direct accuracy on the respective datasets. We also include an inference normalized version of this plot in Figure 6.
图 2：泛化结果。我们评估使用 Quiet-STaR 训练的模型在直接回答需要推理的问题上的泛化能力。左侧图（a）显示了在 GSM8K 上的零-shot 准确率，而右侧图（b）显示了在 CommonsenseQA 上的零-shot 准确率，均未进行任何微调。在两个图中，

x

-轴表示训练步骤，每条线对应于在 Quiet-STaR 训练期间使用的不同数量的思考标记。y 轴测量各自数据集上的零-shot 直接准确率。我们还在图 6 中包含了该图的推理归一化版本。

2.1 Reasoning in Language Models
2.1 语言模型中的推理

There have been many works on training and exploiting language models to solve difficult tasks by first training them to reason through them. For example, Rajani et al. (2019) demonstrated that a pre-trained language model fine-tuned to output on human reasoning traces before answering multiple-choice commonsense reasoning questions outperformed one trained directly on answers. Shwartz et al. (2020) demonstrated that language models, when provided with some scaffolding, can generate these helpful chain-of-thought solutions without additional supervision. Later, Nye et al. (2021) demonstrated that “scratchpads” required less scaffolding when the language models were more capable, a result later reinforced by Wei et al. (2022b), emphasizing informal tasks, and further strengthened by Kojima et al. (2022), demonstrating this behavior could be accomplished zero-shot. Most recently, Wang & Zhou (2024) showed further that for commonsense-question answering, one could force a language model to leverage chain-of-thought reasoning by preventing it from emitting any valid answer tokens unless it was confident. However, once again, these approaches only work for a question-answer dataset, and Wang & Zhou (2024) relies on heuristics to identify when the model has output answer tokens. Somewhat like TRICE (Phan et al., 2023), we use the relative improvements in the log-likelihood of the target text across rationales as an estimate of quality, but we simply subtract the mean reward and do not incorporate more complex control variates.
关于训练和利用语言模型解决困难任务的研究已经有很多，首先通过训练它们进行推理。例如，Rajani 等（2019）证明了一个经过预训练的语言模型，在回答多项选择常识推理问题之前，经过微调以输出人类推理轨迹，其表现优于直接训练在答案上的模型。Shwartz 等（2020）展示了当语言模型提供一些支架时，可以在没有额外监督的情况下生成这些有帮助的思维链解决方案。后来，Nye 等（2021）证明，当语言模型更强大时，“草稿纸”所需的支架更少，这一结果后来被 Wei 等（2022b）进一步强化，强调非正式任务，并且被 Kojima 等（2022）进一步加强，证明这种行为可以在零样本情况下实现。最近，Wang & Zhou（2024）进一步表明，对于常识问答，可以通过防止语言模型发出任何有效答案标记，除非它有信心，从而迫使其利用思维链推理。然而，再次强调，这些方法仅适用于问答数据集，王和周（2024）依赖启发式方法来识别模型何时输出答案标记。有点像 TRICE（Phan 等，2023），我们使用目标文本在理由中的对数似然的相对改善作为质量的估计，但我们只是减去平均奖励，而不引入更复杂的控制变量。

2.2 Training Language Models to Reason
2.2 训练语言模型以进行推理

One direction that researchers have used to train language models to reason or improve their reasoning is training the language model on mined reasoning traces or reasoning-like data (Rajani et al., 2019; Wei et al., 2021a; Lewkowycz et al., 2022; Chung et al., 2022; Gunasekar et al., 2023). Although this approach has been demonstrated to be effective, it comes with drawbacks. It requires either manual annotation, which is sensitive to the capability of the annotators and is off-policy for the language model (i.e., the distribution of reasoning is not text that the language model would otherwise likely have generated). This approach is also expensive, difficult to scale, and provides no clear path to solving problems harder than those that the annotators are capable of solving.
研究人员用于训练语言模型以进行推理或改善其推理能力的一个方向是对挖掘的推理痕迹或类似推理的数据进行训练（Rajani et al., 2019; Wei et al., 2021a; Lewkowycz et al., 2022; Chung et al., 2022; Gunasekar et al., 2023）。尽管这种方法已被证明是有效的，但它也存在缺点。它需要手动标注，这对标注者的能力敏感，并且对语言模型来说是非政策性的（即，推理的分布不是语言模型可能生成的文本）。这种方法也很昂贵，难以扩展，并且没有明确的途径来解决比标注者能够解决的问题更难的问题。
Another direction for teaching reasoning relies on a language model’s own generated reasoning, which can be seen as building on a large body of literature on self-play (Silver et al., 2017; Anthony et al., 2017; Polu & Sutskever, 2020). These include methods such as the
另一种教学推理的方向依赖于语言模型自身生成的推理，这可以被视为建立在大量关于自我对弈的文献基础上（Silver et al., 2017; Anthony et al., 2017; Polu & Sutskever, 2020）。这些方法包括诸如

Algorithm 1: Quiet Self-Taught Reasoner (Quiet-STaR)
Input: Language model \(\theta_{0}\), training steps num_steps, sequence length \(l\), thought length
    \(t\), learning rate \(\alpha\), batch size \(b\), number of thoughts \(n_{\text {thoughts }}\), number of ground
        truth tokens used for supervising each thought \(n_{\text {true }}\)
Output: Language model \(\theta\) that generates rationales to predict future text
for \(i=0\) to num_steps do
    Sample batch of sequences \(X\) of length \(l\)
    \(h^{\text {init }} \leftarrow\) hidden_states \(\theta_{i}(X)\)
    for \(j=1\) to \(l\) in parallel using attention mask do
        \(\log p_{j: j+n_{\text {true }}}^{\text {init }} \leftarrow \operatorname{lm}\) head \(_{\theta_{i}}\left(h_{j: j+n_{\text {true }}}^{\text {init }}\right) \quad / /\) Predict next tokens
        \(T_{j} \leftarrow\) generate_tokens \(\theta_{i}\left(\left[X_{: j} ;<\right.\right.\) start_thought \(\left.\left.>\right], t, n_{\text {thoughts }}\right) / /\) Generate thought
        \(T_{j} \leftarrow\left[T_{j} ;<\right.\) end_thought \(\left.>\right]\)
        \(h_{j: j+n_{\text {true }}}^{\text {thought }} \leftarrow\) hidden_states \({ }_{\theta_{i}}\left(\left[X_{: j} ; T_{j} ; X_{j: j+n_{\text {true }}-1}\right]\right)\)
        \(\log p_{j: j+n_{\text {true }}}^{\text {thought }} \leftarrow \operatorname{lm}\) head \(_{\theta_{i}}\left(h_{j: j+n_{\text {true }}}^{\text {thought }}\right) \quad / /\) Predict next tokens \(\mathrm{w} /\) thought
        \(w_{j: j+n_{\text {true }}} \leftarrow\) mixing_head \({ }_{\theta_{i}}\left(h_{j: j+n_{\text {true }}}^{\text {thought }}, h_{j: j+n_{\text {true }}}^{\text {init }}\right)\)
        \(\log p_{j}^{\text {talk }} \leftarrow w_{j: j+n_{\text {true }}} \cdot \log p_{j: j+n_{\text {true }}}^{\text {init }}+\left(1-w_{j: j+n_{\text {true }}}\right) \cdot \log p_{j: j+n_{\text {true }}}^{\text {thought }} \quad / /\) Mix logits
        \(\mathcal{L}_{j}^{\mathrm{NLL}} \leftarrow-\log p_{j: j+n_{\text {true }}}^{\text {talk }}\left(X_{j+1: j+n_{\text {true }}+1}\right)\)
        \(r_{j}=\log p_{j: j+n_{\text {true }}}^{\text {talk }}\left(X_{j+1: j+n_{\text {true }}+1}\right)-\log \bar{p}_{j: j+n_{\text {true }}}^{\text {talk }}\left(X_{j+1: j+n_{\text {true }}+1}\right)\)
        \(\nabla_{\theta} \mathcal{L}_{j}^{\text {REINFORCE }} \leftarrow-r_{j} \mathbb{1}\left[r_{j}>0\right] \cdot \nabla_{\theta} \log p_{\theta_{i}}\left(T_{j} \mid\left[X_{: j} ;<\right.\right.\) start_thought> \(\left.]\right)\)
        \(\nabla_{\theta} \mathcal{L}_{j} \leftarrow \nabla_{\theta} \mathcal{L}_{j}^{\mathrm{NLL}}+\nabla_{\theta} \mathcal{L}_{j}^{\text {REINFORCE }}\)
    \(\theta_{i+1} \leftarrow \theta_{i}-\alpha \sum_{j=1}^{l} \nabla_{\theta} \mathcal{L}_{j} \quad\) // Update model parameters
return \(\theta_{\text {num_steps }}\)

Self-Taught Reasoner (Zelikman et al., 2022), which demonstrated that a language model iteratively trained on its reasoning that led to correct answers could solve increasingly difficult problems. Later work aimed to leverage additional information or assumptions such as Huang et al. (2022) which demonstrated that the algorithm proposed in STaR could still work if one assumed that the majority-vote answer was correct (although this has a lower ultimate performance). Further work has generalized the results of Zelikman et al. (2022), such as Uesato et al. (2022) which demonstrated additional usefulness to “process-based” supervision where incorrect reasoning traces were filtered, recently VSTaR (Hosseini et al., 2024) that demonstrates that training a verifier to guide generation also improves performance, as well as TRICE (Hoffman et al., 2024) which maximizes the marginal likelihood of the correct answer given several reasoning traces per problem. Finally, related work has also explored learning intermediate reasoning in the constrained setting of making mathematical statements, where statements in the model’s intermediate reasoning could be constrained to only be valid mathematical statements (Poesia et al., 2023). We include further discussion of related reasoning works in Appendix F.
自学推理者（Zelikman 等，2022）表明，经过迭代训练以其推理为基础并导致正确答案的语言模型能够解决越来越困难的问题。后续工作旨在利用额外的信息或假设，例如 Huang 等（2022）表明，如果假设多数投票答案是正确的，STaR 中提出的算法仍然可以工作（尽管这具有较低的最终性能）。进一步的研究推广了 Zelikman 等（2022）的结果，例如 Uesato 等（2022）展示了“基于过程”的监督的额外有效性，其中不正确的推理轨迹被过滤，最近的 VSTaR（Hosseini 等，2024）表明训练一个验证者以指导生成也提高了性能，以及 TRICE（Hoffman 等，2024）在给定每个问题的多个推理轨迹的情况下最大化正确答案的边际似然。最后，相关工作还探讨了在数学陈述的限制环境中学习中间推理，其中模型的中间推理中的陈述可能仅限于有效的数学陈述（Poesia et al., 2023）。我们在附录 F 中包含了对相关推理工作的进一步讨论。

2.3 Meta-tokens 2.3 元标记

Recently, a growing body of work has demonstrated the usefulness of custom tokens optimized to perform specific functions in the context of a neural network - for this reason, they have also been referred to as “function vectors.” (Todd et al., 2023). One of the original instantiations of this was prompt-tuning (Lester et al., 2021) (and relatedly prefix-tuning (Li & Liang, 2021)), where the embeddings corresponding to the tokens of a prompt could be optimized to better accomplish a task. Others have applied meta-tokens to compress long prompts (Li et al., 2023; Jung & Kim, 2023) for efficiency. Most relevant to this work, Mu et al. (2024) optimized a token such that, when the tokens after it could not attend to the tokens before it (i.e., a context compression token), it would provide sufficient information to future tokens. Although we do not focus on compression, we share the problem of learning a
最近，越来越多的研究表明，在神经网络的背景下，针对特定功能优化的自定义令牌的有效性——因此，它们也被称为“功能向量”。（Todd et al., 2023）。其中一个最初的实例是提示调优（Lester et al., 2021）（以及相关的前缀调优（Li & Liang, 2021）），在这种情况下，提示的令牌对应的嵌入可以被优化以更好地完成任务。其他研究者则应用元令牌来压缩长提示（Li et al., 2023；Jung & Kim, 2023），以提高效率。与本研究最相关的是，Mu et al.（2024）优化了一个令牌，使得当其后的令牌无法关注其之前的令牌时（即上下文压缩令牌），它能够为后续令牌提供足够的信息。尽管我们不专注于压缩，但我们面临着学习的问题。
token that affects attention and controls complex downstream behavior. In one related work, Goyal et al. (2023) show that learning a single “pause” token (essentially representing each token as two tokens) improves LM performance. However, unlike the thought tokens in our work, this pause token does not initialize a thought - instead, it can be seen as acting as the entirety of the thought. We find that reasoning in language is significantly more helpful.
影响注意力并控制复杂下游行为的标记。在一项相关工作中，Goyal 等人（2023）表明，学习一个单一的“暂停”标记（本质上将每个标记表示为两个标记）可以提高语言模型的性能。然而，与我们工作的思维标记不同，这个暂停标记并不初始化一个思维——相反，它可以被视为整个思维的表现。我们发现，语言中的推理显著更有帮助。

3 Problem Statement 3 问题陈述

In this work, we introduce an auxiliary ‘rationale’ variable between each pair of observed tokens of the sequence. We then aim to optimize a language model with parameters

θ

with the capacity to generate intermediate thoughts (or rationales) such that
在本研究中，我们在序列中每对观察到的标记之间引入了一个辅助的“理由”变量。然后，我们旨在优化一个具有生成中间思维（或理由）能力的语言模型，其参数为

θ

。

θ^{*} = \arg max_{θ} E_{x} [\log p_{θ} (x_{i : n} ∣ x_{0 : i}, {rationale}_{θ} (x_{0 : i}))]

Note that, in principle, this provides no advantage over an optimal language model that already correctly models the language’s distribution over strings. Yet, in practice, extensive prior work has shown that language models benefit from intermediate rationales on reasoning tasks (Nye et al., 2021; Zelikman et al., 2022; Wei et al., 2022b). Some work has aimed to explain the effects of chain-of-thought reasoning, namely attributing it to “locality of experience” (Prystawski et al., 2024). More broadly, reasoning allows a model to decompose a challenging computation into smaller steps. In effect, we train the model to learn which decomposition and planning steps are effective in predicting future text. Also note that we formulate the objective as accurately predicting the remaining sequence, rather than only the next token. Once again, for an optimal LM these would be equivalent. However we find that the non-myopic formulation leads to a more effective loss for learning rationales.
请注意，原则上，这并没有比已经正确建模语言字符串分布的最佳语言模型提供任何优势。然而，在实践中，大量先前的研究表明，语言模型在推理任务中受益于中间推理（Nye et al., 2021; Zelikman et al., 2022; Wei et al., 2022b）。一些研究旨在解释链式推理的效果，特别是将其归因于“经验的局部性”（Prystawski et al., 2024）。更广泛地说，推理使模型能够将复杂的计算分解为更小的步骤。实际上，我们训练模型学习哪些分解和规划步骤在预测未来文本时是有效的。还要注意，我们将目标设定为准确预测剩余序列，而不仅仅是下一个标记。再一次，对于一个最佳的语言模型，这两者是等价的。然而，我们发现非短视的公式化导致了更有效的损失，以学习推理。

4 Quiet-STaR 4 安静的 STaR

4.1 Overview 4.1 概述

Quiet-STaR operates with three main steps (Figure 1):
Quiet-STaR 通过三个主要步骤进行操作（图 1）：

Parallel rationale generation (think, Subsection 4.2): In parallel across $n$ tokens $x_{i}$ in an input sequence $x_{0 : n}$ , we generate $r$ rationales of length $t : c_{i} = (c_{i 1}, \dots, c_{i t})$ , resulting in $n \times r$ rationale candidates. We insert learned $<$ startofthought $∣>$ and $<∣$ endofthought $∣>$ tokens to mark each rationale’s start and end.
并行推理生成（见第 4.2 小节）：在输入序列 $x_{0 : n}$ 中的 $n$ 个令牌 $x_{i}$ 上并行生成长度为 $t : c_{i} = (c_{i 1}, \dots, c_{i t})$ 的 $r$ 个推理，从而产生 $n \times r$ 个推理候选。我们插入学习到的 $<$ 思考开始 $∣>$ 和 $<∣$ 思考结束 $∣>$ 令牌，以标记每个推理的开始和结束。
Mixing post-rationale and base predictions (talk, Subsection 4.3): From the hidden state output after each rationale, we train a “mixing head” - a shallow MLP producing a weight determining how much the post-rationale next-token predicted logits should be incorporated compared to the base language model predicted logits. This approach eases distribution shift early in finetuning, due to introducing rationales.
混合后理由和基础预测（谈话，子节 4.3）：从每个理由后的隐藏状态输出，我们训练一个“混合头”——一个浅层多层感知器，生成一个权重，决定后理由预测的下一个标记的对数值应与基础语言模型预测的对数值相比应纳入多少。这种方法在微调早期由于引入理由而减轻了分布转移。
Optimizing rationale generation (learn, Subsection 4.4): We optimize the rationale generation parameters (start/end tokens and LM weights) to increase the likelihood of rationales that make future text more probable. We use REINFORCE to provide a learning signal to rationales based on their impact on future-token prediction. To reduce variance, we apply a teacher-forcing trick to include in the loss the likelihood of predicting not only the token after the thought but also later tokens.
优化推理生成（学习，第 4.4 节）：我们优化推理生成参数（起始/结束标记和语言模型权重），以增加使未来文本更可能的推理的概率。我们使用 REINFORCE 根据推理对未来标记预测的影响提供学习信号。为了减少方差，我们应用教师强制技巧，将预测不仅是思考后的标记，还包括后续标记的概率纳入损失中。

4.2 Parallel Generation 4.2 并行生成

A key challenge in Quiet-STaR is efficiently generating rationales at each token position in the input sequence. Naively, this would require a separate forward pass for each token, which becomes computationally intractable for long sequences.
在 Quiet-STaR 中，一个关键挑战是高效地在输入序列的每个标记位置生成理由。简单来说，这将需要对每个标记进行单独的前向传播，这对于长序列来说在计算上是不可行的。
We allow for highly parallel generation by first observing that an inference pass of a language model produces a probability distribution over the next tokens for all input tokens. Naturally, this allows us to sample one next token from each token in the input. If one has generated a successor from each token, it is not possible to simply continue with the original sequence. For example, imagine predicting the next token after each token of "

<

bos

>

the cat sat" one might generate “yes orange saw down” - each successor by itself is a reasonable next token
我们允许高度并行的生成，首先观察到语言模型的推理过程为所有输入标记生成了下一个标记的概率分布。自然，这使我们能够从输入中的每个标记中采样一个下一个标记。如果从每个标记生成了一个后继，就无法简单地继续原始序列。例如，想象在“

<

bos

>

the cat sat”的每个标记之后预测下一个标记，可能会生成“yes orange saw down”——每个后继本身都是一个合理的下一个标记。

Figure 3: Parallel Generation. By constructing an attention mask that allows all thought tokens to pay attention to themselves, all preceding thought tokens within the same thought, and the preceding text, we can generate continuations of all of the thoughts in parallel. Each inference call is used to generate one additional thought token for all text tokens.
图 3：并行生成。通过构建一个注意力掩码，使所有思维标记能够关注自身、同一思维中的所有前置思维标记以及前面的文本，我们可以并行生成所有思维的延续。每次推理调用用于为所有文本标记生成一个额外的思维标记。
to a prefix of the sequence, but the list of tokens is a set of “counterfactual” continuations of these prefixes. We can, however, leverage these continuations to generate hidden thoughts for each observed token.
对于序列的一个前缀，但令牌列表是这些前缀的“反事实”延续。然而，我们可以利用这些延续为每个观察到的令牌生成隐藏的思想。
To do this efficiently, we cache each forward pass and concatenate a diagonal attention mask to the previous attention mask: each generated token now attends to all of the tokens that were used to generate it, as well as to itself (but not to token on other “counterfactual” paths). Moreover, this parallelized next-sampling token procedure can be repeated arbitrarily many times (or at least, until one runs out of memory). We visualize this procedure in Figure 3 and highlight additional ways to make this algorithm faster in Appendix B.
为了高效地实现这一点，我们缓存每次前向传播，并将对角注意力掩码连接到之前的注意力掩码：每个生成的标记现在关注于生成它所使用的所有标记，以及它自身（但不关注其他“反事实”路径上的标记）。此外，这种并行的下一个采样标记过程可以任意多次重复（或者至少，直到内存耗尽）。我们在图 3 中可视化了这一过程，并在附录 B 中强调了加快该算法的其他方法。

4.3 "Mixing" (Residual) Heads
4.3 “混合”（残余）头

When starting with a pre-trained model, thoughts will initially be out of distribution, and hence harm language modeling performance. To smooth the transition to thinking, we introduce a learned interpolation between the LM predictions with and without thoughts. Given the end-of-thought token’s hidden state and the hidden state of the original text token, the mixing head outputs a weight that determines the extent to which the postthought prediction logits will be used. We use a shallow multi-layer perceptron for this head, outputting a scalar for each token. We include implementation details in Appendix A.
当使用预训练模型时，思维最初会超出分布，因此会损害语言建模性能。为了平滑思维的过渡，我们引入了在有思维和无思维的语言模型预测之间的学习插值。给定思维结束标记的隐藏状态和原始文本标记的隐藏状态，混合头输出一个权重，决定后思维预测对数的使用程度。我们为这个头使用一个浅层多层感知器，为每个标记输出一个标量。实现细节见附录 A。

4.4 Optimizing Rationale Generation
4.4 优化推理生成

4.4.1 Optimizing Start-of-Thought and End-of-Thought Tokens
4.4.1 优化思维开始和结束标记

The

<

|startofthought

∣>

and

<

endofthought

∣>

tokens serve as learned meta-tokens that control the model’s rationale generation. Optimizing the representation of these tokens, especially the

<

startofthought

∣>

token, is crucial but challenging due to the discrete nature of the rationale tokens. We initialize the start and end token embeddings to the embedding corresponding to the em dash, “—”, which often appears in text data to denote a pause or thought. This leverages the language model’s preexisting knowledge. In addition, to allow these embeddings to be optimized more quickly, we apply a (hyperparameter) weight to the gradients of these embeddings during the update step. Intuitively, the start thought tokens can be understood as putting the model into a “thinking mode” and the end thought token can be understood as telling the model when it’s done thinking.

<

|startofthought

∣>

和

<

endofthought

∣>

令牌作为学习到的元令牌，用于控制模型的推理生成。优化这些令牌的表示，特别是

<

startofthought

∣>

令牌，是至关重要但具有挑战性的，因为推理令牌的离散特性。我们将开始和结束令牌的嵌入初始化为与破折号“—”对应的嵌入，破折号通常出现在文本数据中以表示停顿或思考。这利用了语言模型的先前知识。此外，为了使这些嵌入能够更快地优化，我们在更新步骤中对这些嵌入的梯度应用了一个（超参数）权重。直观上，开始思考令牌可以理解为将模型置于“思考模式”，而结束思考令牌可以理解为告诉模型何时完成思考。

4.4.2 Non-myopic Scoring and Teacher-forcing
4.4.2 非近视评分与教师强制

Because we do not expect thoughts to be useful in predicting every token, we would prefer the model’s reward to depend less on the exact next word in the text following the thought and more on the following semantic content. There are two primary challenges here. First, unlike in typical language modeling with transformers, only the thoughts corresponding to
因为我们不期望思想在预测每个标记时都能发挥作用，我们更希望模型的奖励不那么依赖于思想后面文本中的确切下一个词，而更多地依赖于后续的语义内容。这里有两个主要挑战。首先，与典型的变换器语言建模不同，仅与思想相对应的内容

Figure 4: Forward Pass and Teacher Forcing. We visualize a single forward pass of our algorithm. Solid lines denote language model computation, while dashed lines indicate tokens are inserted via teacher forcing, and the mixer represents the mixing head. In particular, we visualize predicting three tokens ahead. Thought generation is shown in more detail in Figure 1 and Figure 3.
图 4：前向传播和教师强制。我们可视化了算法的单次前向传播。实线表示语言模型计算，而虚线表示通过教师强制插入的标记，混合器代表混合头。特别地，我们可视化了预测三个标记的过程。思维生成在图 1 和图 3 中有更详细的展示。
a given next-token prediction receive a gradient from that prediction-a consequence of our parallel sampling strategy. We could address this by adding loss terms for future tokens by sampling the tokens before. However this would result in much higher entropy for language modeling in general and lower-quality generated text, because it would train the LM to partially disregard its preceding tokens. Instead, we use the parallel attention mask to compute the log probabilities of the true next tokens, applying teacher forcing by assuming the model selected the correct next ground-truth token (as implicit in normal language modeling with transformers). Note that the loss for each future token also depends on a mixing weight computed from the end thought token and the previous observed token. The number of future tokens included in the loss is a hyper-parameter. We apply the same teacher-forcing technique to insert the start and end tokens. We visualize this procedure in Figure 4.
给定的下一个标记预测接收来自该预测的梯度——这是我们并行采样策略的结果。我们可以通过为未来的标记添加损失项来解决这个问题，方法是先采样这些标记。然而，这将导致语言建模的一般熵显著增加，并且生成的文本质量降低，因为这会使语言模型部分忽视其前面的标记。相反，我们使用并行注意力掩码来计算真实下一个标记的对数概率，通过假设模型选择了正确的下一个真实标记（这在使用变换器的正常语言建模中是隐含的）来应用教师强制。请注意，每个未来标记的损失也依赖于从结束思考标记和先前观察到的标记计算的混合权重。包含在损失中的未来标记数量是一个超参数。我们对插入开始和结束标记应用相同的教师强制技术。我们在图 4 中可视化了这个过程。

4.4.3 Objective 4.4.3 目标

We use REINFORCE to optimize the likelihoods of the rationales based on their usefullness: the log-likelihood of the

n_{true}

true next tokens

X_{j + 1 : j + n_{true} + 1}

under the language model given previous observed tokens and a particular rationale

(p_{j : j + n_{true}}^{talk}

as shorthand for the mixed prediction probabilities after thinking, see Algorithm 1). To reduce variance, we generate multiple rationale continuations for each token in the input sequence (loosely inspired by TRICE, Phan et al. (2023)). We thus define the reward

r_{j}

for each rationale

T_{j}

as the difference between

p_{j : j + n_{true}}^{talk}

and the average across rationales for that token

({\bar{p}}_{j : j + n_{true}}^{talk})

:
我们使用 REINFORCE 来优化基于其有效性的理由的似然性：在给定先前观察到的标记和特定理由的情况下，语言模型下

n_{true}

真实下一个标记

X_{j + 1 : j + n_{true} + 1}

的对数似然，作为思考后混合预测概率的简写（见算法 1）。为了减少方差，我们为输入序列中的每个标记生成多个理由延续（灵感来源于 TRICE，Phan 等，2023）。因此，我们将每个理由

T_{j}

的奖励

r_{j}

定义为

p_{j : j + n_{true}}^{talk}

与该标记的理由平均值

({\bar{p}}_{j : j + n_{true}}^{talk})

之间的差异：

r_{j} = \log p_{j : j + n_{true}}^{talk} (X_{j + 1 : j + n_{true} + 1}) - \log {\bar{p}}_{j : j + n_{true}}^{talk} (X_{j + 1 : j + n_{true} + 1})

We then use this reward in a REINFORCE loss term to update the language model parameters

θ

to increase the likelihood of rationales that perform better than the average:
我们随后在 REINFORCE 损失项中使用这个奖励来更新语言模型参数

θ

，以提高表现优于平均水平的理由的可能性：

\nabla_{θ} L_{j}^{REINFORCE} = - r_{j} \cdot \nabla_{θ} \log p_{θ} (T_{j} ∣ [X_{: j}; <∣ startofthought ∣>])

We found it useful to exclude the negative reward from the REINFORCE loss term, as it led to more stable training, though it may introduce some bias.
我们发现将负奖励从 REINFORCE 损失项中排除是有用的，因为这导致了更稳定的训练，尽管这可能会引入一些偏差。
This loss term encourages the model to generate rationales that improve its predictions of future tokens compared to the average prediction across all generated rationales for that token. The gradients from this loss are used to update both the LM parameters and the start-of-thought and end-of-thought token embeddings, with a (hyperparameter) weight applied to the gradients of the start-of-thought and end-of-thought token embeddings to accelerate their optimization. By iteratively optimizing these parameters, Quiet-STaR trains the model to generate more useful rationales throughout training. Lastly, we also include a log-likelihood loss,

L_{j}^{NLL}

, to ensure that the LM learns to optimize the talking heads and also receives a next-token prediction signal for the base LM head

^{1}

.
该损失项鼓励模型生成合理的推理，以改善其对未来标记的预测，相较于该标记所有生成推理的平均预测。该损失的梯度用于更新语言模型参数以及思考开始和思考结束的标记嵌入，对思考开始和思考结束的标记嵌入的梯度应用一个（超参数）权重，以加速其优化。通过迭代优化这些参数，Quiet-STaR 训练模型在整个训练过程中生成更有用的推理。最后，我们还包括一个对数似然损失，

L_{j}^{NLL}

，以确保语言模型学习优化对话头，并且为基础语言模型头

^{1}

接收下一个标记的预测信号。

5 Experiments and Results
5 实验与结果

Intuitively, not all tokens require equal amounts of thought. For example, consider the sentence “the person is run-”: although there is inevitably some probability of the token being something other than “ing”

^{2}

, as a standalone sentence without context, additional thinking is unlikely to improve a well-trained model’s prediction. Indeed, we conjecture that for most chunks of most online text, additional thought has little to no impact. Indeed, early in our exploration we observed that Quiet-STaR does not benefit all tokens equally. Thus, we design our experiments to investigate whether our approach is useful in predicting tokens that do require thought. We evaluate 1) whether Quiet-STaR improves a language model’s ability to directly predict answers in datasets that require reasoning; and, 2) the distribution of impacts resulting from thinking tokens. We conduct all of our experiments starting with the base version of Mistral 7B (Jiang et al., 2023).
直观上，并非所有的标记都需要相等的思考量。例如，考虑句子“the person is run-”：尽管这个标记有不可避免的概率是其他内容而非“ing”

^{2}

，作为一个没有上下文的独立句子，额外的思考不太可能改善一个训练良好的模型的预测。实际上，我们推测，对于大多数在线文本的大多数片段，额外的思考几乎没有影响。确实，在我们探索的早期阶段，我们观察到 Quiet-STaR 并未对所有标记产生同等的益处。因此，我们设计实验以调查我们的方法是否在预测确实需要思考的标记时有效。我们评估 1）Quiet-STaR 是否提高了语言模型在需要推理的数据集上直接预测答案的能力；以及 2）思考标记所产生的影响分布。我们所有的实验均以 Mistral 7B 的基础版本为起点（Jiang et al., 2023）。
We perform most of our experiments by training on OpenWebMath (Paster et al., 2023), a crawl that emphasizes more technical webpages. We selected OpenWebMath because we anticipated that it would have a higher density of tokens that benefit from reasoning, which our experiments support. We also evaluate Quiet-STaR on C4 (Raffel et al., 2020), a widely used LM pretraining corpus with more diverse text, and again show significant albeit smaller benefits.
我们大多数实验是在 OpenWebMath（Paster 等，2023）上进行的，该数据集强调更技术性的网站。我们选择 OpenWebMath 是因为我们预期它会有更高密度的有助于推理的标记，这一点得到了我们的实验支持。我们还在 C4（Raffel 等，2020）上评估了 Quiet-STaR，这是一个广泛使用的语言模型预训练语料库，包含更多样化的文本，再次显示出显著但较小的收益。

5.1 Downstream Performance
5.1 下游性能

In this subsection, we evaluate the extent to which Quiet-STaR improves the zero-shot reasoning capabilities of the language model on CommonsenseQA (Talmor et al., 2018) and GSM8K (Cobbe et al., 2021). On CommonsenseQA, we find that Quiet-STaR improves performance by

10.9 %

compared to the base language model. As shown in Figure 2, this improvement consistently increases with the number of tokens used in the model’s rationales, indicating that more thorough reasoning through the thought tokens is translating to better direct question-answering performance. Similarly, on GSM8K, Quiet-STaR results in a

5.0 %

boost over the base model, and once again, performance scales with the length of the rationales generated during Quiet-STaR training. For reference, in Figure 2, we include a baseline corresponding to training the same model on the same dataset without thought tokens. We observe that in multiple curves, performance appears to eventually deteriorate - we anticipate that this is because we are not training on these downstream tasks, so the roles of the thought tokens may change over time. We also find a benefit of our non-myopic objective, which we discuss in Appendix D.
在本小节中，我们评估了 Quiet-STaR 在 CommonsenseQA（Talmor 等，2018）和 GSM8K（Cobbe 等，2021）上提高语言模型零-shot 推理能力的程度。在 CommonsenseQA 上，我们发现 Quiet-STaR 相比基础语言模型提高了

10.9 %

的性能。如图 2 所示，这一改善随着模型推理中使用的 token 数量的增加而持续增加，表明通过思维 token 进行更全面的推理转化为更好的直接问答性能。同样，在 GSM8K 上，Quiet-STaR 相比基础模型带来了

5.0 %

的提升，再次表明性能与 Quiet-STaR 训练期间生成的推理长度成正比。作为参考，在图 2 中，我们包含了一个基线，对应于在相同数据集上训练相同模型而不使用思维 token。我们观察到在多条曲线中，性能似乎最终会恶化——我们预计这是因为我们没有在这些下游任务上进行训练，因此思维 token 的作用可能会随着时间的推移而变化。我们还发现了我们非短视目标的好处，具体讨论见附录 D。

We find that training with Quiet-STaR on C4 (Raffel et al., 2020) also improves performance on GSM8K

(5.9 % \to 8.1 %)

and CommonsenseQA

(36.3 % \to 42.6 %)

but by a smaller margin. Specifically, for our C4 evaluation, we train Mistral 7B with 16 thought tokens and 4 true tokens ahead and otherwise the same setup.
我们发现，在 C4（Raffel 等，2020）上使用 Quiet-STaR 进行训练也提高了 GSM8K

(5.9 % \to 8.1 %)

和 CommonsenseQA

(36.3 % \to 42.6 %)

的表现，但幅度较小。具体而言，对于我们的 C4 评估，我们使用 16 个思考标记和 4 个真实标记训练 Mistral 7B，其余设置相同。
We can compare these improvements to those offered by pause tokens (Goyal et al., 2023), which can be seen as a constrained version of Quiet-STaR where each token is represented by two tokens and the second “pause” token acts as the entirety of the thought. In particular, our setup is most comparable to their pause token fine-tuning, as we also finetune a pretrained model. Their results indicate that pause token fine-tuning also provides minor gains over the base model on CommonsenseQA, they observed an improvement from

26.9 %

28.8 %

; on GSM8K, Goyal et al. (2023) found that pause token fine-tuning harms performance. Moreover, on both tasks (and the majority of their evaluated tasks), they observed that additional thought tokens harmed performance. Moreover, they discuss the “lukewarm effect of pause-finetuning a standard-pretrained model” (Goyal et al., 2023). This suggests that allowing the model to generate multi-token rationales leads to more effective reasoning compared to the single-token “pauses”. Note however, that unlike Goyal et al. (2023), we do not fine-tune on the downstream tasks.
我们可以将这些改进与暂停令牌（Goyal et al., 2023）所提供的改进进行比较，暂停令牌可以被视为 Quiet-STaR 的一个受限版本，其中每个令牌由两个令牌表示，第二个“暂停”令牌作为整个思维的代表。特别是，我们的设置与他们的暂停令牌微调最为相似，因为我们也对预训练模型进行了微调。他们的结果表明，暂停令牌微调在 CommonsenseQA 上相较于基础模型也提供了小幅提升，他们观察到改进从

26.9 %

到

28.8 %

；在 GSM8K 上，Goyal et al.（2023）发现暂停令牌微调会损害性能。此外，在这两个任务（以及他们评估的大多数任务）中，他们观察到额外的思维令牌会损害性能。此外，他们讨论了“对标准预训练模型进行暂停微调的温和效应”（Goyal et al., 2023）。这表明，允许模型生成多令牌的推理比单令牌的“暂停”更有效。然而，请注意，与 Goyal et al.（2023）不同，我们并未在下游任务上进行微调。
Overall, these downstream results validate that training a language model to predict the subtext between the lines of general text data can substantially improve its reasoning
总体而言，这些下游结果验证了训练语言模型以预测一般文本数据中潜在含义的能力可以显著提高其推理能力

Figure 5: Zero-shot performance on Quiet-STaR applied to chain-of-thought on GSM8K. We visualize how using a Quiet-STaR trained Mistral model can improve chain-of-thought performance. We use an 8-thought-token-trained model and use its internal thoughts to improve the tokens in a zero-shot chain-of-thought (Kojima et al., 2022)
图 5：在 GSM8K 上应用于思维链的 Quiet-STaR 的零-shot 性能。我们可视化了使用经过 Quiet-STaR 训练的 Mistral 模型如何提高思维链的性能。我们使用一个经过 8 个思维标记训练的模型，并利用其内部思维来改善零-shot 思维链中的标记（Kojima 等，2022）。
capabilities, even on datasets it was not explicitly trained on. The fact that longer rationales consistently lead to better outcomes, and that Quiet-STaR outperforms the constrained pause token approach, supports the notion that Quiet-STaR is successfully teaching the model to leverage its own generated thoughts to reason more thoroughly about the input.
即使在未明确训练过的数据集上，能力也得以体现。更长的推理始终导致更好的结果，而 Quiet-STaR 优于受限暂停标记方法，这支持了 Quiet-STaR 成功教会模型利用其自身生成的思维更全面地推理输入的观点。

5.2 Improvement Distribution
5.2 改进分配

As visualized in Appendix Figure 7, we find that on average there is little improvement in the LM’s ability to predict arbitrary tokens. But, when we visualize the distribution of relative improvements, there is a disproportionate improvement on more difficult tokens. This reflects the idea that some text tokens are substantially harder and benefit more from careful thought.
如附录图 7 所示，我们发现，平均而言，语言模型在预测任意标记的能力上几乎没有改善。但是，当我们可视化相对改善的分布时，发现对更困难的标记有不成比例的改善。这反映了某些文本标记显著更难，并且更能从仔细思考中受益的观点。

In Appendix Figure 8, we aim to provide some insight into the kinds of tokens where the improvements occur. Namely, while thinking appears to help for many tokens in the example, inspection suggests it disproportionately help to predict tokens where recalling relevant information is useful, such as the name of an applicable theorem or the start of the next step in a proof. Notably, this would align well with the framing proposed by Prystawski et al. (2024).
在附录图 8 中，我们旨在提供一些关于改进发生在哪些类型的标记的见解。具体而言，尽管思考似乎对示例中的许多标记有帮助，但检查表明，它在预测那些回忆相关信息有用的标记时，帮助程度不成比例，例如适用定理的名称或证明中下一步的开始。值得注意的是，这与 Prystawski 等人（2024）提出的框架非常一致。

5.3 Quiet-STaR and Chain-of-Thought
5.3 Quiet-STaR 和思维链

While there are natural parallels between chain-of-thought prompting and our approach, they are orthogonal and complementary. In zero-shot chain-of-thought, a user actively prompts the model to think ‘out loud’, otherwise using its ordinary production distribution (Kojima et al., 2022); Quiet-STaR instead allows a model to think quietly at every token, with a distribution trained to be useful. We investigate using silent, Quiet-STaR, rationales while generating explicit CoT reasoning. Because our goal is generalist reasoning that requires no task-specific input at all, we used a zero-shot prompt (“Let’s think step by step.”) without in-context examples. Our experiments indicate that internal rationales allow the model to generate more structured and coherent chains of thought, shown in Appendix E and visualized in Figure 5. The majority vote accuracy over 8 samples (cot-maj@8) increases from

40.6 %

47.7 %

with Quiet-STaR, as evaluated on a sample of 128 GSM8K test items. Note that each chain-of-thought solution is sampled with temperature 0.7 .
虽然链式思维提示与我们的方法之间存在自然的相似性，但它们是正交且互补的。在零-shot 链式思维中，用户主动提示模型“出声”思考，否则使用其普通的生成分布（Kojima et al., 2022）；而 Quiet-STaR 则允许模型在每个标记上安静思考，使用经过训练的有用分布。我们研究在生成明确的链式推理时使用无声的 Quiet-STaR 推理。因为我们的目标是通用推理，完全不需要特定任务的输入，我们使用了一个零-shot 提示（“让我们一步一步思考。”），没有上下文示例。我们的实验表明，内部推理使模型能够生成更有结构和连贯的思维链，详见附录 E，并在图 5 中可视化。对 8 个样本的多数投票准确率（cot-maj@8）在 128 个 GSM8K 测试项的样本评估中，从

40.6 %

增加到

47.7 %

，使用 Quiet-STaR。请注意，每个链式思维解决方案的采样温度为 0.7。

5.4 Examples 5.4 示例

While there is no explicit regularization in Quiet-STaR for thoughts to be humaninterpretable, they are generated from the same transformer trained to model language, hence likely to be at least partially understandable. We discuss why this design choice benefits the training stability in Appendix I. For reference, we include examples of thoughts generated that were helpful to the model in predicting future tokens in OpenWebMath. First, in one case, recalling that one should start with magnesium to produce magnesium nitride allows it to better predict that the first step of the procedure involves heating magnesium.
虽然在 Quiet-STaR 中没有明确的正则化使得思维可被人类理解，但它们是从同一个训练用于建模语言的变换器生成的，因此可能至少部分可理解。我们在附录 I 中讨论了这一设计选择如何有利于训练的稳定性。作为参考，我们包括了一些生成的思维示例，这些示例对模型在 OpenWebMath 中预测未来标记是有帮助的。首先，在一个案例中，回忆起应该从镁开始以生产氮化镁，使得它能够更好地预测该过程的第一步涉及加热镁。

'<s> # Magnesium reacts with nitrogen to form magnesium nitride. The chemical
    formula for this reaction is Mg+N_2-> MgN_2. What is the product, or what
    are the products, of this reaction?\n\nJan 12, 2016\n\nThe formula for
    magnesium nitride is $M {g}_{3} {N}_{2}$.\n\n#### Explanation:\n\nAs do
    many active metals, magnesium nitride can be<|startofthought|> 1 --, so the
    equation of the reaction that forms magnesium nitride is\n\n$Mg + N_2 \\to
    <|endofthought|> formed by heating the metal (fier'

In some cases, the most useful thoughts appear to be near-continuations that correspond more closely to the target text, e.g.,
在某些情况下，最有用的想法似乎是与目标文本更紧密对应的近似延续，例如，

An integer $n$ is odd if $n = 2k+1$ for some integer $k$.\n\nTo prove that $A =
    B$, we must show that $A \\subseteq B$ and $B \\subseteq A$. The first of
    these tends to<|startthought|> in some sense - to be the more difficult<|
    endthought|> trickiest for students

Lastly, we include an example from answering CommonsenseQA. Notably, this thought occurs while reading the question and hence was not used to predict the final answer.
最后，我们包括了一个来自回答 CommonsenseQA 的例子。值得注意的是，这个想法是在阅读问题时产生的，因此并没有用于预测最终答案。

'<s> Q: Talking to the same person about the same thing over and over again is
    \(<\mid s t a r t o f\) thought \(\mid>\backslash n \backslash n(a)\) a one-to-one correlation \(\backslash n \backslash n(b)\) a one-to<|
    endofthought|> something someone can what?'

6 Limitations 6 限制

This work proposes a new framework for learning to reason, and in doing so explores solutions to a variety of meta-learning challenges. However, to solve these challenges, certain simplifications were necessary. For example, it would be valuable to understand whether these techniques work when a model is trained from scratch. We have also only applied Quiet-STaR to a 7 billion parameter model, albeit a powerful one. The same techniques applied to a better model would likely yield disproportionately better results, as has often been observed for gains from reasoning (Wei et al., 2022a).
本研究提出了一种新的推理学习框架，并在此过程中探讨了多种元学习挑战的解决方案。然而，为了解决这些挑战，某些简化是必要的。例如，了解这些技术在从头开始训练模型时是否有效将是有价值的。我们也仅将 Quiet-STaR 应用于一个 70 亿参数的模型，尽管这是一个强大的模型。将相同的技术应用于更好的模型可能会产生不成比例的更好结果，这在推理带来的收益中常常被观察到（Wei et al., 2022a）。
Quiet-STaR results in a substantial overhead, generating many tokens before generating every additional token. (See Appendix C for compute adjusted performance results.) However, this can also be seen as an advantage: typically, a language model can generate the next token based on the current context, and while there are techniques to improve sampling quality, there is no general way to leverage additional compute to enhance next-token prediction. In the current implementation we do not support dynamically predicting when to generate, or end, a rationale. However, this would be a natural extension. For instance, if the mixing head was a prediction from the base language model, before any thought, rather than after the thought, one could apply a threshold to prevent generating thoughts that would not be incorporated. We expect that this is a more difficult task, as predicting the usefulness of a thought is simpler when one has already generated the thought.
Quiet-STaR 结果产生了相当大的开销，在生成每个额外的标记之前会生成许多标记。（有关计算调整后的性能结果，请参见附录 C。）然而，这也可以视为一种优势：通常，语言模型可以根据当前上下文生成下一个标记，尽管有一些技术可以提高采样质量，但没有通用的方法可以利用额外的计算来增强下一个标记的预测。在当前的实现中，我们不支持动态预测何时生成或结束推理。然而，这将是一个自然的扩展。例如，如果混合头是来自基础语言模型的预测，在任何思考之前，而不是在思考之后，可以应用一个阈值来防止生成不会被纳入的思考。我们预计这是一项更困难的任务，因为在已经生成思考时，预测思考的有用性会更简单。

7 Conclusion 7 结论

Quiet-STaR represents a step towards language models that can learn to reason in a general and scalable way. By training on the rich spectrum of reasoning tasks implicit in diverse web text, rather than narrowly specializing for particular datasets, Quiet-STaR points the way to more robust and adaptable language models. Our results demonstrate the promise of this approach, with Quiet-STaR improving downstream reasoning performance while generating qualitatively meaningful rationales. We believe this also opens many potential future directions - for example, one may aim to ensemble thoughts in order to further improve the predictions for future tokens. Moreover, if the language model can predict when thought will be useful, for example by putting the mixing head before the prediction, then the predicted mixing weight could be used to dynamically allocate compute during generation. Future work can build on these insights to further close the gap between language model and human-like reasoning capabilities.
Quiet-STaR 代表了朝着能够以一般性和可扩展的方式学习推理的语言模型迈出的一步。通过在多样化的网络文本中隐含的丰富推理任务上进行训练，而不是狭隘地专注于特定数据集，Quiet-STaR 指明了更强大和更具适应性的语言模型的发展方向。我们的结果展示了这种方法的潜力，Quiet-STaR 在生成具有定性意义的推理时，提高了下游推理性能。我们相信这也开启了许多潜在的未来方向——例如，可以旨在集成思维，以进一步改善对未来标记的预测。此外，如果语言模型能够预测思维何时会有用，例如通过在预测之前放置混合头，那么预测的混合权重可以用于在生成过程中动态分配计算资源。未来的工作可以基于这些见解，进一步缩小语言模型与类人推理能力之间的差距。

Ethics Statement 伦理声明

This work raises some important ethical questions, many of which also apply to STaR. For example, it is impossible to know that the reasoning expressed by the model in language accurately represents the internal processing of the model (i.e., faithfulness). In addition, regardless of faithfulness, there are no safeguards against harmful or biased reasoning patterns if the model finds them useful. Relatedly, we note that CommonsenseQA is known to have many biased questions and low-quality answers (Geva et al., 2019), but we use it in line with prior work (Zelikman et al., 2022; Goyal et al., 2023). Thus, aside from improving language modeling, it is unclear in what capacity the rationales themselves should be used.
这项工作提出了一些重要的伦理问题，其中许多也适用于 STaR。例如，无法确定模型在语言中表达的推理是否准确代表了模型的内部处理（即，忠实性）。此外，无论忠实性如何，如果模型发现有害或偏见的推理模式有用，则没有任何保护措施。相关地，我们注意到 CommonsenseQA 被认为有许多偏见问题和低质量答案（Geva 等，2019），但我们根据先前的工作使用它（Zelikman 等，2022；Goyal 等，2023）。因此，除了改善语言建模外，尚不清楚这些推理本身应以何种方式使用。

Acknowledgements 致谢

We particularly thank Xindi Wu, Michael Li, and Qian Huang for their helpful and detailed comments, as well as Xuechen Li, Jan-Philipp Fränken, Yuhuai Wu, Gabriel Poesia, Winnie Xu, Omar Shaikh, Fan-Yun Sun, Joy He-Yueya, Omar Khattab, and William Yin for useful discussions. In addition, we would like to acknowledge that this work was supported by NSF Grant #2302701.
我们特别感谢吴新迪、李迈克尔和黄倩的有益和详细的评论，以及李雪晨、扬-菲利普·弗兰肯、吴宇怀、加布里埃尔·波西亚、徐维妮、奥马尔·沙伊赫、孙凡云、何悦雅、奥马尔·哈塔布和尹威廉的有用讨论。此外，我们还要感谢国家科学基金会资助的项目#2302701 对本工作的支持。

References 参考文献

Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. Advances in neural information processing systems, 30, 2017.
托马斯·安东尼，郑天，和大卫·巴伯。利用深度学习和树搜索进行快速与慢速思考。《神经信息处理系统进展》，30，2017。

Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915, 2023.
白安·陈，常舒，埃赫桑·沙雷吉，奈杰尔·科利尔，卡尔蒂克·纳拉西曼，和尚宇·姚。Fireact：朝向语言代理的微调。arXiv 预印本 arXiv:2310.05915，2023。

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instructionfinetuned language models. arXiv preprint arXiv:2210.11416, 2022.
郑亨元, 侯乐, 谢恩·朗普雷, 巴雷特·佐夫, 田毅, 威廉·费杜斯, 李云轩, 王学志, 穆斯塔法·德赫加尼, 西达尔塔·布拉赫马, 等. 扩展指令微调语言模型. arXiv 预印本 arXiv:2210.11416, 2022.
"Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training Verifiers to Solve Math Word Problems. arXiv, 2021. -eprint: 2110.14168
卡尔·科贝，维尼特·科萨拉朱，穆罕默德·巴瓦里安，马克·陈，许宇俊，卢卡斯·凯泽，马蒂亚斯·普拉普特，杰瑞·特沃雷克，雅各布·希尔顿，中野礼一郎，克里斯托弗·赫塞，约翰·舒尔曼。训练验证者解决数学文字问题。arXiv，2021。-eprint: 2110.14168

Kanishk Gandhi, Dorsa Sadigh, and Noah D Goodman. Strategic reasoning with language models. arXiv preprint arXiv:2305.19165, 2023.
卡尼什·甘地，多尔萨·萨迪赫，诺亚·D·古德曼。使用语言模型进行战略推理。arXiv 预印本 arXiv:2305.19165，2023。

Mor Geva, Yoav Goldberg, and Jonathan Berant. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. arXiv preprint arXiv:1908.07898, 2019.
Mor Geva, Yoav Goldberg 和 Jonathan Berant. 我们是在建模任务还是标注者？对自然语言理解数据集中标注者偏见的调查. arXiv 预印本 arXiv:1908.07898, 2019.

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. arXiv preprint arXiv:2310.02226, 2023.
Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, 和 Vaishnavh Nagarajan. 三思而后言：使用暂停标记训练语言模型。arXiv 预印本 arXiv:2310.02226, 2023.

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu 等. 强化自我训练（rest）用于语言建模. arXiv 预印本 arXiv:2308.08998, 2023.

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi 等。教科书就是你所需要的一切。arXiv 预印本 arXiv:2306.11644, 2023。

Patrick Haluptzok, Matthew Bowers, and Adam Tauman Kalai. Language models can teach themselves to program better. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=SaRj2ka1 XZ3.
帕特里克·哈卢普佐克、马修·鲍尔斯和亚当·陶曼·卡莱。语言模型可以自我学习以更好地编程。在第十一届国际学习表征会议，2023 年。网址 https://openreview.net/forum?id=SaRj2ka1XZ3。

John Hewitt, John Thickstun, Christopher D Manning, and Percy Liang. Backpack language models. arXiv preprint arXiv:2305.16765, 2023.
约翰·休伊特，约翰·西克斯顿，克里斯托弗·D·曼宁，和珀西·梁。背包语言模型。arXiv 预印本 arXiv:2305.16765，2023。

Namgyu Ho, Laura Schmid, and Se-Young Yun. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071, 2022.
南久·霍，劳拉·施密德，尹世永。大型语言模型是推理教师。arXiv 预印本 arXiv:2212.10071，2022。

Matthew Douglas Hoffman, Du Phan, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, and Rif A Saurous. Training chain-of-thought via latent-variable inference. Advances in Neural Information Processing Systems, 36, 2024.
马修·道格拉斯·霍夫曼，杜·范，戴维·多汉，肖尔托·道格拉斯，段安·黎，亚伦·帕里西，帕维尔·索恩佐夫，查尔斯·萨顿，沙拉德·维克拉姆，和里夫·A·索罗斯。通过潜变量推理训练思维链。神经信息处理系统进展，36，2024。
Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, and Rishabh Agarwal. V-star: Training verifiers for self-taught reasoners. arXiv preprint arXiv:2402.06457, 2024.
阿里安·霍塞尼，袁兴迪，尼古拉·马尔金，亚伦·库尔维尔，亚历山德罗·索尔多尼，和里沙布·阿加瓦尔。V-star：为自学推理者训练验证器。arXiv 预印本 arXiv:2402.06457，2024。

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023.
谢承宇，李俊良，叶志宽，霍坦·纳克霍斯特，藤井康久，亚历山大·拉特纳，兰杰·克里希纳，李晨宇，托马斯·菲斯特。逐步提炼！以更少的训练数据和更小的模型规模超越更大的语言模型。arXiv 预印本 arXiv:2305.02301，2023。

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
黄佳欣，顾世翔，侯乐，吴月欣，王学志，余洪坤，韩家伟。大型语言模型可以自我改进。arXiv 预印本 arXiv:2210.11610，2022。

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
埃里克·张，石翔·顾，和本·普尔。使用 Gumbel-Softmax 的分类重参数化。arXiv 预印本 arXiv:1611.01144，2016。

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
阿尔伯特·Q·姜，亚历山大·萨布莱罗尔，阿瑟·门施，克里斯·班福德，德文德拉·辛格·查普洛特，迭戈·德·拉斯·卡萨斯，弗洛里安·布雷桑，吉安娜·伦吉尔，纪尧姆·兰普尔，露西尔·索尔尼耶，等。Mistral 7b。arXiv 预印本 arXiv:2310.06825，2023 年。

Hoyoun Jung and Kyung-Joong Kim. Discrete prompt compression with reinforcement learning. arXiv preprint arXiv:2308.08758, 2023.
Hoyoun Jung 和 Kyung-Joong Kim. 通过强化学习进行离散提示压缩. arXiv 预印本 arXiv:2308.08758, 2023.

Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024, 2022.
奥马尔·哈塔布、凯沙夫·桑坦纳姆、李香·丽莎、戴维·霍尔、珀西·梁、克里斯托弗·波茨和马泰·扎哈里亚。演示-搜索-预测：为知识密集型自然语言处理构建检索和语言模型。arXiv 预印本 arXiv:2212.14024，2022 年。
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714, 2023.
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam 等. Dspy: 将声明性语言模型调用编译为自我改进的管道. arXiv 预印本 arXiv:2310.03714, 2023.

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large Language Models are Zero-Shot Reasoners, 2022. URL https://arxiv.org/abs/ 2205.11916.
小岛武、石香·肖恩·顾、马切尔·里德、松尾丰、岩泽佑介。大型语言模型是零-shot 推理者，2022 年。网址 https://arxiv.org/abs/2205.11916。

Andrew K Lampinen, Ishita Dasgupta, Stephanie CY Chan, Kory Matthewson, Michael Henry Tessler, Antonia Creswell, James L McClelland, Jane X Wang, and Felix Hill. Can language models learn from explanations in context? arXiv preprint arXiv:2204.02329, 2022.
安德鲁·K·兰皮宁，伊希塔·达斯古普塔，斯蒂芬妮·CY·陈，科里·马修森，迈克尔·亨利·泰斯勒，安东尼亚·克雷斯韦尔，詹姆斯·L·麦克莱兰，简·X·王，费利克斯·希尔。语言模型能从上下文中的解释中学习吗？arXiv 预印本 arXiv:2204.02329，2022。

Jack Lanchantin, Shubham Toshniwal, Jason Weston, Sainbayar Sukhbaatar, et al. Learning to reason and memorize with self-notes. Advances in Neural Information Processing Systems, 36, 2024 .
杰克·兰昌廷，舒巴姆·托什尼瓦尔，杰森·韦斯顿，赛因巴亚尔·苏赫巴特尔等。通过自我笔记学习推理和记忆。《神经信息处理系统进展》，36，2024。

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
布赖恩·莱斯特，拉米·阿尔-尔福，诺亚·康斯坦特。参数高效提示调优的规模效应。arXiv 预印本 arXiv:2104.08691，2021。
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843-3857, 2022.
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo 等. 使用语言模型解决定量推理问题. 神经信息处理系统进展, 35:3843-3857, 2022.

Michael Y Li, Emily B Fox, and Noah D Goodman. Automated statistical model discovery with language models. arXiv preprint arXiv:2402.17879, 2024.
迈克尔·Y·李，艾米莉·B·福克斯，诺亚·D·古德曼。使用语言模型的自动统计模型发现。arXiv 预印本 arXiv:2402.17879，2024。

Shiyang Li, Jianshu Chen, Yelong Shen, Zhiyu Chen, Xinlu Zhang, Zekun Li, Hong Wang, Jing Qian, Baolin Peng, Yi Mao, et al. Explanations from large language models make small reasoners better. arXiv preprint arXiv:2210.06726, 2022.
李世扬，陈建书，沈业龙，陈志宇，张新璐，李泽坤，王洪，钱静，彭宝林，毛毅，等。大型语言模型的解释使小型推理器表现更好。arXiv 预印本 arXiv:2210.06726，2022。
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
向丽莎·李和珀西·梁。前缀调优：优化生成的连续提示。arXiv 预印本 arXiv:2101.00190，2021。
Yucheng Li, Bo Dong, Chenghua Lin, and Frank Guerin. Compressing context to enhance inference efficiency of large language models. arXiv preprint arXiv:2310.06201, 2023.
李宇成, 董博, 林成华, 和弗兰克·盖林. 压缩上下文以提高大型语言模型的推理效率. arXiv 预印本 arXiv:2310.06201, 2023.
Jiacheng Liu, Ramakanth Pasunuru, Hannaneh Hajishirzi, Yejin Choi, and Asli Celikyilmaz. Crystal: Introspective reasoners reinforced with self-feedback. arXiv preprint arXiv:2310.04921, 2023.
刘佳成，拉马坎特·帕苏努鲁，哈娜娜·哈吉希尔兹，崔艺珍，阿斯利·切利基尔马兹。水晶：通过自我反馈强化的内省推理器。arXiv 预印本 arXiv:2310.04921，2023。

Haipeng Luo, Qingfeng Sun, Can Xu , Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
罗海鹏, 孙清风, 许灿, 赵璞, 楼建光, 陶崇阳, 耿秀博, 林青伟, 陈世峰, 张冬梅. Wizardmath: 通过强化进化指令增强大语言模型的数学推理能力. arXiv 预印本 arXiv:2308.09583, 2023.

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self. Feedback, 2023.
阿曼·马丹，尼凯特·坦登，普拉卡尔·古普塔，斯凯勒·哈利南，刘宇·高，萨拉·维格雷夫，乌里·阿隆，诺哈·兹里，施里迈·普拉布莫耶，易铭·杨等。自我精炼：与自我反馈的迭代精炼，2023。

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
弗拉基米尔·姆尼赫，科雷·卡武克乔格鲁，大卫·西尔弗，亚历克斯·格雷夫斯，伊奥尼斯·安东诺格鲁，达恩·维尔斯特拉，马丁·里德米勒。使用深度强化学习玩 Atari 游戏。arXiv 预印本 arXiv:1312.5602，2013 年。

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928-1937. PMLR, 2016.
沃洛迪米尔·姆尼赫，阿德里亚·普伊格多梅内赫·巴迪亚，梅赫迪·米尔扎，亚历克斯·格雷夫斯，蒂莫西·利利克拉普，蒂姆·哈利，大卫·西尔弗，和科拉伊·卡武克乔格鲁。深度强化学习的异步方法。在国际机器学习会议，页码 1928-1937。PMLR，2016。

Jesse Mu, Xiang Li, and Noah Goodman. Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems, 36, 2024.
杰西·穆，李翔，诺亚·古德曼。使用要点令牌学习压缩提示。神经信息处理系统进展，36，2024。

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
麦克斯韦·奈，安德斯·约翰·安德烈亚森，盖·古尔-阿里，亨利克·米哈维尔斯基，雅各布·奥斯丁，大卫·比伯，大卫·多汉，艾托尔·莱科维奇，马滕·博斯马，大卫·卢安，等。展示你的工作：用于中间计算的语言模型草稿纸。arXiv 预印本 arXiv:2112.00114，2021。

Alexander Pan, Erik Jones, Meena Jagadeesan, and Jacob Steinhardt. Feedback loops with language models drive in-context reward hacking. arXiv preprint arXiv:2402.06627, 2024.
亚历山大·潘，埃里克·琼斯，梅娜·贾加迪桑，雅各布·斯坦哈特。语言模型的反馈循环驱动上下文奖励黑客行为。arXiv 预印本 arXiv:2402.06627，2024。

Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. Openwebmath: An open dataset of high-quality mathematical web text. arXiv preprint arXiv:2310.06786, 2023.
Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, 和 Jimmy Ba. Openwebmath: 一个高质量数学网络文本的开放数据集. arXiv 预印本 arXiv:2310.06786, 2023.
Du Phan, Matthew Douglas Hoffman, Sholto Douglas, Tuan Anh Le, Aaron T Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, Rif A Saurous, et al. Training chain-of-thought via latent-variable inference. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
杜凡，马修·道格拉斯·霍夫曼，肖尔托·道格拉斯，阮安·黎，亚伦·T·帕里西，帕维尔·索恩佐夫，查尔斯·萨顿，沙拉德·维克拉姆，里夫·A·索尔斯等。通过潜变量推理训练思维链。在第三十七届神经信息处理系统会议，2023。
Gabriel Poesia, Kanishk Gandhi, Eric Zelikman, and Noah D Goodman. Certified reasoning with language models. arXiv preprint arXiv:2306.04031, 2023.
加布里埃尔·波西亚，卡尼什·甘地，埃里克·泽利克曼，诺亚·D·古德曼。使用语言模型的认证推理。arXiv 预印本 arXiv:2306.04031，2023。
Stanislas Polu and Ilya Sutskever. Generative Language Modeling for Automated Theorem Proving. CoRR, abs/2009.03393, 2020. URL https://arxiv.org/abs/2009.03393. .eprint: 2009.03393 .
斯坦尼斯拉斯·波卢和伊利亚·苏茨克弗。用于自动定理证明的生成语言建模。CoRR, abs/2009.03393, 2020。网址 https://arxiv.org/abs/2009.03393。 .eprint: 2009.03393 .

Ben Prystawski, Michael Li, and Noah Goodman. Why think step by step? reasoning emerges from the locality of experience. Advances in Neural Information Processing Systems, 36,2024.
本·普里斯塔夫斯基、迈克尔·李和诺亚·古德曼。为什么要逐步思考？推理源于经验的局部性。《神经信息处理系统进展》，36，2024。

Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Wangchunshu Zhou, Yuchen Eleanor Jiang, Chengfei Lv, and Huajun Chen. Autoact: Automatic agent learning from scratch via self-planning. arXiv preprint arXiv:2401.05268, 2024.
Qiao Shuofei, Zhang Ningyu, Fang Runnan, Luo Yujie, Zhou Wangchunshu, Jiang Yuchen Eleanor, Lv Chengfei, 和 Chen Huajun. Autoact: 通过自我规划从零开始的自动代理学习. arXiv 预印本 arXiv:2401.05268, 2024.

Linlu Qiu, Liwei Jiang, Ximing Lu, Melanie Sclar, Valentina Pyatkin, Chandra Bhagavatula, Bailin Wang, Yoon Kim, Yejin Choi, Nouha Dziri, et al. Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement. arXiv preprint arXiv:2310.08559, 2023.
Linlu Qiu, Liwei Jiang, Ximing Lu, Melanie Sclar, Valentina Pyatkin, Chandra Bhagavatula, Bailin Wang, Yoon Kim, Yejin Choi, Nouha Dziri 等. 现象级但令人困惑：通过假设精炼测试语言模型的归纳推理能力. arXiv 预印本 arXiv:2310.08559, 2023.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
亚历克·拉德福德，杰弗里·吴，瑞温·查尔德，大卫·卢安，达里奥·阿莫代伊，伊利亚·苏茨凯弗等。语言模型是无监督的多任务学习者。OpenAI 博客，1(8):9，2019。

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1-67, 2020.
科林·拉费尔、诺姆·沙泽尔、亚当·罗伯茨、凯瑟琳·李、沙兰·纳朗、迈克尔·马特纳、阎琦·周、韦·李和彼得·J·刘。探索统一文本到文本变换器的迁移学习极限。《机器学习研究杂志》，21(140):1-67，2020。

Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4932-4942, 2019.
Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, 和 Richard Socher. 解释一下你自己！利用语言模型进行常识推理. 在第 57 届计算语言学协会年会上，页码 4932-4942, 2019.

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2024.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, 和 Thomas Scialom. Toolformer: 语言模型可以自我学习使用工具. 神经信息处理系统进展, 36, 2024.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
约翰·舒尔曼，菲利普·沃尔斯基，普拉富拉·达里瓦尔，亚历克·拉德福德，和奥列格·克里莫夫。近端策略优化算法。arXiv 预印本 arXiv:1707.06347，2017。

Tal Schuster, Ashwin Kalyan, Alex Polozov, and Adam Tauman Kalai. Programming Puzzles. In Thirty-fifth Conference on Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=fe_hCc4RBrg.
Tal Schuster, Ashwin Kalyan, Alex Polozov, 和 Adam Tauman Kalai. 编程难题. 在第三十五届神经信息处理系统会议, 2021. URL https://openreview.net/forum?id=fe_hCc4RBrg.

Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.11366, 2023.
诺亚·辛恩，费德里科·卡萨诺，贝克·拉巴什，阿什温·戈皮纳特，卡尔蒂克·纳拉西姆汉，和姚顺宇。反思：具有语言强化学习的语言代理。arXiv 预印本 arXiv:2303.11366，2023。

Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Unsupervised commonsense question answering with self-talk. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4615-4629, 2020.
Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula, 和 Yejin Choi. 基于自我对话的无监督常识问答. 载于 2020 年自然语言处理实证方法会议论文集（EMNLP），第 4615-4629 页，2020 年。

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
大卫·西尔弗，托马斯·休伯特，朱利安·施瑞特维泽，伊奥尼斯·安东诺格鲁，马修·莱，阿瑟·古兹，马克·兰克托，洛朗·西弗，达尔尚·库马兰，托雷·格雷佩尔等。通过自我对弈与通用强化学习算法掌握国际象棋和将棋。arXiv 预印本 arXiv:1712.01815，2017。

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.
Alon Talmor, Jonathan Herzig, Nicholas Lourie, 和 Jonathan Berant. Commonsenseqa: 一个针对常识知识的问题回答挑战. arXiv 预印本 arXiv:1811.00937, 2018.

Eric Todd, Millicent L Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. Function vectors in large language models. arXiv preprint arXiv:2310.15213, 2023.
埃里克·托德，米莉森·L·李，阿尔纳布·森·夏尔马，亚伦·穆勒，拜伦·C·华莱士，和大卫·包。大型语言模型中的函数向量。arXiv 预印本 arXiv:2310.15213，2023。

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. Neural Information Processing Systems (NeurIPS 2022) Workshop on MATH-AI, 2022.
乔纳森·乌埃萨托，内特·库什曼，拉马纳·库马尔，弗朗西斯·宋，诺亚·西格尔，丽莎·王，安东尼亚·克雷斯韦尔，杰弗里·欧文，和伊琳娜·希金斯。通过过程和结果反馈解决数学文字问题。神经信息处理系统（NeurIPS 2022）MATH-AI 研讨会，2022 年。

Ruocheng Wang, Eric Zelikman, Gabriel Poesia, Yewen Pu, Nick Haber, and Noah D Goodman. Hypothesis search: Inductive reasoning with language models. arXiv preprint arXiv:2309.05660, 2023.
王若成，埃里克·泽利克曼，加布里埃尔·波西亚，蒲业文，尼克·哈伯，诺亚·D·古德曼。假设搜索：使用语言模型的归纳推理。arXiv 预印本 arXiv:2309.05660，2023。

Xuezhi Wang and Denny Zhou. Chain-of-thought reasoning without prompting. arXiv preprint arXiv:2402.10200, 2024.
王雪志和周登尼。无提示的链式思维推理。arXiv 预印本 arXiv:2402.10200，2024。

Lucas Weber, Jaap Jumelet, Elia Bruni, and Dieuwke Hupkes. Language modelling as a multi-task problem. arXiv preprint arXiv:2101.11287, 2021.
卢卡斯·韦伯，雅普·尤梅莱特，埃利亚·布鲁尼，和迪乌克·哈普克斯。将语言建模视为多任务问题。arXiv 预印本 arXiv:2101.11287，2021 年。

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021a.
杰森·韦、马尔滕·博斯马、文森特·赵、凯尔文·古、亚当斯·韦·余、布赖恩·莱斯特、南·杜、安德鲁·M·戴和阮·V·黎。微调语言模型是零样本学习者。在国际学习表征会议，2021a。

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021b.
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, 和 Quoc V Le. 微调语言模型是零-shot 学习者。arXiv 预印本 arXiv:2109.01652, 2021b.

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent Abilities of Large Language Models, October 2022a. URL http: //arxiv. org/abs/2206 . 07682. arXiv:2206.07682 [cs].
杰森·韦、易·泰、瑞希·博马萨尼、科林·拉费尔、巴雷特·佐夫、塞巴斯蒂安·博尔戈德、达尼·尤加塔马、马尔滕·博斯马、丹尼·周、唐纳德·梅茨勒、艾德·H·池、桥本达统、奥里奥尔·维尼亚尔斯、珀西·梁、杰夫·迪恩和威廉·费杜斯。大型语言模型的涌现能力，2022 年 10 月。网址 http://arxiv.org/abs/2206.07682。arXiv:2206.07682 [cs]。

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain of Thought Prompting Elicits Reasoning in Large Language Models, 2022b. URL https://arxiv.org/abs/2201.11903.
杰森·韦、王学志、戴尔·舒尔曼斯、马尔滕·博斯马、布莱恩·伊赫特、夏飞、艾德·奇、阮国和、邹登尼。思维链提示引发大型语言模型的推理，2022b。网址 https://arxiv.org/abs/2201.11903。

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229-256, 1992.
罗纳德·J·威廉姆斯。用于连接主义强化学习的简单统计梯度跟踪算法。《机器学习》，8:229-256，1992。

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. International Conference on Learning Representations (ICLR 2023), 2022.
姚顺宇, 赵杰弗里, 于电, 杜楠, 伊扎克·沙夫兰, 卡尔蒂克·纳拉西姆汉, 和曹源. React: 在语言模型中协同推理与行动. 国际学习表征会议 (ICLR 2023), 2022.

Eric Zelikman, Yuhuai Wu , Jesse Mu , and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476-15488, 2022.
埃里克·泽利克曼，吴宇怀，杰西·穆，诺亚·古德曼。Star：通过推理自举推理。神经信息处理系统进展，35:15476-15488，2022。

Eric Zelikman, Qian Huang, Gabriel Poesia, Noah D. Goodman, and Nick Haber. Parsel: Algorithmic reasoning with language models by composing decompositions, 2023a.
埃里克·泽利克曼、黄倩、加布里埃尔·波西亚、诺亚·D·古德曼和尼克·哈伯。Parsel：通过组合分解进行语言模型的算法推理，2023a。

Eric Zelikman, Eliana Lorch, Lester Mackey, and Adam Tauman Kalai. Self-taught optimizer (stop): Recursively self-improving code generation. arXiv preprint arXiv:2310.02304, 2023b.
埃里克·泽利克曼，埃莉安娜·洛奇，莱斯特·麦基，亚当·陶曼·卡莱。自学优化器（停止）：递归自我改进的代码生成。arXiv 预印本 arXiv:2310.02304，2023b。

Hugh Zhang and David C Parkes. Chain-of-thought reasoning is a policy improvement operator. arXiv preprint arXiv:2309.08589, 2023.
张辉和大卫·C·帕克斯。思维链推理是一种政策改进操作符。arXiv 预印本 arXiv:2309.08589，2023。

Tianjun Zhang, Aman Madaan, Luyu Gao, Steven Zheng, Swaroop Mishra, Yiming Yang, Niket Tandon, and Uri Alon. In-context principle learning from mistakes. arXiv preprint arXiv:2402.05403, 2024.
张天俊, 阿曼·马达安, 高璐宇, 史蒂文·郑, 斯瓦罗普·米什拉, 杨一鸣, 尼凯特·坦登, 和乌里·阿隆. 从错误中进行上下文原则学习. arXiv 预印本 arXiv:2402.05403, 2024.

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022.
张卓生，张阿斯顿，李穆，斯莫拉。大型语言模型中的自动思维链提示。arXiv 预印本 arXiv:2210.03493，2022。

Wenting Zhao, Justin T Chiu, Claire Cardie, and Alexander M Rush. Hop, union, generate: Explainable multi-hop reasoning without rationale supervision. arXiv preprint arXiv:2305.14237, 2023.
赵文婷, 赵俊廷, 克莱尔·卡迪, 亚历山大·M·拉什. 跳跃、联合、生成：无需理由监督的可解释多跳推理. arXiv 预印本 arXiv:2305.14237, 2023.

Hattie Zhou, Azade Nova, Hugo Larochelle, Aaron Courville, Behnam Neyshabur, and Hanie Sedghi. Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066, 2022.
周哈蒂，阿扎德·诺瓦，雨果·拉罗谢尔，亚伦·库维尔，贝赫南·内沙布尔，哈尼·塞德基。通过上下文学习教授算法推理。arXiv 预印本 arXiv:2211.09066，2022。

Zhaocheng Zhu, Yuan Xue, Xinyun Chen, Denny Zhou, Jian Tang, Dale Schuurmans, and Hanjun Dai. Large language models can learn rules. arXiv preprint arXiv:2310.07064, 2023.
朱兆成, 薛媛, 陈新云, 周登尼, 唐健, 达尔·舒尔曼, 代汉军. 大型语言模型可以学习规则. arXiv 预印本 arXiv:2310.07064, 2023.

Appendix 附录

A Hyperparameter Choices
超参数选择

Optimization and Evaluation For optimization, we use the AdamW optimizer with a warmup of 20 steps, a learning rate of

1 e - 6

, a weight decay of 0.001 , and a batch size of 8 (along with any necessary gradient accumulation to keep this fixed across runs). Moreover, our <|startofthought|> and <|endofthought|> embedding gradient weight is

1 e 2

and our policy weight is

1 e 6

. We sample with temperature

T = 1

during training and use greedy decoding for the thoughts during evaluation. We treat our samples as importance samples by computing the REINFORCE loss at temperature

T = 3

. Because we do not prompt the model with any examples, we directly compute the probability of the correct answer, conditioned on generating an answer - for example, for multiple choice questions between

A \dots E

, we compute the accuracy over the logits for tokens corresponding to

A \dots E

. Lastly, for our training, we select a random span of 256 tokens from each sample (or pad if there are fewer than 256 tokens).
优化与评估为了优化，我们使用 AdamW 优化器，预热 20 步，学习率为

1 e - 6

，权重衰减为 0.001，批量大小为 8（并进行必要的梯度累积以保持在各次运行中固定）。此外，我们的<|startofthought|>和<|endofthought|>嵌入梯度权重为

1 e 2

，我们的策略权重为

1 e 6

。在训练期间，我们以温度

T = 1

进行采样，并在评估期间对思维使用贪婪解码。我们通过在温度

T = 3

下计算 REINFORCE 损失，将我们的样本视为重要性样本。由于我们没有用任何示例提示模型，我们直接计算在生成答案的条件下正确答案的概率——例如，对于在

A \dots E

之间的多项选择题，我们计算对应于

A \dots E

的标记的 logits 的准确性。最后，在我们的训练中，我们从每个样本中随机选择 256 个标记的跨度（如果少于 256 个标记则进行填充）。

Mixing Head For our mixing head, we use a three-layer MLP with ReLU activation, taking in a vector of two times the size of the hidden state of the language model (as we concatenate the two predictions to determine their weights), and outputting a scalar. This scalar is them used to weight the logits from the LM head with and without thinking to make a prediction from a given token.
混合头对于我们的混合头，我们使用一个三层的多层感知器（MLP）并采用 ReLU 激活函数，输入为语言模型隐藏状态大小的两倍的向量（因为我们将两个预测连接在一起以确定它们的权重），输出一个标量。这个标量随后用于加权来自语言模型头的有思考和无思考的 logits，以便从给定的标记中做出预测。

Computation We train all of our models on a single node of eight 80GB H100s.
计算我们在一台配备八个 80GB H100 的单节点上训练所有模型。

B Faster Parallel Sampling
B 更快的并行采样

In this section, we highlight some simple ways to further accelerate the parallel generation algorithm. For example, note that one can reduce the attention’s memory cost by computing the diagonal attention simply as elementwise (rather than pairwise) dot-products. That is, given two input embedding sequences of shapes

(b, t, l, d)

and

(b, 1, l, d)

where

t

is the number of timesteps ahead,

b

is batch size,

l

is sequence length, and

d

is embedding dimension, we do not need to compute their pairwise attention of shape (

b, t, l, l)

, we only need to compute the attention for the paired elements along the diagonal of shape

(b, t, l)

. Additionally, to avoid generating continuations for all of the tokens (for example, if one wanted to apply a value function to determine where thoughts would be most useful), one can index into this generated attention mask. Notably, however, this also requires manipulation of the other inputs during the forward pass such as the positional embeddings.
在本节中，我们强调了一些进一步加速并行生成算法的简单方法。例如，注意到可以通过将对角注意力简单地计算为逐元素（而不是逐对）点积来减少注意力的内存成本。也就是说，给定两个形状为

(b, t, l, d)

和

(b, 1, l, d)

的输入嵌入序列，其中

t

是前进的时间步数，

b

是批量大小，

l

是序列长度，

d

是嵌入维度，我们不需要计算它们的逐对注意力，形状为（

b, t, l, l)

，我们只需要计算形状为

(b, t, l)

的对角线元素的注意力。此外，为了避免为所有标记生成续集（例如，如果想应用值函数来确定思维最有用的地方），可以对生成的注意力掩码进行索引。然而，值得注意的是，这也需要在前向传播过程中对其他输入进行操作，例如位置嵌入。

C Compute-Adjusted Plots
C 计算调整图表

We also visualize Figure 2 where we normalize by the number of thought and talk tokens used for training.
我们还可视化了图 2，其中我们通过用于训练的思考和对话标记的数量进行归一化。

D Measuring the Impact of Multiple Thoughts Per Sequence and Multiple Tokens Ahead
D 测量每个序列中的多个思维和多个前置标记的影响

We perform a simple ablation on our 12-thought-token-4-ahead baseline, namely asking whether sampling multiple thoughts per sequence is necessary. We find that although simply computing the reward as the difference between the losses with and without thought proves to be a strong baseline, using multiple thoughts consistently outperformed it (by roughly

0.5 %

on GSM8K generalization and

3 %

on CommonsenseQA generalization). However, the exact number of thoughts had little impact: varying between 2, 3, and 4 thoughts per sequence appeared to result in a consistent improvement with additional
我们对我们的 12 思维标记-4 前瞻基线进行了简单的消融实验，即询问每个序列采样多个思维是否必要。我们发现，尽管简单地将奖励计算为有无思维损失之间的差异证明是一个强有力的基线，但使用多个思维始终优于它（在 GSM8K 泛化上大约提高了

0.5 %

，在 CommonsenseQA 泛化上提高了

3 %

）。然而，思维的确切数量影响不大：每个序列中 2、3 和 4 个思维的变化似乎都导致了一致的改进。

Figure 6: Compute-Normalized Generalization Results. We visualize the performance curves normalized by the number of inference calls used.
图 6：计算归一化泛化结果。我们可视化了按使用的推理调用次数归一化的性能曲线。
thoughts, but additional thoughts per sequence beyond two improved performance by only

0.1 - 0.3 %

on both GSM8K and CommonsenseQA. Similarly, we found that predicting more than one token ahead helped by

0.3 %

on GSM8K and

3.1 %

on CommonsenseQA, but with 12-thought-tokens, we did not find that additional tokens ahead, beyond two, improved performance. However, qualitatively, we found that the rationales appeared more coherent with additional tokens-ahead of supervision. This suggests a need for more specific evaluations of generation quality in future work.
思考，但每个序列中超过两个的额外思考仅在 GSM8K 和 CommonsenseQA 上提高了

0.1 - 0.3 %

的性能。同样，我们发现预测超过一个标记的前瞻性在 GSM8K 上提高了

0.3 %

，在 CommonsenseQA 上提高了

3.1 %

，但在 12 个思考标记的情况下，我们没有发现超过两个的额外标记前瞻性提高了性能。然而，从定性上看，我们发现随着额外的前瞻性标记，推理显得更加连贯。这表明未来的工作需要对生成质量进行更具体的评估。

E Rationale-Augmented Rationales
E 理性增强的理由

Because the language model can insert thoughts anywhere, we perform a preliminary study where we investigate the reasoning generated by the language model when it has access to internal rationales while it generates external rationales (i.e. chains-of-thought) to solve a problem. Because these chains of thought are themselves generated by chains of thought, one might refer to this as a net-of-thought (or NoT). We compare five solutions from the model to a fixed random question in GSM8K, looking at outputs of the base Mistral 7B, the outputs of Mistral fine-tuned for the same number of steps on OpenWebMath (Paster et al., 2023), and the rationales of our best Quiet-STaR model. We sample from all of them directly, at temperature

t = 1

.
由于语言模型可以在任何地方插入思维，我们进行了一项初步研究，调查语言模型在生成外部理由（即思维链）以解决问题时，访问内部理由所产生的推理。由于这些思维链本身是由思维链生成的，可以将其称为思维网（或 NoT）。我们将模型的五个解决方案与 GSM8K 中的一个固定随机问题进行比较，观察基础 Mistral 7B 的输出、在 OpenWebMath 上经过相同步数微调的 Mistral 的输出（Paster 等，2023），以及我们最佳的 Quiet-STaR 模型的理由。我们直接从它们中进行采样，温度设定为

t = 1

。

E. 1 Base Mistral 7B
E. 1 基础 Mistral 7B

First Response: 首次回应：

A: Let's think step by step. How many fresh eggs remains after the three her eat for breakfast and four goes into the muffin mix for her friends. We can calculate it by 16-3-4. The amount equals 9 fresh eggs. So, the dollars it makes daily equals what
A: 让我们一步一步来思考。三个人早餐吃了多少新鲜鸡蛋，四个鸡蛋放进了给她朋友的松饼混合料中。我们可以通过 16-3-4 来计算。结果是 9 个新鲜鸡蛋。那么，它每天赚取的金额等于多少。

multiplied by

$ 2.12 \times 2

. The answer is

$ 24

Second Response: 第二次回应：

66. Janet sells an average of 12 fresh duck eggs daily on the farmers’ market. If she sells them for

$ 2

per egg how much does she make per week, assuming she sells at the farmers’ market most every day?A: If Janet sells on average 12 fresh duck eggs daily on the farmers’ market. This is equal to

12\times \(7 \backslashUndefined control sequence \(

), so

12 ∖ (∖

times

∖

) 7

= ∖ (84 ∖)

. The total amount she would then make would be

∖ (84 ∖

times

2 ∖)

dollars, or 168 per week. So Janet makes

$ 168

selling duck eggs per week.
珍妮特在农贸市场每天平均销售 12 个新鲜鸭蛋。如果她每个蛋售价为

$ 2

，假设她几乎每天都在农贸市场销售，那么她每周能赚多少钱？答：如果珍妮特在农贸市场每天平均销售 12 个新鲜鸭蛋。这相当于

12\times \(7 \backslashUndefined control sequence \(

，所以

12 ∖ (∖

乘以

∖

等于 7

= ∖ (84 ∖)

。她总共能赚到的金额将是

∖ (84 ∖

乘以

2 ∖)

美元，或者每周 168 美元。因此，珍妮特每周销售鸭蛋赚

$ 168

。

Third Response: 第三个回应：

Q: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for

$ 2

per fresh duck egg. How much in dollars does she make every day at the farmers’ market?
简：珍妮特的鸭子每天下 16 个蛋。她每天早上吃三个作为早餐，并用四个给朋友们烤松饼。她每天在农贸市场上以

$ 2

的价格出售剩下的鸭蛋。她每天在农贸市场上赚多少钱？
A: Let’s think step by step.
A: 让我们一步一步来思考。
How many eggs does she have left every day?
她每天还剩多少个鸡蛋？
This is important because it will help us find how many eggs she sells each day.
这很重要，因为它将帮助我们找出她每天卖多少个鸡蛋。
What can we do to answer the question?
我们可以做些什么来回答这个问题？
If we write the relationship between how many duck eggs Janet has left, how many she eats, and how many she bakes for the muffins, then we can see how many eggs she has left to sell at the farmers’ market.
如果我们写下珍妮特剩下的鸭蛋数量、她吃掉的数量以及她为松饼烘焙的数量之间的关系，那么我们就可以看到她在农贸市场上剩下可以出售的蛋数量。

Let’s write an equation to solve our dilemma.
让我们写一个方程来解决我们的困境。
Subtract an equation for eggs eaten and an equation for eggs baked to find the eggs remaining. Let’s see what we can find.
减去一个关于吃掉的鸡蛋的方程和一个关于烘焙的鸡蛋的方程，以找出剩余的鸡蛋。让我们看看我们能找到什么。

We can see that Janet has 6 eggs leftover each day.

$ 6 ∖

times

∖ $ 2 = ∖ $ 12 $

per day. Therefore Janet receives an additional

$ 12

per day for her eggs at the farmer’s market.
我们可以看到，珍妮特每天剩下 6 个鸡蛋。

$ 6 ∖

次

∖ $ 2 = ∖ $ 12 $

每天。因此，珍妮特在农贸市场上每天为她的鸡蛋获得额外的

$ 12

。
Remember, this is just for one day, so now she makes

$ 12

per day in additional income, because she sells her eggs at the farmer’s market each day.
记住，这只是一天的收入，所以现在她每天额外收入

$ 12

，因为她每天在农贸市场出售她的鸡蛋。
#### Answer #### 答案

Fourth Response: 第四个回应：

Q: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for

$ 2

$ 2

的价格出售剩下的鸭蛋。她每天在农贸市场上赚多少钱？
A: Let’s think step by step. First, you want to find out the number of eggs a total that Janet collects each day. Take this many dollars a three and subtract from the money she gets per egg, which is

$ 2

. The answer to that question is

$ 5

.
A: 让我们一步一步来思考。首先，你想要找出珍妮特每天收集的鸡蛋总数。用这个金额除以三，然后从她每个鸡蛋获得的金额中减去，即

$ 2

。这个问题的答案是

$ 5

。

Fifth Response 第五回应

Q: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for

$ 2

$ 2

的价格出售剩下的鸭蛋。她每天在农贸市场上赚多少钱？
A: Let’s think step by step.
A: 让我们一步一步来思考。
For the first step, let’s think of the day as a whole, using some variables to keep track of the numbers of eggs. Let

$ n $ =

the number of eggs that Janet’s ducks lay per day. Then the number of eggs that she eats per day will be

$ (3) $ =

.
对于第一步，让我们将一天视为一个整体，使用一些变量来跟踪鸡蛋的数量。设

$ n $ =

为珍妮特的鸭子每天下的鸡蛋数量。那么她每天吃的鸡蛋数量将是

$ (3) $ =

。
Next, she makes

$ m $ =

muffins. For each muffin, she uses

$ 4 $ =

eggs, so we can take from the eggs Janet’s ducks lay to represent the number she uses:

$ 4 (m) $ =

. Now, to get her back to our starting number of eggs, we need to add these eggs back on, giving us . The eggs Janet sells each day brings her

$ 2 $ $ (n) $ = $

. This brings us to our next step.
接下来，她制作

$ m $ =

个松饼。每个松饼使用

$ 4 $ =

个鸡蛋，因此我们可以用珍妮特的鸭子下的鸡蛋数量来表示她使用的数量：

$ 4 (m) $ =

。现在，为了让她回到我们最初的鸡蛋数量，我们需要将这些鸡蛋加回来，得到。珍妮特每天出售的鸡蛋为她带来

$ 2 $ $ (n) $ = $

。这引导我们进入下一步。
Now we will break down all the eggs from our starting number,

$ n $ =

, into the pieces that we set up previously. For the beginning number, we have: This brings us back to our starting number, and we can simplify this expression.
现在我们将把从起始数字

$ n $ =

开始的所有鸡蛋分解成我们之前设定的部分。对于起始数字，我们有：这使我们回到了起始数字，我们可以简化这个表达式。

Moving from the top line to the second line, we multiply the numbers within each parenthetical group. Moving from the second line to the third line, we add the products together to find the total. Finally, we can distribute and determine the final product:

$ = 360 . $

从第一行到第二行，我们对每个括号内的数字进行相乘。从第二行到第三行，我们将乘积相加以找到总和。最后，我们可以分配并确定最终的乘积：

$ = 360 . $

So the total number of eggs that Janet’s ducks lay per day is

$ 360 . $

Since Janet sells each egg she has for

$ 2, $

her total earnings at the farmers’ market is:

$ ∖

boxed

∖ ∖

720}.$
所以珍妮特的鸭子每天产的蛋总数是

$ 360 . $

。由于珍妮特以

$ 2, $

的价格出售每个蛋，她在农贸市场的总收入是：

$ ∖

盒

∖ ∖

720}。

E. 2 OpenWebMath-Trained Mistral 7B
E. 2 OpenWebMath 训练的 Mistral 7B

First Response: 首次回应：

<s> Q: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for

$ 2

per fresh duck egg. How much
简：珍妮特的鸭子每天下 16 个蛋。她每天早上吃三个作为早餐，并用四个给朋友们烤松饼。她每天在农贸市场以

$ 2

的价格出售剩下的鸭蛋。她每天能赚多少钱？
in dollars does she make every day at the farmers’ market?
她在农贸市场每天赚多少钱？
A: Let’s think step by step. How much does she make every day? Well per the question, that is

$ 2

per a duck egg. Now, how many duck eggs per day does she sell? Well, that’s 16 minus the eggs she eats minus the eggs she uses to ma
A：我们一步一步来想。她每天赚多少钱？根据问题，每个鸭蛋是

$ 2

。那么，她每天卖多少个鸭蛋呢？那就是 16 减去她吃的蛋减去她用来做的蛋。
ke muffins. 凯松饼。
She eats 3 eggs every morning, so

$ 16 - 3 = 13 $

.
她每天早上吃 3 个鸡蛋，所以

$ 16 - 3 = 13 $

。
She makes 4 muffins, which uses 4 eggs.

$ 16 - 3 - 4 = 5 $

.
她做了 4 个松饼，使用了 4 个鸡蛋。
Therefore, the number of duck eggs she is selling is

$ 5 $

.
因此，她出售的鸭蛋数量为

$ 5 $

。
Clearly, 1 duck egg costs

$ 2

. Therefore, the total amount she makes selling the eggs is $10$ $2$. Now, this can be written as a fraction as, “ten two-dollars,” so $10 2 bucks.

$

显然，1 个鸭蛋的价格是

$ 2

。因此，她出售这些蛋所赚的总金额是$10$ $2$。现在，这可以写成一个分数，即“十两个美元”，所以是$10 2 块钱。

$

Converting to a fraction, or “a tenth of ten,” so

$ 102

- bucks$ = $1$ $10 - bucks$
将其转换为分数，或称“十的十分之一”，因此

$ 102

- 美元$ = $1$ $10 - 美元$
[reveal-answer q=“fs-id1165135856725”]Show Solution[/reveal-answer]
显示解答
[hidden-answer a=“fs-id1165135856725”]

Second Response: 第二次回应：

<s> Q: Janet s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for

$ 2

per fresh duck egg. How much in dollars does she make every day at the farmers’ market?
简：珍妮特的鸭子每天下 16 个蛋。她每天早上吃三个作为早餐，并用四个给朋友烤松饼。她每天在农贸市场出售剩下的鸭蛋，每个新鲜鸭蛋售价

$ 2

。她每天在农贸市场赚多少钱？
A: Let’s think step by step. On what basis can we answer the question affer considering the question? What do we know given and how can we use what we’re given to arrive at an answer. 3 is given, 4 is given, 16 is given, does anyone have any thoughts?
A：让我们一步一步来思考。我们可以基于什么来回答这个问题？我们已知的是什么，我们如何利用已知的信息得出答案？给定的是 3，给定的是 4，给定的是 16，有人有什么想法吗？
A: Well 16 minus 3 minus 4 equals 9 , and then I subtract 1 from 9 and that is 4 extra so five more eggs.
A: 好吧，16 减去 3 再减去 4 等于 9，然后我从 9 中减去 1，那就是多出 4，所以再加五个鸡蛋。
M: And we know that 2 times 5 is how much?
M: 我们知道 2 乘以 5 是多少？
A: Ten 十
M: Yes. 是的。
A: If she eats 3 duck eggs for breakfast, in the morning equals

∖ (3 ∖

times

2 = 6)

. If she eat 4 duck eggs baking muffin for her friends equals

∖ (4

times

2 = 8 ∖)

, so total of money and that she got is

$ 6

plus

$ 8

which equals 14 . The weight gain is 5 because it is 5 is that what you put up there so 5 times 2 ducks and subtract that from that from 16 and you got 7 no what did we write down? Are you able to figure out now what the question is, and what is what the question really is. You got 14 the weight loss you, you subtracted that from 16 which equals 2 ,

$ 14

minus 2 , remainder is what?
A：如果她早餐吃 3 个鸭蛋，早上的总和等于

∖ (3 ∖

乘以

2 = 6)

。如果她为朋友烤松饼吃 4 个鸭蛋，则等于

∖ (4

乘以

2 = 8 ∖)

，所以她得到的总金额是

$ 6

加上

$ 8

，等于 14。体重增加是 5，因为 5 就是你放上面的那个，所以 5 乘以 2 个鸭子，然后从 16 中减去这个，你得到了 7，不，我们写下了什么？你现在能弄清楚问题是什么吗，问题到底是什么。你得到了 14，体重减轻，你从 16 中减去，这等于 2，

$ 14

减去 2，剩下的是什么？
A: 12 .
M: The question is what is Janets ducks lay 16 eggs, she eats 12 for sale. You subtract 3 plus 4 minus 5 so 5 when you eat the muffin equals 5 , times two, the 12 , minus the money so 5 times 2 ,

$ - 12 $

. So

∖ (2 ∖

times

9 = 18 ∖

) minus

$ 10

, so 4 more eggs… Still times

$ 2

, total comes to to

$ 20

.
M: 问题是珍妮特的鸭子下了 16 个蛋，她吃了 12 个用于出售。你减去 3 加 4 再减 5，所以 5 当你吃掉松饼等于 5，乘以 2，12，减去钱，所以 5 乘以 2，

$ - 12 $

。所以

∖ (2 ∖

乘以

9 = 18 ∖

减去

$ 10

，所以再加 4 个蛋……仍然乘以

$ 2

，总共是

$ 20

。
A: I understand there is the ones day one 3 plus 4 eats who makes and so on and so on.
A: 我理解有一天三加四吃的人是谁，等等。
M: But maybe in order to answer it correctly you have to look a little close. Typically in the procedures you have that they give you first, I can tell you how here we go.
M: 但也许为了正确回答这个问题，你需要仔细观察一下。通常在程序中，他们首先给你的是，我可以告诉你我们是怎么做的。

M: Now if you increase the 16 , now you have that you would need to determine 16 minus 3 minus 4 and that adds up to the remainder, which is 9 . Then subtract 1 and but you have 5. The next answer then, which you do is now 5 times 2, which is 10.
M：现在如果你增加 16，那么你需要计算 16 减去 3 再减去 4，这样就得到了余数，即 9。然后减去 1，但你有 5。接下来的答案是 5 乘以 2，即 10。
M: Which is

$ 10

, and then 14 , which is is the remainder
M: 这是

$ 10

，然后是 14，这是余数
M: Is $6 and $8 and that is a total of

$ 14

, and now that you subtract 2 , and that means that you now have 16 minus 5 , which is 11 .
M：6 美元和 8 美元，总共是

$ 14

，现在减去 2，这意味着你现在有 16 减去 5，结果是 11。
M: The next thing is to the yield for the second situation you have got 16 minus 3 plus 4 and that comes to 9 . This is very easy to do instantaneously even the first part,

16, 3, 4

and 9 , and you happen to know what lives, and you just count and you have five and five then go ahead. 5 times 2 comes to 10 , so

$ 10

then so actually you have just added 4.5, now your total is 16 plus

$ 2

. But now to this is

$ 10

plus

$ 2

$ 12

, so how much you sell from 16 and then minus 12 comes to 4 .
M：接下来是第二种情况的产量，你有 16 减去 3 加 4，结果是 9。这是非常容易瞬间完成的，甚至第一部分，

16, 3, 4

和 9，你恰好知道是什么，然后你数一下，有五和五，然后继续。5 乘以 2 等于 10，所以

$ 10

，实际上你刚刚加了 4.5，现在你的总数是 16 加

$ 2

。但现在这是

$ 10

加

$ 2

等于

$ 12

，那么你从 16 中卖出多少，然后减去 12，结果是 4。
M: So what is the remainder of 16 minus 3 plus 4 minus 5 ? Matters you see three is missing. Four you have. Five eggs. So 5 times 2 , so

$ 10

remains at

$ 10

, so reduce, so what happens now to what happened was added with subtracted the next the question is 16 minuses 3 plus 4 , and presumably you’ve done this before, if not then you add up 16 minus 3 plus 4 gives us 9 and what is the remainder after 5? So what is 5 times 2 ? So

10.2, 3

, plus

2 = 5

, and so your answer so far is 16 minus 3 plus 4 , and give us the basket minus 2 remains 9 and now you have to determine now what is the total for the second part? You have
M：那么 16 减去 3 加上 4 再减去 5 的余数是多少？你看，3 缺失了。4 你有。5 个鸡蛋。所以 5 乘以 2，所以

$ 10

剩下

$ 10

，所以减少，那么现在发生了什么，发生的是加上减去下一个问题是 16 减去 3 加上 4，假设你以前做过这个，如果没有，那么你加上 16 减去 3 加上 4 给我们 9，5 之后的余数是多少？那么 5 乘以 2 是多少？所以

10.2, 3

，加上

2 = 5

，所以到目前为止你的答案是 16 减去 3 加上 4，给我们篮子减去 2 剩下 9，现在你必须确定第二部分的总数是多少？你有

Third Response: 第三个回应：

Fourth Response: 第四个回应：

<s> Q: Janet s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for

$ 2

$ 2

。她每天在农贸市场赚多少钱？
A: Let’s think step by step. First, Jane’s ducks lay 16 eggs perday. So there are 16 eggs Monday, 16 Tuesday, 16 Wednesday, 16 Thursday, 16 Friday, and so on. Second, Janet eats 3 of the 16 eggs that the ducklay ap day for breakfast. She only eats the duck eggs, not her baked goods (muffins) or other food. Therefore, she eats 3 Monday, 3 Tuesday, 3 Wednesday, 3 Thursday, and so on. Third, she only bakes the first 4 non-eaten eggs into muffins. Therefore$ she sells the remaining 12 noneaten eggs Monday, 12 Tuesday, 12 Wednesday, 12 Thursday, and so on. Fourth, Janet sells duck eggs Monday, Tuesday, Wednesday, Thursday, Friday at the 2 each. Third, she eats 3 Monday, 3 Tuesday, 3 Wednesday, 3 Thursday, Friday at the farmers’ market. Lastly, she makes 12 Monday, 12 Tuesday, 12 Wednesday, 12 Thursday, and so on in dollars every day at the farmers’ market. The worksheet below summarizes these steps. The number formed by the intersection of corresponding row and column terms will be the correct answer. The product of the (3, J) term and the (J, 8) term is 1,150 . So, Janet makes

$ 1, 150

per day!
A：让我们一步一步来思考。首先，简的鸭子每天下 16 个蛋。因此，周一有 16 个蛋，周二 16 个，周三 16 个，周四 16 个，周五 16 个，依此类推。其次，珍妮每天早餐吃 16 个蛋中的 3 个。她只吃鸭蛋，不吃她自己烤的食品（松饼）或其他食物。因此，她周一吃 3 个，周二 3 个，周三 3 个，周四 3 个，依此类推。第三，她只将前 4 个未吃的蛋烤成松饼。因此，她在周一卖掉剩下的 12 个未吃的蛋，周二 12 个，周三 12 个，周四 12 个，依此类推。第四，珍妮在周一、周二、周三、周四、周五每次卖 2 个鸭蛋。第三，她在农贸市场周一吃 3 个，周二 3 个，周三 3 个，周四 3 个，周五 3 个。最后，她在农贸市场每天赚 12 美元，周一 12 美元，周二 12 美元，周三 12 美元，周四 12 美元，依此类推。下面的工作表总结了这些步骤。对应行和列项交叉形成的数字将是正确答案。 (3, J)项和(J, 8)项的乘积是 1,150。因此，珍妮每天赚

$ 1, 150

美元！
Note: Step 3 and Step 8 are repeating, but the

(3, 7)

term in Step 8 is not the same as the

(3, 7)

term in Step 3. )
注意：步骤 3 和步骤 8 是重复的，但步骤 8 中的

(3, 7)

项与步骤 3 中的

(3, 7)

项不同。

Fifth Response 第五回应

<s> Q: Janet s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at
贾娜的鸭子每天下 16 个蛋。她每天早上吃三个作为早餐，并用四个给朋友们烤松饼。她将剩下的蛋出售。
the farmers’ market daily for

$ 2

per fresh duck egg. How much in dollars does she make every day at the farmers’ market?
她在农贸市场每天以

$ 2

美元的价格出售新鲜鸭蛋。她每天在农贸市场赚多少钱？
A: Let’s think step by step.
A: 让我们一步一步来思考。
What do we know?
我们知道什么？

Janet’s ducks lay 16 eggs per day.
珍妮特的鸭子每天下 16 个蛋。
She eats three for breakfast every morning and bakes muffins for her friends every day with four.
她每天早上吃三个作为早餐，并每天用四个给朋友们烤松饼。

From these two facts, we can conclude at least that Janet needs at least seven eggs for consumption per day.
从这两个事实中，我们可以得出结论，珍妮特每天至少需要七个鸡蛋供食用。
And what do we need to find exactly?
我们到底需要找到什么？

We need to find how much in dollars Janet makes every day at the farmers’ market.
我们需要找出珍妮特在农贸市场每天赚多少钱。

Let’s write an equation and solve it to find the answer. We know:
让我们写一个方程并求解以找到答案。我们知道：

- ∖ (16 - 3 - 4 = 9 ∖)

So our daily profit is 9 times

$ 2

, or

∖ (9 ∖ t i m e s ∖ $ 2 = ∖ $ 18 ∖)

所以我们的日利润是 9 倍的

$ 2

，或者

∖ (9 ∖ t i m e s ∖ $ 2 = ∖ $ 18 ∖)

Hence her profits at market on a given day are

∖ ($ 18 ∖)

.
因此她在某一天的市场利润为

∖ ($ 18 ∖)

。
Answer:

∖ ($ 18 ∖)

. 答案：

∖ ($ 18 ∖)

。

E. 3 Quiet-STaR-Trained Mistral 7B
E. 3 安静的 STaR 训练的 Mistral 7B

First Response: 首次回应：

$ 2

per fresh duck egg. How much
简：珍妮特的鸭子每天下 16 个蛋。她每天早上吃三个作为早餐，并用四个给朋友们烤松饼。她每天在农贸市场以

$ 2

的价格出售剩下的鸭蛋。她每天能赚多少钱？
in dollars does she make every day at the farmers’ market?
她在农贸市场每天赚多少钱？
A: Let’s think step by step.
A: 让我们一步一步来思考。
The ducks lay 16 eggs per day.
鸭子每天下 16 个蛋。
This is the total number of eggs.
这是鸡蛋的总数。
3 of the eggs are eaten for breakfast.
早餐吃了 3 个鸡蛋。
This is the number of eggs that are eaten for breakfast.
这是早餐时吃的鸡蛋数量。
4 of the eggs are used to make muffins.
4 个鸡蛋用于制作松饼。
This is the number of eggs that are used to make muffins.
这是制作松饼所用的鸡蛋数量。

16 - 3 - 4 = 9

9 is the number of eggs that are sold at the farmers’ market.
在农贸市场上出售的鸡蛋数量为 9 个。
9 of the eggs are sold at the farmers’ market.
9 个鸡蛋在农贸市场出售。
$2

The price of each egg is

$ 2

.
每个鸡蛋的价格是

$ 2

。
$2 \times

9 = 18 $

18 is the total price of the eggs.
18 是鸡蛋的总价格。
The total price of the eggs is

$ 18

.
鸡蛋的总价格是

$ 18

。

Second Response: 第二次回应：

A: Let’s think step by step.
A: 让我们一步一步来思考。

The ducks lay 16 eggs per day.
鸭子每天下 16 个蛋。
She eats 3 for breakfast every morning.
她每天早上吃 3 个作为早餐。
She bakes muffins for her friends every day with 4.
她每天为朋友们烤 4 个松饼。
She sells the remainder at the farmers’ market daily for $$ 2$ per fresh duck egg.
她每天在农贸市场以 $$ 2$ 的价格出售剩余的新鲜鸭蛋。
The number of eggs she sells at the farmers’ market is $16 - 3 - 4$ .
她在农贸市场出售的鸡蛋数量是 $16 - 3 - 4$ 。
The number of eggs she sells at the farmers’ market is 16 - $3 - 4$ .
她在农贸市场出售的鸡蛋数量是 16 - $3 - 4$ 。
The number of eggs she sells at the farmers’ market is $16 - 3 - 4$ .
她在农贸市场出售的鸡蛋数量是 $16 - 3 - 4$ 。
The number of eggs she sells at the farmers’ market is $16 - 3 - 4$ .
她在农贸市场出售的鸡蛋数量是 $16 - 3 - 4$ 。
The number of eggs she sells at the farmers’ market is $16 - 3 - 4$ .
她在农贸市场出售的鸡蛋数量是 $16 - 3 - 4$ 。
The number of eggs she sells at the farmers’ market is $16 - 3 - 4$ .
她在农贸市场出售的鸡蛋数量是 $16 - 3 - 4$ 。
The number of 数量

Third Response: 第三个回应：

<s> Q: Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $$ 2$ per fresh duck egg. How much in dollars does she make every day at the farmers' market?
简：珍妮特的鸭子每天下 16 个蛋。她每天早上吃三个作为早餐，并用四个给朋友烤松饼。她每天在农贸市场以 $$ 2$ 的价格出售剩下的鸭蛋。她每天在农贸市场赚多少钱？

A: Let’s think step by step.
A: 让我们一步一步来思考。
The ducks lay 16 eggs per day.
鸭子每天下 16 个蛋。
This is the total number of eggs.
这是鸡蛋的总数。
3 is the number of eggs that Janet eats for breakfast every morning.
珍妮特每天早上吃 3 个鸡蛋作为早餐。
4 is the number of eggs that she uses in the muffins that she bakes for her friends every day.
她每天为朋友们烘焙的松饼中使用的鸡蛋数量是 4。

16 - 3 - 4 = 9

9 is the number of eggs that she sells at the farmers’ market daily for

$ 2

per fresh duck egg.
她每天在农贸市场出售 9 个鸡蛋，每个新鲜鸭蛋

$ 2

。

9 $ 2 = 9 $ 2

= 9 $ 2

= 9 $ 2

= 18

So, 18 is the amount in dollars that she makes every day at the farmers’ market.
所以，18 是她在农贸市场每天赚取的美元金额。
The amount in dollars that she makes every day at the farmers’ market is 18.
她在农贸市场每天赚取的金额是 18 美元。
A:

18 < / s >

Fourth Response: 第四个回应：

A: Let’s think step by step.
A: 让我们一步一步来思考。
The number of eggs that Janet’s ducks lay per day is 16.
珍妮特的鸭子每天产蛋的数量是 16 个。
Of these, she eats 3 for breakfast every morning.
她每天早上吃 3 个。
This is 3 .
这是 3。
Of these, she bakes muffins for her friends with 4.
她为她的朋友们烤了 4 个松饼。
This is 4 .
这是 4。

The number of eggs that are left over is

16 - 3 - 4

.
剩余的鸡蛋数量是

16 - 3 - 4

。
This is

16 - 7

.
这是

16 - 7

。
This is 9 .
这是 9。

The number of eggs that she sells at the farmers’ market daily for

$ 2

per egg is 9.
她每天在农贸市场以每个

$ 2

的价格出售 9 个鸡蛋。

The total in dollars that she makes every day at the farmers’ market is

9 $ 2

.
她在农贸市场每天赚取的总金额是

9 $ 2

美元。
This is 92 .
这是 92。
This is 18. 这是 18。
The total in dollars that she makes every day at the farmers’ market is 18.
她在农贸市场每天赚取的总金额是 18 美元。
A: 18

Fifth Response 第五回应

A: Let’s think step by step.
A: 让我们一步一步来思考。
The ducks lay 16 eggs per day.
鸭子每天下 16 个蛋。
This is the total number of eggs.
这是鸡蛋的总数。

3 + 4 + 16 = 23

3 eggs are eaten for breakfast every morning.
每天早上吃 3 个鸡蛋作为早餐。
4 eggs are used to make muffins for her friends every day.
每天用 4 个鸡蛋为她的朋友们做松饼。
16 eggs are sold at the farmers’ market for

$ 2

each every day.
在农贸市场上，每天以

$ 2

的价格出售 16 个鸡蛋。
The 16 eggs that are sold at the farmers’ market are the remaining eggs.
在农贸市场出售的 16 个鸡蛋是剩余的鸡蛋。

16 - 3 - 4 = 9

9 eggs are sold at the farmers’ market for

$ 2

each every day.
在农贸市场上，每天以

$ 2

的价格出售 9 个鸡蛋。

9 (2) = 18

18 is the total in dollars that she makes every day at the farmers’ market.
她在农贸市场每天赚取的总金额是 18 美元。
18 is the answer.
18 是答案。
A: 18

F Additional Works on Learning to Reason
F 额外的推理学习著作

For completeness, we highlight that many other works have explored teaching language models to reason (often with algorithms similar to or inspired by STaR), either from their own rationales, from interaction with an environment, or from a hand-constructed dataset. For example, works explore this in the context of multihop question answering (Zhao et al., 2023), math (Luo et al., 2023; Uesato et al., 2022), machine translation (Gulcehre et al., 2023). Several works investigate teaching language model agents to reason in planning (Chen et al., 2023; Gandhi et al., 2023; Qiao et al., 2024), or to use specific tools or memory (Yao et al., 2022; Lanchantin et al., 2024; Schick et al., 2024), while others investigate how one may distill the reasoning from a large language model into a smaller language model Ho et al. (2022); Li et al. (2022); Hsieh et al. (2023). Notably however, Pan et al. (2024) demonstrates
为了完整性，我们强调许多其他研究探讨了教语言模型推理（通常使用与 STaR 相似或受其启发的算法），无论是基于它们自己的理论、与环境的互动，还是基于手工构建的数据集。例如，有研究在多跳问答的背景下探讨这一问题（Zhao et al., 2023）、数学（Luo et al., 2023；Uesato et al., 2022）、机器翻译（Gulcehre et al., 2023）。一些研究调查了教语言模型代理在规划中推理（Chen et al., 2023；Gandhi et al., 2023；Qiao et al., 2024），或使用特定工具或记忆（Yao et al., 2022；Lanchantin et al., 2024；Schick et al., 2024），而其他研究则探讨如何将大型语言模型的推理提炼到较小的语言模型中（Ho et al., 2022；Li et al., 2022；Hsieh et al., 2023）。然而，值得注意的是，Pan et al. (2024) 展示了
that these feedback loops may result in reward hacking. Zelikman et al. (2023b) shows how a bootstrapping loop can be implemented where a model repeatedly improves a codeimprover using the same code-improver and Haluptzok et al. (2023) shows how language models can bootstrap their programming ability with programming puzzles Schuster et al. (2021). Other works have employed a similar strategy for using language models to solve inductive reasoning tasks or to model real-world systems (Wang et al., 2023; Qiu et al., 2023; Zhu et al., 2023; Li et al., 2024).
这些反馈循环可能导致奖励黑客行为。Zelikman 等（2023b）展示了如何实现一个自举循环，其中模型使用相同的代码改进器反复改进代码改进器，而 Haluptzok 等（2023）则展示了语言模型如何通过编程难题自举其编程能力（Schuster 等，2021）。其他研究采用了类似的策略，利用语言模型解决归纳推理任务或建模现实世界系统（Wang 等，2023；Qiu 等，2023；Zhu 等，2023；Li 等，2024）。
Some works have investigated how models can learn from their reasoning mistakes incontext (Shinn et al., 2023; Madaan et al., 2023; Zhang et al., 2024; Liu et al., 2023). Many studies have also focused on the ability of LMs to learn from in-context reasoning examples (Lampinen et al., 2022; Zhou et al., 2022) - correspondingly, Khattab et al. (2022) and Khattab et al. (2023) show how the sets of examples used to prompt a model to reason can be optimized in the context of a multi-step reasoning pipeline. Furthermore, Zhang et al. (2022) demonstrated that one can improve zero-shot question-answering in language models by using a variety of zero-shot prompts for reasoning.
一些研究探讨了模型如何在上下文中从推理错误中学习（Shinn et al., 2023; Madaan et al., 2023; Zhang et al., 2024; Liu et al., 2023）。许多研究还集中于语言模型从上下文推理示例中学习的能力（Lampinen et al., 2022; Zhou et al., 2022）——相应地，Khattab et al.（2022）和 Khattab et al.（2023）展示了在多步骤推理流程中，如何优化用于提示模型进行推理的示例集。此外，Zhang et al.（2022）证明，通过使用多种零-shot 提示进行推理，可以改善语言模型的零-shot 问答能力。

G Distribution of Improvements
G 改进的分布

We also visualize the distribution of improvements across tokens in the training set.
我们还可视化了训练集中各个标记的改进分布。

Figure 7: Distribution of changes in

\log

probability. We visualize the distribution of changes in log probability resulting from the generated thoughts across the evaluation dataset. We visualize the density of changes in log probability relative to the LM without thoughts, colored based on log density. The distribution is skewed, with most tokens unaffected, while a small fraction of hard tokens see substantial improvements from the thoughts. This matches our intuition that most tokens in general web text do not require significant reasoning to predict, but thoughts are disproportionately beneficial for challenging tokens.
图 7：

\log

概率变化的分布。我们可视化了生成的思维在评估数据集上导致的对数概率变化的分布。我们根据对数密度对相对于没有思维的语言模型的对数概率变化的密度进行了可视化。分布呈偏斜，大多数标记未受影响，而一小部分困难标记则从思维中获得了显著改善。这与我们的直觉相符，即一般网页文本中的大多数标记不需要显著的推理来预测，但思维对具有挑战性的标记却具有不成比例的益处。

H Contribution Visualization
H 贡献可视化

Then I used the Moivre's theorem and I got: $(\operatorname{cos}5\theta+isin5\theta)$
I compared the imaginary parts and I got something like: $sin5\theta=5\operatorname{cos^4\thetasin}\theta-10\mp@subsup{\operatorname{cos}}{}{\wedge}2\theta\operatorname{sin}^3\theta+\operatorname{sin}^5\theta$

Where do I make te mistake?
Thanks for any help! ;)
- Have you tried use $\\mp@subsup{\operatorname{cos}}{}{\wedge}2\theta+sin^2\theta = 1$
- Fan
May 15 '17 at 19:13
- You didn't err. You just didn't finish. May 15 '17 at 19:35
hint
In your last line, factor by $\sin(\theta),$
replace
$\sin^2(\theta)$ by $1-\\mp@subsup{\operatorname{cos}}{}{\wedge}2(\theta)$

Figure 8: Contribution Visualization. We provide an example text where we visualize the degree to which introducing thoughts helped at tokens throughout a text. Green indicates that the thought before that token made that token easier to predict, while yellow indicates that it made it harder to predict. More impactful thoughts have higher opacity.
图 8：贡献可视化。我们提供了一个示例文本，在其中可视化引入思想在文本中各个标记的帮助程度。绿色表示该标记之前的思想使得该标记更容易预测，而黄色则表示使其更难预测。影响力更大的思想具有更高的不透明度。

I Handling Instability 我处理不稳定性

Several aspects of this task have the potential to introduce instability. First, and perhaps most importantly, the utility of a generated thought (or thought token) is a function of the mapping from the thought to its contribution to language prediction; however, the mapping from the thoughts to this contribution is learned based on the thoughts themselves. This means that, even if one were to generate a thought that allowed the perfect prediction of the next token, the loss could receive no signal from it if the mixing head’s weight on that generation was 0 . One solution we explored was to use the Gumbel-Softmax trick with a straight-through estimator Jang et al. (2016), but with many consecutive softmax operations we observed vanishing gradients. This introduces an exploration-exploitation trade-off, a fundamental challenge in reinforcement learning. Approaches like DQN (Mnih et al., 2013), PPO (Schulman et al., 2017), and A3C (Mnih et al., 2016) often resolve these tradeoffs by learning a state value function, which estimates the expected future reward from each state. However, the reward functions associated with this environment are unstable (as noted earlier, due to the also-changing mixing heads) - consequently, our preliminary explorations with these techniques was not promising. While we are far from the first to note that optimizing rationales is a reinforcement-learning task (Zelikman et al., 2022; Zhang & Parkes, 2023; Phan et al., 2023), the need for the rationale to avoid harming the base model performance introduces additional complexity. Essentially, the more complex the mapping from LM output to the next token prediction, the more instability we observed. On the other hand, when we trained without any interpolation, i.e. ablating the mixing head and only using the language model prediction after thoughts, the model quickly learned to simply ignore the thoughts (and we saw no generalization to any downstream tasks).
这个任务的几个方面可能会引入不稳定性。首先，也许最重要的是，生成的思想（或思想标记）的效用是思想与其对语言预测贡献之间映射的函数；然而，这种思想到贡献的映射是基于思想本身学习的。这意味着，即使生成了一个能够完美预测下一个标记的思想，如果混合头对该生成的权重为 0，损失也无法从中获得信号。我们探索的一个解决方案是使用 Gumbel-Softmax 技巧与直通估计器（Jang 等，2016），但在许多连续的 softmax 操作中，我们观察到了梯度消失。这引入了探索与利用的权衡，这是强化学习中的一个基本挑战。像 DQN（Mnih 等，2013）、PPO（Schulman 等，2017）和 A3C（Mnih 等，2016）这样的算法通常通过学习状态价值函数来解决这些权衡，该函数估计每个状态的预期未来奖励。然而，与该环境相关的奖励函数是不稳定的（如前所述，由于混合头的变化）——因此，我们对这些技术的初步探索并不乐观。虽然我们远不是第一个注意到优化推理是一个强化学习任务的人（Zelikman et al., 2022; Zhang & Parkes, 2023; Phan et al., 2023），但推理需要避免损害基础模型性能的要求增加了额外的复杂性。实质上，从语言模型输出到下一个标记预测的映射越复杂，我们观察到的不稳定性就越大。另一方面，当我们在没有任何插值的情况下进行训练，即去掉混合头，仅使用思考后的语言模型预测时，模型很快学会简单地忽略这些思考（我们没有看到对任何下游任务的泛化）。
We explored the use of separate heads for thinking and talking (here, we use talking to refer to directly outputting a hidden state or logits, rather than a mixing weight). In particular, we explored both linear layers from the hidden states and MLPs, initialized to contribute 0 residually to the base language model outputs, in order to generate thoughts and next-token
我们探讨了思考和表达（在这里，我们使用“表达”来指代直接输出隐藏状态或 logits，而不是混合权重）使用独立头部的方式。特别地，我们研究了来自隐藏状态的线性层和多层感知机（MLP），这些层初始化为对基础语言模型输出贡献 0 的残差，以生成思维和下一个标记。
predictions similar to what the language model would have otherwise generated. However, we observed that, in all instances, the previously-mentioned instability prevented learning. Consequently, we aimed to remove or minimize all components that could transform the language model’s outputs, both with and without its rationales. We also note that our choice to use a language model to output a weight combining multiple states (as done by our mixing head) is essentially an attention mechanism allowing the model to attend to its thinking. This has similarity to the approach taken in Backpack language models (Hewitt et al., 2023), which also learn to predict weights to apply to summed input embeddings to model future text, rather than allowing the language model to output arbitrary embeddings. Despite this constraint, Backpack language models appear to have comparable performance to traditional language models (Hewitt et al., 2023).
预测与语言模型本来会生成的内容相似。然而，我们观察到，在所有情况下，之前提到的不稳定性阻碍了学习。因此，我们的目标是去除或最小化所有可能改变语言模型输出的组件，无论是否包含其推理。我们还注意到，我们选择使用语言模型输出一个结合多个状态的权重（如我们的混合头所做的）本质上是一种注意机制，允许模型关注其思考。这与背包语言模型（Hewitt 等，2023）所采用的方法相似，后者也学习预测权重，以应用于求和的输入嵌入，从而建模未来文本，而不是允许语言模型输出任意嵌入。尽管有这种限制，背包语言模型的性能似乎与传统语言模型相当（Hewitt 等，2023）。

$^{1}$ Due to our linear mixing, equivalent to shifting the mixing weight toward the base prediction.
由于我们的线性混合，相当于将混合权重向基础预测移动。
$^{2}$ For example, in this very text, the token following “run” is “-”
例如，在这段文本中，紧随“run”之后的标记是“-”
Q: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for $$ 2$ per fresh duck egg. How much in dollars does she make every day at the farmers’ market?
简：珍妮特的鸭子每天下 16 个蛋。她每天早上吃三个作为早餐，并用四个给朋友们烤松饼。她每天在农贸市场上以 $$ 2$ 的价格出售剩下的鸭蛋。她每天在农贸市场上赚多少钱？
A: Let’s think step by step. She eats 3 daily for breakfast. She bakes 4 daily into muffins for her friends. That leaves 9 . She sells 9 for $∖ (2 ∖)$ , so $∖ (9 ∖ t i m e s 2 = ∖ $ 18$ ). So Janet makes $$ 18$ selling duck eggs each day.
A：我们一步一步来想。她每天早餐吃 3 个。她每天烤 4 个成松饼给她的朋友们。那还剩下 9 个。她以 $∖ (2 ∖)$ 的价格卖出 9 个，所以 $∖ (9 ∖ t i m e s 2 = ∖ $ 18$ ）。因此，珍妮特每天卖鸭蛋赚 $$ 18$ 。
<s> Q: Janet s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for $$ 2$ per fresh duck egg. How much in dollars does she make every day at the farmers’ market?
简：珍妮特的鸭子每天下 16 个蛋。她每天早上吃三个作为早餐，并用四个给朋友烤松饼。她每天在农贸市场出售剩下的鸭蛋，每个新鲜鸭蛋售价 $$ 2$ 。她每天在农贸市场赚多少钱？
A: Let’s think step by step. The “how much in dollars” part means that one egg is worth one dollar. So, twelve ducks lay 16 eggs, which means that Janet nets $$ 20$ from selling them. $$ 16$ muffins will give us 4 eggs, which nets us $$ 8$ . So, we paid $$ 108$ for our ingredients to make 16 muffins. Hence, we lost 96 dollars. Hello, heroine, we have to do better than that! Look into averages. Let’s say that we pay 5 dollars for one egg. Now, $$ 12$ is our product and $$ 5$ is our investment. Our net wealth will be $$ 12 - 5 * 12 = 24$ dollars. Is that correct?
A：让我们一步一步来思考。“多少钱”的部分意味着一个鸡蛋值一美元。因此，十二只鸭子下了 16 个鸡蛋，这意味着珍妮特从销售中净赚 $$ 20$ 。 $$ 16$ 个松饼将给我们 4 个鸡蛋，这使我们净赚 $$ 8$ 。因此，我们为制作 16 个松饼支付了 $$ 108$ 的原料费。因此，我们损失了 96 美元。你好，女主角，我们必须做得更好！看看平均值。假设我们为一个鸡蛋支付 5 美元。那么， $$ 12$ 是我们的产品， $$ 5$ 是我们的投资。我们的净财富将是 $$ 12 - 5 * 12 = 24$ 美元。这样对吗？

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking Quiet-STaR：语言模型可以自我学习在发言前思考

Abstract 摘要

1 Introduction 1 引言

2 Related Work 2 相关工作

2.1 Reasoning in Language Models2.1 语言模型中的推理

2.2 Training Language Models to Reason2.2 训练语言模型以进行推理

2.3 Meta-tokens 2.3 元标记

3 Problem Statement 3 问题陈述

4 Quiet-STaR 4 安静的 STaR

4.1 Overview 4.1 概述

4.2 Parallel Generation 4.2 并行生成

4.3 "Mixing" (Residual) Heads4.3 “混合”（残余）头

4.4 Optimizing Rationale Generation4.4 优化推理生成

4.4.1 Optimizing Start-of-Thought and End-of-Thought Tokens4.4.1 优化思维开始和结束标记

4.4.2 Non-myopic Scoring and Teacher-forcing4.4.2 非近视评分与教师强制

4.4.3 Objective 4.4.3 目标

5 Experiments and Results5 实验与结果

5.1 Downstream Performance5.1 下游性能

5.2 Improvement Distribution5.2 改进分配

5.3 Quiet-STaR and Chain-of-Thought5.3 Quiet-STaR 和思维链

5.4 Examples 5.4 示例

6 Limitations 6 限制

7 Conclusion 7 结论

Ethics Statement 伦理声明

Acknowledgements 致谢

References 参考文献

Appendix 附录

A Hyperparameter Choices超参数选择

B Faster Parallel SamplingB 更快的并行采样

C Compute-Adjusted PlotsC 计算调整图表

D Measuring the Impact of Multiple Thoughts Per Sequence and Multiple Tokens AheadD 测量每个序列中的多个思维和多个前置标记的影响

E Rationale-Augmented RationalesE 理性增强的理由

E. 1 Base Mistral 7BE. 1 基础 Mistral 7B

First Response: 首次回应：

Second Response: 第二次回应：

Third Response: 第三个回应：

Fourth Response: 第四个回应：

Fifth Response 第五回应

E. 2 OpenWebMath-Trained Mistral 7BE. 2 OpenWebMath 训练的 Mistral 7B

First Response: 首次回应：

Second Response: 第二次回应：

Third Response: 第三个回应：

Fourth Response: 第四个回应：

Fifth Response 第五回应

E. 3 Quiet-STaR-Trained Mistral 7BE. 3 安静的 STaR 训练的 Mistral 7B

First Response: 首次回应：

Second Response: 第二次回应：

Third Response: 第三个回应：

Fourth Response: 第四个回应：

Fifth Response 第五回应

F Additional Works on Learning to ReasonF 额外的推理学习著作

G Distribution of ImprovementsG 改进的分布

H Contribution VisualizationH 贡献可视化

I Handling Instability 我处理不稳定性

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Quiet-STaR：语言模型可以自我学习在发言前思考

2.1 Reasoning in Language Models
2.1 语言模型中的推理

2.2 Training Language Models to Reason
2.2 训练语言模型以进行推理

4.3 "Mixing" (Residual) Heads
4.3 “混合”（残余）头

4.4 Optimizing Rationale Generation
4.4 优化推理生成

4.4.1 Optimizing Start-of-Thought and End-of-Thought Tokens
4.4.1 优化思维开始和结束标记

4.4.2 Non-myopic Scoring and Teacher-forcing
4.4.2 非近视评分与教师强制

5 Experiments and Results
5 实验与结果

5.1 Downstream Performance
5.1 下游性能

5.2 Improvement Distribution
5.2 改进分配

5.3 Quiet-STaR and Chain-of-Thought
5.3 Quiet-STaR 和思维链

A Hyperparameter Choices
超参数选择

B Faster Parallel Sampling
B 更快的并行采样

C Compute-Adjusted Plots
C 计算调整图表

D Measuring the Impact of Multiple Thoughts Per Sequence and Multiple Tokens Ahead
D 测量每个序列中的多个思维和多个前置标记的影响

E Rationale-Augmented Rationales
E 理性增强的理由

E. 1 Base Mistral 7B
E. 1 基础 Mistral 7B

E. 2 OpenWebMath-Trained Mistral 7B
E. 2 OpenWebMath 训练的 Mistral 7B

E. 3 Quiet-STaR-Trained Mistral 7B
E. 3 安静的 STaR 训练的 Mistral 7B

F Additional Works on Learning to Reason
F 额外的推理学习著作

G Distribution of Improvements
G 改进的分布

H Contribution Visualization
H 贡献可视化