ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation
ReaRAG：通过迭代检索增强生成增强大型推理模型的知识引导推理的事实性

Zhicheng Lee $^{1}$ , Shulin Cao $^{1}$ , Jinxin Liu $^{1}$ , Jiajie Zhang $^{1}$ , Weichuan Liu $^{2}$ , Xiaoyin Che $^{2}$ , Lei Hou $^{1}$ , Juanzi Li $^{1}$
李志成 $^{1}$ ，曹舒林 $^{1}$ ，刘锦新 $^{1}$ ，张佳杰 $^{1}$ ，刘伟川 $^{2}$ ，车小银 $^{2}$ ，侯磊 $^{1}$ ，李娟子 $^{1}$ $^{1}$ Tsinghua University, $^{2}$ Siemens AG
$^{1}$ 清华大学， $^{2}$ 西门子集团https://github.com/THU-KEG/ReaRAG

Abstract 摘要

Large Reasoning Models (LRMs) exhibit remarkable reasoning abilities but rely primarily on parametric knowledge, limiting factual accuracy. While recent works adopt reinforcement learning (RL) training to integrate reasoning with retrieval, such methods are often complex and resource-heavy. Instead, our approach achieves comparable performance via strategic distillation, without the need for costly RL training. In this paper, we propose ReaRAG, a factuality-enhanced reasoning model that explores diverse queries without excessive iterations. Our solution includes a novel data construction framework with an upper bound on the reasoning chain length. Specifically, we first leverage an LRM to generate deliberate thinking, then select an action from a predefined action space (Search and Finish). For Search action, a query is executed against the RAG engine, where the result is returned as observation to guide reasoning steps later. This process iterates until a Finish action is chosen. Benefiting from ReaRAG’s strong reasoning capabilities, our approach outperforms existing baselines overall. Further analysis highlights its strong reflective ability to recognize errors and refine its reasoning trajectory. Our study enhances LRMs’ factuality while effectively integrating robust reasoning for RetrievalAugmented Generation (RAG).
大规模推理模型（LRMs）展现出卓越的推理能力，但主要依赖参数化知识，限制了事实准确性。尽管最近的工作采用强化学习（RL）训练来整合推理和检索，但此类方法通常复杂且资源密集。相比之下，我们的方法通过策略性知识蒸馏实现 comparable 性能，无需昂贵的 RL 训练。在本文中，我们提出 ReaRAG，这是一个经过事实增强的推理模型，能够在不进行过多迭代的情况下探索多样化的查询。我们的解决方案包括一个新颖的数据构建框架，其中推理链长度有上限。具体而言，我们首先利用大规模推理模型生成深思熟虑的思考，然后从预定义的动作空间（搜索和完成）中选择一个动作。对于搜索动作，针对 RAG 引擎执行查询，返回的结果作为观察结果，后续引导推理步骤。这个过程会迭代进行，直到选择完成动作。得益于 ReaRAG 强大的推理能力，我们的方法整体上优于现有基准。进一步分析突显了其强大的反思能力，能够识别错误并完善推理轨迹。我们的研究增强了大规模推理模型的事实性，同时有效地整合了检索增强生成（RAG）的强大推理能力。

1 Introduction 1 引言

Large Reasoning Models (LRMs) such as OpenAI’s o1 (Jaech et al., 2024), Qwen’s QwQ32B

^{1}

(Team, 2024), GLM-Z1

^{2}

and DeepSeek-R1 (DeepSeek-AI et al., 2025) demonstrate impressive reasoning capabilities on complex tasks (Xu et al., 2025). However, their reliance on parametric knowledge during reasoning limits performance
像 OpenAI 的 o1（Jaech 等，2024）、阿里的 QwQ32B

^{1}

（Team，2024）、GLM-Z1

^{2}

和 DeepSeek-R1（DeepSeek-AI 等，2025）这样的大规模推理模型在复杂任务中展现出令人印象深刻的推理能力。然而，它们在推理过程中对参数化知识的依赖限制了性能

Figure 1: Unlike LRMs, ReaRAG iteratively constructs knowledge-guided reasoning chains for factual answers.
图 1：与大规模推理模型不同，ReaRAG 迭代构建知识引导的推理链以获取事实性答案。
on question answering (QA) tasks that require factual answers, where reasoning beyond the model’s parametric knowledge is required.
在需要事实性答案的问答（QA）任务中，需要超越模型参数知识的推理能力。

To enhance LRMs’ factuality, RetrievalAugmented Generation (RAG) (Lewis et al., 2020; Shi et al., 2024; Guu et al., 2020) offers a promising solution by integrating external knowledge but faces challenges in retrieving relevant documents, which requires formulating precise search queries (Chan et al., 2024). Prior research has explored iterative retrieval strategies (Press et al., 2023; Shao et al., 2023), which construct reasoning chains of sub-queries and sub-answers to solve multi-hop QA. However, these methods suffer from error propagation, where mistakes in earlier steps mislead subsequent retrieval and reasoning, ultimately degrading the overall answer quality (Cao et al., 2023).
为了增强大型检索模型（LRMs）的事实性，检索增强生成（RAG）（Lewis 等，2020；Shi 等，2024；Guu 等，2020）通过整合外部知识提供了一个有希望的解决方案，但在检索相关文档时面临挑战，这需要制定精确的搜索查询（Chan 等，2024）。先前的研究探索了迭代检索策略（Press 等，2023；Shao 等，2023），这些策略构建子查询和子答案的推理链以解决多跳问答。然而，这些方法受到错误传播的困扰，其中早期步骤的错误误导后续检索和推理，最终降低了整体答案质量（Cao 等，2023）。

To address this limitation, Search-o1 (Li et al., 2025) prompts a LRM to iteratively retrieve documents and generate sub-answers via a Reason-inDocuments module, relying heavily on the model’s inherent reasoning and instruction-following capabilities. This leads to three key issues: (1) unreliable retrieval token generation, (2) information extraction failures and hallucinations during document reasoning, and (3) overthinking in multi-hop QA (Chen et al., 2024). Effectively integrating reasoning with retrieval remains challenging.
为了解决这一局限性，Search-o1（Li 等，2025）提示大型检索模型通过文档内推理模块迭代地检索文档并生成子答案，严重依赖模型固有的推理和指令遵循能力。这导致了三个关键问题：（1）不可靠的检索标记生成，（2）文档推理中的信息提取失败和幻觉，（3）多跳问答中的过度思考（Chen 等，2024）。有效地将推理与检索相结合仍然具有挑战性。

In response to this challenge, recent works such as R1-Searcher (Song et al., 2025) and Re-
针对这一挑战，最近的工作如 R1-Searcher（Song 等，2025）和 Re-

Search (Chen et al., 2025) adopt reinforcement learning (RL) to incentivize Large Language Models (LLMs) to perform reasoning before retrieval. However, such policy optimization approaches involve complex training paradigms and require extensive computational resources.
陈等人（Chen et al., 2025）采用强化学习（RL）来激励大型语言模型（LLMs）在检索前执行推理。然而，这种策略优化方法涉及复杂的训练范式，并需要大量计算资源。

In this paper, we propose

R e a R A G

, a factualityenhanced reasoning model for RAG that achieves comparable performance without relying on complex RL-based training. Instead of resource-heavy policy optimization, we strategically distill the reasoning capabilities of LRMs to iteratively construct knowledge-guided reasoning chains, which are then curated into a dedicated dataset with restricted chain lengths to fine-tune ReaRAG under the Thought-Action-Observation paradigm. During inference, ReaRAG iteratively performs the search action and dynamically decides when to trigger the finish action, avoiding excessive retrieval. Guided by external knowledge, ReaRAG reflects on its reasoning trajectory, detects errors, and realigns its reasoning toward the correct path, leading to improved performance with greater training efficiency and practical implementation.
在本文中，我们提出

R e a R A G

，这是一个事实性增强的检索增强生成（RAG）推理模型，通过不依赖复杂的基于强化学习的训练来实现可比较的性能。我们不是使用资源密集型策略优化，而是战略性地提取大型检索模型（LRMs）的推理能力，迭代构建知识引导的推理链，并将其整理成一个具有受限链长度的专用数据集，以在思考-行动-观察范式下微调 ReaRAG。在推理过程中，ReaRAG 迭代执行搜索动作并动态决定何时触发完成动作，避免过度检索。在外部知识的指导下，ReaRAG 反思其推理轨迹，检测错误，并将推理重新对准正确路径，从而以更高的训练效率和实际可实施性带来性能提升。

To validate our approach, we conduct experiments on six benchmarks covering both multi-hop and single-hop QA, where ReaRAG achieves the highest overall performance. In summary, our contributions are as follows:
为验证我们的方法，我们在六个涵盖多跳和单跳问答的基准测试上进行实验，ReaRAG 在这些测试中取得了最高的整体性能。总的来说，我们的贡献如下：

Enhancing LRMs’ factuality through knowledge-guided reasoning chain. We propose ReaRAG-9B, a model fine-tuned on a dedicated dataset to perform knowledge-guided reasoning, enabling reliable access to external knowledge. Analyses show that ReaRAG reflects on prior steps, strategically uses external knowledge to identify errors and refines its reasoning, showcasing robust multi-step reasoning capabilities.
通过知识引导的推理链增强大型检索模型的事实性。我们提出了 ReaRAG-9B，这是一个在专用数据集上微调的模型，用于执行知识引导的推理，从而实现可靠地访问外部知识。分析表明，ReaRAG 会反思先前的步骤，战略性地使用外部知识识别错误并改进推理，展示出强大的多步推理能力。
Effectively combining strong reasoning with RAG. Compared to Search-o1, our model avoids redundant retrieval in multi-hop QA. Compared to RL-based methods, we offer a more efficient and robust solution by strategically distilling reasoning capabilities from a LRM, achieving comparable performance even without complex RL training.
有效地将强大的推理能力与 RAG 相结合。与 Search-o1 相比，我们的模型避免了多跳问答中的冗余检索。与基于强化学习的方法相比，我们通过从大型语言模型中战略性地蒸馏推理能力，提供了一个更加高效和稳健的解决方案，即使没有复杂的强化学习训练也能取得相当的性能。
Enhanced benchmark performance. Compared to RL-based methods, ReaRAG performs on par with or surpasses them on MuSiQue (Trivedi et al., 2022), HotpotQA (Yang et al., 2018), IIRC (Ferguson et al., 2020) and Natural Questions (NQ)
增强的基准性能。与基于强化学习的方法相比，ReaRAG 在 MuSiQue（Trivedi 等，2022）、HotpotQA（Yang 等，2018）、IIRC（Ferguson 等，2020）和 Natural Questions（NQ）上表现相当或超越它们。
(Kwiatkowski et al., 2019), while outperforming them on harder tasks such as FRAMES (Krishna et al., 2024) and FanOutQA (Zhu et al., 2024).
（Kwiatkowski 等，2019），同时在更具挑战性的任务上（如 FRAMES（Krishna 等，2024）和 FanOutQA（Zhu 等，2024））表现更优。

Reasoning-enhanced LLMs. Numerous works have investigated how to elicit the reasoning abilities of LLMs. Early approaches, such as Chain-of-Thought (COT) (Wei et al., 2022), ReAct (Yao et al., 2023b) and Tree of Thought (ToT) (Yao et al., 2023a) prompts LLMs to generate human-like step by step reasoning chain, though they still struggle on complex reasoning tasks. Recent advances of LRMs have scaled up CoT through RL (Kumar et al., 2024), enabling models to generate long CoT before providing a final answer. Notably, LRMs include OpenAI’s o1 (Jaech et al., 2024), Qwen’s QwQ-32B (Team, 2024), GLM-Z1 and DeepSeek-R1(DeepSeek-AI et al., 2025) demonstrate impressive performance in complex reasoning tasks. Despite these advancements, most LRMs lack the ability to interact with external knowledge, limiting their capacity to generate factual responses.
基于推理增强的大语言模型（LLMs）。众多研究已探索如何激发大语言模型的推理能力。早期方法，如思维链（Chain-of-Thought，COT）（Wei 等，2022）、ReAct（Yao 等，2023b）和思维树（Tree of Thought，ToT）（Yao 等，2023a）提示大语言模型生成类似人类的逐步推理链，尽管它们在复杂推理任务上仍面临困难。最近，大语言模型的进展通过强化学习（RL）扩展了思维链（Kumar 等，2024），使模型能在给出最终答案之前生成长的思维链。值得注意的是，大语言模型包括 OpenAI 的 o1（Jaech 等，2024）、Qwen 的 QwQ-32B（Team，2024）、GLM-Z1 和 DeepSeek-R1（DeepSeek-AI 等，2025）在复杂推理任务中展现了令人印象深刻的性能。尽管取得了这些进展，但大多数大语言模型仍缺乏与外部知识交互的能力，限制了其生成事实性响应的能力。

Retrieval-Augmented Generation. RAG has emerged as a promising paradigm for improving LLMs’ factuality. Early methods rely on a single retrieval step (Lewis et al., 2020; Borgeaud et al., 2022; Izacard et al., 2023), which often falls short for multi-hop QA tasks due to limited retrieval quality. To mitigate noisy retrieval (Shi et al., 2023), Self-RAG (Asai et al., 2024) and CRAG (Yan et al., 2024) introduce reflection mechanisms on the retrieved documents, yet single-step retrieval remains insufficient. To address these limitations, iterative methods such as Iter-RetGen (Shao et al., 2023) and Self-Ask (Press et al., 2023) progressively retrieve relevant documents to gather sufficient information for final answer. Building on this, SearChain (Xu et al., 2024) first generates a complete reasoning chain, then verifies the answer of each node in the chain based on external knowledge. However, these approaches lack a strong reflection mechanism to recover from earlier reasoning errors.
检索增强生成（RAG）。检索增强生成已成为提升大语言模型事实性的一个有前景的范式。早期方法依赖单一检索步骤（Lewis 等，2020；Borgeaud 等，2022；Izacard 等，2023），这通常无法满足多跳问答任务，因为检索质量有限。为缓解噪声检索（Shi 等，2023），Self-RAG（Asai 等，2024）和 CRAG（Yan 等，2024）在检索文档上引入反思机制，但单步检索仍然不够充分。为解决这些限制，迭代方法如 Iter-RetGen（Shao 等，2023）和 Self-Ask（Press 等，2023）逐步检索相关文档，以收集足够信息得出最终答案。在此基础上，SearChain（Xu 等，2024）首先生成完整的推理链，然后基于外部知识验证链中每个节点的答案。然而，这些方法缺乏强大的反思机制来从早期推理错误中恢复。

Reasoning-enhanced RAG. Recent studies in scaling token generation at test-time (OpenAI, 2024; Muennighoff et al., 2025; Snell et al., 2024) to enhance LLMs’ reasoning capabilities have spurred interest in reasoning-enhanced RAG. RAGStar (Jiang et al., 2024) leverages Monte Carlo Tree Search to iteratively decompose multi-hop ques-
推理增强的检索增强生成（RAG）。最近关于测试时标记生成扩展的研究（OpenAI，2024；Muennighoff 等，2025；Snell 等，2024）旨在增强 LLMs 的推理能力，这引发了对推理增强 RAG 的兴趣。RAGStar（Jiang 等，2024）利用蒙特卡洛树搜索通过奖励模型迭代分解多跳问题

Figure 2: Overview of our approach to develop a factuality-enhanced reasoning model RearAG. To equip ReaRAG with knowledge-guided reasoning ability, we propose an automated data construction approach (Algorithm 1). Next, we fine-tune ReaRAG on the constructed dataset to conduct reasoning iteratively, following the Thought-ActionObservation Paradigm to solve complex queries. Pseudocode for the inference stage is provided in Algorithm 2.
图 2：我们开发事实性增强推理模型 ReaRAG 的方法概述。为了为 ReaRAG 配备知识引导的推理能力，我们提出了一种自动化数据构建方法（算法 1）。接下来，我们在构建的数据集上对 ReaRAG 进行微调，遵循思考-行动-观察范式迭代推理以解决复杂查询。推理阶段的伪代码在算法 2 中提供。
tions, guided by a reward model during the tree expansion. Search-o1 (Li et al., 2025) introduces a document reasoning module and prompts QwQ32B to generate special tokens that trigger knowledge retrieval. However, it heavily depends on the base model’s instruction-following and inherited reasoning capabilities, leading to three key challenges: unreliable generation of specialized tokens, failure in extracting information from retrieved documents, and overthinking for multi-hop QA. To integrate reasoning with retrieval, R1-Searcher (Song et al., 2025) and ReSearch (Chen et al., 2025) adopt a RL-based method that incentivizes LLMs to reason before retrieving. However, despite the complexity of RL-based training paradigms, our approach achieves comparable performance through strategic distillation, offering greater training efficiency and practical ease of implementation.
题。搜索-o1（Li 等，2025）引入了文档推理模块，并提示 QwQ32B 生成触发知识检索的特殊标记。然而，它高度依赖基础模型的指令遵循和继承的推理能力，这导致三个关键挑战：专门标记生成不可靠、从检索到的文档中提取信息失败，以及多跳问题问答过度思考。为了将推理与检索整合，R1-Searcher（Song 等，2025）和 ReSearch（Chen 等，2025）采用基于强化学习的方法，激励 LLMs 在检索前进行推理。然而，尽管强化学习训练范式复杂，我们的方法通过策略蒸馏实现了可比的性能，提供了更高的训练效率和更实际的实施便利性。

3 Methodology 3 方法论

In this section, we first formalize the task, then present our novel approach for developing ReaRAG as illustrated in Figure 2.
在本节中，我们首先正式化任务，然后如图 2 所示，介绍我们开发 ReaRAG 的新颖方法。

3.1 Task formulation 3.1 任务形式化

We focus on the multi-hop QA task, where iterative reasoning improves answer accuracy. Given a
我们关注多跳问答任务，通过迭代推理提高答案准确性。给定一个
question

x

, our goal is to construct a knowledgeguided reasoning chain

C

to enhance the factual correctness of the generated answer

\hat{y}

. Specifically, the reasoning chain is formulated as a sequence of

N

steps, where each step consists of a reasoning thought

τ_{t}

, an action

α_{t}

, and an observation

o_{t}

:
问题

x

，我们的目标是构建一个知识引导的推理链

C

，以增强生成答案

\hat{y}

的事实准确性。具体而言，推理链被表述为

N

步骤的序列，其中每一步包括一个推理思路

τ_{t}

、一个动作

α_{t}

和一个观察结果

o_{t}

：

C = {(τ_{t}, α_{t}, o_{t})}_{t = 1}^{N}, 1 \leq t \leq T_{max}

The number of reasoning steps

N

is dynamically determined by the model but is constrained by an upper limit

T_{max}

to prevent indefinite iterations, i.e.,

N \leq T_{max}

. To guide the reasoning process with external knowledge, we define the action space as

A = {search ()

, finish()

}

where search takes a search query as input, while the input to finish represents the derived answer. At each step

t

, the model refines a search query based on the reasoning thought

τ_{t}

and executes the search action to retrieve relevant information from the RAG engine

R

. The process continues until the model selects the finish action, at which point the final answer

\hat{y}

is derived from all prior reasoning steps. This ensures that the answer is grounded in retrieved knowledge through an iterative and structured reasoning process, thereby enhancing the factual reliability of LRMs.
推理步骤的数量

N

由模型动态确定，但受上限

T_{max}

约束，以防止无限迭代，即

N \leq T_{max}

。为了利用外部知识指导推理过程，我们将动作空间定义为

A = {search ()

，finish()

}

其中搜索操作接受搜索查询作为输入，而 finish 的输入表示推导出的答案。在每一步

t

，模型根据推理思路

τ_{t}

优化搜索查询，并执行搜索动作从 RAG 引擎

R

检索相关信息。该过程持续进行，直到模型选择 finish 动作，此时最终答案

\hat{y}

由之前的所有推理步骤推导而出。这确保了答案通过迭代和结构化的推理过程扎根于检索到的知识，从而提高大规模语言模型的事实可靠性。

Algorithm 1 Data Construction
Input: Seed Dataset \(\mathcal{D}_{\text {seed }}\), Large Reasoning
Model \(\mathcal{M}_{L R M}\), Instruction prompt \(\mathcal{P}_{d}\), Max itera-
tions \(T_{\text {max }}\), RAG engine \(\mathcal{R}\)
Output: Dataset \(\mathcal{D}_{\text {reason }}\) with reasoning chains
    Initialize \(\mathcal{D}_{\text {reason }} \leftarrow \emptyset\)
    for each \(\left(x_{i}, d o c_{i}\right) \sim \mathcal{D}_{\text {seed }}\) do
            \(\triangleright\) Sample question and gold documents
        \(t \leftarrow 0 \quad \triangleright\) Iteration counter
        \(\mathcal{C}_{i} \leftarrow[] \quad \triangleright\) Reasoning chain
        while \(t<T_{\text {max }}\) do
            \(y_{t}^{\prime} \leftarrow \mathcal{M}_{L R M}\left(\left[\mathcal{P}_{d} \oplus x_{i} \oplus \mathcal{C}_{i}\right]\right)\)
                    \(\triangleright\) Generate response
            \(\left(\tau_{t}, \alpha_{t}\right) \leftarrow \operatorname{parse}\left(y_{t}^{\prime}\right)\)
                \(\triangleright\) Extract thought \(\tau_{t}\) and action \(\alpha_{t}\)
            Get action type \(\alpha_{t_{\text {type }}} \leftarrow \alpha_{t}\) ['type']
            if finish \(\in \alpha_{t_{\text {type }}}\) then
                Append \(\left(\tau_{t}, \alpha_{t}\right)\) to \(\mathcal{C}_{i}\)
                break
            else if search \(\in \alpha_{t_{\text {type }}}\) then
                \(q_{s} \leftarrow \alpha_{t}\) ['query']
                \(o_{t} \leftarrow \mathcal{R}\left(q_{s}, d o c_{i}\right)\)
                    \(\triangleright\) Get \(o_{t}\) from RAG engine
                Append \(\left(\tau_{t}, \alpha_{t}, o_{t}\right)\) to \(\mathcal{C}_{i}\)
            end if
            \(t \leftarrow t+1\)
        end while
        Append \(\mathcal{C}_{i}\) to \(\mathcal{D}_{\text {reason }}\)
    end for
    return \(\mathcal{D}_{\text {reason }}\)

3.2 Knowledge-guided reasoning chain generation
3.2 知识引导的推理链生成

While existing LRMs demonstrate strong reasoning capabilities, they often fail to ground their reasoning process in factual knowledge. To make external knowledge accessible, we design a structured reasoning step where each step consists of a reasoning thought

τ_{t}

, an action

α_{t}

, and an observation

o_{t}

.
尽管现有的大规模语言模型展现出强大的推理能力，但它们常常未能将推理过程建立在事实知识的基础上。为了使外部知识可访问，我们设计了一个结构化的推理步骤，每一步由推理思路

τ_{t}

、动作

α_{t}

和观察结果

o_{t}

组成。

Reasoning thought $τ_{t}$ : Represents the model’s thought process where it reflects on prior actions and observations before deciding an action and its input parameter.
推理思路 $τ_{t}$ ：代表模型的思考过程，在决定动作及其输入参数之前，它会反思 prior 动作和观察结果。
Action $α_{t}$ : A JSON dictionary containing an action sampled from the action space $A$ along with the corresponding input parameter.
动作 $α_{t}$ ：从动作空间 $A$ 中采样的包含相应输入参数的 JSON 字典。
Observation $o_{t}$ : Feedback received after executing action $α_{t}$ , guiding subsequent reasoning.
观察 $o_{t}$ ：执行动作 $α_{t}$ 后收到的反馈，指导后续推理。

To equip ReaRAG with the ability to construct
为了赋予 ReaRAG 利用外部知识构建推理链的能力
reasoning chain guided by external knowledge, we propose an automated data construction approach, as detailed in Algorithm 1.
，我们提出了一种自动化数据构建方法，详细描述见算法 1。

Data construction. Given a multi-hop question

x_{i}

sampled from the seed dataset, we prompt the LRM

M_{L R M}

with instruction prompt

P_{d}

(see Appendix B ) to collect the reasoning thoughts and actions. Next, the search query is extracted and executed against RAG Engine

R

to obtain an observation

o_{t}

. The process iterates until either model decides an finish action or the iteration count exceeds the maximum iterations

T_{max}

.
数据构建。给定从种子数据集中抽样的多跳问题

x_{i}

，我们使用指令提示

P_{d}

（参见附录 B）提示 LRM

M_{L R M}

，以收集推理思路和动作。接下来，提取搜索查询并针对 RAG 引擎

R

执行，以获取观察结果

o_{t}

。该过程会迭代进行，直到模型决定完成动作或迭代次数超过最大迭代次数

T_{max}

。

Data filtering. Previous work studies have shown that the performance of the LLMs heavily depends on the quality of fine-tuning data (Gunasekar et al., 2023). To ensure high-quality reasoning chains, we apply data filtering by comparing the final answer

{\hat{y}}_{i}

derived from the reasoning chain

C_{i}

against the ground truth answer

y_{i}

using the F1 metric. Reasoning chains with an F1 score of 0 are discarded to maintain data integrity.
数据过滤。之前的研究工作表明，LLMs 的性能高度依赖于微调数据的质量（Gunasekar 等，2023）。为确保高质量的推理链，我们通过使用 F1 指标比较从推理链

C_{i}

得出的最终答案

{\hat{y}}_{i}

和基准真值答案

y_{i}

，来进行数据过滤。F1 分数为 0 的推理链将被丢弃，以保持数据完整性。

3.3 Factuality-enhanced LRMs: ReaRAG
3.3 事实性增强的 LRMs：ReaRAG

Fine-tuning. To incorporate knowledge-guided reasoning ability into the model, we perform supervised fine-tuning (SFT) on the constructed dataset discussed in the previous section. Each sample is a sequence of conversation chain

S = {P, x_{i}, {(τ_{t}, α_{t}, o_{t})}_{t = 1}^{N}}

, where

P

denotes the instruction prompt (see Appendix B). We fine-tune our factuality-enhanced LRM, ReaRAG (

M_{ReaRAG}

) using loss function below:
微调。为将知识引导的推理能力融入模型，我们对上一节中构建的数据集进行监督微调（SFT）。每个样本是一个会话链序列

S = {P, x_{i}, {(τ_{t}, α_{t}, o_{t})}_{t = 1}^{N}}

，其中

P

表示指令提示（参见附录 B）。我们使用下面的损失函数对我们的事实性增强 LRM ReaRAG

M_{ReaRAG}

进行微调：

L = - \sum_{j} 1 \times \log M_{ReaRAG} (s_{j} ∣ s_{< j})

where

s_{j}

represents the textual tokens of the input sequence

S, 1 (\cdot)

is a loss mask indicator, which is set to True on thought

τ_{t}

and action

α_{t}

tokens, ensuring that the loss is only computed over the tokens contributed to the thought reasoning as well as action, rather than the entire sequence.
其中

s_{j}

表示输入序列的文本标记，

S, 1 (\cdot)

是损失掩码指示器，在思考

τ_{t}

和动作

α_{t}

标记上设置为 True，确保损失仅在对思考推理和动作有贡献的标记上计算，而不是整个序列。

Inference. After fine-tuning, ReaRAG is equipped with advanced reasoning capabilities to solve multi-hop QA. Given an instruction prompt

P

and a question

x

, the model first generates a reasoning thought

τ_{0}

and an initial action

α_{0}

, typically a search action. The search query

q_{s}

is extracted and executed over the RAG engine

R

, which returns an observation

o_{0}

. The triplet
推理。经过微调，ReaRAG 具备了解决多跳问答的高级推理能力。给定指令提示

P

和问题

x

，模型首先生成推理思路

τ_{0}

和初始动作

α_{0}

，通常是搜索动作。搜索查询

q_{s}

被提取并在 RAG 引擎

R

上执行，返回一个观察结果

o_{0}

。这个三元组

Algorithm 2 Inference
Input: Input question \(x\), documents \(d o c\), ReaRAG
\(\mathcal{M}_{\text {Rearag }}\), Answer LLM \(\mathcal{M}_{\text {Ans }}\), Instruction
prompt \(\mathcal{P}\), Answer Prompt \(\mathcal{P}_{\text {ans }}\), Max iterations
\(T_{\text {max }}\), RAG engine \(\mathcal{R}\)
Output: Final answer, \(\hat{y}\)
    \(t \leftarrow 0, \mathcal{C} \leftarrow[]\) Sum \(\leftarrow[] \quad \triangleright\) Initialization
    while \(t<T_{\text {max }}\) do
        \(y_{t}^{\prime} \leftarrow \mathcal{M}_{\text {Rearag }}([\mathcal{P} \oplus x \oplus \mathcal{C}])\)
                \(\triangleright\) Generate response for iteration \(t\)
        \(\left(\tau_{t}, \alpha_{t}\right) \leftarrow \operatorname{parse}\left(y_{t}^{\prime}\right)\)
                \(\triangleright\) Extract thought \(\tau_{t}\) and action \(\alpha_{t}\)
        Get action type \(\alpha_{t_{\text {type }}} \leftarrow \alpha_{t}\) ['type']
        if finish \(\in \alpha_{t_{\text {type }}}\) then
            \(\hat{y} \leftarrow \mathcal{M}_{\text {Ans }}\left(\left[\mathcal{P}_{\text {ans }} \oplus x \oplus S u m\right]\right)\)
            return final answer \(\hat{y}\)
        else if search \(\in \alpha_{t_{\text {type }}}\) then
            \(q_{s} \leftarrow \alpha_{t}\) ['query']
            \(o_{t} \leftarrow \mathcal{R}\left(q_{s}, d o c\right)\)
                    \(\triangleright\) Get \(o_{t}\) from RAG engine
            Append ( \(\tau_{t}, \alpha_{t}, o_{t}\) ) to \(\mathcal{C}\)
            Append ( \(q_{s}, o_{t}\) ) to Sum
        end if
        \(t \leftarrow t+1\)
    end while

(

τ_{t}, α_{t}, o_{t}

) is appended to the reasoning chain

C

, while the pair (

q_{s}, o_{t}

) is appended to the summary chain Sum. This process iterates until ReaRAG decides on a finish action at step

N

. Upon this action, we concatenate all gathered information in Sum and prompt an answer model

M_{Ans}

to generate final answer

\hat{y}

using the prompt

P_{ans}

described in Appendix B. The pseudocode for the inference stage is provided in Algorithm 2.
（

τ_{t}, α_{t}, o_{t}

）被附加到推理链

C

，同时对（

q_{s}, o_{t}

）对被附加到总结链 Sum 中。这个过程会迭代进行，直到 ReaRAG 在步骤

N

决定结束操作。在此操作后，我们将 Sum 中收集的所有信息进行连接，并提示答案模型

M_{Ans}

使用附录 B 中描述的提示语

P_{ans}

生成最终答案

\hat{y}

。推理阶段的伪代码在算法 2 中提供。

4 Experiments 4 实验

4.1 Experimental setup 4.1 实验设置

Dataset and metrics. We evaluate our approach on multi-hop QA benchmarks requiring reasoning across multiple documents, including MuSiQue (MQ) (Trivedi et al., 2022), HotpotQA (HP) (Yang et al., 2018) and IIRC (Ferguson et al., 2020), as well as recently proposed benchmarks, including FRAMES (FR) (Krishna et al., 2024) and FanOutQA (FQ) (Zhu et al., 2024). To evaluate single-hop reasoning capabilities, we additionally include NQ (Kwiatkowski et al., 2019). Since these datasets require open-ended answers, traditional metric such as exact match (EM) may fail to cap-
数据集和指标。我们在需要跨多个文档进行推理的多跳问答基准上评估了我们的方法，包括 MuSiQue (MQ)（Trivedi 等，2022），HotpotQA (HP)（Yang 等，2018）和 IIRC（Ferguson 等，2020），以及最近提出的基准，包括 FRAMES (FR)（Krishna 等，2024）和 FanOutQA (FQ)（Zhu 等，2024）。为了评估单跳推理能力，我们还额外包括了 NQ（Kwiatkowski 等，2019）。由于这些数据集需要开放式答案，传统指标（如精确匹配（EM））可能无法
ture semantically equivalent responses (Yin et al., 2024). Hence, we include the LLM-as-a-Judge metric (

{ACC}_{L}

) (Zheng et al., 2023) based on GPT4 o for more accurate evaluation using the prompt

P_{j u d g e}

described in Appendix B. For FanOutQA, we follow the metric used in the original benchmark and report the Rouge-L (R-L) score instead of EM. We randomly sample 100 validation examples each from MuSiQue, HotpotQA, and NQ, 200 from IIRC, and 396 from FRAMES. For FanOutQA, we evaluate on the full development set, which contains 310 examples.
为了获得语义等效的响应（Yin 等人，2024），我们引入了基于 GPT4 的 LLM-as-a-Judge 指标（

{ACC}_{L}

）（Zheng 等人，2023），以使用附录 B 中描述的提示词

P_{j u d g e}

进行更准确的评估。对于 FanOutQA，我们遵循原始基准测试中使用的指标，并报告 Rouge-L（R-L）分数，而非精确匹配（EM）分数。我们从 MuSiQue、HotpotQA 和 NQ 中随机抽样 100 个验证示例，从 IIRC 中抽样 200 个，从 FRAMES 中抽样 396 个。对于 FanOutQA，我们在包含 310 个示例的完整开发集上进行评估。

Baselines. We compare our approach against multiple baselines, categorized based on their access to external knowledge. These include incontext retrieval, where the corpus is directly appended to the language model’s context; vanilla

R A G

, which performs a single retrieval based on the original input question; and state-of-the-art

a d

vanced RAG methods proposed recently.
基准测试。我们将我们的方法与多个基准进行比较，这些基准根据其对外部知识的访问方式进行分类。这些基准包括：上下文检索，即直接将语料库附加到语言模型的上下文中；原始

R A G

，它基于原始输入问题执行单次检索；以及最近提出的最先进的

a d

高级 RAG 方法。

In the in-context settings, we use GLM-4-9B and GLM-4-32B (Zeng et al., 2024), both with a 128 k context window. The vanilla

R A G

baselines employ the same models, along with QwQ32B (Team, 2024). For advanced

R A G

baselines, we include Self-RAG (Asai et al., 2024), which fine-tunes Llama-2-7B to retrieve documents on demand and filter noisy evidence; SearChain (Xu et al., 2024), which generates a Chain-of-Query structure enabling multi-turn retrieval with verification; and Search-o1 (Li et al., 2025), which proposes a framework for LRMs to perform iterative knowledge retrieval. We further consider R1Searcher (Song et al., 2025) and ReSearch (Chen et al., 2025), which incentivize LLMs to perform reasoning prior to retrieval through RL.
在上下文设置中，我们使用 GLM-4-9B 和 GLM-4-32B（Zeng 等，2024），两者都具有 128k 上下文窗口。原始的

R A G

基线采用相同的模型，以及 QwQ32B（Team，2024）。对于高级的

R A G

基线，我们包括 Self-RAG（Asai 等，2024），它微调 Llama-2-7B 以按需检索文档并过滤噪声证据；SearChain（Xu 等，2024），它生成查询链结构，支持多轮检索和验证；以及 Search-o1（Li 等，2025），它为 LRMs 提出了执行迭代知识检索的框架。我们进一步考虑 R1Searcher（Song 等，2025）和 ReSearch（Chen 等，2025），这些方法通过强化学习激励 LLMs 在检索之前进行推理。

4.2 Implementations details
4.2 实现细节

Data construction and fine-tuning. The seed dataset described in Algorithm 1 is derived from the training sets of MuSiQue, HotpotQA, and NQ, with QwQ-32B as the LRM. To ensure model’s general capabilities, we fine-tune GLM-4-9B (Zeng et al., 2024) with the constructed dataset (roughly 20k filtered samples), as well as the general SFT dataset from GLM-4 (Zeng et al., 2024).
数据构建和微调。算法 1 中描述的种子数据集源于 MuSiQue、HotpotQA 和 NQ 的训练集，以 QwQ-32B 作为 LRM。为确保模型的通用能力，我们使用构建的数据集（约 2 万个筛选样本）以及 GLM-4 的通用 SFT 数据集对 GLM-4-9B（Zeng 等，2024）进行微调。

Evaluation. For MuSiQue and HotpotQA, we use the original corpora provided by the respective authors. For NQ, we follow the corpus setup from “Lost in the middle” (Liu et al., 2024). To increase the difficulty, particularly for comparisons with
评估。对于 MuSiQue 和 HotpotQA，我们使用各作者提供的原始语料库。对于 NQ，我们遵循"迷失在中间"（Liu 等，2024）的语料库设置。为了增加难度，特别是为了与

Model 模型	Multi-hop 多跳										Single-hop 单跳		Average 平均
	MuSiQue		HotpotQA		IIRC		FRAMES		FanOutQA		NQ		Average 平均
	${ACC}_{L}$	EM	${ACC}_{L}$	EM	${ACC}_{L}$	EM	${ACC}_{L}$	EM	${ACC}_{L}$	R-L	${ACC}_{L}$	EM	${ACC}_{L}$
In-context 上下文中
GLM-4-9B (128k) GLM-4-9B（128k）	23.5	15.0	58.0	47.0	-	-	-	-	-	-	45.5	26.0	-
GLM-4-32B (128k) GLM-4-32B（128k）	33.5	17.0	65.5	50.0	-	-	-	-	-	-	52.5	24.0	-
Vanilla RAG 原始 RAG
GLM-4-9B (128k) GLM-4-9B（128k）	25.5	14.0	68.0	52.0	27.3	22.0	11.0	7.3	4.2	15.6	49.0	32.0	30.8
GLM-4-32B (128k) GLM-4-32B（128k）	29.0	17.0	67.5	52.0	26.0	14.5	13.9	7.6	3.9	12.3	$\underset{―}{53.0}$	39.0	32.2
QwQ-32B	36.0	20.0	67.0	47.0	32.8	25.0	23.4	13.9	9.7	14.2	48.0	26.0	36.1
Advanced RAG 高级检索增强生成
Self-RAG-7B 自我检索增强生成-7B	24.0	13.0	45.5	31.0	21.8	12.0	8.8	5.6	3.9	14.2	40.0	28.0	24.0
SearChain-7B	8.5	3.0	23.0	17.0	25.8	18.5	16.0	10.6	4.8	$\underset{―}{17.7}$	28.0	13.0	17.7
Search-o1-32B	40.5	32.0	55.5	38.0	32.8	29.5	27.8	19.2	12.3	14.9	43.0	28.0	35.3
R1-Searcher-7B-Base	64.0	52.5	79.5	68.0	42.5	35.5	26.9	18.7	5.8	15.6	48.5	35.0	44.5
ReSearch-Qwen-7B-Instruct	65.0	$\underset{―}{49.0}$	72.5	59.0	$\underset{―}{41.8}$	33.5	$\underset{―}{29.4}$	$\underset{―}{21.0}$	6.8	15.8	53.0	$\underset{―}{37.0}$	44.7
Ours 我们的
ReaRAG-9B	67.0	46.0	81.5	58.0	41.0	27.5	34.2	21.2	$\underset{―}{11.6}$	24.3	55.5	32.0	48.5

Table 1: Main experimental results across six benchmarks. Bold and underline indicate the best and second best results. We report the traditional EM scores as well as

{ACC}_{L}

, a metric based on the LLM-as-a-Judge framework using GPT-4o. For FanOutQA, we follow the original benchmark and report the Rouge-L (R-L) score. Our model, ReaRAG-9B, achieves the best overall performance, demonstrating the effectiveness of our strategic distillation approach, even when compared to recent RL-based baselines that integrate reasoning with retrieval.
表 1：六个基准测试上的主要实验结果。粗体和下划线分别表示最佳和次佳结果。我们报告了传统的精确匹配（EM）分数以及基于 GPT-4o 的 LLM 作为评判者框架的

{ACC}_{L}

指标。对于 FanOutQA，我们遵循原始基准测试并报告 Rouge-L（R-L）分数。我们的模型 ReaRAG-9B 实现了最佳整体性能，证明了我们的战略蒸馏方法的有效性，即使与最近集成推理和检索的基于强化学习的基准模型相比。
long-context LLMs, we scale up the number of distractor documents, resulting in corpora with 48k to 58 k token lengths. Thus, this design demands high-quality queries to enhance retrieval quality.
对于长上下文 LLM，我们将干扰文档的数量扩大到 48k 到 58k 个标记长度。因此，这种设计需要高质量的查询以提高检索质量。

For IIRC, FRAMES, and FanOutQA, where official corpora are incomplete or unavailable, we retrieve the top 5 most relevant Wikipedia snippets via the Google Serper API. These benchmarks are excluded from the in-context settings evaluations due to the lack of official corpora. All baselines are evaluated using their official implementations with our RAG engine and corpus setup.
对于 IIRC、FRAMES 和 FanOutQA，由于官方语料库不完整或不可用，我们通过 Google Serper API 检索了前 5 个最相关的维基百科片段。由于缺乏官方语料库，这些基准测试被排除在上下文设置评估之外。所有基线都使用其官方实现和我们的 RAG 引擎以及语料库设置进行评估。

RAG engine. Our RAG engine consists of two main components: retrieval and generation. For retrieval, we utilize the embedding model embedding3 from Zhipu’s API

^{3}

, along with a reranker based on the GLM3 architecture to enhance retrieval quality. This setup is applied only to benchmarks with self-constructed corpora, while for IIRC, FRAMES, and FanOutQA, documents are retrieved directly via Google. For generation, we use GLM-4-32B with a 128 k context window to generate responses based on the retrieved documents.
RAG 引擎。我们的 RAG 引擎由两个主要组件组成：检索和生成。在检索方面，我们利用智谱 API

^{3}

的嵌入模型 embedding3，以及基于 GLM3 架构的重新排序器来提高检索质量。这种设置仅适用于自构建语料库的基准测试，而对于 IIRC、FRAMES 和 FanOutQA，文档直接通过 Google 检索。对于生成，我们使用具有 128k 上下文窗口的 GLM-4-32B 根据检索到的文档生成响应。

4.3 Main results 4.3 主要结果

Table 1 presents our main results across six benchmarks. Overall, our approach ReaRAG9 B achieves the highest average

{ACC}_{L}

(48.5)
表 1 展示了我们在六个基准测试中的主要结果。总体而言，我们的方法 ReaRAG9 B 实现了最高的平均

{ACC}_{L}

（48.5）

across all evaluated methods. Despite the recent momentum around RL in RAG, particularly with models such as R1-Searcher and ReSearch, ReaRAG-9B trained solely with SFT to generate knowledge-guided reasoning chain, still matches or exceeds performance on MuSiQue, HotpotQA, IIRC, and NQ, and outperforms RL-based baselines on FRAMES and FanOutQA (by

5.8 % - 7.3 %

{ACC}_{L}

), suggesting SFT can achieve equally powerful performance for open-ended QA tasks. Although designed for multi-hop tasks, ReaRAG also performs strongly on single-hop QA, achieving 55.5

{ACC}_{L}

on NQ, demonstrating ReaRAG’s capabilities beyond multi-hop reasoning.
在所有评估的方法中。尽管最近围绕 RAG 中的强化学习（RL）有动力，特别是像 R1-Searcher 和 ReSearch 这样的模型，但仅使用监督微调（SFT）训练的 ReaRAG-9B，在生成知识引导的推理链时，仍然在 MuSiQue、HotpotQA、IIRC 和 NQ 上匹配或超越性能，并且在 FRAMES 和 FanOutQA 上优于基于强化学习的基线（通过

5.8 % - 7.3 %

{ACC}_{L}

），这表明 SFT 可以为开放式问答任务实现同样强大的性能。尽管设计用于多跳任务，但 ReaRAG 在单跳问答上也表现出色，在 NQ 上达到 55.5

{ACC}_{L}

，展示了 ReaRAG 超越多跳推理的能力。

However, discrepancies between EM and

{ACC}_{L}

, e.g., on NQ, where ReaRAG achieves a higher

{ACC}_{L}

than ReSearch ( 55.5 vs . 53.0) but yields a lower EM score ( 32.0 vs. 37.0), suggest that EM may fail to capture contextually valid answers, a phenomenon consistently observed across all other benchmarks, necessitating the use of

{ACC}_{L}

metric.
然而，精确匹配（EM）和

{ACC}_{L}

之间的差异，例如在 NQ 上，ReaRAG 的

{ACC}_{L}

高于 ReSearch（55.5 比 53.0），但 EM 分数较低（32.0 比 37.0），表明 EM 可能无法捕捉上下文有效的答案，这是在所有其他基准测试中一致观察到的现象，因此需要使用

{ACC}_{L}

指标。

Comparing vanilla RAG with in-context settings across different GLM-4 backbone scales, we find that vanilla RAG generally performs better, suggesting that long-context models may struggle with distractor-heavy corpora. The only exception is MuSiQue with the GLM-4-32B backbone, where the in-context setting slightly outperforms vanilla

R A G

( 33.50 vs. 29.00 ). Under vanilla

R A G

setting, QwQ-32B significantly outperforms GLM-4-
在不同的 GLM-4 骨干规模上比较传统 RAG 和上下文设置，我们发现传统 RAG 通常表现更好，这表明长上下文模型可能难以处理干扰词丰富的语料库。唯一的例外是使用 GLM-4-32B 骨干的 MuSiQue，其中上下文设置略微优于传统

R A G

（33.50 比 29.00）。在传统

R A G

设置下，QwQ-32B 在 FRAMES 和 FanOutQA 等复杂的多跳基准测试中显著优于 GLM-4-32B，突显了具有强大推理能力的大型语言模型（LRMs）的优势。

32B on complex multi-hop benchmarks such as FRAMES and FanOutQA, highlighting the advantage of LRMs with strong reasoning capabilities.
32B 在复杂的多跳基准测试中表现出色，突显了具有强大推理能力的大型语言模型（LRMs）的优势。

Notably, the original SearChain paper reports results using GPT-3.5, a proprietary large-scale model. For a fair comparison, we replace it with Qwen2.5-7B-Instruct (Yang et al., 2024). Under this setting, SearChain underperforms across all benchmarks. Similarly, Self-RAG, despite aiming to improve retrieval quality, lacks multi-turn retrieval strategies and shows weak performance. Search-o1, although leveraging QwQ-32B, a strong reasoning LRM to perform multi-turn retrieval, only performs on par with vanilla

R A G

settings. We further analyze these findings in Section 4.5.
值得注意的是，原始的 SearChain 论文报告了使用 GPT-3.5（一个专有的大规模模型）的结果。为了进行公平的比较，我们将其替换为 Qwen2.5-7B-Instruct（Yang 等，2024）。在这种设置下，SearChain 在所有基准测试中表现不佳。同样，Self-RAG 尽管旨在改进检索质量，但缺乏多轮检索策略，并显示出较弱的性能。Search-o1 虽然利用了 QwQ-32B（一个强大的推理大语言模型）来执行多轮检索，但仅与原始

R A G

设置相当。我们将在第 4.5 节进一步分析这些发现。

4.4 Ablation 4.4 消融实验

Closed-book performance. We conduct a closed-book experiment to evaluate the parametric knowledge of the language models. The results, presented in Table 3 show that QwQ-32B outperforms GLM-4 on multi-hop benchmarks requiring strong reasoning, except for HotpotQA and FanOutQA. Nevertheless, their parametric knowledge remains insufficient compared to the results in Table 1.
闭卷性能。我们进行了一个闭卷实验，以评估语言模型的参数化知识。表 3 中的结果显示，QwQ-32B 在需要强大推理能力的多跳基准测试中优于 GLM-4，除了 HotpotQA 和 FanOutQA。尽管如此，与表 1 中的结果相比，他们的参数化知识仍然不足。

Advantage of strong reasoning. To evaluate the impact of strong reasoning capabilities, we finetune a model that lacks such abilities under the same Thought-Action-Observation paradigm. This variant, denoted as w/o reasoning in Table 4, shares the same backbone architecture as ReaRAG-9B and follows the data construction process outlined in Algorithm 3. However, instead of leveraging a strong reasoning model like QwQ-32B for data generation, we employ GLM-4-9B, which lacks robust reasoning abilities. Unlike the previous data construction approach in Algorithm 1, which used only multi-hop questions as input, we now provide GLM-4-9B with additional information, including ground-truth decompositions and ground-truth answers. The instruction prompt

P_{ablat}

used to generate its reasoning chain is detailed in Appendix B.
强推理能力的优势。为了评估强推理能力的影响，我们对缺乏这种能力的模型进行微调，同时遵循思考-行动-观察范式。如表 4 中标记为"无推理"的变体，其骨干架构与 ReaRAG-9B 相同，并遵循算法 3 中概述的数据构建过程。然而，与之前使用多跳问题作为输入的算法 1 不同，我们现在为 GLM-4-9B 提供额外信息，包括地面真实的分解和地面真实的答案。用于生成其推理链的指令提示

P_{ablat}

详见附录 B。

Table 4 shows that ReaRAG-9B with enhanced reasoning capabilities (w/ reasoning) consistently outperforms its counterpart without reasoning, achieving a notable gain of 6-12%

{ACC}_{L}

gain on the multi-hop benchmarks and

3.5 %

gain on singlehop NQ. The smaller gain on NQ suggests that strong reasoning offers limited benefit for singlehop tasks, consistent with the assumption that such
表 4 显示，具有增强推理能力的 ReaRAG-9B（带推理）始终优于没有推理的对应模型，在多跳基准测试中实现了 6-12%的显著提升

{ACC}_{L}

，并在单跳 NQ 任务中获得了

3.5 %

的提升。在 NQ 上的较小提升表明强大的推理能力对单跳任务的帮助有限，这与这种假设是一致的
questions require less compositional reasoning.
问题需要较少的组合推理能力。

4.5 Analysis 4.5 分析

	Multi-hop 多跳
MQ HP IIRC FR FQ MQ HP 据我所记得 FR FQ	NQ
Invalid (%) 19.028 .018 .018 .930 .0 无效 (%) 19.028 .018 .018 .930 .0	25.0

Table 2: Invalid generation rates of special tokens in QwQ-32B, leading to retrieval failures in Search-o1.
表 2：QwQ-32B 中特殊标记的无效生成率，导致搜索-o1 中的检索失败

4.5.1 Performance against strong baseline
4.5.1 与强基准相比的性能

We conduct an in-depth analysis and compare our approach against the strong baselines, including Search-o1, R1-Searcher and ReSearch. Below, we highlight key factors affecting their performance.
我们对我们的方法进行了深入分析，并将其与强基线方法进行比较，包括 Search-o1、R1-Searcher 和 ReSearch。下面，我们重点阐述影响其性能的关键因素。

Token generation failure.

QwQ - 32 B

struggles to follow the instruction prompt, failing to generate special tokens (e.g., |begin_search_query|) essential for Search-o1 to retrieve external knowledge. This limitation forces Search-o1 into a closed-book setting, significantly impairing its performance. Table 2 quantifies this issue, revealing invalid generation rates of 18-30%.
令牌生成失败。

QwQ - 32 B

无法遵循指令提示，未能生成对 Search-o1 检索外部知识至关重要的特殊令牌（例如 |begin_search_query|）。这一限制迫使 Search-o1 陷入封闭书本设置，显著损害其性能。表 2 量化了这一问题，揭示了 18-30% 的无效生成率。

Information extraction failure. Search-o1 introduces the Reason-in-Documents module, leveraging QwQ-32B for in-depth reasoning over retrieved documents to generate responses as search results. However, this module has a key limitation: it may incorrectly conclude that no useful information is available (Table 5). Our analysis identifies the primary cause: the module attempts to answer the original multi-hop question based on the search query, but the retrieved information is insufficient. For example, as shown in Table 5, the module searches for information related to “Hannibal and Scipio book” at first search, but the retrieved content only includes the book’s author, lacking information about the place of education. This flaw weakens Search-o1, as it continuously searches for non-existent information, causing reasoning paths to diverge and ultimately hitting iteration limits.
信息提取失败。Search-o1 引入了"文档推理"模块，利用 QwQ-32B 对检索到的文档进行深入推理以生成搜索结果。然而，该模块存在一个关键限制：可能错误地得出没有可用信息的结论（表 5）。我们的分析确定了主要原因：该模块试图基于搜索查询回答原始多跳问题，但检索到的信息不充分。例如，如表 5 所示，模块首次搜索与"汉尼拔和斯基皮奥书籍"相关的信息，但检索到的内容仅包含书籍作者，缺乏关于教育地点的信息。这一缺陷削弱了 Search-o1，因为它持续搜索不存在的信息，导致推理路径发散并最终达到迭代限制。

Hallucination in Reason-in-Documents module. The Reason-in-Documents module is prone to hallucination Table 6. For instance, when searching for the members of Bruce Lee Band, the module fails to find relevant information and fabricates “Less Than Records” based on parametric knowledge rather than the provided corpus. This hallu-
"文档推理"模块中的幻觉。"文档推理"模块容易产生幻觉（表 6）。例如，在搜索布鲁斯·李乐队成员时，模块未能找到相关信息，并基于参数知识而非提供的语料库虚构了"Less Than Records"。这种幻觉-

Figure 3: Comparison of chain length across all benchmarks. We measure the number of retrievals needed for the baselines to achieve a full

{ACC}_{L}

score. Search-o1 consistently requires more steps than ReaRAG, highlighting the tendency of the underlying LRM to overthink in multi-hop QA tasks.
图 3：所有基准测试中链长度的比较。我们测量基线模型为达到完整

{ACC}_{L}

分数所需的检索次数。Search-o1 始终需要比 ReaRAG 更多的步骤，突显了底层 LRM 在多跳问答任务中过度思考的倾向。
cination propagates through subsequent reasoning steps, degrading the final answer quality.
幻觉会通过后续的推理步骤传播，降低最终答案的质量。

Poor generalization on harder benchmarks. R1-Searcher and ReSearch show weaker performance on FRAMES and FanOutQA, despite matching ReaRAG on other benchmarks. Case studies in Tables 7 and 8 reveal that both models struggle to reason effectively in the presence of noisy information. R1-Searcher, in particular, proceeds to generate final answers despite lacking key information. These findings suggest that RL-based models lack the ability to reflect and revise earlier reasoning, an ability essential for robust multi-hop reasoning. In contrast, ReaRAG successfully realigns its reasoning path, even after early-stage errors, demonstrating greater robustness on complex multi-hop tasks.
在更难的基准测试中泛化能力较差。R1-Searcher 和 ReSearch 在 FRAMES 和 FanOutQA 上表现较弱，尽管在其他基准测试中能与 ReaRAG 相匹配。表 7 和表 8 中的案例研究揭示，这两个模型在存在噪音信息时难以有效推理。特别是 R1-Searcher，即便缺乏关键信息也会继续生成最终答案。这些发现表明基于强化学习的模型缺乏反思和修正早期推理的能力，而这对于鲁棒的多跳推理至关重要。相比之下，ReaRAG 能够成功重新调整其推理路径，即使在早期出现错误，在复杂的多跳任务中也表现出更强的鲁棒性。

Overthinking in multi-hop QA. Recent studies have shown that LRMs tend to overthink by generating overly long reasoning chains, causing redundancy in multi-hop QA (Chen et al., 2024; Team et al., 2025). We analyze this by comparing the number of retrievals needed to achieve full

{ACC}_{L}

. Compared to Search-o1, ReaRAG requires fewer steps across all benchmarks except for FanOutQA, demonstrating efficiency in multi-hop reasoning while achieving better performance. Case studies in Tables 9 and 10 further highlight this difference. However, for baselines such as ReSearch and R1-Searcher which are optimized for retrieval, ReaRAG consistently takes more steps.
多跳问答中的过度思考。最近的研究表明，大型检索模型（LRM）倾向于通过生成过长的推理链来过度思考，在多跳问答中造成冗余（Chen 等，2024；Team 等，2025）。我们通过比较达到完整

{ACC}_{L}

所需的检索次数来分析这一点。与 Search-o1 相比，ReaRAG 在除 FanOutQA 之外的所有基准测试中所需步骤更少，展示了在多跳推理中的效率并取得了更好的性能。表 9 和表 10 中的案例研究进一步突显了这一差异。然而，对于像 ReSearch 和 R1-Searcher 这样针对检索进行优化的基线模型，ReaRAG 始终需要更多步骤。

4.5.2 Strength of ReaRAG
4.5.2 ReaRAG 的优势

This section showcases ReaRAG’s advanced reasoning capabilities. Table 11 shows that ReaRAG initially mistakenly identified “Anne of Austria” as the grandmother of “Philippe” rather than his mother. However, ReaRAG later detected this mistake, verified the information, and corrected it. This self-correction mechanism helps prevent errors from propagating to later reasoning steps.
本节展示了 ReaRAG 的高级推理能力。表 11 显示，ReaRAG 最初错误地将"奥地利的安妮"误认为是"菲利普"的祖母，而不是他的母亲。然而，ReaRAG 随后检测到了这一错误，验证了信息，并进行了更正。这种自我纠正机制有助于防止错误在后续推理步骤中传播。

Table 12 shows how ReaRAG resolves ambiguity in a multi-hop question via multiple retrieval. The query involves “The Hard Easy”, which refers to both a film and a TV series. At the sixth reasoning step, ReaRAG also faces conflicting information but successfully disambiguates and provides the correct answer.
表 12 展示了 ReaRAG 如何通过多次检索来解决多跳问题中的歧义性。该查询涉及"The Hard Easy"，它指的是一部电影和一部电视剧。在第六个推理步骤中，ReaRAG 还面临着冲突的信息，但成功地消除了歧义并提供了正确的答案。

Table 13 provides another example of ReaRAG handling ambiguity in a multi-hop question while resolving a knowledge conflict. Its parametric knowledge incorrectly states that Sonic is voiced by “Roger Craig Smith” instead of “Jim Cummings”. ReaRAG detects and corrects this inconsistency, ultimately reaching the correct answer. This case further highlights its robust reasoning abilities.
表 13 提供了另一个例子，展示了 ReaRAG 如何处理多跳问题中的歧义性并解决知识冲突。其参数化知识错误地声称声音角色索尼克是由"罗杰·克雷格·史密斯"配音，而不是"吉姆·卡明斯"。ReaRAG 检测并纠正了这一不一致性，最终得出了正确的答案。这一案例进一步突显了其强大的推理能力。

These examples highlight ReaRAG’s ability to iteratively perform knowledge-guided reasoning. Compared to existing baselines, our approach better integrates reasoning model with external knowledge, enhancing factual accuracy.
这些示例突出了 ReaRAG 通过迭代方式执行知识引导推理的能力。与现有基准相比，我们的方法能更好地整合推理模型和外部知识，提高事实准确性。

5 Conclusion 5 结论

In this study, we introduce ReaRAG, a factualityenhanced reasoning model capable of performing knowledge-guided reasoning. ReaRAG iteratively plans and reflects on reasoning steps, leveraging external knowledge to validate each step in the reasoning chain. Through comprehensive evaluation across six benchmarks, ReaRAG achieves the best overall performance, outperforming RLbased methods on challenging benchmarks such as FRAMES and FanOutQA, while matching them on easier tasks. Unlike RL-based approaches requiring complex policy optimization and high computational cost, our method strategically distills reasoning capabilities from LRMs, offering a more efficient alternative. Further analysis shows its robustness in handling complex multi-hop questions while mitigating the overthinking behavior seen in Search-o1. These results underscore the effectiveness of strategic reasoning distillation as a practical alternative to RL-based RAG methods.
在本研究中，我们引入了 ReaRAG，这是一个能够执行知识引导推理的事实增强推理模型。ReaRAG 通过迭代方式规划和反思推理步骤，利用外部知识验证推理链中的每一个步骤。通过对六个基准的全面评估，ReaRAG 取得了最佳整体性能，在 FRAMES 和 FanOutQA 等具有挑战性的基准上超越基于强化学习的方法，同时在较简单的任务上与之匹配。与需要复杂策略优化和高计算成本的基于强化学习的方法不同，我们的方法从大型语言模型中战略性地提取推理能力，提供了一种更高效的替代方案。进一步分析表明，该方法在处理复杂的多跳问题时具有鲁棒性，同时减轻了 Search-o1 中观察到的过度思考行为。这些结果凸显了战略性推理蒸馏作为基于强化学习的检索增强生成（RAG）方法的实用替代方案的有效性。

Limitations 局限性

Limited action space While ReaRAG demonstrates strong performance in the QA task, its action space is currently limited to only search and finish in this study. Consequently, it is restricted to processing local knowledge sources and cannot perform actions such as leveraging a code compiler for coding tasks, executing mathematical calculations, or conducting real-time web searches. Expanding its action space could enhance its adaptability across diverse problem domains.
受限的动作空间虽然 ReaRAG 在问答任务中表现出强大的性能，但在本研究中，其动作空间目前仅限于搜索和完成。因此，它仅限于处理本地知识源，无法执行诸如利用代码编译器进行编码任务、执行数学计算或进行实时网页搜索等操作。扩展其动作空间可以增强其在不同问题领域中的适应性。

Data construction efficiency To equip ReaRAG with a structured reasoning process, we fine-tune ReaRAG using structured responses generated by the LRM. However, this approach relies on the LRM’s strong instruction-following ability, and a substantial portion of the data is discarded due to validity issues, leading to computational inefficiency and resource waste. Improving data augmentation techniques could mitigate this limitation.
数据构建效率为了使 ReaRAG 具备结构化推理过程，我们使用 LRM 生成的结构化响应对 ReaRAG 进行微调。然而，这种方法依赖于 LRM 的强大指令跟随能力，并且由于有效性问题，大量数据被丢弃，导致计算效率低下和资源浪费。改进数据增强技术可以缓解这一限制。

Inference latency ReaRAG solves questions iteratively, requiring multiple reasoning steps to reach the final answer. While this enhances accuracy, it also increases inference time compared to models that generate answers in a single pass. This trade-off between reasoning depth and efficiency may limit its practicality in real-time applications or scenarios with strict latency constraints.
推理延迟 ReaRAG 通过迭代方式解决问题，需要多个推理步骤来得出最终答案。虽然这提高了准确性，但与单次生成答案的模型相比，它也增加了推理时间。这种推理深度和效率之间的权衡可能会限制其在实时应用或对延迟有严格要求的场景中的实用性。

Acknowledgments 致谢

This work is supported by National Natural Science Foundation of China (62476150), Beijing Natural Science Foundation (L243006), Tsinghua University Initiative Scientific Research Program and Tsinghua University (Department of Computer Science and Technology)-Siemens Ltd., China Joint Research Center for Industrial Intelligence and Internet of Things (JCIIOT).
本项目得到国家自然科学基金（62476150）、北京市自然科学基金（L243006）、清华大学科研创新计划以及清华大学（计算机科学与技术系）-西门子（中国）有限公司工业智能和物联网联合研究中心（JCIIOT）的支持。

This paper involved the use of AI-assisted tools (e.g., ChatGPT) for language refinement and editing. All content was reviewed and verified by the authors.
本论文使用了人工智能辅助工具（如 ChatGPT）进行语言润色和编辑。所有内容均已由作者审查和验证。

References 参考文献

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. Self-rag: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, and Laurent Sifre. 2022. Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 2206-2240. PMLR.
Sebastian Borgeaud、Arthur Mensch、Jordan Hoffmann、Trevor Cai、Eliza Rutherford、Katie Millican、George van den Driessche、Jean-Baptiste Lespiau、Bogdan Damoc、Aidan Clark、Diego de Las Casas、Aurelia Guy、Jacob Menick、Roman Ring、Tom Hennigan、Saffron Huang、Loren Maggiore、Chris Jones、Albin Cassirer、Andy Brock、Michela Paganini、Geoffrey Irving、Oriol Vinyals、Simon Osindero、Karen Simonyan、Jack W. Rae、Erich Elsen 和 Laurent Sifre. 2022. 通过从数万亿个标记中检索来改进语言模型。载于《机器学习国际会议论文集》（ICML 2022），2022 年 7 月 17-23 日，美国马里兰州巴尔的摩，第 162 卷，第 2206-2240 页。PMLR。

Shulin Cao, Jiajie Zhang, Jiaxin Shi, Xin Lv, Zijun Yao, Qi Tian, Lei Hou, and Juanzi Li. 2023. Probabilistic tree-of-thought reasoning for answering knowledgeintensive complex questions. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 1254112560. Association for Computational Linguistics.
Shulin Cao、Jiajie Zhang、Jiaxin Shi、Xin Lv、Zijun Yao、Qi Tian、Lei Hou 和 Juanzi Li. 2023. 用于回答知识密集型复杂问题的概率思考树推理。载于《计算语言学协会会议成果》（EMNLP 2023），新加坡，2023 年 12 月 6-10 日，第 12541-12560 页。计算语言学协会。

Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, and Jie Fu. 2024. RQ-RAG: learning to refine queries for retrieval augmented generation. CoRR, abs/2404.00610.
Chi-Min Chan、Chunpu Xu、Ruibin Yuan、Hongyin Luo、Wei Xue、Yike Guo 和 Jie Fu. 2024. RQ-RAG：学习优化检索增强生成的查询。CoRR，abs/2404.00610。

Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. 2025. Research: Learning to reason with search for llms via reinforcement learning. CoRR, abs/2503.19470.
Mingyang Chen、Tianpeng Li、Haoze Sun、Yijie Zhou、Chenzheng Zhu、Haofen Wang、Jeff Z. Pan、Wen Zhang、Huajun Chen、Fan Yang、Zenan Zhou 和 Weipeng Chen. 2025. 研究：通过强化学习让 LLMs 学会使用搜索进行推理。CoRR，abs/2503.19470。

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2024. Do NOT think that much for

2 + 3 =

? on the overthinking of o1-like llms. CoRR, abs/2412.21187.
陈星宇、徐佳豪、梁天、何志伟、庞建辉、余典、宋林峰、刘秋芝、周梦菲、张卓生、王瑞、涂兆鹏、米海涛和余东。2024 年。不要为

2 + 3 =

想太多？关于 O1 类 LLMs 的过度思考。CoRR，abs/2412.21187。

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang,
DeepSeek-AI、郭大亚、杨德健、张浩伟、宋俊骁、张若瑜、许润心、朱启豪、马士荣、王培义、毕小、张晓康、余星凯、吴宇、吴泽锋、苟志斌、邵志红、李卓舒、高子怡、刘爱新、薛冰、王斌宣、吴柏超、冯蓓、陆成达、赵成刚、邓成琦、张晨宇、阮崇、戴大美、陈德立、纪东杰、李尔航、林方云、戴福聪、罗福利、郝广博、陈冠廷、李国伟、张海、鲍汉、徐涵未、王浩成、丁红辉、辛华建、高华佐、曲慧、李辉、郭建中、李家世、王佳伟、陈静昌、袁景阳、邱俊杰、李俊龙、蔡建龙、倪佳琪、梁健、陈进、董凯、胡凯、高凯歌、关康、黄科欣、于奎、王乐、张乐聪、赵亮、王立同、张丽悦、徐雷、夏乐怡、张明川、张明华、唐明辉、李萌、王妙君、李明明、田宁、黄盼盼、张鹏、王前程、陈琴宇、杜秋实、葛瑞琦、张瑞松、潘瑞哲、王润基
R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, and S. S. Li. 2025. Deepseek-r1: Incentivizing reasoning capability in 11 ms via reinforcement learning. CoRR, abs/2501.12948.
陈瑞杰、金瑞龙、陈儒易、陆尚豪、周上研、陈善煌、叶胜峰、王世宇、余水平、周顺峰、潘书婷和李森森。2025 年。DeepSeek-R1：通过强化学习在 11 毫秒内激发推理能力。CoRR，abs/2501.12948。

James Ferguson, Matt Gardner, Hannaneh Hajishirzi, Tushar Khot, and Pradeep Dasigi. 2020. IIRC: A dataset of incomplete information reading comprehension questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 1137-1147. Association for Computational Linguistics.
詹姆斯·弗格森、马特·加德纳、汉娜·哈吉希尔兹、图沙尔·科特和普拉迪普·达西吉。2020 年。IIRC：不完整信息阅读理解问题数据集。载《2020 年自然语言处理实证方法会议论文集》，EMNLP 2020，线上，2020 年 11 月 16-20 日，第 1137-1147 页。计算语言学协会。

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. 2023. Textbooks are all you need. CoRR, abs/2306.11644.
《沉浸式翻译 - AI 驱动的 PDF 翻译：公式识别，表格识别，多栏支持，一键导出》 %% Suriya Gunasekar、Yi Zhang、Jyoti Aneja、Caio César Teodoro Mendes、Allie Del Giorno、Sivakanth Gopi、Mojan Javaheripi、Piero Kauffmann、Gustavo de Rosa、Olli Saarikivi、Adil Salim、Shital Shah、Harkirat Singh Behl、Xin Wang、Sébastien Bubeck、Ronen Eldan、Adam Tauman Kalai、Yin Tat Lee 和 Yuanzhi Li. 2023. 教科书是你所需要的。CoRR，abs/2306.11644。

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 3929-3938. PMLR.
Kelvin Guu、Kenton Lee、Zora Tung、Panupong Pasupat 和 Ming-Wei Chang. 2020. 检索增强语言模型预训练。在第 37 届机器学习国际会议论文集，ICML 2020，2020 年 7 月 13-18 日，虚拟活动，机器学习研究论文集第 119 卷，第 3929-3938 页。PMLR。

Gautier Izacard, Patrick S. H. Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2023. Atlas: Few-shot learning with retrieval augmented language models. J. Mach. Learn. Res., 24:251:1-251:43.
Gautier Izacard、Patrick S. H. Lewis、Maria Lomeli、Lucas Hosseini、Fabio Petroni、Timo Schick、Jane Dwivedi-Yu、Armand Joulin、Sebastian Riedel 和 Edouard Grave. 2023. Atlas：具有检索增强的语言模型的小样本学习。机器学习研究期刊，24:251:1-251:43。

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey Mishchenko, Andy Applebaum, Angela Jiang, Ashvin Nair, Barret Zoph, Behrooz Ghorbani, Ben Rossen, Benjamin Sokolowsky, Boaz Barak, Bob McGrew, Borys Minaiev, Botao Hao, Bowen Baker, Brandon Houghton, Brandon McKinzie, Brydon Eastman, Camillo Lugaresi, Cary Bassin, Cary Hudson, Chak Ming Li, Charles de Bourcy, Chelsea Voss, Chen Shen, Chong Zhang, Chris Koch, Chris Orsinger, Christopher Hesse, Claudia Fischer, Clive Chan, Dan Roberts, Daniel Kappler, Daniel Levy, Daniel Selsam, David Dohan, David Farhi, David Mely, David Robinson, Dimitris Tsipras, Doug Li, Dragos Oprica, Eben Freeman, Eddie Zhang, Edmund Wong, Elizabeth Proehl, Enoch Cheung, Eric Mitchell, Eric Wallace, Erik Ritter, Evan Mays, Fan Wang, Felipe Petroski Such, Filippo Raso, Florencia Leoni, Foivos Tsimpourlas, Francis Song, Fred von Lohmann, Freddie Sulit,
Aaron Jaech、Adam Kalai、Adam Lerer、Adam Richardson、Ahmed El-Kishky、Aiden Low、Alec Helyar、Aleksander Madry、Alex Beutel、Alex Carney、Alex Iftimie、Alex Karpenko、Alex Tachard Passos、Alexander Neitz、Alexander Prokofiev、Alexander Wei、Allison Tam、Ally Bennett、Ananya Kumar、Andre Saraiva、Andrea Vallone、Andrew Duberstein、Andrew Kondrich、Andrey Mishchenko、Andy Applebaum、Angela Jiang、Ashvin Nair、Barret Zoph、Behrooz Ghorbani、Ben Rossen、Benjamin Sokolowsky、Boaz Barak、Bob McGrew、Borys Minaiev、Botao Hao、Bowen Baker、Brandon Houghton、Brandon McKinzie、Brydon Eastman、Camillo Lugaresi、Cary Bassin、Cary Hudson、Chak Ming Li、Charles de Bourcy、Chelsea Voss、Chen Shen、Chong Zhang、Chris Koch、Chris Orsinger、Christopher Hesse、Claudia Fischer、Clive Chan、Dan Roberts、Daniel Kappler、Daniel Levy、Daniel Selsam、David Dohan、David Farhi、David Mely、David Robinson、Dimitris Tsipras、Doug Li、Dragos Oprica、Eben Freeman、Eddie Zhang、Edmund Wong、Elizabeth Proehl、Enoch Cheung、Eric Mitchell、Eric Wallace、Erik Ritter、Evan Mays、Fan Wang、Felipe Petroski Such、Filippo Raso、Florencia Leoni、Foivos Tsimpourlas、Francis Song、Fred von Lohmann、Freddie Sulit

Geoff Salmon, Giambattista Parascandolo, Gildas Chabot, Grace Zhao, Greg Brockman, Guillaume Leclerc, Hadi Salman, Haiming Bao, Hao Sheng, Hart Andrin, Hessam Bagherinezhad, Hongyu Ren, Hunter Lightman, Hyung Won Chung, Ian Kivlichan, Ian O’Connell, Ian Osband, Ignasi Clavera Gilaberte, and Ilge Akkaya. 2024. Openai o1 system card. CoRR, abs/2412.16720.
Geoff Salmon、Giambattista Parascandolo、Gildas Chabot、Grace Zhao、Greg Brockman、Guillaume Leclerc、Hadi Salman、Haiming Bao、Hao Sheng、Hart Andrin、Hessam Bagherinezhad、Hongyu Ren、Hunter Lightman、Hyung Won Chung、Ian Kivlichan、Ian O'Connell、Ian Osband、Ignasi Clavera Gilaberte，以及 Ilge Akkaya。2024 年。OpenAI O1 系统卡。《CoRR》，abs/2412.16720。

Jinhao Jiang, Jiayi Chen, Junyi Li, Ruiyang Ren, Shijie Wang, Wayne Xin Zhao, Yang Song, and Tao Zhang. 2024. Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement. CoRR, abs/2412.12881.
Jinhao Jiang、Jiayi Chen、Junyi Li、Ruiyang Ren、Shijie Wang、Wayne Xin Zhao、Yang Song，以及 Tao Zhang。2024 年。RAG-Star：通过检索增强的验证和细化来增强深思熟虑的推理。《CoRR》，abs/2412.12881。

Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. 2024. Fact, fetch, and reason: A unified evaluation of retrievalaugmented generation. CoRR, abs/2409.12941.
Satyapriya Krishna、Kalpesh Krishna、Anhad Mohananey、Steven Schwarcz、Adam Stambler、Shyam Upadhyay 和 Manaal Faruqui. 2024. 事实、获取和推理：检索增强生成的统一评估。CoRR, abs/2409.12941.

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. 2024. Training language models to selfcorrect via reinforcement learning. arXiv preprint arXiv:2409.12917.
Aviral Kumar、Vincent Zhuang、Rishabh Agarwal、Yi Su、John D Co-Reyes、Avi Singh、Kate Baumli、Shariq Iqbal、Colton Bishop、Rebecca Roelofs 等. 2024. 通过强化学习训练语言模型自我纠正。arXiv 预印本 arXiv:2409.12917.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur P. Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452466.
Tom Kwiatkowski、Jennimaria Palomaki、Olivia Redfield、Michael Collins、Ankur P. Parikh、Chris Alberti、Danielle Epstein、Illia Polosukhin、Jacob Devlin、Kenton Lee、Kristina Toutanova、Llion Jones、Matthew Kelcey、Ming-Wei Chang、Andrew M. Dai、Jakob Uszkoreit、Quoc Le 和 Slav Petrov. 2019. 自然问题：问答研究的基准。计算语言学协会汇刊, 7:452-466.

Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
Patrick S. H. Lewis、Ethan Perez、Aleksandra Piktus、Fabio Petroni、Vladimir Karpukhin、Naman Goyal、Heinrich Küttler、Mike Lewis、Wen-tau Yih、Tim Rocktäschel、Sebastian Riedel 和 Douwe Kiela. 2020. 知识密集型自然语言处理任务的检索增强生成。神经信息处理系统进展 33：2020 年神经信息处理系统年会，NeurIPS 2020，2020 年 12 月 6-12 日，线上。

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. 2025. Search-o1: Agentic searchenhanced large reasoning models. arXiv preprint arXiv:2501.05366.
李小熙、董冠廷、金佳杰、张雨尧、周裕佳、朱宇涛、张沛田和窦志成. 2025. Search-o1: 代理搜索增强的大规模推理模型. arXiv 预印本 arXiv:2501.05366.

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts. Trans. Assoc. Comput. Linguistics, 12:157-173.
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni 和 Percy Liang. 2024. 迷失在中间：语言模型如何使用长上下文. 计算语言学学报, 12:157-173.

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. 2025. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393.
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès 和 Tatsunori Hashimoto. 2025. s1: 简单的测试时间缩放. arXiv 预印本 arXiv:2501.19393.

OpenAI. 2024. Learning to reason with 11 ms .
OpenAI. 2024. 学习以 11 毫秒推理.
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. 2023. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 5687-5711. Association for Computational Linguistics.
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith 和 Mike Lewis. 2023. 测量和缩小语言模型中的组合性差距. 在计算语言学协会会议论文集：EMNLP 2023，新加坡，2023 年 12 月 6-10 日，第 5687-5711 页. 计算语言学协会.

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 9248-9274. Association for Computational Linguistics.
Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan 和 Weizhu Chen. 2023. 通过迭代检索-生成协同增强检索增强型大型语言模型. 在计算语言学协会会议论文集：EMNLP 2023，新加坡，2023 年 12 月 6-10 日，第 9248-9274 页. 计算语言学协会.

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Schärli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 31210-31227. PMLR.
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Schärli 和 Denny Zhou. 2023. 大型语言模型很容易被无关上下文分散注意力. 在机器学习国际会议，ICML 2023，2023 年 7 月 23-29 日，夏威夷火奴鲁鲁，机器学习研究论文集，第 202 卷，第 31210-31227 页. PMLR.

Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2024. REPLUG: retrievalaugmented black-box language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pages 8371-8384. Association for Computational Linguistics.
Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettlemoyer 和 Wen-tau Yih. 2024. REPLUG：检索增强型黑盒语言模型. 在 2024 年北美计算语言学协会人类语言技术大会论文集（第 1 卷：长论文），NAACL 2024，墨西哥城，墨西哥，2024 年 6 月 16-21 日，第 8371-8384 页. 计算语言学协会.

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling LLM test-time compute optimally can be more effective than scaling model parameters. CoRR, abs/2408.03314.
Charlie Snell、Jaehoon Lee、Kelvin Xu 和 Aviral Kumar. 2024. 扩展 LLM 测试时计算可能比扩展模型参数更有效. CoRR, abs/2408.03314.

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and JiRong Wen. 2025. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. CoRR, abs/2503.05592.
Huatong Song、Jinhao Jiang、Yingqian Min、Jie Chen、Zhipeng Chen、Wayne Xin Zhao、Lei Fang 和 JiRong Wen. 2025. R1-searcher：通过强化学习激励 LLM 的搜索能力. CoRR, abs/2503.05592.

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haozhen Yu, Hongcheng Gao, Huabin Zheng, Huan Yuan, Jia Chen, Jianhang Guo, Jianlin Su, Jianzhou Wang, Jie Zhao, Jin Zhang, Jingyuan Liu, Junjie Yan, Junyan Wu, Lidong Shi, Ling Ye, Longhui Yu, Mengnan Dong, Neo Zhang, Ningchen Ma, Qiwei Pan, Qucheng Gong, Shaowei Liu, Shengling Ma, Shupeng Wei, Sihan Cao, Siying Huang, Tao Jiang,
Kimi 团队，Angang Du、Bofei Gao、Bowei Xing、Changjiu Jiang、Cheng Chen、Cheng Li、Chenjun Xiao、Chenzhuang Du、Chonghua Liao、Chuning Tang、Congcong Wang、Dehao Zhang、Enming Yuan、Enzhe Lu、Fengxiang Tang、Flood Sung、Guangda Wei、Guokun Lai、Haiqing Guo、Han Zhu、Hao Ding、Hao Hu、Hao Yang、Hao Zhang、Haotian Yao、Haotian Zhao、Haoyu Lu、Haoze Li、Haozhen Yu、Hongcheng Gao、Huabin Zheng、Huan Yuan、Jia Chen、Jianhang Guo、Jianlin Su、Jianzhou Wang、Jie Zhao、Jin Zhang、Jingyuan Liu、Junjie Yan、Junyan Wu、Lidong Shi、Ling Ye、Longhui Yu、Mengnan Dong、Neo Zhang、Ningchen Ma、Qiwei Pan、Qucheng Gong、Shaowei Liu、Shengling Ma、Shupeng Wei、Sihan Cao、Siying Huang、Tao Jiang、

Weihao Gao, Weimin Xiong, Weiran He, Weixiao Huang, Wenhao Wu, Wenyang He, Xianghui Wei, Xianqing Jia, Xingzhe Wu, Xinran Xu, Xinxing Zu, Xinyu Zhou, Xuehai Pan, Y. Charles, Yang Li, Yangyang Hu, Yangyang Liu, Yanru Chen, Yejie Wang, Yibo Liu, Yidao Qin, Yifeng Liu, Ying Yang, Yiping Bao, Yulun Du, Yuxin Wu, Yuzhi Wang, Zaida Zhou, Zhaoji Wang, Zhaowei Li, Zhen Zhu, Zheng Zhang, Zhexu Wang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Ziyao Xu, and Zonghan Yang. 2025. Kimi k1.5: Scaling reinforcement learning with 11 ms . CoRR, abs/2501.12599.
Weihao Gao、Weimin Xiong、Weiran He、Weixiao Huang、Wenhao Wu、Wenyang He、Xianghui Wei、Xianqing Jia、Xingzhe Wu、Xinran Xu、Xinxing Zu、Xinyu Zhou、Xuehai Pan、Y. Charles、Yang Li、Yangyang Hu、Yangyang Liu、Yanru Chen、Yejie Wang、Yibo Liu、Yidao Qin、Yifeng Liu、Ying Yang、Yiping Bao、Yulun Du、Yuxin Wu、Yuzhi Wang、Zaida Zhou、Zhaoji Wang、Zhaowei Li、Zhen Zhu、Zheng Zhang、Zhexu Wang、Zhilin Yang、Zhiqi Huang、Zihao Huang、Ziyao Xu 和 Zonghan Yang. 2025. Kimi K1.5：利用 11 毫秒扩展强化学习. CoRR, abs/2501.12599.

Qwen Team. 2024. Qwq: Reflect deeply on the boundaries of the unknown.
Qwen 团队. 2024. Qwq：深入反思未知的边界。

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. Musique: Multihop questions via single-hop question composition. Trans. Assoc. Comput. Linguistics, 10:539-554.
Harsh Trivedi、Niranjan Balasubramanian、Tushar Khot 和 Ashish Sabharwal. 2022. Musique：通过单跳问题组合的多跳问题. 计算语言学协会汇刊, 10:539-554.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In

A d

vances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
Jason Wei、Xuezhi Wang、Dale Schuurmans、Maarten Bosma、Brian Ichter、Fei Xia、Ed H. Chi、Quoc V. Le 和 Denny Zhou. 2022. 思维链提示引发大型语言模型的推理. 在

A d

第 35 届神经信息处理系统年会：神经信息处理系统 2022 大会，美国新奥尔良，2022 年 11 月 28 日至 12 月 9 日。

Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, and Yong Li. 2025. Towards large reasoning models: A survey of reinforced reasoning with large language models. CoRR, abs/2501.09686.
Fengli Xu、Qianyue Hao、Zefang Zong、Jingwei Wang、Yunke Zhang、Jingyi Wang、Xiaochong Lan、Jiahui Gong、Tianjian Ouyang、Fanjin Meng、Chenyang Shao、Yuwei Yan、Qinglong Yang、Yiwen Song、Sijian Ren、Xinyuan Hu、Yu Li、Jie Feng、Chen Gao 和 Yong Li. 2025. 走向大规模推理模型：大型语言模型增强推理的调查. 计算机科学档案, abs/2501.09686.

Shicheng Xu, Liang Pang, Huawei Shen, Xueqi Cheng, and Tat-Seng Chua. 2024. Search-in-the-chain: Interactively enhancing large language models with search for knowledge-intensive tasks. In Proceedings of the ACM on Web Conference 2024, WWW 2024, Singapore, May 13-17, 2024, pages 1362-1373. ACM.
徐世诚, 庞亮, 沈华伟, 程学启, 和蔡达生. 2024. 链内搜索: 通过搜索交互增强知识密集型任务的大语言模型. 载于 2024 年 ACM 网络会议论文集, WWW 2024, 新加坡, 2024 年 5 月 13-17 日, 第 1362-1373 页. ACM.

Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. 2024. Corrective retrieval augmented generation. CoRR, abs/2401.15884.
闫诗琦, 顾佳辰, 朱云, 和凌振华. 2024. 纠正性检索增强生成. CoRR, abs/2401.15884.

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. 2024. Qwen2.5 technical report. CoRR, abs/2412.15115.
杨安, 杨宝松, 张彬彦, 惠斌元, 郑波, 余博文, 李晨元, 刘大义恒, 黄飞, 魏浩然, 林欢, 杨建, 屠建宏, 张建伟, 杨建新, 杨佳茜, 周景仁, 林军阳, 党凯, 陆克明, 鲍克勤, 杨可欣, 余乐, 李梅, 薛明峰, 张培, 朱琴, 门瑞, 林润吉, 李天豪, 夏廷宇, 任行章, 任轩成, 樊洋, 苏阳, 张艺昌, 万宇, 刘玉琼, 崔泽宇, 张振如, 和邱子涵. 2024. Qwen2.5 技术报告. CoRR, abs/2412.15115.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and
杨志林, 齐鹏, 张赛政, 约书亚·本吉奥, 威廉·W·科恩, 鲁斯兰·萨拉胡特丁诺夫, 和

Christopher D. Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2369-2380. Association for Computational Linguistics.
Christopher D. Manning. 2018. Hotpotqa：多样化、可解释的多跳问题回答数据集。载于《2018 年自然语言处理实证方法会议论文集》，比利时布鲁塞尔，2018 年 10 月 31 日-11 月 4 日，第 2369-2380 页。计算语言学协会。

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023a. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
Shunyu Yao、Dian Yu、Jeffrey Zhao、Izhak Shafran、Tom Griffiths、Yuan Cao 和 Karthik Narasimhan. 2023a. 思想树：大型语言模型的深思熟虑的问题求解。载于《第 36 届神经信息处理系统年会：神经信息处理系统 2023 大会》，美国新奥尔良，2023 年 12 月 10-16 日。

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023b. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
Shunyu Yao、Jeffrey Zhao、Dian Yu、Nan Du、Izhak Shafran、Karthik R. Narasimhan 和 Yuan Cao. 2023b. ReAct：语言模型中推理与行动的协同。载于《第十一届学习表征国际会议》，卢旺达基加利，2023 年 5 月 1-5 日。OpenReview.net。

Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Raghavi Chandu, Kai-Wei Chang, Yejin Choi, and Bill Yuchen Lin. 2024. Agent lumos: Unified and modular training for open-source language agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 12380-12403. Association for Computational Linguistics.
Da Yin、Faeze Brahman、Abhilasha Ravichander、Khyathi Raghavi Chandu、Kai-Wei Chang、Yejin Choi 和 Bill Yuchen Lin. 2024. Agent Lumos：开源语言智能体的统一和模块化训练。载于《计算语言学协会第 62 届年会论文集（第 1 卷：长论文）》，泰国曼谷，2024 年 8 月 11-16 日，第 12380-12403 页。计算语言学协会。

Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, Weng Lam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiaotao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zhang, Yifan An, Yifan Xu, Yilin Niu, Yuantao Yang, Yueyan Li, Yushi Bai, Yuxiao Dong, Zehan Qi, Zhaoyu Wang, Zhen Yang, Zhengxiao Du, Zhenyu Hou, and Zihan Wang. 2024. Chatglm: A family of large language models from GLM-130B to GLM-4 all tools. CoRR, abs/2406.12793.
曾傲涵、徐斌、王博文、张晨辉、殷达、迭戈·罗哈斯、冯冠宇、赵翰林、赖汉宇、余浩、王洪宁、孙佳迪、张佳杰、程佳乐、桂佳怡、唐杰、张静、李娟子、赵磊、吴林东、钟鲁晨、刘明道、黄民列、张鹏、郑钦凯、卢锐、段帅琦、张舒丹、曹舒林、杨舒鑫、谭永霖、赵文义、刘小、夏小、张晓涵、顾小涛、吕欣、刘行汉、刘新宇、杨新悦、宋熙轩、张讯凯、安一帆、徐一帆、牛一林、杨元涛、李悦妍、白宇实、董跃晓、戚泽涵、王兆宇、杨震、杜政潇、侯振宇和王子涵. 2024. ChatGLM: 从 GLM-130B 到 GLM-4 的大语言模型家族. CoRR, abs/2406.12793.

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10-16, 2023.
郑连民, 蒋伟林, 盛颖, 庄思远, 吴樟浩, 庄永浩, 林梓, 李卓涵, 李大成, Eric P. Xing, 张浩, Joseph E. Gonzalez, 和 Ion Stoica. 2023. 使用 MT-Bench 和聊天机器人竞技场评判 LLM 作为裁判. 收录于神经信息处理系统进展 36: 2023 年神经信息处理系统年度会议, NeurIPS 2023, 美国新奥尔良, 2023 年 12 月 10-16 日.

Andrew Zhu, Alyssa Hwang, Liam Dugan, and Chris Callison-Burch. 2024. Fanoutqa: A multi-hop, multidocument question answering benchmark for large
朱安德鲁、黄艾丽莎、杜根利亚姆和卡利森-伯奇克里斯。2024. Fanoutqa：面向大型语言模型的多跳、多文档问答基准测试
language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Short Papers, Bangkok, Thailand, August 11-16, 2024, pages 18-37. Association for Computational Linguistics.
语言模型。载《第 62 届计算语言学协会年会论文集》，ACL 2024 - 短文论文，泰国曼谷，2024 年 8 月 11-16 日，第 18-37 页。计算语言学协会。

Appendix 附录

A Ablation 消融实验

Model 模型	Multi-hop 多跳										Single-hop 单跳		Average 平均
	MuSiQue		HotpotQA		IIRC		FRAMES		FanOutQA		NQ		Average 平均
	${ACC}_{L}$	EM	${ACC}_{L}$	EM	${ACC}_{L}$	EM	${ACC}_{L}$	EM	${ACC}_{L}$	R-L	${ACC}_{L}$	EM	${ACC}_{L}$
GLM-4-9B(128k) GLM-4-9B（128k）	3.5	0.0	29.5	23.0	17.0	15.0	5.2	3.5	7.1	18.0	27.50	16.00	15.0
GLM-4-32B(128k) GLM-4-32B（128k）	6.5	1.0	40.0	28.0	17.0	11.5	10.5	4.6	11.9	22.5	44.5	25.0	21.7
QwQ-32B	11.0	2.0	35.0	10.0	21.3	14.5	14.4	9.1	11.0	13.2	37.5	12.0	21.7

Table 3: Closed-book performance of language models on multi-hop and single-hop benchmarks. These models perform better on single-hop benchmarks but score significantly lower on multi-hop benchmarks, highlighting the limitations of relying solely on parametric knowledge for these benchmarks.
表 3：语言模型在多跳和单跳基准测试中的封闭书本性能。这些模型在单跳基准测试中表现更好，但在多跳基准测试中得分显著较低，突显了仅依赖参数化知识的局限性。

Model 模型	Multi-hop 多跳										Single-hop 单跳		Average 平均值
	MuSiQue		HotpotQA		IIRC		FRAMES		FanOutQA		NQ		Average 平均值
	${ACC}_{L}$	EM	${ACC}_{L}$	EM	${ACC}_{L}$	EM	${ACC}_{L}$	EM	${ACC}_{L}$	R-L	${ACC}_{L}$	EM	${ACC}_{L}$
ReaRAG-9B
- w/o reasoning - 无推理	57.5	38.0	69.5	50.0	31.8	22.0	26.6	19.2	5.5	21.6	52.0	29.0	40.5
-w/ reasoning - 有推理	67.0	46.0	81.5	58.0	41.0	27.5	34.2	21.2	11.6	24.3	55.5	32.0	48.5

Table 4: Performance comparison of models with and without strong reasoning capabilities. w/ reasoning consistently outperforms w/o reasoning across all benchmarks, demonstrating the effectiveness of our fine-tuning process, which enables ReaRAG-9B to inherit the strong reasoning abilities of LRM.
表 4：具有和不具有强推理能力的模型的性能比较。具有推理的模型在所有基准测试中始终优于无推理的模型，证明了我们微调过程的有效性，使 ReaRAG-9B 能够继承 LRM 的强大推理能力。

Algorithm 3 Data Construction to fine-tune ReaRAG w/o reasoning
Input: Seed Dataset \(\mathcal{D}_{\text {seed }}\), Large Language Model \(\mathcal{M}_{L L M}\), Instruction Prompt \(\mathcal{P}_{\text {ablat }}\)
Output: Dataset \(\mathcal{D}_{\text {ablat }}\)
    Initialize \(\mathcal{D}_{\text {ablat }} \leftarrow \emptyset\)
    for each \(\left(x_{i}, y_{i}\right.\), decomp \(\left._{i}\right) \sim \mathcal{D}_{\text {seed }}\) do
            \(\triangleright\) Sample question \(x_{i}\), ground truth answer \(y_{i}\) and golden decomposition decomp \({ }_{i}\)
        \(r_{i}^{\prime} \leftarrow \mathcal{M}_{\text {LLM }}\left(\left[\mathcal{P}_{\text {ablat }} \oplus x_{i} \oplus y_{i} \oplus\right.\right.\) decomp \(\left.\left._{i}\right]\right)\)
        \(\triangleright\) Generate response
        \(\mathcal{C}_{i}=\left[\left\{\tau_{t}, \alpha_{t}, o_{t}\right\}_{t=1}^{N}\right] \leftarrow \operatorname{parse}\left(r_{i}^{\prime}\right)\)
            \(\triangleright\) Parse a list of thought \(\tau_{t}\), action \(\alpha_{t}\) and observation \(o_{t}\) into reasoning chain \(\mathcal{C}_{i}\)
        Append \(\mathcal{C}_{i}\) to \(\mathcal{D}_{\text {ablat }}\)
    end for
    return \(\mathcal{D}_{\text {ablat }}\)

B Prompts B 提示词

Instruction prompts $P_{d}$ for data construction to fine-tune ReaRAG
用于数据构建以微调 ReaRAG 的指令提示词 $P_{d}$

Your task is to solve a question answering task. To improve your solving accuracy, please conduct reasoning processes following this sequence: Thought, Action, Observation steps. Thought can reason about the current situation, and Action is in the form of a function. There are two available function types:
你的任务是解决一个问题回答任务。为了提高解决准确性，请按照以下顺序进行推理过程：思考、行动、观察步骤。思考可以推理当前情况，行动以函数形式呈现。有两种可用的函数类型：

Available Functions: 可用函数：

(1) Search (1) 搜索
{

"name": "search",
"description": "It can help you find useful information through the internet or local
    knowledge base. You can use this tool to access external knowledge.",
"parameters": {
    "type": "object",
    "properties": {
        "query": {
            "description": "what you want to search"
        }
    },
    "required": ["query"]
}
}

(2) Finish (2) 完成
{

    "name": "finish",
    "description": "You can use this function to make a conclusion from the reasoning process
        and give the final answer. The reasoning process is completed after this 'finish'
        function is called",
    "parameters": {
        "type": "object",
        "properties": {
            "answer": {
                "description": "the final answer"
            }
        },
        "required": ["answer"]
    }
}

Some important rules you must follow:
必须遵循的一些重要规则：

(1) Please follow the function calling format above strictly.
(1) 请严格遵循上面的函数调用格式。
(2) A set of ‘Thought’, ‘Action’, and ‘Observation’ is considered as one reasoning step. Add numbering after each
(2) 一组"思考"、"行动"和"观察"被视为一个推理步骤。在每个之后添加编号以指示推理步骤的顺序。
‘Thought’, ‘Action’, and ‘Observation’ to indicate the sequence of the reasoning steps.
"思考"、"行动"和"观察"来表示推理步骤的序列。
(3) Please give your ‘Thought’ first, then the ‘Action’, and finally the ‘Observation’, follow the format as shown in the in-context examples below.
(3) 请先给出你的"思考"，然后是"行动"，最后是"观察"，按照下面上下文示例中显示的格式进行。
(4) In your ‘Thought’, you should perform reflection when necessary to ensure the correctness of your reasoning process, such as: “Wait! Maybe I made some mistakes! I need to rethink from scratch”, “Alternatively, we can…”, “Hold on, let’s try another approach”, etc.
(4) 在你的"思考"中，必要时应进行反思，以确保推理过程的正确性，例如："等等！我可能犯了一些错误！我需要从头开始重新思考"、"另外，我们可以……"、"且慢，让我们尝试另一种方法"等。
(5) Give your ‘Action’ in the form of function call, as shown in in-context examples below.
(5) 按照下面上下文示例中所示，以函数调用的形式给出你的"动作"。
(6) You should not provide information based on your own knowledge, only use the information provided in the context.
(6) 你不应该基于自己的知识提供信息，只能使用上下文中提供的信息。

Some example of reflection text:
反思文本的一些示例：

“There is no enough information from the previous steps. I need to plan my query again.”
前面的步骤中没有足够的信息。我需要重新规划我的查询。
“No context found from observation. Let me restart the reasoning process.”
未从观察中找到任何上下文。让我重新开始推理过程。
“Missing information. Let me restructure my query.”
缺少信息。让我重新构建我的查询。
“Wait! Maybe I made some mistakes! I need to rethink from scratch.”
等等！也许我犯了一些错误！我需要从头开始重新思考。
“I think I need to take a step back and reconsider my approach.”
我认为我需要退一步，重新考虑我的方法。
“I need to reevaluate my reasoning process. Let’s start over.”
我需要重新评估我的推理过程。让我们重新开始。
“I need to reflect on my reasoning. Let’s try a different approach.”
我需要反思我的推理过程。让我们尝试一种不同的方法。
{More examples of reflection text. Simplfied for readability}
更多反射文本示例。为了提高可读性而简化

In-Context Example: 沉浸式翻译：由 AI 驱动的 PDF 翻译器，支持公式识别、表格识别、多栏排版，一键导出

{Some in-context examples}
{一些上下文示例}

Instruction prompt $P$ for fine-tuning and inference with ReaRAG
用于使用 ReaRAG 进行微调和推理的指令提示 $P$

Your task is to solve a question answering task. To improve your solving accuracy, please conduct reasoning process interleaving Thought, Action, Observation steps. Thought can reason about the current situation, and Action are in the form of function, there are two types:
你的任务是解决问题回答任务。为了提高解决准确性，请进行推理过程，交替使用思考、行动和观察步骤。思考可以推理当前情况，行动采用函数形式，有两种类型：

Available Functions:
(1) Search
{
    "name": "search",
    "description": "It can help you find useful information through the internet or local
        knowledge base. You can use this tool to access external knowledge.",
    "parameters": {
        "type": "object",
        "properties": {
            "query": {
                "description": "what you want to search"
            }
        },
        "required": ["query"]
    }
}
(2) Finish
\{
    "name": "finish",
    "description": "You can use this function to make a conclusion from the reasoning process
        and give the final answer. The reasoning process is completed after this 'finish'
        function is called",
    "parameters": {
        "type": "object",
        "properties": {
            "answer": {
                "description": "the final answer"
            }
        },
        "required": ["answer"]
    }
}

Please follow the format strictly.
请严格遵循格式。

Answer prompt $P_{ans}$ to derive final answer
回答提示 $P_{ans}$ 以得出最终答案

<lsysteml>

You are a QA assistant. Always return a short answer. Output ONLY the answer with no extra words.<luserl>
你是一个问答助手。始终返回简短的答案。仅输出答案，不加任何额外文字。<luserl>
You have conducted multiple searches to seek for informations to answer the Question.
你已进行多次搜索以寻找信息来回答问题。

Searches 搜索

{Previous search query and search results}
{先前的搜索查询和搜索结果}

Instruction 指令

Answer the Question based on the searches. Only output the final answer. Do NOT add any explanation, punctuation, or extra words.
根据搜索结果回答问题。仅输出最终答案。不要添加任何解释、标点符号或额外的文字。
Question: {question} 问题：{question}
Answer: <lassistantl> 答案：<lassistantl>

Judgement prompt $P_{judge}$ for ${ACC}_{L}$ metrics on all benchmarks except for FanOutQA
评判提示 $P_{judge}$ ，用于除 FanOutQA 之外的所有基准测试指标

You are asked to evaluate the quality of the AI assistant’s answer to user questions as an impartial judge, and your evaluation should take into account factors including correctness (high priority), and comprehensiveness (whether the assistant’s answer covers all points).
您被要求作为公正的评判者，评估 AI 助手对用户问题的答案质量，评估应考虑包括正确性（高优先级）和全面性（助手的答案是否涵盖所有要点）在内的因素。
Read the AI assistant’s answer and compare against the reference answer, and give an overall integer rating in

1, 2, 3

( 1

=

wrong or irrelevant,

2 =

partially correct,

3 =

correct and comprehensive) based on the above principles, strictly in the following format:"[[rating]]", e.g. “[[2]]”.
阅读 AI 助手的回答，并与参考答案进行比较，根据上述原则，给出一个整数评分，评分为 1（错误或无关）、2（部分正确）或 3（正确且全面），严格按照以下格式输出："[[评分]]"，例如 "[[2]]"。
[Reference answer]  《沉浸式翻译 - AI 驱动的 PDF 翻译：公式识别、表格识别、多栏支持、一键导出》
{Ground truth}  《沉浸式翻译 - 由 AI 驱动的 PDF 翻译：公式识别、表格识别、多栏支持、一键导出》
[Assistant’s answer]  沉浸式翻译 - 由 AI 驱动的 PDF 翻译：公式识别、表格识别、多栏支持、一键导出
{Prediction}  预测结果
Rating:  评分：

Judgement prompt $P_{j u d g e}$ for ${ACC}_{L}$ metrics on FanOutQA benchmarks
在 FanOutQA 基准测试上针对 $P_{j u d g e}$ 指标的判断提示 ${ACC}_{L}$

[BEGIN DATA]
**************
[Question]: {Question}
*************
[Expert]: {Ground truth}
**************
[Submission]: {Prediction}
**************
[END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
比较已提交答案与专家答案的事实内容。忽略样式、语法或标点符号的任何差异。
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. First, write out your reasoning in a step-by-step manner to carefully evaluate the factual content. Make sure your conclusion is supported by detailed reasoning. Avoid simply stating the correct choice immediately. After completing your reasoning, output only the final selected letter (A, B, C, D, E, or F). Format your output exactly as follows: wrap the letter inside <answer> and </answer> tags. Important: Do not include any extra words, quotation marks, punctuation, or explanations - only the letter inside the tags.
提交的答案可能是专家答案的子集或超集，也可能与之存在冲突。确定适用的情况。首先，以步骤化的方式写出你的推理，仔细评估事实内容。确保你的结论得到详细推理的支持。避免直接立即陈述正确选项。完成推理后，仅输出最终选择的字母（A、B、C、D、E 或 F）。按照以下格式输出：将字母包裹在<answer>和</answer>标签内。重要提示：不要包含任何额外的词语、引号、标点或解释 - 仅在标签内放置字母。

Here are the available choices:
以下是可用的选择：
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
（A）提交的答案是专家答案的子集，并与之完全一致。
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
（B）提交的答案是专家答案的超集，并与之完全一致。
© The submitted answer contains all the same details as the expert answer.
© 提交的答案包含与专家答案完全相同的细节。
(D) There is a disagreement between the submitted answer and the expert answer.
(D) 提交的答案与专家答案存在不一致。
(E) The answers differ, but these differences don’t matter from the perspective of factuality.
(E) 答案不同，但从事实角度来看这些差异并不重要。
(F) The submitted answer does not answer the question or is otherwise invalid.
(F) 提交的答案没有回答问题，或者答案无效。

Instruction prompts $P_{ablat}$ for data construction to fine-tune ReaRAG w/o reasoning
用于数据构建的指令提示 $P_{ablat}$ 来微调不带推理的 ReaRAG

You are given Question, Ground-truth answer, and Decompositions, your task is to give reasoning process interleaving Thought, Action, Observation steps. Thought can reason about the current situation, and Action are in the form of function, there are two types:
给定问题、真实答案和分解步骤，您的任务是提供包括思考、行动、观察步骤的推理过程。思考可以推理当前情况，行动以函数形式呈现，有两种类型：

(1) search
{
    "name": "search",
    "description": "It can help you find useful information through the internet or local
        knowledge base. You can use this tool to access external knowledge.",
    "parameters": {
        "type": "object",
        "properties": {
            "query": {
                "description": "what you want to search"
            }
        },
        "required": ["query"]
    }
}

(2) finish  (2) 完成
{
“name”: “finish”,  name": "完成
“description”: “You can use this function to make a conclusion from the reasoning process
您可以使用此功能从推理过程中得出结论
and give the final answer. The reasoning process is completed after this ‘finish’
并给出最终答案。推理过程在此"完成"后结束
function is called”,  当调用此函数时
“parameters”: {  "参数": {
“type”: “object”,  "类型": "对象",
“properties”: {  "属性": {
“answer”: {  "答案": {
“description”: “the final answer”
描述": "最终答案
}
},
“required”: [“answer”]  "必需的": ["答案"]
}
}

After ‘finish’ function being called, the reasoning process is completed, then marked with <lcompletedl>. Please follow the format strictly.
在调用"finish"函数后，推理过程完成，随后标记为<lcompletedl>。请严格遵循此格式。

Here are in-context examples:
以下是上下文示例：
Question: 问题：
What was the wettest year in the second largest city in the state where Yuma’s Library District is located?
在乌马图书馆所在州的第二大城市中，哪一年降水量最多？

Ground-truth answer: 基准答案：

1905

Decompositions:
[
    {"question": "Yuma located in the administrative territorial entity",
        "answer": "Yuma County",
        "context": "Some context" # Simplified for readability},
    ... # Simplified for readability
    {"question": "Tucson's wettest year?",
        "answer": "1905",
        "context": "Some context" # Simplified for readability}
]

Reasoning process with function call:
通过函数调用的推理过程：
{Reasoning process example}
【推理过程示例】

C Case study C 案例研究

# Case Study 1: Where was the author of Hannibal and Scipio educated at?
# 案例研究 1：汉尼拔和斯基皮奥的作者在哪里接受教育？

Ground truth: Exeter College
真实情况：埃克塞特学院

Search-o1 搜索-o1

Alright, I've got this question: ... Simplified here.

So, I'll do a search for "Hannibal and Scipio book."

<|begin_search_query|>Hannibal and Scipio book<|end_search_query|>

<|begin_search_result|>No helpful information found.<|end_search_result|>

Hmm, that didn't turn up anything. Maybe the book has a different title. Let me think differently.

... Simplified here. Search-ol reasoning diverges and hence unable to find related information.

Only search queries and results are listed below for simplicity.

<|begin_search_query|>Punic Wars book<|end_search_query|>

<|begin_search_result|>No helpful information found.<|end_search_result|>

<|begin_search_query|>Hannibal vs. Scipio book<|end_search_query|>

<|begin_search_result|>No helpful information found.<|end_search_result|>

<|begin_search_query|>Adrian Goldsworthy books<|end_search_query|>

<|begin_search_result|>No helpful information found.ThoughtsAfter carefully reviewing the three web pages provided, it is evident that none of them ... Simplified here. <|end_search_result|>

<|begin_search_query|>Robert F. Pennell education<|end_search_query|>

<|begin_search_result|>No helpful information found.<|end_search_result|>

... Simplified here. Search-o1 tries many unrelated search queries, demonstrating divergent reasoning.

I've reached the maximum number of searches. Based on the information I have, I need to make an educated guess. ... Simplified here. Maximum searches reaches

Search-ol tries to make a guess when there is not enough information to answer the multi-hop question.

Considering that Oxford and Cambridge are traditional centers for classical studies in the UK, I'll go with Oxford.

Therefore, my answer is:

Oxford

ReaRAG-9B

想法 1：我需要找出"汉尼拔"和"西庇阿"的作者接受教育的地方。首先，我需要知道作者是谁。行动 1：{'function': 'search', 'parameters': {'query': '谁是《汉尼拔和西庇阿》的作者？'}} 观察 1：《汉尼拔和西庇阿》的作者是托马斯·纳布斯。

Thought 1: So I need to find out where the author of "Hannibal" and "Scipio" was educated. First, I need to know who the author is.

Action 1: {'function': 'search', 'parameters': {'query': 'Who is the author of Hannibal and Scipio?' }}

Observation 1: The author of "Hannibal and Scipio" is Thomas Nabbes.

想法 2：好的，既然我知道作者是托马斯·纳布斯，我需要找出他在哪里接受教育。行动 2：{'function': 'search', 'parameters': {'query': '托马斯·纳布斯在哪里接受教育？'}} 观察 2：托马斯·纳布斯在牛津大学埃克塞特学院接受教育。

Thought 2: Okay, now that I know the author is Thomas Nabbes, I need to find out where he was educated.

Action 2: {'function': 'search', 'parameters': {'query': 'Where was Thomas Nabbes educated?' }}

Observation 2: Thomas Nabbes was educated at Exeter College, Oxford.

想法 3：很好，托马斯·纳布斯在牛津大学埃克塞特学院学习。我应该确认这个信息是否正确。行动 3：{'function': 'search', 'parameters': {'query': '确认托马斯·纳布斯的教育背景'}} 观察 3：托马斯·纳布斯在 1621 年就读于牛津大学埃克塞特学院。他离开大学时未获得学位。

Thought 3: Great, so Thomas Nabbes studied at Exeter College in Oxford. I should make sure that this the correct information.

Action 3: {'function': 'search', 'parameters': {'query': 'Confirm the educational background of Thomas Nabbes'}}

Observation 3: Thomas Nabbes was educated at Exeter College, Oxford in 1621. He left the university without taking a degree.

想法 4：所以，他在牛津大学埃克塞特学院学习，但没有完成学位。这很有趣。我认为这就是我需要的信息。行动 4：{'function': 'finish', 'parameters': {'answer': '托马斯·纳布斯在牛津大学埃克塞特学院接受教育。'}}

Thought 4: So, he studied at Exeter College in Oxford, but didn't finish his degree. That's interesting. I think this is the information I need.

Action 4: {'function': 'finish', 'parameters': {'answer': 'Thomas Nabbes was educated at Exeter College, Oxford.' }}

Final Answer: Exeter College, Oxford
最终答案：牛津大学埃塞特学院

# Case Study 1: Where was the author of Hannibal and Scipio educated at? Ground truth: Exeter College Search-o1 "Alright, I've got this question: ... Simplified here. So, I'll do a search for "Hannibal and Scipio book." <|begin_search_query|>Hannibal and Scipio book<|end_search_query|> <|begin_search_result|>No helpful information found.<|end_search_result|> Hmm, that didn't turn up anything. Maybe the book has a different title. Let me think differently. ... Simplified here. Search-ol reasoning diverges and hence unable to find related information. Only search queries and results are listed below for simplicity. <|begin_search_query|>Punic Wars book<|end_search_query|> <|begin_search_result|>No helpful information found.<|end_search_result|> <|begin_search_query|>Hannibal vs. Scipio book<|end_search_query|> <|begin_search_result|>No helpful information found.<|end_search_result|> <|begin_search_query|>Adrian Goldsworthy books<|end_search_query|> <|begin_search_result|>No helpful information found.ThoughtsAfter carefully reviewing the three web pages provided, it is evident that none of them ... Simplified here. <|end_search_result|> <|begin_search_query|>Robert F. Pennell education<|end_search_query|> <|begin_search_result|>No helpful information found.<|end_search_result|> ... Simplified here. Search-o1 tries many unrelated search queries, demonstrating divergent reasoning. I've reached the maximum number of searches. Based on the information I have, I need to make an educated guess. ... Simplified here. Maximum searches reaches Search-ol tries to make a guess when there is not enough information to answer the multi-hop question. Considering that Oxford and Cambridge are traditional centers for classical studies in the UK, I'll go with Oxford. Therefore, my answer is: Oxford" ReaRAG-9B "Thought 1: So I need to find out where the author of "Hannibal" and "Scipio" was educated. First, I need to know who the author is. Action 1: {'function': 'search', 'parameters': {'query': 'Who is the author of Hannibal and Scipio?' }} Observation 1: The author of "Hannibal and Scipio" is Thomas Nabbes." "Thought 2: Okay, now that I know the author is Thomas Nabbes, I need to find out where he was educated. Action 2: {'function': 'search', 'parameters': {'query': 'Where was Thomas Nabbes educated?' }} Observation 2: Thomas Nabbes was educated at Exeter College, Oxford." "Thought 3: Great, so Thomas Nabbes studied at Exeter College in Oxford. I should make sure that this the correct information. Action 3: {'function': 'search', 'parameters': {'query': 'Confirm the educational background of Thomas Nabbes'}} Observation 3: Thomas Nabbes was educated at Exeter College, Oxford in 1621. He left the university without taking a degree." "Thought 4: So, he studied at Exeter College in Oxford, but didn't finish his degree. That's interesting. I think this is the information I need. Action 4: {'function': 'finish', 'parameters': {'answer': 'Thomas Nabbes was educated at Exeter College, Oxford.' }}" Final Answer: Exeter College, Oxford

| # Case Study 1: Where was the author of Hannibal and Scipio educated at? | | :--- | | Ground truth: Exeter College | | Search-o1 | | Alright, I've got this question: ... Simplified here. So, I'll do a search for "Hannibal and Scipio book." <\|begin_search_query|>Hannibal and Scipio book<|end_search_query|> <\|begin_search_result|>No helpful information found.<|end_search_result|> Hmm, that didn't turn up anything. Maybe the book has a different title. Let me think differently. ... Simplified here. Search-ol reasoning diverges and hence unable to find related information. Only search queries and results are listed below for simplicity. <\|begin_search_query|>Punic Wars book<|end_search_query|> <\|begin_search_result|>No helpful information found.<|end_search_result|> <\|begin_search_query|>Hannibal vs. Scipio book<|end_search_query|> <\|begin_search_result|>No helpful information found.<|end_search_result|> <\|begin_search_query|>Adrian Goldsworthy books<|end_search_query|> <\|begin_search_result|>No helpful information found.**Thoughts**After carefully reviewing the three web pages provided, it is evident that none of them ... Simplified here. <\|end_search_result|> <\|begin_search_query|>Robert F. Pennell education<|end_search_query|> <\|begin_search_result|>No helpful information found.<|end_search_result|> ... Simplified here. Search-o1 tries many unrelated search queries, demonstrating divergent reasoning. I've reached the maximum number of searches. Based on the information I have, I need to make an educated guess. ... Simplified here. Maximum searches reaches Search-ol tries to make a guess when there is not enough information to answer the multi-hop question. Considering that Oxford and Cambridge are traditional centers for classical studies in the UK, I'll go with Oxford. Therefore, my answer is: Oxford | | ReaRAG-9B | | Thought 1: So I need to find out where the author of "Hannibal" and "Scipio" was educated. First, I need to know who the author is. Action 1: {'function': 'search', 'parameters': {'query': 'Who is the author of Hannibal and Scipio?' }} Observation 1: The author of "Hannibal and Scipio" is Thomas Nabbes. | | Thought 2: Okay, now that I know the author is Thomas Nabbes, I need to find out where he was educated. Action 2: {'function': 'search', 'parameters': {'query': 'Where was Thomas Nabbes educated?' }} Observation 2: Thomas Nabbes was educated at Exeter College, Oxford. | | Thought 3: Great, so Thomas Nabbes studied at Exeter College in Oxford. I should make sure that this the correct information. Action 3: {'function': 'search', 'parameters': {'query': 'Confirm the educational background of Thomas Nabbes'}} Observation 3: Thomas Nabbes was educated at Exeter College, Oxford in 1621. He left the university without taking a degree. | | Thought 4: So, he studied at Exeter College in Oxford, but didn't finish his degree. That's interesting. I think this is the information I need. Action 4: {'function': 'finish', 'parameters': {'answer': 'Thomas Nabbes was educated at Exeter College, Oxford.' }} | | Final Answer: Exeter College, Oxford |

Table 5: We compare the generations of Search-o1 and ReaRAG. During the initial searches, the Reason-inDocuments module of Search-o1 returns “No helpful information found” because it attempts to answer the multi-hop question using information retrieved based on the search query rather than the original question. Hence, the retrieved content lacks the full answer, misleading Search-o1’s reasoning. Ultimately, it reaches its search limits and fails to provide the correct answer. In contrast, ReaRAG demonstrates logical reasoning and correctly answers the query.
表 5：我们对比了 Search-o1 和 ReaRAG 的生成结果。在初始搜索中，Search-o1 的文档推理模块返回"未找到有用信息"，因为它试图使用基于搜索查询而非原始问题检索的信息来回答多跳问题。因此，检索到的内容缺乏完整答案，误导了 Search-o1 的推理。最终，它达到搜索限制并未能提供正确答案。相比之下，ReaRAG 展示了逻辑推理能力，并正确回答了查询。

# Case Study 2: What record label did the person who is part of The Bruce Lee Band start?
# 案例研究 2：加入布鲁斯·李乐队的人创立了什么唱片公司？

Ground truth: Asian Man Records
真实答案：Asian Man Records

Search-o1 搜索-o1

Alright, so I need to find out what record label the person who is part of The Bruce Lee Band ... Simplified here.

So, I'1l do a search for "The Bruce Lee Band."

<|begin_search_query|>The Bruce Lee Band<|end_search_query|>

<|begin_search_result|>No helpful information found.<|end_search_result|>

Hmm, that's strange. Maybe the band name is different or not well-known. Perhaps I should try searching for "Bruce Lee tribute band" or something like that.

... Simplified here. Search-ol tries different searches, but Reason-in-Documents returns "No helpful information"

Only search queries and search results are listed here for simplicity

<|begin_search_query|>Bruce Lee tribute band<|end_search_query|>

<|begin_search_result|>No helpful information found.<|end_search_result|>

<|begin_search_query|>Bruce Lee band<|end_search_query|>

<|begin_search_result|>No helpful information found.<|end_search_result|>

... Simplified here.

<|begin_search_query|>The Bruce Lee Band members<|end_search_query|>

<|begin_search_result|>Less Than Records, started by a member of Less Than Jake, who is part of The Bruce Lee Band.ExplanationBased on the information from Web Page 1, The Bruce Lee Band features Less Than Jake as part of its backing band. Less Than Jake is known for starting their own record label, Less Than Records.

... Simplified here. Search-ol fabricates the name "Less Than Records"

<|begin_search_query|>Less Than Jake record label<|end_search_query|>

<|begin_search_result|>No helpful information found.ExplanationAfter reviewing the three web pages provided for the search query "Less Than Jake record label," none of them contained specific information about which member of Less Than Jake started Less Than Records. ... Simplified here. <|end_search_result|>

... Simplified here. Less Than Records was started by a member of Less Than Jake, who is also part of The Bruce Lee Band.Therefore, the record label started by the person who is part of The Bruce Lee Band is Less Than Records.

Final Answer

Less Than Records

Alright, so I need to find out what record label the person who is part of The Bruce Lee Band ... Simplified here. So, I'1l do a search for "The Bruce Lee Band." <|begin_search_query|>The Bruce Lee Band<|end_search_query|> <|begin_search_result|>No helpful information found.<|end_search_result|> Hmm, that's strange. Maybe the band name is different or not well-known. Perhaps I should try searching for "Bruce Lee tribute band" or something like that. ... Simplified here. Search-ol tries different searches, but Reason-in-Documents returns "No helpful information" Only search queries and search results are listed here for simplicity <|begin_search_query|>Bruce Lee tribute band<|end_search_query|> <|begin_search_result|>No helpful information found.<|end_search_result|> <|begin_search_query|>Bruce Lee band<|end_search_query|> <|begin_search_result|>No helpful information found.<|end_search_result|> ... Simplified here. <|begin_search_query|>The Bruce Lee Band members<|end_search_query|> <|begin_search_result|>Less Than Records, started by a member of Less Than Jake, who is part of The Bruce Lee Band.ExplanationBased on the information from Web Page 1, The Bruce Lee Band features Less Than Jake as part of its backing band. Less Than Jake is known for starting their own record label, Less Than Records. ... Simplified here. Search-ol fabricates the name "Less Than Records" <|begin_search_query|>Less Than Jake record label<|end_search_query|> <|begin_search_result|>No helpful information found.ExplanationAfter reviewing the three web pages provided for the search query "Less Than Jake record label," none of them contained specific information about which member of Less Than Jake started Less Than Records. ... Simplified here. <|end_search_result|> ... Simplified here. Less Than Records was started by a member of Less Than Jake, who is also part of The Bruce Lee Band.Therefore, the record label started by the person who is part of The Bruce Lee Band is Less Than Records. Final Answer Less Than Records

ReaRAG-9B

想法 1：我需要找出作为 Bruce Lee 乐队成员的人创建了哪个唱片公司。首先，我应该弄清楚 Bruce Lee 乐队的成员是谁。行动 1：{'function': 'search', 'parameters': {'query': 'Bruce Lee 乐队的成员是谁？'}} 观察 1：Mike Park 和他的伴奏乐队，其中包括 Less Than Jake 和 Rx Bandits，是 Bruce Lee 乐队的成员。

Thought 1: I need to find out which record label the person who is part of The Bruce Lee Band started. First, I should figure out who is part of The Bruce Lee Band.

Action 1: {'function': 'search', 'parameters': {'query': 'Who is part of The Bruce Lee Band?' }}

Observation 1: Mike Park and his backing band, which has included Less Than Jake and the Rx Bandits, are part of The Bruce Lee Band.

想法 2：所以，迈克·帕克与布鲁斯·李乐队有关。我需要查明他是否创立了唱片公司。行动 2：{'function': 'search', 'parameters': {'query': '迈克·帕克是否创立了唱片公司？'}} 观察 2：是的，迈克·帕克创立了一个唱片公司。

Thought 2: So, Mike Park is involved with The Bruce Lee Band. I need to find out if he started a record label.

Action 2: {'function': 'search', 'parameters': {'query': 'Did Mike Park start a record label?' }}

Observation 2: Yes, Mike Park started a record label.

想法 3：很好，现在我需要找出他开始的唱片公司名称。行动 3：{'function': 'search', 'parameters': {'query': 'Mike Park 开始了什么唱片公司？' } } 观察 3：Mike Park 开始了 Asian Man Records 唱片公司。

Thought 3: Great, now I need to find out the name of the record label he started.

Action 3: {'function': 'search', 'parameters': {'query': 'What record label did Mike Park start?' } }

Observation 3: Mike Park started the record label Asian Man Records.

想法 4：所以，亚洲人唱片是迈克·帕克创立的标签。但我应该确认这是正确的，并查看是否有其他相关信息。行动 4：{'function': 'search', 'parameters': {'query': '确认亚洲人唱片是否由迈克·帕克创立。'}}观察 4：是的，亚洲人唱片是由迈克·帕克创立的。

Thought 4: So, Asian Man Records is the label Mike Park started. But I should make sure that this is correct and see if there's any other relevant information.

Action 4: {'function': 'search', 'parameters': {'query': 'Confirm if Asian Man Records was started by Mike Park.' }}

Observation 4: Yes, Asian Man Records was started by Mike Park.

想法 5：好的，我已确认迈克·帕克创立了亚洲人唱片。由于他是布鲁斯·李乐队的成员，这应该就是答案。行动 5：{'function': 'finish', 'parameters': {'answer': '亚洲人唱片'}}

Thought 5: Alright, I have confirmed that Mike Park started Asian Man Records. Since he is part of The Bruce Lee Band, this should be the answer.

Action 5: {'function': 'finish', 'parameters': {'answer': 'Asian Man Records'}}

Final Answer: Asian Man Records
最终答案：亚洲人唱片

# Case Study 2: What record label did the person who is part of The Bruce Lee Band start? Ground truth: Asian Man Records Search-o1 "Alright, so I need to find out what record label the person who is part of The Bruce Lee Band ... Simplified here. So, I'1l do a search for "The Bruce Lee Band." <|begin_search_query|>The Bruce Lee Band<|end_search_query|> <|begin_search_result|>No helpful information found.<|end_search_result|> Hmm, that's strange. Maybe the band name is different or not well-known. Perhaps I should try searching for "Bruce Lee tribute band" or something like that. ... Simplified here. Search-ol tries different searches, but Reason-in-Documents returns "No helpful information" Only search queries and search results are listed here for simplicity <|begin_search_query|>Bruce Lee tribute band<|end_search_query|> <|begin_search_result|>No helpful information found.<|end_search_result|> <|begin_search_query|>Bruce Lee band<|end_search_query|> <|begin_search_result|>No helpful information found.<|end_search_result|> ... Simplified here. <|begin_search_query|>The Bruce Lee Band members<|end_search_query|> <|begin_search_result|>Less Than Records, started by a member of Less Than Jake, who is part of The Bruce Lee Band.ExplanationBased on the information from Web Page 1, The Bruce Lee Band features Less Than Jake as part of its backing band. Less Than Jake is known for starting their own record label, Less Than Records. ... Simplified here. Search-ol fabricates the name "Less Than Records" <|begin_search_query|>Less Than Jake record label<|end_search_query|> <|begin_search_result|>No helpful information found.ExplanationAfter reviewing the three web pages provided for the search query "Less Than Jake record label," none of them contained specific information about which member of Less Than Jake started Less Than Records. ... Simplified here. <|end_search_result|> ... Simplified here. Less Than Records was started by a member of Less Than Jake, who is also part of The Bruce Lee Band.Therefore, the record label started by the person who is part of The Bruce Lee Band is Less Than Records. Final Answer Less Than Records" ReaRAG-9B "Thought 1: I need to find out which record label the person who is part of The Bruce Lee Band started. First, I should figure out who is part of The Bruce Lee Band. Action 1: {'function': 'search', 'parameters': {'query': 'Who is part of The Bruce Lee Band?' }} Observation 1: Mike Park and his backing band, which has included Less Than Jake and the Rx Bandits, are part of The Bruce Lee Band." "Thought 2: So, Mike Park is involved with The Bruce Lee Band. I need to find out if he started a record label. Action 2: {'function': 'search', 'parameters': {'query': 'Did Mike Park start a record label?' }} Observation 2: Yes, Mike Park started a record label." "Thought 3: Great, now I need to find out the name of the record label he started. Action 3: {'function': 'search', 'parameters': {'query': 'What record label did Mike Park start?' } } Observation 3: Mike Park started the record label Asian Man Records." "Thought 4: So, Asian Man Records is the label Mike Park started. But I should make sure that this is correct and see if there's any other relevant information. Action 4: {'function': 'search', 'parameters': {'query': 'Confirm if Asian Man Records was started by Mike Park.' }} Observation 4: Yes, Asian Man Records was started by Mike Park." "Thought 5: Alright, I have confirmed that Mike Park started Asian Man Records. Since he is part of The Bruce Lee Band, this should be the answer. Action 5: {'function': 'finish', 'parameters': {'answer': 'Asian Man Records'}}" Final Answer: Asian Man Records

| # Case Study 2: What record label did the person who is part of The Bruce Lee Band start? | | :--- | | Ground truth: Asian Man Records | | Search-o1 | | Alright, so I need to find out what record label the person who is part of The Bruce Lee Band ... Simplified here. So, I'1l do a search for "The Bruce Lee Band." <\|begin_search_query|>The Bruce Lee Band<|end_search_query|> <\|begin_search_result|>No helpful information found.<|end_search_result|> Hmm, that's strange. Maybe the band name is different or not well-known. Perhaps I should try searching for "Bruce Lee tribute band" or something like that. ... Simplified here. Search-ol tries different searches, but Reason-in-Documents returns "No helpful information" Only search queries and search results are listed here for simplicity <\|begin_search_query|>Bruce Lee tribute band<|end_search_query|> <\|begin_search_result|>No helpful information found.<|end_search_result|> <\|begin_search_query|>Bruce Lee band<|end_search_query|> <\|begin_search_result|>No helpful information found.<|end_search_result|> ... Simplified here. <\|begin_search_query|>The Bruce Lee Band members<|end_search_query|> <\|begin_search_result|>Less Than Records, started by a member of Less Than Jake, who is part of The Bruce Lee Band.**Explanation**Based on the information from Web Page 1, The Bruce Lee Band features Less Than Jake as part of its backing band. Less Than Jake is known for starting their own record label, Less Than Records. ... Simplified here. Search-ol fabricates the name "Less Than Records" <\|begin_search_query|>Less Than Jake record label<|end_search_query|> <\|begin_search_result|>No helpful information found.**Explanation**After reviewing the three web pages provided for the search query "Less Than Jake record label," none of them contained specific information about which member of Less Than Jake started Less Than Records. ... Simplified here. <\|end_search_result|> ... Simplified here. Less Than Records was started by a member of Less Than Jake, who is also part of The Bruce Lee Band.Therefore, the record label started by the person who is part of The Bruce Lee Band is Less Than Records. Final Answer Less Than Records | | ReaRAG-9B | | Thought 1: I need to find out which record label the person who is part of The Bruce Lee Band started. First, I should figure out who is part of The Bruce Lee Band. Action 1: {'function': 'search', 'parameters': {'query': 'Who is part of The Bruce Lee Band?' }} Observation 1: Mike Park and his backing band, which has included Less Than Jake and the Rx Bandits, are part of The Bruce Lee Band. | | Thought 2: So, Mike Park is involved with The Bruce Lee Band. I need to find out if he started a record label. Action 2: {'function': 'search', 'parameters': {'query': 'Did Mike Park start a record label?' }} Observation 2: Yes, Mike Park started a record label. | | Thought 3: Great, now I need to find out the name of the record label he started. Action 3: {'function': 'search', 'parameters': {'query': 'What record label did Mike Park start?' } } Observation 3: Mike Park started the record label Asian Man Records. | | Thought 4: So, Asian Man Records is the label Mike Park started. But I should make sure that this is correct and see if there's any other relevant information. Action 4: {'function': 'search', 'parameters': {'query': 'Confirm if Asian Man Records was started by Mike Park.' }} Observation 4: Yes, Asian Man Records was started by Mike Park. | | Thought 5: Alright, I have confirmed that Mike Park started Asian Man Records. Since he is part of The Bruce Lee Band, this should be the answer. Action 5: {'function': 'finish', 'parameters': {'answer': 'Asian Man Records'}} | | Final Answer: Asian Man Records |

Table 6: We compare the generations of Search-o1 and ReaRAG. Search-o1’s Reason-in-Documents module fabricates non-existent information (“Less Than Records”, highlighted in red), compromising factual accuracy and leading to a hallucinated response. In contrast, ReaRAG successfully conduct knowledge-guided reasoning to generate a factual answer.
表 6：我们比较了 Search-o1 和 ReaRAG 的生成结果。Search-o1 的文档推理模块编造了不存在的信息（红色高亮的"Less Than Records"），这损害了事实准确性并导致了幻觉响应。相比之下，ReaRAG 成功地进行了知识引导的推理以生成事实性答案。

Ground truth: Ambystoma texanum
真实答案：Ambystoma texanum

R1-Searcher R1-搜索者

To solve this question, I need to find the scientific name of an amphibian that is listed as endangered by the Canadian government and has its only population in Canada on an island that is the southernmost inhabited part of Canada.
为了解决这个问题，我需要找出一个被加拿大政府列为濒危的两栖动物的学名，并且该物种在加拿大的唯一种群仅存在于加拿大最南端的有人居住的岛屿上。

Pelee, Ontario: 皮利岛，安大略省：

Pelee is the southernmost inhabited part of Canada and is situated on two major migratory bird routes, the Atlantic Flyway and the Mississippi flyway. The
皮利岛是加拿大最南端的有人居住的地区，位于大西洋迁徙路线和密西西比迁徙路线两大候鸟迁徙路线的交汇处。该地区

Extreme points of Canada:
加拿大的极点：

The southernmost point is Middle Island, in Lake Erie, Ontario (

41^{\circ} 41^{'} N 82^{\circ} 40^{'} W

); the southernmost water point lies just south of the island, on the Ontario-
最南端的点是安大略省伊利湖中的米德尔岛（Middle Island），最南端的水域点位于该岛的南侧，在安大略省边界的-

Middle Island (Lake Erie):
中岛（伊利湖）：

Middle Island is a small island, just 18.5 hectares (46 acres) in area. [1] It is the southernmost point of land in Canada, located at

41^{\circ} 41^{'} N, 82^{\circ} 41^{''} W

.
中岛是一个面积仅为 18.5 公顷（46 英亩）的小岛。[1] 它是加拿大最南端的陆地点，位于

41^{\circ} 41^{'} N, 82^{\circ} 41^{''} W

。

Extreme points of Canadian provinces:
加拿大省份的极点：

British Columbia, Yukon and Northwest Territories border ... Simplified here.
不列颠哥伦比亚省、育空地区和西北地区与……简化此处。

Southern Ontario: 安大略省南部：

Southern Ontario is a primary region of the Canadian province of Ontario. ... Simplified here. <|end_of_documents|>
安大略省南部是加拿大安大略省的主要地区。...简化描述。

The keyword "southernmost" appears in multiple documents, but only Pelee, Ontario is correct as the southernmost inhabited part of Canada. R1-Searcher fails to identify it accurately.
关键词"最南端"出现在多个文档中，但只有佩里岛（Pelee）是加拿大最南端的有人居住的地方。R1-搜索器未能准确识别。

From the search results, I can see that Middle Island is the southernmost inhabited part of Canada.
从搜索结果来看，米德尔岛是加拿大最南端的有人居住的地方。

**Step 2: Determine which amphibian is listed as endangered by the Canadian government and has its only population on
第 2 步：确定哪种两栖动物被加拿大政府列为濒危，且其种群仅存在于

I'1l use a query to find out:
我将使用查询来找出：

<|begin_of_query|>amphibian listed as endangered by the Canadian government with its only population on Middle
加拿大政府列为濒危的两栖动物，其唯一的种群位于中间地区

Island, Ontario<|end_of_query|> Wrong query due to earlier mistakes. Middle Island is not the correct place.
安大略省海岛，错误查询是由于之前的错误。中海岛并非正确地点。

<|begin_of_documents|>

List of amphibians of Canada:
加拿大两栖动物名录：

List of amphibians of Canada ... Simplified here.
加拿大两栖动物名录...简化版本。

List of Wildlife Species at Risk (Canada):
加拿大濒危野生物种名录：

As of November 2024, ... Simplified here.
截至 2024 年 11 月，......这里是简化内容。

List of endangered amphibians:
濒危两栖动物名录：

Of all evaluated amphibian species, ... Simplified here.
在所有已评估的两栖动物物种中，......这里是简化内容。

[PDF] Midland Painted Turtle - Species at risk public registry:
[PDF] 中部彩绘龟 - 濒危物种公共登记：

Painted Turtle is recognized as a species of Least Concern ... Simplified here.
彩龟被认定为最低关注度物种...这里简化描述。

American burying beetle (Nicrophorus americanus): COSEWIC :
美国埋葬甲虫（Nicrophorus americanus）：加拿大濒危物种委员会：

The American Burying Beetle ... Simplified here. .<|end_of_documents|>
美国埋葬甲虫...这里简化描述。

From the search results, I can see that there is no specific amphibian listed as endangered by the Canadian government with its only population on Middle Island, Ontario. However, I will focus on amphibians that are endangered and have populations in Canada.</think>
根据搜索结果，我可以看出没有特定的两栖动物被加拿大政府列为濒危物种，且其唯一的种群位于安大略省的中岛。然而，我将专注于加拿大境内濒危的两栖动物。

ReSearch 研究

<思考> 为了找出被加拿大政府列为濒危的两栖动物的科学名称，且该动物仅存在于加拿大最南端的有人居住岛屿上，我需要先确定这个岛屿，然后找出该岛上濒危的两栖动物物种。让我们从识别加拿大最南端的有人居住岛屿开始。</思考> <搜索> 搜索查询：加拿大最南端的有人居住岛屿是什么？</搜索> <结果>

<think> To find the scientific name of the amphibian that is listed as endangered by the Canadian government and is only found on the southernmost inhabited island of Canada, I need to first identify the island and then find the amphibian species that is endangered there. Let's start by identifying the southernmost inhabited island of Canada.</think>

<search> search query: What is the southernmost inhabited island of Canada?</search> <result>

# Case Study 3: What is the scientific name of an amphibian that is listed as endangered by the Canadian government (as of 2024), and its only population in Canada occurs on an island which is the southernmost inhabited part of Canada? Ground truth: Ambystoma texanum R1-Searcher To solve this question, I need to find the scientific name of an amphibian that is listed as endangered by the Canadian government and has its only population in Canada on an island that is the southernmost inhabited part of Canada. Pelee, Ontario: Pelee is the southernmost inhabited part of Canada and is situated on two major migratory bird routes, the Atlantic Flyway and the Mississippi flyway. The Extreme points of Canada: The southernmost point is Middle Island, in Lake Erie, Ontario ( 41^(@)41^(')N82^(@)40^(')W ); the southernmost water point lies just south of the island, on the Ontario- Middle Island (Lake Erie): Middle Island is a small island, just 18.5 hectares (46 acres) in area. [1] It is the southernmost point of land in Canada, located at 41^(@)41^(')N,82^(@)41^('')W. Extreme points of Canadian provinces: British Columbia, Yukon and Northwest Territories border ... Simplified here. Southern Ontario: Southern Ontario is a primary region of the Canadian province of Ontario. ... Simplified here. <|end_of_documents|> The keyword "southernmost" appears in multiple documents, but only Pelee, Ontario is correct as the southernmost inhabited part of Canada. R1-Searcher fails to identify it accurately. From the search results, I can see that Middle Island is the southernmost inhabited part of Canada. **Step 2: Determine which amphibian is listed as endangered by the Canadian government and has its only population on I'1l use a query to find out: <|begin_of_query|>amphibian listed as endangered by the Canadian government with its only population on Middle Island, Ontario<|end_of_query|> Wrong query due to earlier mistakes. Middle Island is not the correct place. <|begin_of_documents|> List of amphibians of Canada: List of amphibians of Canada ... Simplified here. List of Wildlife Species at Risk (Canada): As of November 2024, ... Simplified here. List of endangered amphibians: Of all evaluated amphibian species, ... Simplified here. [PDF] Midland Painted Turtle - Species at risk public registry: Painted Turtle is recognized as a species of Least Concern ... Simplified here. American burying beetle (Nicrophorus americanus): COSEWIC : The American Burying Beetle ... Simplified here. .<|end_of_documents|> From the search results, I can see that there is no specific amphibian listed as endangered by the Canadian government with its only population on Middle Island, Ontario. However, I will focus on amphibians that are endangered and have populations in Canada.</think> ReSearch "<think> To find the scientific name of the amphibian that is listed as endangered by the Canadian government and is only found on the southernmost inhabited island of Canada, I need to first identify the island and then find the amphibian species that is endangered there. Let's start by identifying the southernmost inhabited island of Canada.</think> <search> search query: What is the southernmost inhabited island of Canada?</search> <result>"

| # Case Study 3: What is the scientific name of an amphibian that is listed as endangered by the Canadian government (as of 2024), and its only population in Canada occurs on an island which is the southernmost inhabited part of Canada? | | :--- | | Ground truth: Ambystoma texanum | | R1-Searcher | | To solve this question, I need to find the scientific name of an amphibian that is listed as endangered by the Canadian government and has its only population in Canada on an island that is the southernmost inhabited part of Canada. | | Pelee, Ontario: | | Pelee is the southernmost inhabited part of Canada and is situated on two major migratory bird routes, the Atlantic Flyway and the Mississippi flyway. The | | Extreme points of Canada: | | The southernmost point is Middle Island, in Lake Erie, Ontario ( $41^{\circ} 41^{\prime} \mathrm{N} 82^{\circ} 40^{\prime} \mathrm{W}$ ); the southernmost water point lies just south of the island, on the Ontario- | | Middle Island (Lake Erie): | | Middle Island is a small island, just 18.5 hectares (46 acres) in area. [1] It is the southernmost point of land in Canada, located at $41^{\circ} 41^{\prime} \mathrm{N}, 82^{\circ} 41^{\prime \prime} \mathrm{W}$. | | Extreme points of Canadian provinces: | | British Columbia, Yukon and Northwest Territories border ... Simplified here. | | Southern Ontario: | | Southern Ontario is a primary region of the Canadian province of Ontario. ... Simplified here. <\|end_of_documents|> | | The keyword "southernmost" appears in multiple documents, but only Pelee, Ontario is correct as the southernmost inhabited part of Canada. R1-Searcher fails to identify it accurately. | | From the search results, I can see that Middle Island is the southernmost inhabited part of Canada. | | **Step 2: Determine which amphibian is listed as endangered by the Canadian government and has its only population on | | I'1l use a query to find out: | | <\|begin_of_query|>amphibian listed as endangered by the Canadian government with its only population on Middle | | Island, Ontario<\|end_of_query|> Wrong query due to earlier mistakes. Middle Island is not the correct place. | | <\|begin_of_documents|> | | List of amphibians of Canada: | | List of amphibians of Canada ... Simplified here. | | List of Wildlife Species at Risk (Canada): | | As of November 2024, ... Simplified here. | | List of endangered amphibians: | | Of all evaluated amphibian species, ... Simplified here. | | [PDF] Midland Painted Turtle - Species at risk public registry: | | Painted Turtle is recognized as a species of Least Concern ... Simplified here. | | American burying beetle (Nicrophorus americanus): COSEWIC : | | The American Burying Beetle ... Simplified here. .<\|end_of_documents|> | | From the search results, I can see that there is no specific amphibian listed as endangered by the Canadian government with its only population on Middle Island, Ontario. However, I will focus on amphibians that are endangered and have populations in Canada.</think> | | ReSearch | | <think> To find the scientific name of the amphibian that is listed as endangered by the Canadian government and is only found on the southernmost inhabited island of Canada, I need to first identify the island and then find the amphibian species that is endangered there. Let's start by identifying the southernmost inhabited island of Canada.</think> <search> search query: What is the southernmost inhabited island of Canada?</search> <result> |

Table 7 - continued from previous page
表 7 - 续上页
# Case Study 3 (continued)
# 案例研究 3（续）

Extreme points of Canada:
加拿大的极端点：

The southernmost point is Middle Island, in Lake Erie, Ontario (

41^{\circ} 41^{'} N 82^{\circ} 40^{'} W

); the southernmost water point lies just south of the island, on the Ontario
最南端的点是安大略省的伊利湖中的米德尔岛（

41^{\circ} 41^{'} N 82^{\circ} 40^{'} W

）；最南端的水域点位于该岛的正南方

Middle Island (Lake Erie):
米德尔岛（伊利湖）：

Middle Island is a small island, just 18.5 hectares (46 acres) in area. [1] It is the southernmost point of land in Canada, located at

41^{\circ} 41^{'} N, 82^{\circ} 41^{''} W

.
米德尔岛是一个面积仅为 18.5 公顷（46 英亩）的小岛。[1] 它是加拿大最南端的陆地点，位于

41^{\circ} 41^{'} N, 82^{\circ} 41^{''} W

。

Pelee, Ontario: 佩利岛，安大略省：

At 42 km 2 ( 16 sq mi ), Pelee Island is the largest island in Lake Erie and the southernmost populated point in Canada. An Ontario Historical Plaque was erected
在面积为 42 平方公里（16 平方英里）的佩利岛是伊利湖最大的岛屿，也是加拿大最南端的有人居住地点。安大略省历史牌匾已在此竖立

Ontario: 安大略省：

Ontario is the southernmost province of Canada. … Simplified here.
安大略省是加拿大最南部的省份。... 简化如下。

Newfoundland (island): 纽芬兰（岛屿）：

Newfoundland is a large island within the Canadian province … Simplified here.
纽芬兰是加拿大省份内的一个大岛屿……简化如下。
Similar to R1-Searcher, ReSearch mistakenly identified Middle Island as the southernmost inhabited island leading to a subsequent incorrect query.
类似于 R1-Searcher，ReSearch 错误地将中岛识别为加拿大最南端的有人居住的岛屿，从而导致后续查询不正确。
</result><think> From the information provided, the southernmost inhabited island of Canada is Middle Island in Lake Erie, Ontario. Now, I need to find out which amphibian species is listed as endangered on this island.</think>
</result><think> 根据提供的信息，加拿大最南端的有人居住的岛屿是安大略省埃里湖中的中岛。现在，我需要找出在这个岛屿上被列为濒危的两栖动物物种。</think>
<search> search query: Which amphibian is endangered on Middle Island, Ontario?</search> <result>
<search>搜索查询：安大略省中岛上濒危的两栖动物是哪种？</search> <result>

List of amphibians of Canada:
加拿大两栖动物列表：

List of amphibians of Canada ; EX - Extinct, EW - Extinct in the Wild ; CR - Critically Endangered, EN - Endangered, VU - Vulnerable ; NT - Near Threatened, LC -
加拿大两栖动物列表；EX - 灭绝，EW - 野外灭绝；CR - 极度濒危，EN - 濒危，VU - 易危；NT - 近危，LC -

List of endangered amphibians:
濒危两栖动物列表：

As of September 2021, the International Union for Conservation of Nature (IUCN) lists 1085 endangered amphibian species.
截至 2021 年 9 月，国际自然保护联盟（IUCN）列出了 1085 种濒危两栖动物物种。
List of Wildlife Species at Risk (Canada): Threatened amphibians • Ambystoma jeffersonianum — Jefferson salamander - Bufo fowleri - Fowler’s toad • Desmognathus ochrophaeus - Allegheny Mountain dusky
加拿大濒危野生动物物种名录：受威胁的两栖动物 • 杰斐逊蛋蚓 — 杰斐逊蝾螈 - 福勒蟾蜍 • 褐灰蝾螈 - 阿勒格尼山暗色蝾螈
… Simplified here. No related information about endangered species on Middle Island.
……此处简化。没有关于中岛濒危物种的相关信息。
</result><think> The information provided does not directly mention the amphibian species that is endangered on Middle Island. I need to narrow down the search to amphibians that are specifically listed as endangered in Canada and are found on Middle Island.</think>
</result><think> 提供的信息未直接提及中岛上濒危的两栖动物物种。我需要缩小搜索范围，专注于加拿大列为濒危且在中岛上存在的两栖动物。</think>
<search> search query: Endangered amphibian on Middle Island, Ontario, Canada</search> <result>
<搜索>搜索查询：加拿大安大略省中岛上濒危两栖动物</搜索> <结果>

List of amphibians of Canada:
加拿大两栖动物列表：

This is a list of amphibians of Canada. Conservation status - IUCN Red List of Threatened Species: EX - Extinct, EW Extinct in the Wild: CR - Critically
这是加拿大两栖动物的列表。保护状态 - 世界自然保护联盟濒危物种红色名录：EX - 灭绝，EW 野外灭绝：CR - 极度濒危

List of endangered amphibians:
濒危两栖动物列表：

As of September 2021, the International Union for Conservation of Nature (IUCN) lists 1085 endangered amphibian species.
截至 2021 年 9 月，国际自然保护联盟（IUCN）列出了 1085 种濒危两栖动物物种。

Middle Island (Lake Erie):
米德尔岛（伊利湖）：

Middle Island is a small island, just 18.5 hectares (46 acres) in area. [1] It is the southernmost point of land in Canada, located at

41^{\circ} 41^{'} N, 82^{\circ} 41^{''} W

.
米德尔岛是一个很小的岛屿，面积仅为 18.5 公顷（46 英亩）。[1] 它是加拿大最南端的陆地，位于

41^{\circ} 41^{'} N, 82^{\circ} 41^{''} W

。
… Simplified here. No related information about endangered species on Middle Island.
……此处已简化。没有关于米德尔岛濒危物种的相关信息。
</result><think> After reviewing the information, I realize that the specific amphibian species that is listed as endangered on Middle Island, Ontario, Canada, is the Jefferson salamander (Ambystoma jeffersonianum). This amphibian is the only one that fits the criteria of being endangered and only found on this specific island.</think>
</result><think> 在查看相关信息后，我意识到在安大略省的米德尔岛上，被列为濒危的两栖动物是杰斐逊蝾螈（Ambystoma jeffersonianum）。这种两栖动物是唯一符合在该特定岛屿上濒危的条件的物种。</think>
<think> Now that I have identified the amphibian species, I can provide the scientific name.</think>
<think> 既然我已经确定了两栖动物物种，现在可以提供其科学名称。</think>
<answer> Ambystoma jeffersonianum </answer>

ReaRAG-9B

Thought 1: So I need to find the scientific name of an endangered amphibian in Canada, and its only population in Canada is on an island that’s the southernmost inhabited part of Canada. Okay, first things first, I need to know what the southernmost inhabited island in Canada is.
思路 1：所以我需要找出加拿大一种濒危的两栖动物，而它在加拿大的唯一分布地是加拿大最南端的有人居住的岛屿。好的，首先我需要弄清楚加拿大最南端的有人居住的岛屿是哪个。
Action 1: {‘function’: ‘search’, ‘parameters’: {‘query’: ‘What is the southernmost inhabited island in Canada?’ }}
动作 1：{'function': '搜索', 'parameters': {'query': '加拿大最南端的有人居住的岛屿是什么？'}}
Observation 1: The southernmost inhabited island in Canada is Pelee Island, Ontario.
观察 1：加拿大最南端的有人居住的岛屿是安大略省的皮利岛（Pelee Island）。
Thought 2: Alright, so the island is Pelee Island in Ontario. Now, I need to find an endangered amphibian that’s only found there in Canada. Maybe there are some lists of endangered species in Canada that I can look at.
思考 2：好的，这个岛是安大略省的皮利岛。现在，我需要找到一种仅在那里发现的濒危两栖动物。也许我可以查看加拿大濒危物种的一些列表。
# Case Study 3 (continued)
# 案例研究 3（续）
Action 2: {‘function’: ‘search’, ‘parameters’: {‘query’: ‘List of endangered amphibians in Canada’} }
动作 2：{'function': 'search', 'parameters': {'query': '加拿大濒危两栖动物列表'}}
Observation 2: I am sorry, but I cannot provide a specific list of endangered amphibians in Canada based on the given context. The context provided does not include a list of endangered amphibians in Canada.
观察 2：很抱歉，我无法根据给定的上下文提供加拿大濒危两栖动物的具体列表。提供的上下文不包含加拿大濒危两栖动物的列表。
Thought 3: Hmm, it seems like I need to find out which amphibians are endangered in Canada and see if any of them are only found on Pelee Island. Maybe I should look for information on endangered species in Ontario, since Pelee Island is in Ontario.
思考 3：嗯，看起来我需要找出在加拿大濒危的两栖动物，并查看它们中是否有仅在佩利岛发现的物种。也许我应该查找安大略省濒危物种的信息，因为佩利岛位于安大略省。
Action 3: {‘function’: ‘search’, ‘parameters’: {‘query’: ‘Endangered amphibians in Ontario’}}
动作 3：{'function': 'search', 'parameters': {'query': '安大略省濒危两栖动物'}}
Observation 3: The endangered amphibians in Ontario mentioned in the context are: 1. Jefferson salamander (Ambystoma jeffersonianum) … 2. Northern dusky salamander (Desmognathus fuscus) …
观察 3：安大略省提到的濒危两栖动物包括：1. 杰斐逊蝾螈（Ambystoma jeffersonianum）… 2. 北部暗色蝾螈（Desmognathus fuscus）…
Both species are actually not on Peele Island. ReaRAG later verifies this information.
事实上，这两个物种并不存在于皮里岛。ReaRAG 随后验证了这一信息。
Thought 4: So, there are a couple of endangered amphibians in Ontario, but I need to find out if any of them are only found on Pelee Island. Maybe I should look into the distribution of these species.
思考 4：因此，安大略省确实存在几种濒危两栖动物，但我需要弄清楚它们中是否有任何物种仅在皮里岛出现。也许我应该调查这些物种的分布情况。
Action 4: {‘function’: ‘search’, ‘parameters’: {‘query’: ‘Distribution of Jefferson salamander in Ontario’ } }
行动 4：{'function': 'search', 'parameters': {'query': '杰斐逊蝾螈在安大略省的分布'} }
Observation 4: The distribution of the Jefferson salamander in Ontario is primarily associated with the Niagara Escarpment and Carolinian forest regions.
观察 4：杰斐逊蝾螈在安大略省的分布主要与尼亚加拉悬崖和卡罗琳森林地区相关联。
Thought 5: Pelee Island is part of the Lake Erie Islands, which are different from the Niagara Escarpment and Carolinian forest regions. So, perhaps the Jefferson salamander isn’t found there. Maybe I should check the distribution of the Northern dusky salamander.
想法 5：皮利岛是伊利湖群岛的一部分，与尼亚加拉悬崖和卡罗琳森林地区不同。因此，杰斐逊蝾螈可能不存在于那里。也许我应该查查北部暗色蝾螈的分布。
Action 5: {‘function’: ‘search’, ‘parameters’: {‘query’: ‘Distribution of Northern dusky salamander in Ontario’}}
动作 5：{'function': 'search', 'parameters': {'query': '北部暗色蝾螈在安大略省的分布'}}
Observation 5: The Northern dusky salamander (Desmognathus fuscus) is restricted to a small area in Ontario. Specifically, it can be found in the Niagara Gorge.
观察 5：北部暗色蝾螈（Desmognathus fuscus）在安大略省仅限于一个小区域。具体来说，它可以在尼亚加拉峡谷找到。
Thought 6: Again, Niagara Gorge is different from Pelee Island. Maybe there’s another endangered amphibian in Ontario that’s only found on Pelee Island. Perhaps I need to look for amphibians that are specific to Pelee Island.
思考 6：再次，尼亚加拉峡谷与佩利岛不同。也许安大略省有另一种仅存在于佩利岛的濒危两栖动物。也许我需要寻找特定于佩利岛的两栖动物。
Now, ReaRAG performs a more specific search.
现在，ReaRAG 进行更具体的搜索。
Action 6: {‘function’: ‘search’, ‘parameters’: {‘query’: ‘Endangered amphibians specific to Pelee Island, Ontario’}}
行动 6：{'function': 'search', 'parameters': {'query': '安大略省佩利岛特有的濒危两栖动物'}}
Observation 6: Small-mouth salamander.
观察 6：小口蝾螈。
Thought 7: So, the small-mouth salamander is found on Pelee Island, but I need to confirm if it’s endangered in Canada.
思考 7：小口蝾螈确实存在于佩利岛，但我需要确认它在加拿大是否濒危。
Action 7: {‘function’: ‘search’, ‘parameters’: {‘query’: ‘Is the small-mouth salamander endangered in Canada?’ }}
行动 7：{'function': 'search', 'parameters': {'query': '小口蝾螈在加拿大是否濒危？'}}
Observation 7: Yes, the small-mouth salamander is listed as endangered in Canada.
观察 7：是的，小口蝾螈在加拿大被列为濒危物种。
Thought 8: Great, so now I have the scientific name, Ambystoma maculatum, which is the small-mouth salamander. But wait, I need to make sure that it’s only found on Pelee Island in Canada.
思考 8：很好，现在我已经知道它的学名是斑点蝾螈（Ambystoma maculatum），就是小口蝾螈。但是等等，我需要确保它只存在于加拿大的佩利岛。
Action 8: {‘function’: ‘search’, ‘parameters’: {‘query’: ‘Distribution of small-mouth salamander in Canada’}}
动作 8：{'function': '搜索', 'parameters': {'query': '加拿大小口蝾螈的分布'}}
Observation 8: The distribution of the small-mouth salamander (Ambystoma texanum) in Canada is restricted solely to Pelee Island. The entire Canadian range of this species is only on Pelee Island.
观察 8：小口蝾螈（Ambystoma texanum）在加拿大的分布仅限于佩利岛。该物种在加拿大的整个分布范围只在佩利岛。
Thought 9: Wait a minute, in my earlier steps, I thought it was Ambystoma maculatum, but now the search is showing Ambystoma texanum. There might be some confusion here.
思考 9：等等，在之前的步骤中，我原本以为是斑蝾螈（Ambystoma maculatum），但现在搜索显示的是小口蝾螈（Ambystoma texanum）。这里可能存在一些混淆。
Despite earlier mistakes, RearAG managed to resolve the confusion by taking correct search action
尽管之前有错误，但 ReaRAG 通过采取正确的搜索动作成功解决了这种混淆
Action 9: {‘function’: ‘search’, ‘parameters’: {‘query’: ‘Scientific name of small-mouth salamander’}}
动作 9：{'function': 'search', 'parameters': {'query': '小口蝾螈的科学名称'}}
Observation 9: The scientific name of the small-mouth salamander is Ambystoma texanum.
观察 9：小口蝾螈的科学名称是 Ambystoma texanum。
Thought 10: Okay, so it’s Ambystoma texanum, not Ambystoma maculatum. I must have misremembered earlier. So, to confirm, Ambystoma texanum is the small-mouth salamander, and it’s endangered in Canada, and its only population in Canada is on Pelee Island.
思考 10：好的，是 Ambystoma texanum，而不是 Ambystoma maculatum。我之前确实记错了。所以，可以确认，Ambystoma texanum 是小口蝾螈，它在加拿大是濒危物种，在加拿大的唯一种群位于佩利岛。

Action 10: {‘function’: ‘finish’, ‘parameters’: {‘answer’: ‘Ambystoma texanum.’}}
动作 10：{'function': 'finish', 'parameters': {'answer': 'Ambystoma texanum.'}}
Final Answer: Ambystoma texanum
最终答案：阿克斯洛尔蝾螈（Ambystoma texanum）
Table 7: This case study illustrates the poor generalization of R1-Searcher on FRAMES and FanOutQA. Both R1Searcher and ReSearch fail to extract the correct information from noisy retrieved documents, mistakenly identifying “Middle Island” as Canada’s southernmost inhabited area instead of “Pelee, Ontario”, leading to errors in subsequent steps. In contrast, despite early mistakes, ReaRAG successfully realigns its reasoning path, demonstrating greater robustness.
表 7：这个案例研究展示了 R1-Searcher 在 FRAMES 和 FanOutQA 数据集上泛化能力的不足。R1Searcher 和 ReSearch 都未能从嘈杂的检索文档中提取正确信息，错误地将"Middle Island"误认为是加拿大最南端的有人居住区域，而非"Pelee, Ontario"，这导致了后续步骤中的错误。相比之下，尽管有早期的错误，ReaRAG 成功地重新调整了推理路径，展示了更强的鲁棒性。
# Case Study 4: In what year was each member of Blackpink born?
# 案例研究 4：Blackpink 的每个成员分别在哪一年出生？
Ground truth: Jisoo - 1995, Jennie - 1996, Rosé - 1997, Lisa - 1997
真实情况：Jisoo - 1995 年，Jennie - 1996 年，Rosé - 1997 年，Lisa - 1997 年
R1-Searcher R1-检索器
To solve this question, I need to find out the birthdates of each member of Blackpink.
为了解决这个问题，我需要查找 Blackpink 每个成员的出生日期。
Step 1: Identify the members of Blackpink.
步骤 1：识别 Blackpink 的成员。
I’ll use a query to search for information about the members of Blackpink:
我将使用查询来搜索关于 Blackpink 成员的信息：
<|begin_of_query|>Blackpink members<|end_of_query|>
Blackpink 成员
<|begin_of_documents|> 《ReaRAG：通过迭代检索增强生成提高大型推理模型的事实性》
Blackpink: 黑色粉色
The group is composed of four members: Jisoo, Jennie, Rosé, and Lisa. … Simplified here.
该小组由四名成员组成：Jisoo、Jennie、Rosé和 Lisa。
<|end_of_documents|>
From the search results, I can see that the members of Blackpink are Jisoo, Jennie, Rosé, and Lisa.
从搜索结果中，我可以看到 Blackpink 的成员是 Jisoo、Jennie、Rosé和 Lisa。
Step 2: Find the birthdates of each member.
第 2 步：查找每位成员的出生日期。
I’ll use a query to search for the birthdates of Jisoo, Jennie, Rosé, and Lisa:
我将使用查询来搜索 Jisoo、Jennie、Rosé和 Lisa 的出生日期：
<|begin_of_query|> Jisoo birthdate, Jennie birthdate, Rosé birthdate, Lisa birthdate<| end_of_query|> <|begin_of_documents|>
<|begin_of_query|> Jisoo 出生日期，Jennie 出生日期，Rosé出生日期，Lisa 出生日期 <|end_of_query|>
Blackpink: Blackpink：
Blackpink is a South Korean girl group formed by YG Entertainment. … Simplified here.
Blackpink 是由 YG 娱乐公司打造的韩国女子组合。
Jisoo: Jisoo：
Jisoo ; Born. Kim Ji-soo. (1995-01-03) January 3, 1995 (age 30). … Simplified here. .
金智秀（Jisoo）；出生。金智秀。（1995-01-03）1995 年 1 月 3 日（30 岁）。… 简化版。
Rosé (singer): 罗丝（歌手）：
Roseanne Park (born 11 February 1997), known mononymously as Rosé, is a New Zealand … Simplified here.
朴彩英（1997 年 2 月 11 日出生），艺名 Rosé，是一位新西兰出生的韩国女歌手… 简化版。
Jennie (singer): Jennie（歌手）：
Jennie ; Born. Jennie Kim. (1996-01-16) January 16, 1996 … Simplified here.
珍妮；出生。金珍妮。（1996-01-16）1996 年 1 月 16 日
Blackpink - Simple English Wikipedia, the free encyclopedia:
Blackpink - 简易英语维基百科，自由的百科全书：
The group consists of members Jisoo, Jennie, Rosé, and Lisa. They are the highest-charting female K-pop act on both Billboard Hot 100 and Billboard 200.
该团体由 Jisoo、Jennie、Rosé和 Lisa 四位成员组成。她们是 Billboard Hot 100 和 Billboard 200 上成绩最高的女子 K-pop 组合。
<|end_of_documents|>
From the search results, I can see the following birthdates:
从搜索结果中，我可以看到以下出生日期：

Jisoo: January 3, 1995
Jisoo：1995 年 1 月 3 日
Jennie: January 16, 1996
Jennie：1996 年 1 月 16 日
Rosé: February 11, 1997
Rosé：1997 年 2 月 11 日
Lisa: Unknown (only age mentioned)
Lisa：未知（仅提及年龄）

Therefore, the year each member of Blackpink was born is:
因此，Blackpink 每位成员的出生年份是：

Jisoo: 1995 Jisoo：1995 年
Jennie: 1996 Jennie：1996 年
Rosé: 1997 罗斯: 1997
Lisa: Unknown </think> 丽莎: 未知 </think>

Although Lisa’s birthdate is missing, R1-Searcher proceeds to generate a final answer.
尽管丽莎的出生日期缺失，R1-搜索器仍继续生成最终答案。
<answer> 1995, 1996, 1997, Unknown </answer>
<answer> 1995, 1996, 1997, 未知 </answer>
ReSearch 研究
<think> To find the birth years of each member of Blackpink, I will first search for the individual members of the group and then look for their birth years. I will use the Wikipedia search tool to find this information.</think>
<think> 为了找出 Blackpink 每个成员的出生年份，我将首先搜索该团体的成员，然后查找他们的出生年份。我将使用维基百科搜索工具来查找这些信息。</think>
<search> search query: Members of Blackpink</search> <result>
<search> 搜索查询：Blackpink 的成员</search> <result>
Blackpink: Blackpink：
The group is composed of four members: Jisoo, Jennie, Rosé, and Lisa. Regarded by various publications as the “biggest girl group in the world”, they are
这个团体由四名成员组成：Jisoo、Jennie、Rosé和 Lisa。被多家出版物誉为"世界上最大的女子组合"，他们
… Simplified here. ……这里简化。
</result><think> Now that I have the names of the members, I will search for their birth years.</think>
</result><think> 现在我已经知道了成员的名字，我将搜索他们的出生年份。</think>
<search> search query: Jisoo birth year</search> <result>
<search> 搜索查询：Jisoo 出生年份</search> <result>
Jisoo:
Jisoo ; Born. Kim Ji-soo. (1995-01-03) January 3, 1995 (age 30). Gunpo, South Korea ; Education, School of Performing Arts Seoul ; Occupations. Singer; actress.
金智秀，出生于 1995 年 1 月 3 日（现年 30 岁），来自韩国军浦。毕业于首尔表演艺术学校。职业为歌手和演员。
# Case Study 4 (continued)
# 案例研究 4（续）
Jisoo - Simple English Wikipedia, the free encyclopedia:
Jisoo - 简易英语维基百科，自由的百科全书
Jisoo ; Born, (1995-01-03) January 3, 1995 (age 30). South Korea ; Occupations. Singer; actress; model ; Origin, Seoul, South Korea ; Genres. K-pop • dance-pop
智秀；出生于（1995-01-03）1995 年 1 月 3 日（现年 30 岁）。韩国；职业：歌手、演员、模特；出身：韩国首尔；流派：K-pop • 舞曲流行音乐
Ji Soo: 智秀：
Kim Ji-soo (born 30 March 1993), professionally known as Ji Soo, is a South Korean actor currently based in the Philippines.
金智秀（1993 年 3 月 30 日出生），以艺名智秀闻名，是目前在菲律宾的韩国演员。

Kim Ji-soo (actress, born 1972):
金智秀（演员，1972 年出生）：

Kim Ji-soo is a South Korean actress. Kim Ji-soo. Kim Ji-soo at the 5th World Ceramic Biennale press conference in Korea (2009). Born. Yang Sung-yoon.
金智秀是一名韩国女演员。金智秀。金智秀在韩国第五届世界陶瓷双年展新闻发布会上（2009 年）。出生。杨承允。

Kim Ji-soo (singer, born 1990):
金智秀（歌手，1990 年出生）：

Kim Ji-soo (singer, born 1990) ; (1990-02-05) February 5, 1990 (age 35) South Korea

\cdot

Pop; folk

\cdot

Singer-songwriter; musician.
金智秀（歌手，1990 年出生）；（1990-02-05）1990 年 2 月 5 日（现年 35 岁）韩国

\cdot

流行音乐；民谣

\cdot

创作歌手；音乐人。
</result><think> I have found the birth year of Jisoo. Now I will search for the birth years of the other members.</think>
</result><think> 我已经找到了 Jisoo 的出生年份。现在我将搜索其他成员的出生年份。</think>
<search> search query: Jennie birth year</search> <result>
<search> 搜索查询：Jennie 出生年份</search> <result>
Jennie (singer): Jennie（歌手）：
Jennie ; Born. Jennie Kim. (1996-01-16) January 16, 1996 (age 29). Seongnam, Gyeonggi, South Korea ; Other names, Jennie Ruby Jane ; Occupations. Singer; rapper
Jennie；出生：Jennie Kim。（1996-01-16）1996 年 1 月 16 日（现年 29 岁）；韩国京畿道城南；其他名字，Jennie Ruby Jane；职业：歌手；说唱歌手
Jennie (singer) - Simple English Wikipedia, the free encyclopedia:
Jennie（歌手）- 简易英语维基百科，自由百科全书：
Jennie Kim (born January 16, 1996), known as Jennie is a South Korean singer and rapper. She is a member of the South Korean girl
金珍妮（出生于 1996 年 1 月 16 日），简称 Jennie，是一位韩国歌手和说唱歌手。她是韩国女子

Jennie Baranczyk: 詹妮·巴兰茨克：

Jennie Lillis Baranczyk is an American basketball coach who is the current head coach of the University of Oklahoma women’s basketball team. Jennie
詹妮·利利斯·巴兰茨克是一位美国篮球教练，目前是俄克拉荷马大学女子篮球队的主教练。詹妮
Jennie: 詹妮：
Jennie (film), a 1940 American drama film

\cdot

Jennie (musical), a 1963 Broadway production

\cdot

Jennie (novel), a 1994 science fiction thriller by Douglas Preston
《詹妮》（电影），1940 年的美国戏剧电影

\cdot

《詹妮》（音乐剧），1963 年的百老汇制作

\cdot

《詹妮》（小说），道格拉斯·普雷斯顿于 1994 年创作的科幻惊悚小说

Jennie Garth: 詹妮·加斯：

Jennie Garth ; Garth in 2023 ; Born. Jennifer Eve Garth. (1972-04-03) April 3, 1972 (age 53). Urbana, Illinois, U.S. ; Occupation(s), Actress, fashion designer.
詹妮·加斯；2023 年的加斯；出生：詹妮弗·伊夫·加斯（1972 年 4 月 3 日）（现年 53 岁）；出生地：伊利诺伊州厄巴纳，美国；职业：演员、时装设计师。
</result><think> I have found the birth year of Jennie. Now I will search for the birth years of the other members.</think>
</result><think> 我已找到詹妮的出生年份。现在我将搜索其他成员的出生年份。</think>
<search> search query: Rosé birth year</search> <result>
<search> 搜索查询：Rosé出生年份</search> <result>

Rosé (singer): Rosé（歌手）：

Rosé. MBE ; Born. Roseanne Park. (1997-02-11) 11 February 1997 (age 28). Auckland, New Zealand ; Other names, Park Chae-young ; Citizenship. South Korea • New
Rosé，MBE；出生。朴彩英（1997 年 2 月 11 日）1997 年 2 月 11 日（现年 28 岁）。新西兰奥克兰；其他名字：朴彩英；国籍：韩国 • 新

Rosé (singer) - Simple English Wikipedia, the free encyclopedia:
Rosé（歌手）- 简易英语维基百科，自由百科全书：

Rosé was born in New Zealand on 11 February 1997. At the age of seven, she moved to Australia. In 2012, she passed a YG Entertainment audition at the age of
瑞希出生于 1997 年 2 月 11 日的新西兰。七岁时，她移居澳大利亚。2012 年，她在年龄为

Rose (French singer): Rose（法国歌手）：

Keren Meloul (born May 24, 1978), better known as Rose, is a French singer, songwriter, author and composer. Rose. Rose performing in 2008.
凯伦·梅洛（Keren Meloul，1978 年 5 月 24 日出生），艺名 Rose，是一位法国歌手、作词人、作家和作曲家。Rose。2008 年的 Rose 表演。

Rosé discography: Rosé音乐作品目录：

Rosé began her career as a member of girl group Blackpink in August 2016. She released her debut solo single album R through YG Entertainment and Interscope
罗斯于 2016 年 8 月作为女子组合 Blackpink 的成员开始她的职业生涯。她通过 YG 娱乐公司和 Interscope 唱片公司发行了她的首张个人单曲专辑 R。

Rose Falcon: 罗斯·福尔康

Rose Falcon (born May 2, 1984) is an American singer and songwriter. Falcon’s songs have been recorded by Faith Hill, Lady A, Day of Fire, Eric Paslay,
罗斯·福尔康（出生于 1984 年 5 月 2 日）是一位美国歌手和词曲作者。福尔康的歌曲已被信仰山（Faith Hill）、Lady A、烈火乐队（Day of Fire）、埃里克·帕斯莱（Eric Paslay）等歌手演唱。
</result><think> I have found the birth year of Rosé. Now I will search for the birth year of the last member.</think> <search> search query: Lisa birth year</search> <result>
《ReaRAG：通过迭代检索增强生成增强大规模推理模型的事实性的知识引导推理》

Lisa (rapper): Lisa（说唱歌手）：

Lisa in 2024 ; Born. Pranpriya Manobal. (1997-03-27) March 27, 1997 (age 28). Buriram, Thailand ; Occupations. Rapper; singer; dancer; actress; songwriter.
Lisa，出生于 2024 年；本名：Pranpriya Manobal，生于 1997 年 3 月 27 日（现年 28 岁），来自泰国武里南。职业：说唱歌手、歌手、舞者、演员、词曲作者。
Lisa: Lisa：

Table 7 - continued from previous page
表 7 - 接上页
# Case Study 4 (continued)
# 案例研究 4（续）
Lisa (Japanese musician, born 1974), stylized “LISA”; Lisa, stagename of Japanese singer Lisa Komine (born 1978); Lisa (South Korean singer) (born 1980); Lisa
Lisa（出生于 1974 年的日本音乐家），风格化为"LISA"；Lisa，日本歌手小峰莉萨（出生于 1978 年）的舞台名；Lisa（韩国歌手）（出生于 1980 年）；Lisa
Lisa Ajax: Lisa Ajax：
Birth name, Lisa Kristina Ajax ; Born, (1998-08-13) 13 August 1998 (age 26) Järfälla, Sweden ; Genres, Pop ; Years active, 2012-present ; Labels, Capitol Records
出生姓名，Lisa Kristina Ajax；出生日期，(1998-08-13) 1998 年 8 月 13 日（现年 26 岁）；出生地，瑞典雅尔法拉；音乐类型，流行；活跃年份，2012 至今；唱片公司，Capitol Records
Lisa Collier Cool:
Lisa Collier Cool (born July 6, 1952) is an American journalist and author who writes on health-related topics. Lisa Collier Cool. Born, July 6, 1952.
出生于 1952 年 7 月 6 日的丽莎·科利尔·库尔是一位专注于健康相关主题的美国记者和作家。丽莎·科利尔·库尔。出生日期：1952 年 7 月 6 日。
Lisa (Japanese musician, born 1987):
丽莎（日本音乐家，生于 1987 年）：
Born. Risa Oribe. (1987-06-24) June 24, 1987 (age 37). Seki, Gifu, Japan; Occupations. Singer; songwriter; Years active, 2005-present; Spouse. Tatsuhisa Suzuki.
出生。丽莎·织部（Risa Oribe）。（1987-06-24）1987 年 6 月 24 日（现年 37 岁）。日本岐阜县关市；职业：歌手；词曲作者；活跃年份：2005 年至今；配偶：铃木达久。
</result><think> I have found the birth year of Lisa. Now I can compile the birth years of all the members of Blackpink.</think>
</result><think> 我已经找到了 Lisa 的出生年份。现在我可以汇总 Blackpink 所有成员的出生年份。</think>
<think> Having found the birth years of each member, I can now provide the answer.</think>
<think> 既然已经找到了每个成员的出生年份，我现在可以提供答案。</think>
ReSearch fails to reason over previous steps and provide an accurate answer due to noisy information.
研究未能通过推理先前的步骤并因噪声信息而无法提供准确答案。
<answer> 2018 </answer>
ReaRAG-9B
Thought 1: I need to find out the birth years of all four members of Blackpink. I should probably look up their names individually to get their birthdates.
思考 1：我需要查找 Blackpink 全部四名成员的出生年份。我可能需要单独查找他们的名字以获取出生日期。
Action 1: {‘function’: ‘search’, ‘parameters’: {‘query’: ‘Blackpink members and their birth years’}}
行动 1：{'function': 'search', 'parameters': {'query': 'Blackpink 成员及其出生年份'}}
Observation 1: Obs 1: Blackpink members and their birth years are as follows: - Jisoo: 1995
观察 1：Blackpink 成员及其出生年份如下：- Jisoo：1995

Jennie: 1996
Rosé: 1997
Lisa: 1997

Thought 2: Wait, I should verify this information because sometimes birthdates can be confusing, especially if there are different sources with different dates.
想法 2：等等，我应该验证这些信息，因为出生日期有时可能会很混乱，尤其是在不同来源提供的日期不一致的情况下。
Action 2: {‘function’: ‘search’, ‘parameters’: {‘query’: ‘Confirm birth years of Blackpink members’}}
行动 2：{'function': 'search', 'parameters': {'query': '确认 Blackpink 成员的出生年份'}}
Observation 2: The birth years of Black Blackpink members are as follows: - Jisoo: 1995
观察 2：Blackpink 成员的出生年份如下：- Jisoo：1995 年

Jennie: 1996 Jennie：1996 年
Rosé: The provided in the context
Rosé：上下文中未提供
Lisa: 1997 丽莎：1997 年

Thought 3: Hmm, it seems like there’s a gap in the information for Rosé. I need to find her birth year specifically.
思考 3：嗯，似乎 Rosé的信息还有些缺失。我需要具体找出她的出生年份。
Action 3: {‘function’: ‘search’, ‘parameters’: {‘query’: ‘Rosé Blackpink birth year’}}
行动 3：{'function': 'search', 'parameters': {'query': 'Rosé Blackpink 出生年份'}}
Observation 3: 1997 观察 3：1997
Thought 4: Okay, so Rosé was born in 1997. Now, I should make sure that all the birth years are correct and consistent across different sources.
想法 4：好的，所以 Rosé是 1997 年出生。现在，我应该确保所有的出生年份在不同来源中都是正确且一致的。
Action 4: {‘function’: ‘search’, ‘parameters’: {‘query’: ‘Cross-verify birth years of Blackpink members’}}
行动 4：{'function': 'search', 'parameters': {'query': '交叉验证 Blackpink 成员的出生年份'}}
Observation 4: The birth years of the Blackpink members are as follows: - Jisoo: January 3, 1995
观察 4：Blackpink 成员的出生年份如下：- Jisoo：1995 年 1 月 3 日

Jennie: January 16, 1996
Jennie：1996 年 1 月 16 日
Rosé: February 11, 1997
罗斯：1997 年 2 月 11 日
Lisa: March 27, 1997
Lisa：1997 年 3 月 27 日

Thought 5: Great, now I have the birth years along with the birthdates for each member. This should be sufficient to answer the question.
思考 5：好的，现在我已经获得了每个成员的出生年份以及出生日期。这应该足以回答问题了。
Action 5: {‘function’: ‘finish’, ‘parameters’: {‘answer’: ‘Jisoo was born in 1995, Jennie in 1996, Rosé in 1997, and Lisa in 1997.’}}
动作 5：{'function': 'finish', 'parameters': {'answer': 'Jisoo 出生于 1995 年，Jennie 出生于 1996 年，罗斯出生于 1997 年，Lisa 出生于 1997 年。'}}

Final Answer: Jisoo: 1995, Jennie: 1996, Rosé: 1997, Lisa: 1997
最终答案：Jisoo：1995 年，Jennie：1996 年，Rosé：1997 年，Lisa：1997 年
Table 8: This case study illustrates the poor generalization of R1-Searcher and ReSearch on the FRAMES and FanOutQA benchmarks. Specifically, R1-Searcher fails to resolve missing information about Lisa’s birthdate and does not take additional steps to complete the reasoning process. ReSearch, on the other hand, constructs a noisy reasoning chain, leading to an incorrect answer. In contrast, despite also facing missing information (Rosé’s birthdate), ReaRAG performs additional verification and successfully recovers the correct answer, demonstrating greater robustness.
表 8：该案例研究说明了 R1-Searcher 和 ReSearch 在 FRAMES 和 FanOutQA 基准测试中的泛化能力较差。具体而言，R1-Searcher 未能解决 Lisa 出生日期的缺失信息，并且未采取额外步骤完成推理过程。另一方面，ReSearch 构建了一个嘈杂的推理链，导致得出错误答案。相比之下，尽管面临缺失信息（Rosé的出生日期），ReaRAG 执行了额外的验证，并成功恢复了正确答案，展示了更强的鲁棒性。

# Case Study 5: In what county is William W. Blair's birthplace located?
# 案例研究 5：威廉·W·布莱尔的出生地位于哪个县？

Ground truth: Orleans County
真实情况：奥尔良县

Search-o1

... Simplified here. Maybe I can search for "Holley, New York county" to find out which county it's part of.

<|begin_search_query|>Holley, New York county<|end_search_query|>

<|begin_search_result|>Holley is located in Orleans County, New York.<|end_search_result|>

... Simplified here. But wait, let me double-check to make sure that Holley is indeed in Orleans County. Sometimes place names can be ambiguous, and there might be multiple Holleys in different counties or states.

Let me search for "Holley, New York county" again to confirm.

<|begin_search_query|>Holley, New York county<|end_search_query|>

<|begin_search_result|>You have searched this query. Please refer to previous results.<|end_search_result|>

Alright, since the previous search already indicated that Holley is in Orleans County, New York, and no other information contradicts that, I can confidently say that William W. Blair was born in Orleans County.

However, to be thorough, perhaps I should verify if there are any other places named Holley in New York or elsewhere that might be confused with the one where William W. Blair was born.

Let me search for "Holley, New York" to see if there's only one such place.

<|begin_search_query|>Holley, New York<|end_search_query|>

<|begin_search_result|>Holley is a village ... Simplified here. <|end_search_result|>

But just to cover all bases, maybe there's another Holley in New York state that I should consider. Let me search for "New York state places named Holley" to see if there are multiple locations with that name.

<|begin_search_query|>New York state places named Holley<|end_search_query|>

<|begin_search_result|>Holley is a village ... Simplified here. <|end_search_result|>

Search-ol continues to execute excessive and unnecessary searches. Only search queries are listed here for simplicity.

<|begin_search_query|>William W. Blair historian<|end_search_query|>

<|begin_search_query|>William W. Blair Latter Day Saints<|end_search_query|>

<|begin_search_query|>William W. Blair birth certificate Orleans County<|end_search_query|>

<|begin_search_query|>William W. Blair biography<|end_search_query|>

Perfect, so this biography confirms that he was born in Holley, Orleans County, New York. This aligns with all the previous information I've found.

Therefore, I can confidently say that William W. Blair's birthplace is located in Orleans County. ... Simplified here.

Final Answer

Orleans County

... Simplified here. Maybe I can search for "Holley, New York county" to find out which county it's part of. <|begin_search_query|>Holley, New York county<|end_search_query|> <|begin_search_result|>Holley is located in Orleans County, New York.<|end_search_result|> ... Simplified here. But wait, let me double-check to make sure that Holley is indeed in Orleans County. Sometimes place names can be ambiguous, and there might be multiple Holleys in different counties or states. Let me search for "Holley, New York county" again to confirm. <|begin_search_query|>Holley, New York county<|end_search_query|> <|begin_search_result|>You have searched this query. Please refer to previous results.<|end_search_result|> Alright, since the previous search already indicated that Holley is in Orleans County, New York, and no other information contradicts that, I can confidently say that William W. Blair was born in Orleans County. However, to be thorough, perhaps I should verify if there are any other places named Holley in New York or elsewhere that might be confused with the one where William W. Blair was born. Let me search for "Holley, New York" to see if there's only one such place. <|begin_search_query|>Holley, New York<|end_search_query|> <|begin_search_result|>Holley is a village ... Simplified here. <|end_search_result|> But just to cover all bases, maybe there's another Holley in New York state that I should consider. Let me search for "New York state places named Holley" to see if there are multiple locations with that name. <|begin_search_query|>New York state places named Holley<|end_search_query|> <|begin_search_result|>Holley is a village ... Simplified here. <|end_search_result|> Search-ol continues to execute excessive and unnecessary searches. Only search queries are listed here for simplicity. <|begin_search_query|>William W. Blair historian<|end_search_query|> <|begin_search_query|>William W. Blair Latter Day Saints<|end_search_query|> <|begin_search_query|>William W. Blair birth certificate Orleans County<|end_search_query|> <|begin_search_query|>William W. Blair biography<|end_search_query|> Perfect, so this biography confirms that he was born in Holley, Orleans County, New York. This aligns with all the previous information I've found. Therefore, I can confidently say that William W. Blair's birthplace is located in Orleans County. ... Simplified here. Final Answer Orleans County

ReaRAG-9B

思考 1：我需要查明威廉·W·布莱尔出生在哪个县。首先，我应该找出他出生的城市或镇。行动 1：{'function': 'search', 'parameters': {'query': '威廉·W·布莱尔在哪里出生？'}} 观察 1：威廉·W·布莱尔出生在纽约州的霍利。

Thought 1: I need to find out in which county William W. Blair was born. First, I should figure out where he was born, meaning his birthplace city or town.

Action 1: {'function': 'search', 'parameters': {'query': 'Where was William W. Blair born? '}}

Observation 1: William W. Blair was born in Holley, New York.

思考 2：好的，现在我知道他的出生地是纽约州的霍利。接下来，我需要确定霍利位于哪个县。行动 2：{'function': 'search', 'parameters': {'query': '霍利，纽约州位于哪个县？'}} 观察 2：霍利，纽约州位于奥尔良县。

Thought 2: Okay, so now I know that his birthplace is Holley, New York. Next, I need to determine which county Holley is located in.

Action 2: {'function': 'search', 'parameters': {'query': 'Which county is Holley, New York in?' } }

Observation 2: Holley, New York is in Orleans County.

思考 3：既然霍利在奥尔良县，那么答案就应该是这个。但我应该再次确认是否存在歧义或纽约州是否有多个同名的霍利。行动 3：{'function': 'search', 'parameters': {'query': '纽约州是否有多个霍利？'}} 观察 3：根据提供的上下文，只提到了一个霍利，位于纽约州的奥尔良县。上下文未提供关于纽约州是否有其他霍利的信息。因此，基于给定的上下文，我无法确认纽约州是否有多个霍利。

Thought 3: So, if Holley is in Orleans County, then that should be the answer. But I should double-check if there's any ambiguity or if there are multiple places named Holley in New York.

Action 3: {'function': 'search', 'parameters': {'query': 'Are there multiple Holleys in New York State?' }}

Observation 3: Based on the provided context, there is only one Holley mentioned, which is in Orleans County, New York. The context does not provide information about other Holleys in New York State. Therefore, based on the given context, I cannot confirm whether there are multiple Holleys in New York State.

想法 4：考虑到已有的信息，我认为可以安全地得出结论，William W. Blair 的出生地 Holley 位于纽约州奥尔良县。动作 4：{'function': 'finish', 'parameters': {'answer': 'Orleans County'} }

Thought 4: Given that, and considering the information I have, I think it's safe to conclude that William W. Blair's birthplace, Holley, is in Orleans County, New York.

Action 4: {'function': 'finish', 'parameters': {'answer': 'Orleans County'} }

Final Answer: Orleans County
最终答案：奥尔良县

# Case Study 5: In what county is William W. Blair's birthplace located? Ground truth: Orleans County Search-o1 "... Simplified here. Maybe I can search for "Holley, New York county" to find out which county it's part of. <|begin_search_query|>Holley, New York county<|end_search_query|> <|begin_search_result|>Holley is located in Orleans County, New York.<|end_search_result|> ... Simplified here. But wait, let me double-check to make sure that Holley is indeed in Orleans County. Sometimes place names can be ambiguous, and there might be multiple Holleys in different counties or states. Let me search for "Holley, New York county" again to confirm. <|begin_search_query|>Holley, New York county<|end_search_query|> <|begin_search_result|>You have searched this query. Please refer to previous results.<|end_search_result|> Alright, since the previous search already indicated that Holley is in Orleans County, New York, and no other information contradicts that, I can confidently say that William W. Blair was born in Orleans County. However, to be thorough, perhaps I should verify if there are any other places named Holley in New York or elsewhere that might be confused with the one where William W. Blair was born. Let me search for "Holley, New York" to see if there's only one such place. <|begin_search_query|>Holley, New York<|end_search_query|> <|begin_search_result|>Holley is a village ... Simplified here. <|end_search_result|> But just to cover all bases, maybe there's another Holley in New York state that I should consider. Let me search for "New York state places named Holley" to see if there are multiple locations with that name. <|begin_search_query|>New York state places named Holley<|end_search_query|> <|begin_search_result|>Holley is a village ... Simplified here. <|end_search_result|> Search-ol continues to execute excessive and unnecessary searches. Only search queries are listed here for simplicity. <|begin_search_query|>William W. Blair historian<|end_search_query|> <|begin_search_query|>William W. Blair Latter Day Saints<|end_search_query|> <|begin_search_query|>William W. Blair birth certificate Orleans County<|end_search_query|> <|begin_search_query|>William W. Blair biography<|end_search_query|> Perfect, so this biography confirms that he was born in Holley, Orleans County, New York. This aligns with all the previous information I've found. Therefore, I can confidently say that William W. Blair's birthplace is located in Orleans County. ... Simplified here. Final Answer Orleans County" ReaRAG-9B "Thought 1: I need to find out in which county William W. Blair was born. First, I should figure out where he was born, meaning his birthplace city or town. Action 1: {'function': 'search', 'parameters': {'query': 'Where was William W. Blair born? '}} Observation 1: William W. Blair was born in Holley, New York." "Thought 2: Okay, so now I know that his birthplace is Holley, New York. Next, I need to determine which county Holley is located in. Action 2: {'function': 'search', 'parameters': {'query': 'Which county is Holley, New York in?' } } Observation 2: Holley, New York is in Orleans County." "Thought 3: So, if Holley is in Orleans County, then that should be the answer. But I should double-check if there's any ambiguity or if there are multiple places named Holley in New York. Action 3: {'function': 'search', 'parameters': {'query': 'Are there multiple Holleys in New York State?' }} Observation 3: Based on the provided context, there is only one Holley mentioned, which is in Orleans County, New York. The context does not provide information about other Holleys in New York State. Therefore, based on the given context, I cannot confirm whether there are multiple Holleys in New York State." "Thought 4: Given that, and considering the information I have, I think it's safe to conclude that William W. Blair's birthplace, Holley, is in Orleans County, New York. Action 4: {'function': 'finish', 'parameters': {'answer': 'Orleans County'} }" Final Answer: Orleans County

| # Case Study 5: In what county is William W. Blair's birthplace located? | | :--- | | Ground truth: Orleans County | | Search-o1 | | ... Simplified here. Maybe I can search for "Holley, New York county" to find out which county it's part of. <\|begin_search_query|>Holley, New York county<|end_search_query|> <\|begin_search_result|>Holley is located in Orleans County, New York.<|end_search_result|> ... Simplified here. But wait, let me double-check to make sure that Holley is indeed in Orleans County. Sometimes place names can be ambiguous, and there might be multiple Holleys in different counties or states. Let me search for "Holley, New York county" again to confirm. <\|begin_search_query|>Holley, New York county<|end_search_query|> <\|begin_search_result|>You have searched this query. Please refer to previous results.<|end_search_result|> Alright, since the previous search already indicated that Holley is in Orleans County, New York, and no other information contradicts that, I can confidently say that William W. Blair was born in Orleans County. However, to be thorough, perhaps I should verify if there are any other places named Holley in New York or elsewhere that might be confused with the one where William W. Blair was born. Let me search for "Holley, New York" to see if there's only one such place. <\|begin_search_query|>Holley, New York<|end_search_query|> <\|begin_search_result|>Holley is a village ... Simplified here. <|end_search_result|> But just to cover all bases, maybe there's another Holley in New York state that I should consider. Let me search for "New York state places named Holley" to see if there are multiple locations with that name. <\|begin_search_query|>New York state places named Holley<|end_search_query|> <\|begin_search_result|>Holley is a village ... Simplified here. <|end_search_result|> Search-ol continues to execute excessive and unnecessary searches. Only search queries are listed here for simplicity. <\|begin_search_query|>William W. Blair historian<|end_search_query|> <\|begin_search_query|>William W. Blair Latter Day Saints<|end_search_query|> <\|begin_search_query|>William W. Blair birth certificate Orleans County<|end_search_query|> <\|begin_search_query|>William W. Blair biography<|end_search_query|> Perfect, so this biography confirms that he was born in Holley, Orleans County, New York. This aligns with all the previous information I've found. Therefore, I can confidently say that William W. Blair's birthplace is located in Orleans County. ... Simplified here. Final Answer Orleans County | | ReaRAG-9B | | Thought 1: I need to find out in which county William W. Blair was born. First, I should figure out where he was born, meaning his birthplace city or town. Action 1: {'function': 'search', 'parameters': {'query': 'Where was William W. Blair born? '}} Observation 1: William W. Blair was born in Holley, New York. | | Thought 2: Okay, so now I know that his birthplace is Holley, New York. Next, I need to determine which county Holley is located in. Action 2: {'function': 'search', 'parameters': {'query': 'Which county is Holley, New York in?' } } Observation 2: Holley, New York is in Orleans County. | | Thought 3: So, if Holley is in Orleans County, then that should be the answer. But I should double-check if there's any ambiguity or if there are multiple places named Holley in New York. Action 3: {'function': 'search', 'parameters': {'query': 'Are there multiple Holleys in New York State?' }} Observation 3: Based on the provided context, there is only one Holley mentioned, which is in Orleans County, New York. The context does not provide information about other Holleys in New York State. Therefore, based on the given context, I cannot confirm whether there are multiple Holleys in New York State. | | Thought 4: Given that, and considering the information I have, I think it's safe to conclude that William W. Blair's birthplace, Holley, is in Orleans County, New York. Action 4: {'function': 'finish', 'parameters': {'answer': 'Orleans County'} } | | Final Answer: Orleans County |

Table 9: We compare the generations of Search-o1 and ReaRAG. Words highlighted in red indicate instances where Search-o1 overthinks and performs excessive verification. Words highlighted in green show that the correct answer was already found in earlier steps. ReaRAG finds the correct answer in just two searches, whereas Search-o1 requires eleven, generating overly redundant queries that introduce unnecessary computations to multi-hop QA.
表 9：我们对比了 Search-o1 和 ReaRAG 的生成结果。用红色标注的词语表示 Search-o1 过度思考并进行过多验证。用绿色标注的词语显示正确答案已在早期步骤中找到。ReaRAG 仅通过两次搜索就找到正确答案，而 Search-o1 需要进行十一次搜索，生成了过多冗余的查询，为多跳问答引入了不必要的计算。

# Case Study 6: Where is Ulrich Walter's employer headquartered?
# 案例研究 6：乌尔里希·沃尔特的雇主总部位于何处？

Ground truth: Cologne 科隆

Search-o1 搜索-o1

Alright, I have this question: "Where is Ulrich Walter's employer ... Simplified here.

<|begin_search_query|>Who is Ulrich Walter?<|end_search_query|>

<|begin_search_result|>Ulrich Walter is a German astronaut who was selected in 1987 to join the German astronaut team and trained at the German Aerospace Center (DLR). His employer, the German Aerospace Center, is headquartered in Cologne, Germany.<|end_search_result|>

Hmm, according to this search result, Ulrich Walter is a German astronaut who was selected in 1987 and trained at the German Aerospace Center (DLR). It also says that his employer, the DLR, is headquartered in Cologne, Germany.

... Simplified here. <|begin_search_query|>Headquarters of German Aerospace Center<|end_search_query|>

<|begin_search_result|>The German Aerospace Center (DLR) is headquartered in Cologne, Germany.<|end_search_result|>

So, according to this, the DLR is headquartered in Cologne, Germany.

But just to be thorough, let me see if there is any other information about Ulrich Walter's employer.

<|begin_search_query|>Ulrich Walter workplace<|end_search_query|>

<|begin_search_result|>Ulrich Walter's employer is the German Aerospace Center (DLR), headquartered in Cologne, Germany.<|end_search_result|>

... Simplified here.

I should probably confirm if Ulrich Walter is still active or if he has retired. ... Simplified here.

<|begin_search_query|>Ulrich Walter retirement<|end_search_query|>

<|begin_search_result|>No helpful information found.<|end_search_result|> ... Simplified here.

Alternatively, perhaps he has taken up an academic position at a university.

<|begin_search_query|>Ulrich Walter university affiliation<|end_search_query|>

<|begin_search_result|>Ulrich Walter's employer, the German Aerospace Center (DLR), is headquartered in Cologne, Germany.<|end_search_result|>

... Simplified here. Maybe I should look for his professional profile or LinkedIn page to see his current employer.

<|begin_search_query|>Ulrich Walter LinkedIn<|end_search_query|>

... Simplified here. the answer to the question "Where is Ulrich Walter's employer headquartered?" is Cologne, Germany.

Final Answer

Cologne, Germany

Alright, I have this question: "Where is Ulrich Walter's employer ... Simplified here. <|begin_search_query|>Who is Ulrich Walter?<|end_search_query|> <|begin_search_result|>Ulrich Walter is a German astronaut who was selected in 1987 to join the German astronaut team and trained at the German Aerospace Center (DLR). His employer, the German Aerospace Center, is headquartered in Cologne, Germany.<|end_search_result|> Hmm, according to this search result, Ulrich Walter is a German astronaut who was selected in 1987 and trained at the German Aerospace Center (DLR). It also says that his employer, the DLR, is headquartered in Cologne, Germany. ... Simplified here. <|begin_search_query|>Headquarters of German Aerospace Center<|end_search_query|> <|begin_search_result|>The German Aerospace Center (DLR) is headquartered in Cologne, Germany.<|end_search_result|> So, according to this, the DLR is headquartered in Cologne, Germany. But just to be thorough, let me see if there is any other information about Ulrich Walter's employer. <|begin_search_query|>Ulrich Walter workplace<|end_search_query|> <|begin_search_result|>Ulrich Walter's employer is the German Aerospace Center (DLR), headquartered in Cologne, Germany.<|end_search_result|> ... Simplified here. I should probably confirm if Ulrich Walter is still active or if he has retired. ... Simplified here. <|begin_search_query|>Ulrich Walter retirement<|end_search_query|> <|begin_search_result|>No helpful information found.<|end_search_result|> ... Simplified here. Alternatively, perhaps he has taken up an academic position at a university. <|begin_search_query|>Ulrich Walter university affiliation<|end_search_query|> <|begin_search_result|>Ulrich Walter's employer, the German Aerospace Center (DLR), is headquartered in Cologne, Germany.<|end_search_result|> ... Simplified here. Maybe I should look for his professional profile or LinkedIn page to see his current employer. <|begin_search_query|>Ulrich Walter LinkedIn<|end_search_query|> ... Simplified here. the answer to the question "Where is Ulrich Walter's employer headquartered?" is Cologne, Germany. Final Answer Cologne, Germany

想法 1：我需要找出 Ulrich Walter 的雇主总部位于何处。首先，我应该弄清楚 Ulrich Walter 是谁，以及他在哪里工作。行动 1：{'function': 'search', 'parameters': {'query': 'Ulrich Walter 是谁，他在哪里工作？'}} 观察 1：Ulrich Walter 是一名德国宇航员，曾在德国航空航天中心工作。

Thought 1: I need to find out where Ulrich Walter's employer is headquartered. First, I should figure out who Ulrich Walter is and where he works.

Action 1: {'function': 'search', 'parameters': {'query': 'Who is Ulrich Walter and where does he work?' }}

Observation 1: Ulrich Walter is a German astronaut who worked at the German Aerospace Center.

想法 2：所以，Ulrich Walter 在德国航空航天中心工作。现在，我需要找出这个中心的总部位于何处。行动 2：{'function': 'search', 'parameters': {'query': '德国航空航天中心总部位于何处？'}} 观察 2：德国航空航天中心（DLR）总部位于德国科隆。

Thought 2: So, Ulrich Walter works at the German Aerospace Center. Now, I need to find out where this center is headquartered.

Action 2: {'function': 'search', 'parameters': {'query': 'Where is the German Aerospace Center headquartered?' }}

Observation 2: The German Aerospace Center (DLR) is headquartered in Cologne, Germany.

想法 3：好的，既然 DLR 的总部位于德国科隆，那么这就应该是答案。但我可能需要再次确认 Ulrich Walter 是否仍在那里工作，或者是否有任何更新。行动 3：{'function': 'search', 'parameters': {'query': 'Ulrich Walter 是否仍在 DLR 工作？'}} 观察 3：由于我的训练数据仅截至 2023 年，我无法提供 Ulrich Walter 是否仍在德国航空航天中心（DLR）工作的实时信息或更新。要获取关于 Ulrich Walter 就业状况的最新信息，你需要查看最新的信息来源或直接联系德国航空航天中心。

Thought 3: Alright, so if DLR is headquartered in Cologne, Germany, then that should be the answer. But maybe I should double-check if Ulrich Walter works there still or if there's any update.

Action 3: {'function': 'search', 'parameters': {'query': 'Is Ulrich Walter still working at DLR?' } }

Observation 3: I cannot provide real-time information or updates on whether Ulrich Walter is still working at the German Aerospace Center (DLR) since my training data only goes up until 2023. To get the most current information on Ulrich Walter's employment status, you would need to check the latest sources or contact the German Aerospace Center directly.

思考 4：考虑到这一点，我可能应该继续使用我已有的信息，即乌尔里希·沃尔特在德国航空航天中心（DLR）工作，该机构位于德国科隆。行动 4：{'function': 'finish', 'parameters': {'answer': '德国科隆'}}

Thought 4: Given that, I should probably proceed with the information I have, which is that Ulrich Walter works at DLR, which is headquartered in Cologne, Germany.

Action 4: {'function': 'finish', 'parameters': {'answer': 'Cologne, Germany'}}

Final Answer: Cologne, Germany
最终答案：德国科隆

# Case Study 6: Where is Ulrich Walter's employer headquartered? Ground truth: Cologne Search-o1 "Alright, I have this question: "Where is Ulrich Walter's employer ... Simplified here. <|begin_search_query|>Who is Ulrich Walter?<|end_search_query|> <|begin_search_result|>Ulrich Walter is a German astronaut who was selected in 1987 to join the German astronaut team and trained at the German Aerospace Center (DLR). His employer, the German Aerospace Center, is headquartered in Cologne, Germany.<|end_search_result|> Hmm, according to this search result, Ulrich Walter is a German astronaut who was selected in 1987 and trained at the German Aerospace Center (DLR). It also says that his employer, the DLR, is headquartered in Cologne, Germany. ... Simplified here. <|begin_search_query|>Headquarters of German Aerospace Center<|end_search_query|> <|begin_search_result|>The German Aerospace Center (DLR) is headquartered in Cologne, Germany.<|end_search_result|> So, according to this, the DLR is headquartered in Cologne, Germany. But just to be thorough, let me see if there is any other information about Ulrich Walter's employer. <|begin_search_query|>Ulrich Walter workplace<|end_search_query|> <|begin_search_result|>Ulrich Walter's employer is the German Aerospace Center (DLR), headquartered in Cologne, Germany.<|end_search_result|> ... Simplified here. I should probably confirm if Ulrich Walter is still active or if he has retired. ... Simplified here. <|begin_search_query|>Ulrich Walter retirement<|end_search_query|> <|begin_search_result|>No helpful information found.<|end_search_result|> ... Simplified here. Alternatively, perhaps he has taken up an academic position at a university. <|begin_search_query|>Ulrich Walter university affiliation<|end_search_query|> <|begin_search_result|>Ulrich Walter's employer, the German Aerospace Center (DLR), is headquartered in Cologne, Germany.<|end_search_result|> ... Simplified here. Maybe I should look for his professional profile or LinkedIn page to see his current employer. <|begin_search_query|>Ulrich Walter LinkedIn<|end_search_query|> ... Simplified here. the answer to the question "Where is Ulrich Walter's employer headquartered?" is Cologne, Germany. Final Answer Cologne, Germany" "Thought 1: I need to find out where Ulrich Walter's employer is headquartered. First, I should figure out who Ulrich Walter is and where he works. Action 1: {'function': 'search', 'parameters': {'query': 'Who is Ulrich Walter and where does he work?' }} Observation 1: Ulrich Walter is a German astronaut who worked at the German Aerospace Center." "Thought 2: So, Ulrich Walter works at the German Aerospace Center. Now, I need to find out where this center is headquartered. Action 2: {'function': 'search', 'parameters': {'query': 'Where is the German Aerospace Center headquartered?' }} Observation 2: The German Aerospace Center (DLR) is headquartered in Cologne, Germany." "Thought 3: Alright, so if DLR is headquartered in Cologne, Germany, then that should be the answer. But maybe I should double-check if Ulrich Walter works there still or if there's any update. Action 3: {'function': 'search', 'parameters': {'query': 'Is Ulrich Walter still working at DLR?' } } Observation 3: I cannot provide real-time information or updates on whether Ulrich Walter is still working at the German Aerospace Center (DLR) since my training data only goes up until 2023. To get the most current information on Ulrich Walter's employment status, you would need to check the latest sources or contact the German Aerospace Center directly." "Thought 4: Given that, I should probably proceed with the information I have, which is that Ulrich Walter works at DLR, which is headquartered in Cologne, Germany. Action 4: {'function': 'finish', 'parameters': {'answer': 'Cologne, Germany'}}" Final Answer: Cologne, Germany

| # Case Study 6: Where is Ulrich Walter's employer headquartered? | | :--- | | Ground truth: Cologne | | Search-o1 | | Alright, I have this question: "Where is Ulrich Walter's employer ... Simplified here. <\|begin_search_query|>Who is Ulrich Walter?<|end_search_query|> <\|begin_search_result|>Ulrich Walter is a German astronaut who was selected in 1987 to join the German astronaut team and trained at the German Aerospace Center (DLR). His employer, the German Aerospace Center, is headquartered in Cologne, Germany.<|end_search_result|> Hmm, according to this search result, Ulrich Walter is a German astronaut who was selected in 1987 and trained at the German Aerospace Center (DLR). It also says that his employer, the DLR, is headquartered in Cologne, Germany. ... Simplified here. <\|begin_search_query|>Headquarters of German Aerospace Center<|end_search_query|> <\|begin_search_result|>The German Aerospace Center (DLR) is headquartered in Cologne, Germany.<|end_search_result|> So, according to this, the DLR is headquartered in Cologne, Germany. But just to be thorough, let me see if there is any other information about Ulrich Walter's employer. <\|begin_search_query|>Ulrich Walter workplace<|end_search_query|> <\|begin_search_result|>Ulrich Walter's employer is the German Aerospace Center (DLR), headquartered in Cologne, Germany.<|end_search_result|> ... Simplified here. I should probably confirm if Ulrich Walter is still active or if he has retired. ... Simplified here. <\|begin_search_query|>Ulrich Walter retirement<|end_search_query|> <\|begin_search_result|>No helpful information found.<|end_search_result|> ... Simplified here. Alternatively, perhaps he has taken up an academic position at a university. <\|begin_search_query|>Ulrich Walter university affiliation<|end_search_query|> <\|begin_search_result|>Ulrich Walter's employer, the German Aerospace Center (DLR), is headquartered in Cologne, Germany.<|end_search_result|> ... Simplified here. Maybe I should look for his professional profile or LinkedIn page to see his current employer. <\|begin_search_query|>Ulrich Walter LinkedIn<|end_search_query|> ... Simplified here. the answer to the question "Where is Ulrich Walter's employer headquartered?" is Cologne, Germany. Final Answer Cologne, Germany | | | | Thought 1: I need to find out where Ulrich Walter's employer is headquartered. First, I should figure out who Ulrich Walter is and where he works. Action 1: {'function': 'search', 'parameters': {'query': 'Who is Ulrich Walter and where does he work?' }} Observation 1: Ulrich Walter is a German astronaut who worked at the German Aerospace Center. | | Thought 2: So, Ulrich Walter works at the German Aerospace Center. Now, I need to find out where this center is headquartered. Action 2: {'function': 'search', 'parameters': {'query': 'Where is the German Aerospace Center headquartered?' }} Observation 2: The German Aerospace Center (DLR) is headquartered in Cologne, Germany. | | Thought 3: Alright, so if DLR is headquartered in Cologne, Germany, then that should be the answer. But maybe I should double-check if Ulrich Walter works there still or if there's any update. Action 3: {'function': 'search', 'parameters': {'query': 'Is Ulrich Walter still working at DLR?' } } Observation 3: I cannot provide real-time information or updates on whether Ulrich Walter is still working at the German Aerospace Center (DLR) since my training data only goes up until 2023. To get the most current information on Ulrich Walter's employment status, you would need to check the latest sources or contact the German Aerospace Center directly. | | Thought 4: Given that, I should probably proceed with the information I have, which is that Ulrich Walter works at DLR, which is headquartered in Cologne, Germany. Action 4: {'function': 'finish', 'parameters': {'answer': 'Cologne, Germany'}} | | Final Answer: Cologne, Germany |

Table 10: We compare the generations of Search-o1 and ReaRAG. Words highlighted in red indicate instances where Search-o1 overthinks and performs excessive verification. Words highlighted in green show that the correct answer was already found in earlier steps. As shown above, ReaRAG arrives at the correct answer in just three searches, while Search-o1 requires eight, producing excessive queries that are redundant in multi-hop QA.
表 10：我们比较了 Search-o1 和 ReaRAG 的生成结果。红色突出显示的词语表示 Search-o1 过度思考并进行过多验证的实例。绿色突出显示的词语显示正确答案已在早期步骤中找到。如上所示，ReaRAG 仅通过三次搜索就得出了正确答案，而 Search-o1 需要八次搜索，产生了多跳问答中多余的查询。
# Case Study 7: Who is the grandmother of Philippe, Duke of Orléans?
# 案例研究 7：谁是奥尔良公爵菲利普的祖母？
Ground truth: Marie de’ Medici
基准事实：玛丽·德·美第奇
ReaRAG-9B
Thought 1: So I need to find out who the grandmother of Philippe, Duke of Orléans, is. I don’t know much about French royalty, so I’ll need to look this up.
思考 1：我需要查明菲利普·奥尔良公爵的祖母是谁。我对法国皇室并不了解，所以我需要查找这个信息。
Action 1: {‘function’: ‘search’, ‘parameters’: {‘query’: ‘Who is the grandmother of Philippe, Duke of Orléans?’ }}
动作 1：{'function': 'search', 'parameters': {'query': '菲利普·奥尔良公爵的祖母是谁？'}}
Observation 1: The grandmother of Philippe, Duke of Orléans (Philippe I) is Anne of Austria. Wrong answer!
观察 1：奥尔良公爵菲利普（菲利普一世）的祖母是奥地利的安妮。错误答案！
Thought 2: Wait, let me make sure about this. I have found that his grandmother is Anne of Austria, but I should verify this information from multiple sources to ensure accuracy.
思考 2：等等，我要确认这一点。我发现他的祖母是奥地利的安妮，但我应该从多个来源验证这个信息以确保准确性。
Action 2: {‘function’: ‘search’, ‘parameters’: {‘query’: ‘Philippe, Duke of Orléans family tree’}}
行动 2：{'function': 'search', 'parameters': {'query': '奥尔良公爵菲利普家族谱系'}}
Observation 2: Philippe, Duke of Orléans, was the younger son of Louis XIII of France and Anne of Austria. His older brother was Louis XIV, famously known as the “Sun King.” Philippe became Duke of Orléans upon the death of his uncle Gaston in 1660. In 1661, he also received the dukedoms of Valois and Chartres.
观察 2：奥尔良公爵菲利普是法国路易十三和奥地利的安妮的幼子。他的哥哥是路易十四，著名的"太阳王"。菲利普在 1660 年叔叔加斯顿去世后成为奥尔良公爵。1661 年，他还获得了瓦卢瓦和沙特尔的公爵称号。
Thought 3: So, according to this, his grandmother is Anne of Austria. But let me see if there’s any confusion here. Philippe’s parents were Louis XIII and Anne of Austria, so she would be his mother, not grandmother. Wait, that doesn’t make sense. Maybe I need to clarify the generational relationships. ReaRAG notices the mistakes.
想法 3：按照这个说法，他的祖母是奥地利的安妮。但让我看看这里是否有什么混淆。菲利普的父母是路易十三和奥地利的安妮，所以她会是他的母亲，而不是祖母。等等，这不合理。也许我需要澄清世代关系。ReaRAG 注意到了这些错误。
Action 3: {‘function’: ‘search’, ‘parameters’: {‘query’: 'Genealogy of the House of Orléans '}}
行动 3：{'function': 'search', 'parameters': {'query': '奥尔良王室的家谱'}}
Observation 3: The House of Orléans is a cadet branch of the House of Bourbon, which is one of the most prominent royal houses in European history. The Orléans branch originated with Philippe I, Duke of Orléans, who was the younger son of Louis XIII of France and Anne of Austria. Philippe I was the brother of Louis XIV, the “Sun King.” The genealogy of the House of Orléans can be traced through various descendants of Philippe I, including: 1. Philippe II, Duke of Orléans, who was the father of Louise Élisabeth d’Orléans mentioned in Document 351. Philippe II was also known as Philippe I’s grandson, as he was the son of Philippe I’s son, Louis XIV. 2. Louis Philippe, Duke of Orléans, who became King of the French after the July Revolution of 1830. He was the father of François d’Orléans, Prince of Joinville, mentioned in Document 393. These are just a few examples of the descendants of Philippe I, Duke of Orléans, and the House of Orléans continues to have living members today.
观察 3：奥尔良王室是波旁王室的旁支，是欧洲历史上最著名的王室之一。奥尔良分支起源于菲利普一世，奥尔良公爵，他是法国国王路易十三和奥地利的安妮的幼子。菲利普一世是"太阳王"路易十四的兄弟。奥尔良王室的家谱可以通过菲利普一世的各个后代追溯，包括：1. 菲利普二世，奥尔良公爵，他是在文件 351 中提到的路易丝·伊丽莎白·多尔良的父亲。菲利普二世也被称为菲利普一世的孙子，因为他是菲利普一世之子路易十四的儿子。2. 路易·菲利普，奥尔良公爵，他在 1830 年七月革命后成为法国国王。他是在文件 393 中提及的若安维尔亲王弗朗索瓦·多尔良的父亲。这些只是奥尔良公爵菲利普一世后代的几个例子，奥尔良王室至今仍有在世成员。
Thought 4: Hmm, I think I’m getting confused here. Philippe I, Duke of Orléans, was the son of Louis XIII and Anne of Austria. So, Anne of Austria would be his mother, not his grandmother. His grandmother would be the mother of his parents, which would be Louis XIII’s mother. Let me check who that was. ReaRAG realigns its reasoning direction.
想法 4：嗯，我似乎在这里感到困惑。菲利普一世，奥尔良公爵，是路易十三和奥地利的安妮的儿子。所以，奥地利的安妮是他的母亲，而不是祖母。他的祖母将是他父母的母亲，也就是路易十三母亲。让我查看一下她是谁。ReaRAG 重新调整了推理方向。

Action 4: {‘function’: ‘search’, ‘parameters’: {‘query’: ‘Who was the mother of Louis XIII of France? ‘}}
动作 4：{'function': 'search', 'parameters': {'query': '法国路易十三的母亲是谁？'}}
Observation 4: The mother of Louis XIII of France was Marie de’ Medici.
观察 4：法国路易十三的母亲是玛丽·德·美第奇。
Thought 5: So, if Marie de’ Medici was the mother of Louis XIII, and Anne of Austria was his wife and the mother of Philippe I, then Anne of Austria would be Philippe I’s mother, not grandmother. So, his grandmother would be Marie de’ Medici. That makes more sense. But to be thorough, I should confirm the relationships again. Louis XIII married Anne of Austria, and they had children, including Philippe I and Louis XIV. So, Anne of Austria is Philippe I’s mother, and Marie de’ Medici is his grandmother. I think that’s correct, but I’ll double-check with another search.
思考 5：所以，如果玛丽·德·美第奇是路易十三的母亲，而奥地利的安妮是他的妻子和菲利普一世的母亲，那么奥地利的安妮将是菲利普一世的母亲，而不是祖母。因此，他的祖母将是玛丽·德·美第奇。这更说得通。但为了彻底，我应该再次确认这些关系。路易十三娶了奥地利的安妮，他们有孩子，包括菲利普一世和路易十四。所以，奥地利的安妮是菲利普一世的母亲，玛丽·德·美第奇是他的祖母。我认为这是正确的，但我将再次通过搜索进行核实。
Action 5: {‘function’: ‘search’, ‘parameters’: {‘query’: ‘Philippe, Duke of Orléans family tree ‘}}
动作 5：{'function': 'search', 'parameters': {'query': '奥尔良公爵菲利普的家谱'}}
Observation 5: Philippe, Duke of Orléans (1640-1701) was the younger son of Louis XIII of France and Anne of Austria. His older brother was Louis XIV, the famous “Sun King.” Philippe was styled Duke of Anjou from birth and became Duke of Orléans upon the death of his uncle Gaston in 1660. He also received the dukedoms of Valois and Chartres in 1661. Philippe had several children, including Louise Élisabeth d’Orléans, who was one of his seven daughters. Louise Élisabeth was known as Élisabeth and was a “princesse du sang” as a member of the reigning House of Bourbon.
观察 5：奥尔良公爵菲利普（1640-1701）是法国国王路易十三和安妮·奥地利的次子。他的哥哥是著名的"太阳王"路易十四。菲利普从出生时就被称为昂茹公爵，并在 1660 年其叔父加斯顿去世后成为奥尔良公爵。他于 1661 年还获得了瓦卢瓦和沙特尔公爵的爵位。菲利普有几个孩子，包括他的七个女儿之一的路易丝·伊丽莎白·德奥尔良。路易丝·伊丽莎白被称为伊丽莎白，作为波旁王室的成员，她是一位"王室血统公主"。
Thought 6: Okay, so this confirms that Anne of Austria was Philippe I’s mother. Therefore, his grandmother would be Marie de’ Medici, who was Louis XIII’s mother. I think that’s the correct answer. But just to be absolutely sure, let’s look up the genealogy again. Louis XIII was the son of Henry IV and Marie de’ Medici, so Marie de’ Medici would be his mother, and Anne of Austria would be his wife and Philippe I’s mother. I think that’s consistent across all the information I’ve found. So, the grandmother of Philippe, Duke of Orléans, is Marie de’ Medici. I’ll go ahead and provide that as the answer.
思考 6：好的，这确认了安妮·奥地利是菲利普一世的母亲。因此，他的祖母将是玛丽·德·美第奇，她是路易十三的母亲。我认为这是正确的答案。但为了绝对确定，让我们再次查看家谱。路易十三是亨利四世和玛丽·德·美第奇的儿子，所以玛丽·德·美第奇是他的母亲，安妮·奥地利是他的妻子和菲利普一世的母亲。我认为这与我找到的所有信息是一致的。因此，奥尔良公爵菲利普的祖母是玛丽·德·美第奇。我将提供这个作为答案。
Action 6: {‘function’: ‘finish’, ‘parameters’: {‘answer’: ‘Marie de’ Medici’} }
动作 6：{'function': 'finish', 'parameters': {'answer': 'Marie de' Medici'} }
Final Answer: Marie de’ Medici
最终答案：玛丽·德·美第奇
Table 11: In the first reasoning step, ReaRAG makes an incorrect deduction, mistakenly identifying Anne of Austria as Philippe’s grandmother instead of his mother. However, it quickly recognizes the error, then verifies its previous reasoning steps, and recalibrates its logic. By refining its approach, ReaRAG ultimately arrives at the correct answer, demonstrating its adaptive reasoning ability.
表 11：在第一个推理步骤中，ReaRAG 做出了一个不正确的推论，错误地将奥地利的安妮识别为菲利普的祖母，而非其母亲。然而，它很快就认识到了这个错误，随后验证了之前的推理步骤，并重新校准了其逻辑。通过改进其方法，ReaRAG 最终得出了正确的答案，展示了其自适应推理能力。

# Case Study 8: who is the younger brother of The episode guest stars of The Hard Easy
# 案例研究 8：谁是《The Hard Easy》的剧集嘉宾明星的弟弟

Ground truth: Bill Murray
真相：比尔·默里

ReaRAG-9B

思考 1：我需要找出《The Hard Easy》的剧集嘉宾明星的弟弟是谁。首先，我应该弄清楚"The Hard Easy"是什么。听起来像是一部电视节目或电影。行动 1：{'function': 'search', 'parameters': {'query': 'What is The Hard Easy?'}} 观察 1：《The Hard Easy》是一部 2006 年的美国动作犯罪剧情片，由阿里·瑞安执导，由乔恩·林德斯特朗和汤姆·沙克利编剧。影片由亨利·托马斯、维拉·法米加、戴维·博伦亚兹、布鲁斯·德恩、彼得·韦勒、加里·巴塞、尼克·拉齐联合主演。该片于 2007 年 5 月 22 日在美国直接发行 DVD。

Thought 1: I need to find out who the younger brother is of the episode guest stars of "The Hard Easy." First, I should figure out what "The Hard Easy" is. It sounds like a TV show or a movie.

Action 1: {'function': 'search', 'parameters': {'query': 'What is The Hard Easy?' }}

Observation 1: The Hard Easy is a 2006 American action crime drama film directed by Ari Ryan and written by Jon Lindstrom and Tom Schanley. It stars Henry Thomas, Vera Farmiga, David Boreanaz, Bruce Dern, Peter Weller, Gary Busey, and Nick Lachey. The film was released straight-to-DVD in the United States on May 22, 2007.

思考 2：所以，"The Hard Easy"是一部 2006 年的电影，有几位客座明星。我需要识别这些客座明星，然后找出其中一位的弟弟。但问题似乎有点不清楚。也许它是在询问某一集中的客座明星的弟弟，但由于它是一部电影，而不是电视剧，可能有多个剧集。行动 2：{'function': 'search', 'parameters': {'query': 'Are there multiple episodes of The Hard Easy?' }} 观察 2：不，"The Hard Easy"只是"Adventure Time"系列的一集。

Thought 2: So, "The Hard Easy" is a 2006 film with several guest stars. I need to identify the guest stars and then find out who the younger brother of one of them is. But the question seems a bit unclear. Maybe it's asking for the younger brother of one of the guest stars in a specific episode, but since it's a film, not a TV series, perhaps there are multiple episodes.

Action 2: {'function': 'search', 'parameters': {'query': 'Are there multiple episodes of The Hard Easy?' }}

Observation 2: No, there is only one episode of "The Hard Easy" as part of the "Adventure Time" series.

思考 3：等等，现在我开始感到困惑。之前，我以为"The Hard Easy"是一部 2006 年的电影，但现在它说是"Adventure Time"的一集。也许这里有些混淆。我需要澄清"The Hard Easy"指的是什么。ReaRAG 主动执行搜索以解决混淆。行动 3：{'function': 'search', 'parameters': {'query': 'Is The Hard Easy a film or an episode? ' }} 观察 3："The Hard Easy"既是一部电影，又是一集。电影版于 2007 年直接发行 DVD，而这一集是动画电视剧"Adventure Time"的一部分，于 2012 年在卡通频道播出。

Thought 3: Wait, now I'm getting confused. Earlier, I thought "The Hard Easy" was a 2006 film, but now it's saying it's an episode of "Adventure Time." Maybe there's some confusion here. I need to clarify what "The Hard Easy" refers to.

ReaRAG actively performs searches to resolve the confusion.

Action 3: {'function': 'search', 'parameters': {'query': 'Is The Hard Easy a film or an episode? ' }}

Observation 3: The Hard Easy is both a film and an episode. The film version was released straight-to-DVD in 2007, while the episode is part of the animated television series "Adventure Time" and aired on Cartoon Network in 2012.

思考 4：好的，有一部 2006 年的电影叫"The Hard Easy"，"Adventure Time"系列中也有一集同名。这个问题可能是指"Adventure Time"中的那一集。我需要找出该集的客座明星，然后查明其中一位的弟弟是谁。行动 4：{'function': 'search', 'parameters': {'query': 'Who are the guest stars in the Adventure Time episode "The Hard Easy"? '}} 观察 4："Adventure Time"剧集"The Hard Easy"的客座明星是 Brian Doyle-Murray 饰演巨大王子（Prince Huge）和 Jonathan Katz 饰演泥营长老（the Mudscamp elder）。

Thought 4: Okay, so there's a 2006 film called "The Hard Easy" and an episode with the same name in the "Adventure Time" series. The question might be referring to the episode in "Adventure Time." I need to find out the guest stars of that episode and then find out who the younger brother of one of them is.

Action 4: {'function': 'search', 'parameters': {'query': 'Who are the guest stars in the Adventure Time episode "The Hard Easy"? '}}

Observation 4: The guest stars in the Adventure Time episode "The Hard Easy" are Brian Doyle-Murray as Prince Huge and Jonathan Katz as the Mudscamp elder.

想法 5：所以，嘉宾明星是布赖恩·多伊尔-默里和乔纳森·卡茨。现在，我需要找出他们中谁的弟弟。但我不知道他们的家庭关系。我需要查找一下。动作 5：{'function': 'search', 'parameters': {'query': '布赖恩·多伊尔-默里的弟弟是谁？'}} 观察 5：布赖恩·多伊尔-默里的弟弟是比尔·默里。

Thought 5: So, the guest stars are Brian Doyle-Murray and Jonathan Katz. Now, I need to find out who the younger brother of one of them is. But I don't know their family relationships off the top of my head. I'll need to look that up.

Action 5: {'function': 'search', 'parameters': {'query': 'Who is the younger brother of Brian Doyle-Murray? '}}

Observation 5: The younger brother of Brian Doyle-Murray is Bill Murray.

想法 6：等等，这不可能正确。比尔·默里是一位著名演员，但我不认为他是布赖恩·多伊尔-默里的弟弟。也许我需要再次确认他们的关系。ReaRAG 遇到了参数知识的冲突。动作 6：{'function': 'search', 'parameters': {'query': '比尔·默里和布赖恩·多伊尔-默里是兄弟吗？'}} 观察 6：是的，比尔·默里和布赖恩·多伊尔-默里是兄弟。

Thought 6: Wait, that can't be right. Bill Murray is a famous actor, but I don't think he's the younger brother of Brian Doyle-Murray. Maybe I need to check their relationship again.

ReaRAG encounters conflicts with its parametric knowledge.

Action 6: {'function': 'search', 'parameters': {'query': 'Are Bill Murray and Brian Doyle-Murray brothers? '}}

Observation 6: Yes, Bill Murray and Brian Doyle-Murray are brothers.

想法 7：所以，如果问题是询问动画片《The Hard Easy》中嘉宾明星的弟弟，而嘉宾明星是布赖恩·多伊尔-默里和乔纳森·卡茨，而布赖恩·多伊尔-默里的弟弟是比尔·默里，那么答案应该是比尔·默里。动作 7：{'function': 'finish', 'parameters': {'answer': '比尔·默里'}}

Thought 7: So, if the question is asking for the younger brother of one of the guest stars in the Adventure Time episode "The Hard Easy," and the guest stars are Brian Doyle-Murray and Jonathan Katz, and the younger brother of Brian Doyle-Murray is Bill Murray, then the answer should be Bill Murray.

Action 7: {'function': 'finish', 'parameters': {'answer': 'Bill Murray'}}

Final Answer: Bill Murray
最终答案：比尔·默里

# Case Study 8: who is the younger brother of The episode guest stars of The Hard Easy Ground truth: Bill Murray ReaRAG-9B "Thought 1: I need to find out who the younger brother is of the episode guest stars of "The Hard Easy." First, I should figure out what "The Hard Easy" is. It sounds like a TV show or a movie. Action 1: {'function': 'search', 'parameters': {'query': 'What is The Hard Easy?' }} Observation 1: The Hard Easy is a 2006 American action crime drama film directed by Ari Ryan and written by Jon Lindstrom and Tom Schanley. It stars Henry Thomas, Vera Farmiga, David Boreanaz, Bruce Dern, Peter Weller, Gary Busey, and Nick Lachey. The film was released straight-to-DVD in the United States on May 22, 2007." "Thought 2: So, "The Hard Easy" is a 2006 film with several guest stars. I need to identify the guest stars and then find out who the younger brother of one of them is. But the question seems a bit unclear. Maybe it's asking for the younger brother of one of the guest stars in a specific episode, but since it's a film, not a TV series, perhaps there are multiple episodes. Action 2: {'function': 'search', 'parameters': {'query': 'Are there multiple episodes of The Hard Easy?' }} Observation 2: No, there is only one episode of "The Hard Easy" as part of the "Adventure Time" series." "Thought 3: Wait, now I'm getting confused. Earlier, I thought "The Hard Easy" was a 2006 film, but now it's saying it's an episode of "Adventure Time." Maybe there's some confusion here. I need to clarify what "The Hard Easy" refers to. ReaRAG actively performs searches to resolve the confusion. Action 3: {'function': 'search', 'parameters': {'query': 'Is The Hard Easy a film or an episode? ' }} Observation 3: The Hard Easy is both a film and an episode. The film version was released straight-to-DVD in 2007, while the episode is part of the animated television series "Adventure Time" and aired on Cartoon Network in 2012." "Thought 4: Okay, so there's a 2006 film called "The Hard Easy" and an episode with the same name in the "Adventure Time" series. The question might be referring to the episode in "Adventure Time." I need to find out the guest stars of that episode and then find out who the younger brother of one of them is. Action 4: {'function': 'search', 'parameters': {'query': 'Who are the guest stars in the Adventure Time episode "The Hard Easy"? '}} Observation 4: The guest stars in the Adventure Time episode "The Hard Easy" are Brian Doyle-Murray as Prince Huge and Jonathan Katz as the Mudscamp elder." "Thought 5: So, the guest stars are Brian Doyle-Murray and Jonathan Katz. Now, I need to find out who the younger brother of one of them is. But I don't know their family relationships off the top of my head. I'll need to look that up. Action 5: {'function': 'search', 'parameters': {'query': 'Who is the younger brother of Brian Doyle-Murray? '}} Observation 5: The younger brother of Brian Doyle-Murray is Bill Murray." "Thought 6: Wait, that can't be right. Bill Murray is a famous actor, but I don't think he's the younger brother of Brian Doyle-Murray. Maybe I need to check their relationship again. ReaRAG encounters conflicts with its parametric knowledge. Action 6: {'function': 'search', 'parameters': {'query': 'Are Bill Murray and Brian Doyle-Murray brothers? '}} Observation 6: Yes, Bill Murray and Brian Doyle-Murray are brothers." "Thought 7: So, if the question is asking for the younger brother of one of the guest stars in the Adventure Time episode "The Hard Easy," and the guest stars are Brian Doyle-Murray and Jonathan Katz, and the younger brother of Brian Doyle-Murray is Bill Murray, then the answer should be Bill Murray. Action 7: {'function': 'finish', 'parameters': {'answer': 'Bill Murray'}}" Final Answer: Bill Murray

| # Case Study 8: who is the younger brother of The episode guest stars of The Hard Easy | | :--- | | Ground truth: Bill Murray | | ReaRAG-9B | | Thought 1: I need to find out who the younger brother is of the episode guest stars of "The Hard Easy." First, I should figure out what "The Hard Easy" is. It sounds like a TV show or a movie. Action 1: {'function': 'search', 'parameters': {'query': 'What is The Hard Easy?' }} Observation 1: The Hard Easy is a 2006 American action crime drama film directed by Ari Ryan and written by Jon Lindstrom and Tom Schanley. It stars Henry Thomas, Vera Farmiga, David Boreanaz, Bruce Dern, Peter Weller, Gary Busey, and Nick Lachey. The film was released straight-to-DVD in the United States on May 22, 2007. | | Thought 2: So, "The Hard Easy" is a 2006 film with several guest stars. I need to identify the guest stars and then find out who the younger brother of one of them is. But the question seems a bit unclear. Maybe it's asking for the younger brother of one of the guest stars in a specific episode, but since it's a film, not a TV series, perhaps there are multiple episodes. Action 2: {'function': 'search', 'parameters': {'query': 'Are there multiple episodes of The Hard Easy?' }} Observation 2: No, there is only one episode of "The Hard Easy" as part of the "Adventure Time" series. | | Thought 3: Wait, now I'm getting confused. Earlier, I thought "The Hard Easy" was a 2006 film, but now it's saying it's an episode of "Adventure Time." Maybe there's some confusion here. I need to clarify what "The Hard Easy" refers to. ReaRAG actively performs searches to resolve the confusion. Action 3: {'function': 'search', 'parameters': {'query': 'Is The Hard Easy a film or an episode? ' }} Observation 3: The Hard Easy is both a film and an episode. The film version was released straight-to-DVD in 2007, while the episode is part of the animated television series "Adventure Time" and aired on Cartoon Network in 2012. | | Thought 4: Okay, so there's a 2006 film called "The Hard Easy" and an episode with the same name in the "Adventure Time" series. The question might be referring to the episode in "Adventure Time." I need to find out the guest stars of that episode and then find out who the younger brother of one of them is. Action 4: {'function': 'search', 'parameters': {'query': 'Who are the guest stars in the Adventure Time episode "The Hard Easy"? '}} Observation 4: The guest stars in the Adventure Time episode "The Hard Easy" are Brian Doyle-Murray as Prince Huge and Jonathan Katz as the Mudscamp elder. | | Thought 5: So, the guest stars are Brian Doyle-Murray and Jonathan Katz. Now, I need to find out who the younger brother of one of them is. But I don't know their family relationships off the top of my head. I'll need to look that up. Action 5: {'function': 'search', 'parameters': {'query': 'Who is the younger brother of Brian Doyle-Murray? '}} Observation 5: The younger brother of Brian Doyle-Murray is Bill Murray. | | Thought 6: Wait, that can't be right. Bill Murray is a famous actor, but I don't think he's the younger brother of Brian Doyle-Murray. Maybe I need to check their relationship again. ReaRAG encounters conflicts with its parametric knowledge. Action 6: {'function': 'search', 'parameters': {'query': 'Are Bill Murray and Brian Doyle-Murray brothers? '}} Observation 6: Yes, Bill Murray and Brian Doyle-Murray are brothers. | | Thought 7: So, if the question is asking for the younger brother of one of the guest stars in the Adventure Time episode "The Hard Easy," and the guest stars are Brian Doyle-Murray and Jonathan Katz, and the younger brother of Brian Doyle-Murray is Bill Murray, then the answer should be Bill Murray. Action 7: {'function': 'finish', 'parameters': {'answer': 'Bill Murray'}} | | Final Answer: Bill Murray |

Table 12: During the first three iterations, ReaRAG attempts to resolve ambiguity in the original multi-hop question, actively conducting searches to eliminate confusion. Additionally, at the sixth iteration, it encounters knowledge conflicts. Despite these challenges, ReaRAG successfully derives the correct answer, showcasing its robust reasoning abilities.
表 12：在最初的三次迭代中，ReaRAG 尝试解决原始多跳问题中的歧义，主动进行搜索以消除混淆。此外，在第六次迭代中，它遇到了知识冲突。尽管面临这些挑战，ReaRAG 仍成功推导出正确答案，展示了其强大的推理能力。

# Case Study 9: This singer of A Rather Blustery Day also voiced what hedgehog?
# 案例研究 9：这首《暴风天》的歌手还配音了哪只刺猬？

Ground truth: Sonic 真实答案：索尼克

ReaRAG-9B

思考 1：我需要找出谁演唱了"暴风天"，然后找出他配音了哪只刺猬。行动 1：{'function': 'search', 'parameters': {'query': '谁演唱了"暴风天"？'}} 观察 1：吉姆·卡明斯演唱了"暴风天"。

Thought 1: I need to find out which singer performed "A Rather Blustery Day" and then find out what hedgehog they voiced.

Action 1: {'function': 'search', 'parameters': {'query': 'Who sang "A Rather Blustery Day"?' }}

Observation 1: Jim Cummings sang "A Rather Blustery Day".

想法 2：既然我知道 Jim Cummings 唱过那首歌，我需要找出他配音的是哪个刺猬。动作 2：{'function': 'search', 'parameters': {'query': 'Jim Cummings 配音了哪个刺猬？'}} 观察 2：Jim Cummings 在"刺猬索尼克"中配音了 Dr. Robotnik。

Thought 2: Now that I know Jim Cummings sang that song, I need to find out what hedgehog he voiced.

Action 2: {'function': 'search', 'parameters': {'query': 'What hedgehog did Jim Cummings voice?' }}

Observation 2: Jim Cummings voiced Dr. Robotnik from "Sonic the Hedgehog".

想法 3：等等，但 Dr. Robotnik 是一个机器人，不是刺猬。也许我需要再次确认。动作 3：{'function': 'search', 'parameters': {'query': 'Jim Cummings 在刺猬索尼克中配音了哪些角色？'}} 观察 3：Jim Cummings 在"刺猬索尼克"系列中配音了 Dr. Robotnik。

Thought 3: Wait, but Dr. Robotnik is a robot, not a hedgehog. Maybe I need to check again.

Action 3: {'function': 'search', 'parameters': {'query': 'What characters did Jim Cummings voice in Sonic the Hedgehog? '}}

Observation 3: Jim Cummings voiced Dr. Robotnik in the "Sonic the Hedgehog" series.

想法 4：这里似乎存在一些混淆。Dr. Robotnik 是一个机器人，但问题提到了一个刺猬。也许刺猬是另一个角色。原始问题很令人困惑，但 RearaG 设法解决了它。动作 4：{'function': 'search', 'parameters': {'query': 'Jim Cummings 在刺猬索尼克中配音的角色列表。'}} 观察 4：Jim Cummings 以在"刺猬索尼克"系列中配音 Dr. Robotnik 而闻名。

Thought 4: It seems like there might be confusion here. Dr. Robotnik is a robot, but the question mentions a hedgehog. Maybe the hedgehog is a different character.

The original question is confusing, but RearaG manages to resolve it.

Action 4: {'function': 'search', 'parameters': {'query': 'List of characters voiced by Jim Cummings in Sonic the Hedgehog. '}}

Observation 4: Jim Cummings is known for voicing Dr. Robotnik in the "Sonic the Hedgehog" series.

想法 5：也许我需要寻找 Jim Cummings 配音的其他刺猬角色。动作 5：{'function': 'search', 'parameters': {'query': 'Jim Cummings 配音的刺猬角色。'}} 观察 5：Jim Cummings 以配音"刺猬索尼克"中的 Dr. Robotnik 而闻名。

Thought 5: Maybe I need to look for other hedgehog characters that Jim Cummings has voiced.

Action 5: {'function': 'search', 'parameters': {'query': 'Hedgehog characters voiced by Jim Cummings. '}}

Observation 5: Jim Cummings is known for voicing Dr. Robotnik from "Sonic the Hedgehog".

想法 6：也许这个刺猬是索尼克本人，他确实是一只刺猬，但我已经知道索尼克是由罗杰·克雷格·史密斯配音，而不是吉姆·卡明斯。ReaRAG 遇到知识冲突，但通过验证来缓解这些冲突。动作 6：{'function': 'search', 'parameters': {'query': '谁在索尼克系列中为索尼克配音？'}} 观察 6：吉姆·卡明斯在索尼克系列中为索尼克配音。

Thought 6: Perhaps the hedgehog in question is Sonic himself, who is a hedgehog, but I already know that Sonic is voiced by Roger Craig Smith, not Jim Cummings.

ReaRAG encounters knowledge conflicts but mitigates them through verification.

Action 6: {'function': 'search', 'parameters': {'query': 'Who voices Sonic in the Sonic the Hedgehog series?' } }

Observation 6: Jim Cummings voices Sonic the Hedgehog in the Sonic the Hedgehog series.

想法 7：等等，这与我之前的想法相矛盾。我可能混淆了配音演员。动作 7：{'function': 'search', 'parameters': {'query': '索尼克的配音演员'}} 观察 7：吉姆·卡明斯

Thought 7: Wait, that contradicts what I thought earlier. I might be confusing the voice actors.

Action 7: {'function': 'search', 'parameters': {'query': 'Voice actor for Sonic the Hedgehog. ' }}

Observation 7: Jim Cummings

想法 8：这变得越来越混乱。我需要验证索尼克的配音演员。动作 8：{'function': 'search', 'parameters': {'query': '谁在索尼克游戏和动画片中为索尼克配音？'}} 观察 8：吉姆·卡明斯在索尼克游戏和动画片中为索尼克配音。

Thought 8: This is getting confusing. I need to verify the voice actor for Sonic the Hedgehog.

Action 8: {'function': 'search', 'parameters': {'query': 'Who voices Sonic in the Sonic the Hedgehog games and cartoons? '}}

Observation 8: Jim Cummings voices Sonic the Hedgehog in the Sonic the Hedgehog games and cartoons.

想法 9：所以，吉姆·卡明斯为索尼克配音，而索尼克是一只刺猬。因此，答案应该是索尼克。动作 9：{'function': 'finish', 'parameters': {'answer': '索尼克'}}

Thought 9: So, Jim Cummings voices Sonic the Hedgehog, who is a hedgehog. Therefore, the answer should be Sonic the Hedgehog.

Action 9: {'function': 'finish', 'parameters': {'answer': 'Sonic the Hedgehog'}}

Final Answer: Sonic the Hedgehog
最终答案：索尼克刺猬

# Case Study 9: This singer of A Rather Blustery Day also voiced what hedgehog? Ground truth: Sonic ReaRAG-9B "Thought 1: I need to find out which singer performed "A Rather Blustery Day" and then find out what hedgehog they voiced. Action 1: {'function': 'search', 'parameters': {'query': 'Who sang "A Rather Blustery Day"?' }} Observation 1: Jim Cummings sang "A Rather Blustery Day"." "Thought 2: Now that I know Jim Cummings sang that song, I need to find out what hedgehog he voiced. Action 2: {'function': 'search', 'parameters': {'query': 'What hedgehog did Jim Cummings voice?' }} Observation 2: Jim Cummings voiced Dr. Robotnik from "Sonic the Hedgehog"." "Thought 3: Wait, but Dr. Robotnik is a robot, not a hedgehog. Maybe I need to check again. Action 3: {'function': 'search', 'parameters': {'query': 'What characters did Jim Cummings voice in Sonic the Hedgehog? '}} Observation 3: Jim Cummings voiced Dr. Robotnik in the "Sonic the Hedgehog" series." "Thought 4: It seems like there might be confusion here. Dr. Robotnik is a robot, but the question mentions a hedgehog. Maybe the hedgehog is a different character. The original question is confusing, but RearaG manages to resolve it. Action 4: {'function': 'search', 'parameters': {'query': 'List of characters voiced by Jim Cummings in Sonic the Hedgehog. '}} Observation 4: Jim Cummings is known for voicing Dr. Robotnik in the "Sonic the Hedgehog" series." "Thought 5: Maybe I need to look for other hedgehog characters that Jim Cummings has voiced. Action 5: {'function': 'search', 'parameters': {'query': 'Hedgehog characters voiced by Jim Cummings. '}} Observation 5: Jim Cummings is known for voicing Dr. Robotnik from "Sonic the Hedgehog"." "Thought 6: Perhaps the hedgehog in question is Sonic himself, who is a hedgehog, but I already know that Sonic is voiced by Roger Craig Smith, not Jim Cummings. ReaRAG encounters knowledge conflicts but mitigates them through verification. Action 6: {'function': 'search', 'parameters': {'query': 'Who voices Sonic in the Sonic the Hedgehog series?' } } Observation 6: Jim Cummings voices Sonic the Hedgehog in the Sonic the Hedgehog series." "Thought 7: Wait, that contradicts what I thought earlier. I might be confusing the voice actors. Action 7: {'function': 'search', 'parameters': {'query': 'Voice actor for Sonic the Hedgehog. ' }} Observation 7: Jim Cummings" "Thought 8: This is getting confusing. I need to verify the voice actor for Sonic the Hedgehog. Action 8: {'function': 'search', 'parameters': {'query': 'Who voices Sonic in the Sonic the Hedgehog games and cartoons? '}} Observation 8: Jim Cummings voices Sonic the Hedgehog in the Sonic the Hedgehog games and cartoons." "Thought 9: So, Jim Cummings voices Sonic the Hedgehog, who is a hedgehog. Therefore, the answer should be Sonic the Hedgehog. Action 9: {'function': 'finish', 'parameters': {'answer': 'Sonic the Hedgehog'}}" Final Answer: Sonic the Hedgehog

| # Case Study 9: This singer of A Rather Blustery Day also voiced what hedgehog? | | :--- | | Ground truth: Sonic | | ReaRAG-9B | | Thought 1: I need to find out which singer performed "A Rather Blustery Day" and then find out what hedgehog they voiced. Action 1: {'function': 'search', 'parameters': {'query': 'Who sang "A Rather Blustery Day"?' }} Observation 1: Jim Cummings sang "A Rather Blustery Day". | | Thought 2: Now that I know Jim Cummings sang that song, I need to find out what hedgehog he voiced. Action 2: {'function': 'search', 'parameters': {'query': 'What hedgehog did Jim Cummings voice?' }} Observation 2: Jim Cummings voiced Dr. Robotnik from "Sonic the Hedgehog". | | Thought 3: Wait, but Dr. Robotnik is a robot, not a hedgehog. Maybe I need to check again. Action 3: {'function': 'search', 'parameters': {'query': 'What characters did Jim Cummings voice in Sonic the Hedgehog? '}} Observation 3: Jim Cummings voiced Dr. Robotnik in the "Sonic the Hedgehog" series. | | Thought 4: It seems like there might be confusion here. Dr. Robotnik is a robot, but the question mentions a hedgehog. Maybe the hedgehog is a different character. The original question is confusing, but RearaG manages to resolve it. Action 4: {'function': 'search', 'parameters': {'query': 'List of characters voiced by Jim Cummings in Sonic the Hedgehog. '}} Observation 4: Jim Cummings is known for voicing Dr. Robotnik in the "Sonic the Hedgehog" series. | | Thought 5: Maybe I need to look for other hedgehog characters that Jim Cummings has voiced. Action 5: {'function': 'search', 'parameters': {'query': 'Hedgehog characters voiced by Jim Cummings. '}} Observation 5: Jim Cummings is known for voicing Dr. Robotnik from "Sonic the Hedgehog". | | Thought 6: Perhaps the hedgehog in question is Sonic himself, who is a hedgehog, but I already know that Sonic is voiced by Roger Craig Smith, not Jim Cummings. ReaRAG encounters knowledge conflicts but mitigates them through verification. Action 6: {'function': 'search', 'parameters': {'query': 'Who voices Sonic in the Sonic the Hedgehog series?' } } Observation 6: Jim Cummings voices Sonic the Hedgehog in the Sonic the Hedgehog series. | | Thought 7: Wait, that contradicts what I thought earlier. I might be confusing the voice actors. Action 7: {'function': 'search', 'parameters': {'query': 'Voice actor for Sonic the Hedgehog. ' }} Observation 7: Jim Cummings | | Thought 8: This is getting confusing. I need to verify the voice actor for Sonic the Hedgehog. Action 8: {'function': 'search', 'parameters': {'query': 'Who voices Sonic in the Sonic the Hedgehog games and cartoons? '}} Observation 8: Jim Cummings voices Sonic the Hedgehog in the Sonic the Hedgehog games and cartoons. | | Thought 9: So, Jim Cummings voices Sonic the Hedgehog, who is a hedgehog. Therefore, the answer should be Sonic the Hedgehog. Action 9: {'function': 'finish', 'parameters': {'answer': 'Sonic the Hedgehog'}} | | Final Answer: Sonic the Hedgehog |

Table 13: This example illustrates how ReaRAG successfully resolves ambiguity in the original question and addresses conflicts between retrieved knowledge and its parametric knowledge. Despite these challenges, ReaRAG effectively mitigates them and remains robust throughout the reasoning process.
表 13：这个示例说明了 ReaRAG 如何成功解决原始问题中的歧义，并处理检索到的知识与其参数知识之间的冲突。尽管面临这些挑战，ReaRAG 仍能有效缓解这些问题，并在整个推理过程中保持稳健性。

$^{1}$ For simplicity, all mentions of QwQ-32B in this paper refer to QwQ-32B-Preview.
$^{1}$ 为简单起见，本文中所有提到的 QwQ-32B 均指 QwQ-32B-Preview。
$^{2}$ https://open.bigmodel.cn/dev/api/ Reasoning-models/glm-z1
$^{2}$ https://open.bigmodel.cn/dev/api/ 推理模型/glm-z1
$^{3}$ https://bigmodel.cn/dev/api/vector/embedding

ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation ReaRAG：通过迭代检索增强生成增强大型推理模型的知识引导推理的事实性

Abstract 摘要

1 Introduction 1 引言

2 Related Work 2 相关工作

3 Methodology 3 方法论

3.1 Task formulation 3.1 任务形式化

3.2 Knowledge-guided reasoning chain generation 3.2 知识引导的推理链生成

3.3 Factuality-enhanced LRMs: ReaRAG3.3 事实性增强的 LRMs：ReaRAG

4 Experiments 4 实验

4.1 Experimental setup 4.1 实验设置

4.2 Implementations details4.2 实现细节

4.3 Main results 4.3 主要结果

4.4 Ablation 4.4 消融实验

4.5 Analysis 4.5 分析

4.5.1 Performance against strong baseline4.5.1 与强基准相比的性能

4.5.2 Strength of ReaRAG4.5.2 ReaRAG 的优势

5 Conclusion 5 结论

Limitations 局限性

Acknowledgments 致谢

References 参考文献

Appendix 附录

A Ablation 消融实验

B Prompts B 提示词

Instruction prompts P d P d P_(d)\mathcal{P}_{d} for data construction to fine-tune ReaRAG用于数据构建以微调 ReaRAG 的指令提示词 P d P d P_(d)\mathcal{P}_{d}

Available Functions: 可用函数：

Some important rules you must follow: 必须遵循的一些重要规则：

Some example of reflection text:反思文本的一些示例：

In-Context Example: 沉浸式翻译：由 AI 驱动的 PDF 翻译器，支持公式识别、表格识别、多栏排版，一键导出

Instruction prompt P P P\mathcal{P} for fine-tuning and inference with ReaRAG用于使用 ReaRAG 进行微调和推理的指令提示 P P P\mathcal{P}

Answer prompt P ans P ans P_("ans ")\mathcal{P}_{\text {ans }} to derive final answer回答提示 P ans P ans P_("ans ")\mathcal{P}_{\text {ans }} 以得出最终答案

<lsysteml>

Searches 搜索

Instruction 指令

Judgement prompt P judge P judge P_("judge ")\mathcal{P}_{\text {judge }} for ACC L ACC L ACC_(L)\mathrm{ACC}_{L} metrics on all benchmarks except for FanOutQA评判提示 P judge P judge P_("judge ")\mathcal{P}_{\text {judge }} ，用于除 FanOutQA 之外的所有基准测试指标

Judgement prompt P j u d g e P j u d g e P_(judge)\mathcal{P}_{j u d g e} for ACC L ACC L ACC_(L)\mathrm{ACC}_{L} metrics on FanOutQA benchmarks在 FanOutQA 基准测试上针对 P j u d g e P j u d g e P_(judge)\mathcal{P}_{j u d g e} 指标的判断提示 ACC L ACC L ACC_(L)\mathrm{ACC}_{L}

Instruction prompts P ablat P ablat P_("ablat ")\mathcal{P}_{\text {ablat }} for data construction to fine-tune ReaRAG w/o reasoning用于数据构建的指令提示 P ablat P ablat P_("ablat ")\mathcal{P}_{\text {ablat }} 来微调不带推理的 ReaRAG

Ground-truth answer: 基准答案：

C Case study C 案例研究

Extreme points of Canada:加拿大的极端点：

Middle Island (Lake Erie):米德尔岛（伊利湖）：

Pelee, Ontario: 佩利岛，安大略省：

Ontario: 安大略省：

Newfoundland (island): 纽芬兰（岛屿）：

List of amphibians of Canada:加拿大两栖动物列表：

List of endangered amphibians:濒危两栖动物列表：

List of amphibians of Canada:加拿大两栖动物列表：

List of endangered amphibians:濒危两栖动物列表：

Middle Island (Lake Erie):米德尔岛（伊利湖）：

ReaRAG-9B

Kim Ji-soo (actress, born 1972):金智秀（演员，1972 年出生）：

Kim Ji-soo (singer, born 1990):金智秀（歌手，1990 年出生）：

Jennie Baranczyk: 詹妮·巴兰茨克：

Jennie Garth: 詹妮·加斯：

Rosé (singer): Rosé（歌手）：

Rosé (singer) - Simple English Wikipedia, the free encyclopedia:Rosé（歌手）- 简易英语维基百科，自由百科全书：

Rose (French singer): Rose（法国歌手）：

Rosé discography: Rosé音乐作品目录：

Rose Falcon: 罗斯·福尔康

Lisa (rapper): Lisa（说唱歌手）：

ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation
ReaRAG：通过迭代检索增强生成增强大型推理模型的知识引导推理的事实性

3.2 Knowledge-guided reasoning chain generation
3.2 知识引导的推理链生成

3.3 Factuality-enhanced LRMs: ReaRAG
3.3 事实性增强的 LRMs：ReaRAG

4.2 Implementations details
4.2 实现细节

4.5.1 Performance against strong baseline
4.5.1 与强基准相比的性能

4.5.2 Strength of ReaRAG
4.5.2 ReaRAG 的优势

Instruction prompts $P_{d}$ for data construction to fine-tune ReaRAG
用于数据构建以微调 ReaRAG 的指令提示词 $P_{d}$

Some important rules you must follow:
必须遵循的一些重要规则：

Some example of reflection text:
反思文本的一些示例：

Instruction prompt $P$ for fine-tuning and inference with ReaRAG
用于使用 ReaRAG 进行微调和推理的指令提示 $P$

Answer prompt $P_{ans}$ to derive final answer
回答提示 $P_{ans}$ 以得出最终答案

Judgement prompt $P_{judge}$ for ${ACC}_{L}$ metrics on all benchmarks except for FanOutQA
评判提示 $P_{judge}$ ，用于除 FanOutQA 之外的所有基准测试指标

Judgement prompt $P_{j u d g e}$ for ${ACC}_{L}$ metrics on FanOutQA benchmarks
在 FanOutQA 基准测试上针对 $P_{j u d g e}$ 指标的判断提示 ${ACC}_{L}$

Instruction prompts $P_{ablat}$ for data construction to fine-tune ReaRAG w/o reasoning
用于数据构建的指令提示 $P_{ablat}$ 来微调不带推理的 ReaRAG

Extreme points of Canada:
加拿大的极端点：

Middle Island (Lake Erie):
米德尔岛（伊利湖）：

List of amphibians of Canada:
加拿大两栖动物列表：

List of endangered amphibians:
濒危两栖动物列表：

List of amphibians of Canada:
加拿大两栖动物列表：

List of endangered amphibians:
濒危两栖动物列表：

Middle Island (Lake Erie):
米德尔岛（伊利湖）：

Kim Ji-soo (actress, born 1972):
金智秀（演员，1972 年出生）：

Kim Ji-soo (singer, born 1990):
金智秀（歌手，1990 年出生）：

Rosé (singer) - Simple English Wikipedia, the free encyclopedia:
Rosé（歌手）- 简易英语维基百科，自由百科全书：