这是用户在 2024-4-1 14:52 为 https://ar5iv.labs.arxiv.org/html/2310.01061?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
\useunder

Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning
图推理:忠实且可解释的大型语言模型推理

Linhao Luo, Yuan-Fang Li & Gholamreza Haffari
Linhao Luo, Yuan-Fang Li & Gholamreza Haffari

Monash University
Australia
{linhao.luo,yuanfang.li,Gholamreza.Haffari}@monash.edu
澳大利亚莫纳什大学 {linhao.luo,yuanfang.li,Gholamreza.Haffari}@monash.edu

&Shirui Pan
Griffith University
Australia
s.pan@griffith.edu.au
&Shirui Pan 格里菲斯大学,澳大利亚 s.pan@griffith.edu.au
Corresponding author. 通讯作者。
Abstract 抽象

Large language models (LLMs) have demonstrated impressive reasoning abilities in complex tasks. However, they lack up-to-date knowledge and experience hallucinations during reasoning, which can lead to incorrect reasoning processes and diminish their performance and trustworthiness. Knowledge graphs (KGs), which capture vast amounts of facts in a structured format, offer a reliable source of knowledge for reasoning. Nevertheless, existing KG-based LLM reasoning methods only treat KGs as factual knowledge bases and overlook the importance of their structural information for reasoning. In this paper, we propose a novel method called reasoning on graphs (RoG) that synergizes LLMs with KGs to enable faithful and interpretable reasoning. Specifically, we present a planning-retrieval-reasoning framework, where RoG first generates relation paths grounded by KGs as faithful plans. These plans are then used to retrieve valid reasoning paths from the KGs for LLMs to conduct faithful reasoning. Furthermore, RoG not only distills knowledge from KGs to improve the reasoning ability of LLMs through training but also allows seamless integration with any arbitrary LLMs during inference. Extensive experiments on two benchmark KGQA datasets demonstrate that RoG achieves state-of-the-art performance on KG reasoning tasks and generates faithful and interpretable reasoning results111Code and data are available at: https://github.com/RManLuo/reasoning-on-graphs.
大型语言模型(LLMs)在复杂任务中表现出令人印象深刻的推理能力。然而,他们在推理过程中缺乏最新的知识和经历幻觉,这可能导致不正确的推理过程并降低他们的表现和可信度。知识图谱 (KG) 以结构化格式捕获大量事实,为推理提供了可靠的知识来源。然而,现有的基于LLMKG的推理方法仅将KG视为事实知识库,而忽略了其结构信息对推理的重要性。在本文中,我们提出了一种称为图推理(RoG)的新方法,LLMs该方法与KG协同作用,以实现忠实和可解释的推理。具体来说,我们提出了一个计划-检索-推理框架,其中 RoG 首先生成以 KG 为基础的关系路径作为忠实计划。然后,这些计划用于从幼稚园检索有效的推理路径,以便LLMs进行忠实的推理。此外,RoG不仅从KG中提炼知识以提高LLMs通过训练的推理能力,而且还LLMs允许在推理过程中与任意任意进行无缝集成。在两个基准 KGQA 数据集上的大量实验表明,RoG 在 KG 推理任务上实现了最先进的性能,并产生了忠实且可解释的推理结果 1

1 Introduction 1引言

Large language models (LLMs) have shown great performance in many NLP tasks (Brown et al., 2020; Bang et al., 2023). What’s especially striking is their ability to handle complex tasks through reasoning (Wei et al., 2022; Huang & Chang, 2023). To further unleash LLMs’ reasoning ability, the plan-and-solve paradigm (Wang et al., 2023c) has been proposed, in which LLMs are prompted to generate a plan and execute each reasoning step. In this way, LLMs decompose complex reasoning tasks into a series of sub-tasks and solve them step by step (Khot et al., 2022).
大型语言模型(LLMs)在许多NLP任务中表现出出色的性能(Brown et al., 2020;Bang 等人,2023 年)。特别引人注目的是他们通过推理处理复杂任务的能力(Wei 等人,2022 年;Huang & Chang,2023 年)。为了进一步释放LLMs推理能力,提出了计划与解决范式(Wang et al., 2023c),其中LLMs提示生成计划并执行每个推理步骤。通过这种方式,LLMs将复杂的推理任务分解为一系列子任务并逐步解决(Khot et al., 2022)。

Despite their success, LLMs are still limited by the lack of knowledge and prone to hallucinations during reasoning, which can lead to errors in reasoning processes (Hong et al., 2023; Wang et al., 2023b). For example, as shown in Figure 1, LLMs do not have the latest knowledge and hallucinate an incorrect reasoning step: “has a daughter”. These issues largely diminish the performance and trustworthiness of LLMs in high-stakes scenarios, such as legal judgment and medical diagnosis.
尽管他们取得了成功,LLMs但仍然受到缺乏知识的限制,并且在推理过程中容易出现幻觉,这可能导致推理过程中的错误(Hong 等人,2023 年;Wang 等人,2023b)。例如,如图 1 所示,LLMs没有最新的知识并产生幻觉一个错误的推理步骤:“有一个女儿”。这些问题在很大程度上降低LLMs了高风险场景(例如法律判断和医疗诊断)的性能和可信度。

To tackle the issues, knowledge graphs (KGs) have been incorporated to improve the reasoning ability of LLMs (Pan et al., 2023; Luo et al., 2023). KGs capture abundant factual knowledge in a structured format, which provides a faithful knowledge source for reasoning. As a typical reasoning task, knowledge graph question answering (KGQA) aims to obtain answers based on knowledge from KGs (Sun et al., 2019; Hu et al., 2023). Previous works that jointly use KGs and LLMs for KGQA reasoning can be broadly divided into two categories: 1) semantic parsing methods (Lan & Jiang, 2020; Ye et al., 2022), which use LLMs to convert questions into logical queries that are executed on KGs to obtain answers; and 2) retrieval-augmented methods (Li et al., 2023; Pan et al., 2022), which retrieve triples from KGs as knowledge context and uses LLMs to obtain the final answers.
为了解决这些问题,已经纳入了知识图谱 (KG) 以提高推理能力(LLMsPan et al., 2023;Luo 等人,2023 年)。幼稚园以结构化的形式捕捉丰富的事实知识,为推理提供了忠实的知识来源。作为一种典型的推理任务,知识图谱问答(KGQA)旨在根据KG的知识获得答案(Sun et al., 2019;胡 等人,2023 年)。以前联合使用KG和LLMsKGQA推理的工作大致可分为两类:1)语义解析方法(Lan & 江,2020;Ye et al., 2022),用于LLMs将问题转换为逻辑查询,在 KG 上执行以获得答案;和 2) 检索增强方法(Li 等人,2023 年;Pan et al., 2022),从 KG 中检索三元组作为知识上下文,并用于LLMs获得最终答案。

Although semantic parsing methods can generate more accurate and interpretable results by leveraging reasoning on KGs, the generated logical queries can often be non-executable and yield no answers, due to syntax and semantic limitations (Yu et al., 2022a). Retrieval-augmented methods are more flexible and exploit the ability of LLMs for reasoning. However, they only treat KGs as factual knowledge bases and overlook the importance of their structural information for reasoning (Jiang et al., 2022). For instance, as shown in Figure 1, a relation path, which is a sequence of relations, “child_of\tohas_son” can be used to obtain answers to the question “Who is the brother of Justin Bieber?”. Therefore, it is essential to enable LLMs to directly reason on KGs to achieve faithful and interpretable reasoning.
尽管语义解析方法可以通过利用 KG 上的推理来生成更准确和可解释的结果,但由于语法和语义限制,生成的逻辑查询通常是不可执行的,并且不会产生答案(Yu 等人,2022a)。检索增强方法更灵活,并利用了LLMs推理的能力。然而,他们只将KG视为事实知识库,而忽略了其结构信息对推理的重要性(江等人,2022)。例如,如图 1 所示,关系路径(即关系序列“child_of \to has_son”可用于获得“贾斯汀·比伯的兄弟是谁”问题的答案。因此,必须能够LLMs直接对KG进行推理,以实现忠实和可解释的推理。

In this paper, we propose a novel method called reasoning on graphs (RoG) that synergizes LLMs with KGs to conduct faithful and interpretable reasoning. To address the issues of hallucinations and lack of knowledge, we present a planning-retrieval-reasoning framework, where RoG first generates relation paths grounded by KGs as faithful plans via the planning module. These plans are then used to retrieve valid reasoning paths from KGs to conduct faithful reasoning by the retrieval-reasoning module. In this way, we not only retrieve the latest knowledge from KGs but also consider the guidance of KG structure for reasoning and explanations. Moreover, the planning module of RoG can be plug-and-play with different LLMs during inference to improve their performance. Based on this framework, RoG is optimized by two tasks: 1) planning optimization, where we distill knowledge from KGs into LLMs to generate faithful relation paths as plans; and 2) retrieval-reasoning optimization, where we enable LLMs to conduct faithful reasoning based on retrieved paths and generate interpretable results. We conduct extensive experiments on two benchmark KGQA datasets, and the results demonstrate that RoG achieves state-of-the-art performance on KG reasoning tasks and generates faithful and interpretable reasoning results.
在本文中,我们提出了一种称为图形推理(RoG)的新方法,LLMs该方法与KG协同作用,以进行忠实且可解释的推理。为了解决幻觉和缺乏知识的问题,我们提出了一个计划-检索-推理框架,其中 RoG 首先通过计划模块生成由 KG 作为忠实计划的关系路径。然后,这些计划用于从 KG 中检索有效的推理路径,以通过检索推理模块进行忠实的推理。这样,我们不仅可以从幼稚园中检索到最新的知识,还可以考虑幼稚园结构的指导进行推理和解释。此外,RoG的规划模块可以在推理过程中与不同的LLMs模块即插即用,以提高其性能。基于该框架,RoG通过两个任务进行优化:1)计划优化,我们将KG中的知识提炼成LLMs忠实的关系路径作为计划;2)检索推理优化,我们能够LLMs根据检索到的路径进行忠实的推理并生成可解释的结果。我们在两个基准KGQA数据集上进行了广泛的实验,结果表明RoG在KG推理任务上实现了最先进的性能,并产生了忠实和可解释的推理结果。

Refer to caption
Figure 1: The issues of lack of knowledge and hallucination in LLMs reasoning and how they can be addressed by triples and relation paths from KGs.

2 Related Work

LLM Reasoning Prompt. Many studies have been proposed to harness the reasoning ability of LLMs to handle complex tasks through prompting (Wei et al., 2022; Wang et al., 2022; Yao et al., 2023; Besta et al., 2023). Plan-and-solve (Wang et al., 2023c) prompts LLMs to generate a plan and conduct reasoning based on it. DecomP (He et al., 2021) prompts LLMs to decompose the reasoning task into a series of sub-tasks and solve them step by step. However, the problem of hallucinations and lack of knowledge affect the faithfulness of LLMs’ reasoning. ReACT (Yao et al., 2022) treats LLMs as agents, which interact with the environment to get the latest knowledge for reasoning. To explore faithful reasoning, FAME (Hong et al., 2023) introduces the Monte-Carlo planning to generate faithful reasoning steps. RR (He et al., 2022) and KD-CoT Wang et al. (2023b) further retrieve relevant knowledge from KGs to produce faithful reasoning plans for LLMs.

Knowledge Graph Question Answering (KGQA). Conventional embedding-based methods represent the entities and relations in embedding space and design special model architectures (e.g., Key-Value memory networks, sequential models, and graph neural networks) to reason answers (Miller et al., 2016; He et al., 2021; Yasunaga et al., 2021). To integrate LLMs for KGQA, retrieval-augmented methods aim to retrieve the relative facts from the KGs to improve the reasoning performance (Li et al., 2023; Karpukhin et al., 2020). Recently, UniKGQA (Jiang et al., 2022) which unifies the graph retrieval and reasoning process into a single model with LLMs, achieves STOA performance. Semantic parsing methods convert the question into a structural query (e.g., SPARQL) by LLMs, which can be executed by a query engine to reason the answers on KGs (Sun et al., 2020; Lan & Jiang, 2020). However, these methods heavily rely on the quality of generated queries. If the query is not executable, no answers will be generated. DECAF (Yu et al., 2022a) combines semantic parsing and LLMs reasoning to jointly generate answers, which also reach salient performance on KGQA tasks.

3 Preliminary

Knowledge Graphs (KGs) contain abundant factual knowledge in the form of a set of triples: 𝒢={(e,r,e)|e,e,r}𝒢conditional-set𝑒𝑟superscript𝑒formulae-sequence𝑒superscript𝑒𝑟{\mathcal{G}}=\{(e,r,e^{\prime})|e,e^{\prime}\in{\mathcal{E}},r\in{\mathcal{R}}\}, where {\mathcal{E}} and {\mathcal{R}} denote the set of entities and relations, respectively.

Relation Paths are a sequence of relations: z={r1,r2,,rl}𝑧subscript𝑟1subscript𝑟2subscript𝑟𝑙z=\{r_{1},r_{2},\dots,r_{l}\}, where risubscript𝑟𝑖r_{i}\in{\mathcal{R}} denotes the i𝑖i-th relation in the path and l𝑙l denotes the length of the path.

Reasoning Paths are the instances of a relation path z𝑧z in KGs: wz=e0r1e1r2rlelsubscript𝑤𝑧subscript𝑒0subscript𝑟1subscript𝑒1subscript𝑟2subscript𝑟𝑙subscript𝑒𝑙w_{z}=e_{0}\xrightarrow{r_{1}}e_{1}\xrightarrow{r_{2}}\ldots\xrightarrow{r_{l}}e_{l}, where eisubscript𝑒𝑖e_{i}\in{\mathcal{E}} denotes the i𝑖i-th entity and risubscript𝑟𝑖r_{i} denotes the i𝑖i-th relation in the relation path z𝑧z.

Example 1.

Given a relation path: z=marry_tofather_of𝑧marry_tofather_ofz=\texttt{marry\_to}\to\texttt{father\_of}, a reasoning path instance could be: wz=Alicemarry_toBobfather_ofCharliesubscript𝑤𝑧Alicemarry_toBobfather_ofCharliew_{z}=\text{Alice}\xrightarrow{\texttt{marry\_to}}\text{Bob}\xrightarrow{\texttt{father\_of}}\text{Charlie}, which denotes “Alice” is married to “Bob” and “Bob” is the father of “Charlie”.

Knowledge Graph Question Answering (KGQA) is a typical reasoning task based on KGs. Given a natural language question q𝑞q and a KG 𝒢𝒢{\mathcal{G}}, the task aims to design a function f𝑓f to predict answers a𝒜q𝑎subscript𝒜𝑞a\in{\mathcal{A}}_{q} based on knowledge from 𝒢𝒢{\mathcal{G}}, i.e., a=f(q,𝒢)𝑎𝑓𝑞𝒢a=f(q,{\mathcal{G}}). Following previous works (Sun et al., 2019; Jiang et al., 2022), we assume the entities eq𝒯qsubscript𝑒𝑞subscript𝒯𝑞e_{q}\in{\mathcal{T}}_{q} mentioned in q𝑞q and answers a𝒜q𝑎subscript𝒜𝑞a\in{\mathcal{A}}_{q} are labeled and linked to the corresponding entities in 𝒢𝒢{\mathcal{G}}, i.e., 𝒯q,𝒜qsubscript𝒯𝑞subscript𝒜𝑞{\mathcal{T}}_{q},{\mathcal{A}}_{q}\subseteq{\mathcal{E}}.

4 Approach

In this section, we introduce our method: reasoning on graphs (RoG). We present a novel planning-retrieval-reasoning framework that synergizes LLMs and KGs to conduct faithful and interpretable reasoning for KGQA. The overall framework of RoG is illustrated in Figure 2.

Refer to caption
Figure 2: The overall framework of reasoning on graphs (RoG). 1) given a question, we first prompt LLMs to generate several relation paths that are grounded by KGs as plans. 2) Then, we retrieve reasoning paths from KGs using the plans. 3) Finally, we conduct faithful reasoning based on the retrieved reasoning paths and generate answers with interpretable explanations. The orange and red rectangles denote the entities mentioned in the question and answer, respectively.

4.1 Reasoning on Graphs: Planning-Retrieval-Reasoning

Recently, many techniques have been explored to improve the reasoning ability of LLMs by planning, which first prompts LLMs to generate a reasoning plan and then conduct reasoning based on it (Wang et al., 2023c). However, LLMs are known for having hallucination issues, which are prone to generating incorrect plans and leading to wrong answers (Ji et al., 2023). To address this issue, we present a novel planning-retrieval-reasoning framework, which makes the reasoning plans grounded by KGs and then retrieves faithful reasoning paths for LLM reasoning.

Relation paths, which capture semantic relations between entities, have been utilized in many reasoning tasks on KGs (Wang et al., 2021; Xu et al., 2022). Moreover, compared to the dynamically updated entities, the relations in KGs are more stable (Wang et al., 2023a). By using relation paths, we can always retrieve the latest knowledge from KGs for reasoning. Therefore, relation paths can serve as faithful plans for reasoning the answer to KGQA task.

Example 2.

Given a question “Who is the child of Alice”, we can generate a relation path as the plan: z=marry_tofather_of𝑧marry_tofather_ofz=\texttt{marry\_to}\to\texttt{father\_of}. This relation path expresses the plan: 1) find the person that “Alice” is married to; 2) find the child of that person. We can execute the plan (relation path) by retrieving a reasoning path from KGs as: wz=Alicemarry_toBobfather_ofCharliesubscript𝑤𝑧Alicemarry_toBobfather_ofCharliew_{z}=\text{Alice}\xrightarrow{\texttt{marry\_to}}\text{Bob}\xrightarrow{\texttt{father\_of}}\text{Charlie}. Finally, we can answer the question based on the reasoning path, which is “Charlie”.

By treating relation paths as plans, we can make sure the plans are grounded by KGs, which enables LLMs to conduct faithful and interpretable reasoning on graphs. In a nutshell, we formulate our RoG as an optimization problem that aims to maximize the probability of reasoning the answer from a knowledge graph 𝒢𝒢{\mathcal{G}} w.r.t the question q𝑞q by generating relation paths z𝑧z as the plan:

Pθ(a|q,𝒢)=z𝒵Pθ(a|q,z,𝒢)Pθ(z|q),subscript𝑃𝜃conditional𝑎𝑞𝒢subscript𝑧𝒵subscript𝑃𝜃conditional𝑎𝑞𝑧𝒢subscript𝑃𝜃conditional𝑧𝑞P_{\theta}(a|q,{\mathcal{G}})=\sum_{z\in{\mathcal{Z}}}P_{\theta}(a|q,z,{\mathcal{G}})P_{\theta}(z|q), (1)

where θ𝜃\theta denotes the parameters of LLMs, z𝑧z denotes the relation paths (plans) generated by LLMs, and 𝒵𝒵{\mathcal{Z}} denotes the set of possible relation paths. The latter term Pθ(z|q)subscript𝑃𝜃conditional𝑧𝑞P_{\theta}(z|q) is the probability of generating a faithful relation path z𝑧z grounded by KG, given the question q𝑞q, which is realized by the planning module. The former term Pθ(a|q,z,𝒢)subscript𝑃𝜃conditional𝑎𝑞𝑧𝒢P_{\theta}(a|q,z,{\mathcal{G}}) is the probability of reasoning an answer a𝑎a given the question q𝑞q, relation path z𝑧z, and KG 𝒢𝒢{\mathcal{G}}, which is computed by the retrieval-reasoning module.

4.2 Optimization Framework

Despite the advantage of generating relation paths as plans, the LLMs have zero knowledge of the relations contained in KGs. Therefore, LLMs cannot directly generate relation paths grounded by KGs as faithful plans. Moreover, LLMs might not understand the reasoning paths correctly and conduct reasoning based on them. To address these issues, we design two instruction tuning tasks: 1) planning optimization, which distills the knowledge from KGs into LLMs to generate faithful relation paths as plans; 2) retrieval-reasoning optimization, which enables LLMs to reason based on the retrieved reasoning paths.

The objective function in equation 1 can be optimized by maximizing the evidence lower bound (ELBO) (Jordan et al., 1999), which is formulated as

logP(a|q,𝒢)𝔼zQ(z)[logPθ(a|q,z,𝒢)]DKL(Q(z)Pθ(z|q)),\log P(a|q,{\mathcal{G}})\geq\mathbb{E}_{z\sim Q(z)}[\log P_{\theta}(a|q,z,{\mathcal{G}})]-D_{\mathrm{KL}}(Q(z)\|P_{\theta}(z|q)), (2)

where Q(z)𝑄𝑧Q(z) denotes the posterior distribution of faithful relation paths grounded by KGs. The latter term minimizes the KL divergence between the posterior and the prior, which encourages LLMs to generate faithful relation paths (planning optimization). The former term maximizes the expectation that retrieval-reasoning module generates correct answers based on the relation paths and KGs (retrieval-reasoning optimization).

Planning optimization. In planning optimization, we aim to distill the knowledge from KGs into LLMs to generate faithful relation paths as plans. This can be achieved by minimizing the KL divergence with the posterior distribution of faithful relation paths Q(z)𝑄𝑧Q(z), which can be approximated by the valid relation paths in KGs.

Given a question q𝑞q and answer a𝑎a, we could find the path instances wz(eq,ea)=eqr1e1r2rleasubscript𝑤𝑧subscript𝑒𝑞subscript𝑒𝑎subscript𝑒𝑞subscript𝑟1subscript𝑒1subscript𝑟2subscript𝑟𝑙subscript𝑒𝑎w_{z}(e_{q},e_{a})=e_{q}\xrightarrow{r_{1}}e_{1}\xrightarrow{r_{2}}\ldots\xrightarrow{r_{l}}e_{a} connecting eqsubscript𝑒𝑞e_{q} and easubscript𝑒𝑎e_{a} in KGs. The corresponding relation path z={r1,r2,,rl}𝑧subscript𝑟1subscript𝑟2subscript𝑟𝑙z=\{r_{1},r_{2},\dots,r_{l}\} can be considered valid and serve as a faithful plan for answering the question q𝑞q. Therefore, the posterior distribution Q(z)𝑄𝑧Q(z) can be formally approximated as

Q(z)Q(z|a,q,𝒢)={1,wz(eq,ea)𝒢,0,else.Q(z)\simeq Q(z|a,q,{\mathcal{G}})=\left\{\begin{aligned} &1,\exists w_{z}(e_{q},e_{a})\in{\mathcal{G}},\\ &0,else.\end{aligned}\right. (3)

where wz(eq,ea)𝒢subscript𝑤𝑧subscript𝑒𝑞subscript𝑒𝑎𝒢\exists w_{z}(e_{q},e_{a})\in{\mathcal{G}} denote the existence of a path instance connecting the question eqsubscript𝑒𝑞e_{q} and answer easubscript𝑒𝑎e_{a} entities in 𝒢𝒢{\mathcal{G}}. To reduce the number of valid relation paths, we only consider the shortest paths between eqsubscript𝑒𝑞e_{q} and easubscript𝑒𝑎e_{a} in KGs (Zhang et al., 2022). Therefore, the KL divergence can be calculated as

plan=DKL(Q(z)Pθ(z|q))\displaystyle{\mathcal{L}}_{\text{plan}}=D_{\mathrm{KL}}(Q(z)\|P_{\theta}(z|q)) =DKL(Q(z|a,q,𝒢)Pθ(z|q))\displaystyle=D_{\mathrm{KL}}(Q(z|a,q,{\mathcal{G}})\|P_{\theta}(z|q)) (4)
=𝔼zQ(z|a,q,𝒢)Q(z|a,q,𝒢)[logQ(z|a,q,𝒢)logPθ(z|q)]absentsubscript𝔼similar-to𝑧𝑄conditional𝑧𝑎𝑞𝒢𝑄conditional𝑧𝑎𝑞𝒢delimited-[]𝑄conditional𝑧𝑎𝑞𝒢subscript𝑃𝜃conditional𝑧𝑞\displaystyle=\mathbb{E}_{z\sim Q(z|a,q,{\mathcal{G}})}Q(z|a,q,{\mathcal{G}})[\log Q(z|a,q,{\mathcal{G}})-\log P_{\theta}(z|q)]
=𝔼zQ(z|a,q,𝒢)Q(z|a,q,𝒢)logPθ(z|q)+CONSTabsentsubscript𝔼similar-to𝑧𝑄conditional𝑧𝑎𝑞𝒢𝑄conditional𝑧𝑎𝑞𝒢subscript𝑃𝜃conditional𝑧𝑞CONST\displaystyle=-\mathbb{E}_{z\sim Q(z|a,q,{\mathcal{G}})}Q(z|a,q,{\mathcal{G}})\log P_{\theta}(z|q)+\text{CONST}
=zQ(z|a,q,𝒢)logPθ(z|q).absentsubscript𝑧𝑄conditional𝑧𝑎𝑞𝒢subscript𝑃𝜃conditional𝑧𝑞\displaystyle=-\sum_{z\in Q(z|a,q,{\mathcal{G}})}\log P_{\theta}(z|q).

By optimizing the equation 4, we maximize the probability of LLMs generating faithful relation paths through distilling the knowledge from KGs.

Retrieval-reasoning optimization. In retrieval-reasoning optimization, we aim to enable LLMs to conduct reasoning based on the retrieved reasoning paths. For the retrieval-reasoning module, we follow the FiD framework (Izacard & Grave, 2021), which allows reasoning on multiple retrieved reasoning paths, formulated as

Pθ(a|q,𝒵,𝒢)=z𝒵Pθ(a|q,z,𝒢).subscript𝑃𝜃conditional𝑎𝑞𝒵𝒢subscriptproduct𝑧𝒵subscript𝑃𝜃conditional𝑎𝑞𝑧𝒢P_{\theta}(a|q,{\mathcal{Z}},{\mathcal{G}})=\prod_{z\in{\mathcal{Z}}}P_{\theta}(a|q,z,{\mathcal{G}}). (5)

By approximating the expectation with K sampled plans 𝒵Ksubscript𝒵𝐾{\mathcal{Z}}_{K}, the objective function of reasoning optimization can be written as

reason=𝔼zQ(z|a,q,𝒢)[logPθ(a|q,z,𝒢)]=logPθ(a|q,𝒵K,𝒢).subscriptreasonsubscript𝔼similar-to𝑧𝑄conditional𝑧𝑎𝑞𝒢delimited-[]subscript𝑃𝜃conditional𝑎𝑞𝑧𝒢subscript𝑃𝜃conditional𝑎𝑞subscript𝒵𝐾𝒢{\mathcal{L}}_{\text{reason}}=\mathbb{E}_{z\sim Q(z|a,q,{\mathcal{G}})}[\log P_{\theta}(a|q,z,{\mathcal{G}})]=\log P_{\theta}(a|q,{\mathcal{Z}}_{K},{\mathcal{G}}). (6)

This maximizes the probability of LLMs generating correct answers based on the retrieved reasoning paths.

The final objective function of RoG is the combination of the planning optimization and retrieval-reasoning optimization, which can be formulated as

=logPθ(a|q,𝒵K,𝒢)Retrieval-reasoning+zQ(z|a,q,𝒢)logPθ(z|q)Planning.subscriptsubscript𝑃𝜃conditional𝑎𝑞subscript𝒵𝐾𝒢Retrieval-reasoningsubscriptsubscript𝑧𝑄conditional𝑧𝑎𝑞𝒢subscript𝑃𝜃conditional𝑧𝑞Planning{\mathcal{L}}=\log\underbrace{P_{\theta}(a|q,{\mathcal{Z}}_{K},{\mathcal{G}})}_{\text{Retrieval-reasoning}}+\underbrace{\sum_{z\in Q(z|a,q,{\mathcal{G}})}\log P_{\theta}(z|q)}_{\text{Planning}}. (7)

From equation 7, we can see that we adopt the same LLM for both planning and reasoning, which are jointly trained on two instruction-tuning tasks, i.e., (planning and retrieval-reasoning). We will discuss the implementation details of these two tasks in the following subsections.

4.3 Planning Module

The planning module aims to generate faithful relation paths as plans for answering the question. To utilize the instruction-following ability of LLMs (Wei et al., 2021), we design a simple instruction template that prompts LLMs to generate relation paths:

Planning Prompt Template Please generate a valid relation path that can be helpful for answering the following question: <Question>

where <Question> indicates the question q𝑞q. The question together with the instruction template is fed into LLMs to generate the relation paths, which are structurally formatted as a sentence:

z=𝑧absentz= <PATH> r1subscript𝑟1r_{1} <SEP> r2subscript𝑟2r_{2} <SEP> \dots <SEP> rlsubscript𝑟𝑙r_{l} </PATH>

where <PATH>, <SEP>, </PATH> are special tokens indicating the start, separator, and end of the relation path, respectively222The relation name rsubscript𝑟r_{*} could be split into multiple tokens. For example, “born_in” could be split into “born” and “_in” by tokenizer. In this way, we could fully utilize the semantic information in relation names and generalize to different KGs..

Therefore, the optimization of plansubscriptplan{\mathcal{L}}_{\text{plan}} can be achieved as

argmaxθzQ(z|a,q,𝒢)logPθ(z|q)=zQ(z|a,q,𝒢)logi=1|z|Pθ(ri|r<i,q),subscriptargmax𝜃subscript𝑧𝑄conditional𝑧𝑎𝑞𝒢subscript𝑃𝜃conditional𝑧𝑞subscript𝑧𝑄conditional𝑧𝑎𝑞𝒢superscriptsubscriptproduct𝑖1𝑧subscript𝑃𝜃conditionalsubscript𝑟𝑖subscript𝑟absent𝑖𝑞\operatorname*{arg\,max}_{\theta}\sum_{z\in Q(z|a,q,{\mathcal{G}})}\log P_{\theta}(z|q)=\sum_{z\in Q(z|a,q,{\mathcal{G}})}\log\prod_{i=1}^{|z|}P_{\theta}(r_{i}|r_{<i},q), (8)

where Pθ(z|q)subscript𝑃𝜃conditional𝑧𝑞P_{\theta}(z|q) denotes the prior distribution of generating faithful relation path z𝑧z, and Pθ(ri|r<i,q)subscript𝑃𝜃conditionalsubscript𝑟𝑖subscript𝑟absent𝑖𝑞P_{\theta}(r_{i}|r_{<i},q) denotes the probability of each token in z𝑧z generated by LLMs.

4.4 Retrieval-Reasoning Module

Retrieval. Given a question q𝑞q and a relation path as plan z𝑧z, the retrieval module aims to retrieve the reasoning paths wzsubscript𝑤𝑧w_{z} from KG 𝒢𝒢{\mathcal{G}}. The retrieval process can be conducted by finding paths in 𝒢𝒢{\mathcal{G}} that start from the question entities eqsubscript𝑒𝑞e_{q} and follow the relation paths z𝑧z, formulated as

𝒲z={wz(eq,e)|wz(eq,e)=eqr1e1r2rlea,wz(eq,e)𝒢}.subscript𝒲𝑧conditional-setsubscript𝑤𝑧subscript𝑒𝑞subscript𝑒formulae-sequencesubscript𝑤𝑧subscript𝑒𝑞subscript𝑒subscript𝑒𝑞subscript𝑟1subscript𝑒1subscript𝑟2subscript𝑟𝑙subscript𝑒𝑎subscript𝑤𝑧subscript𝑒𝑞subscript𝑒𝒢{\mathcal{W}}_{z}=\{w_{z}(e_{q},e_{*})|w_{z}(e_{q},e_{*})=e_{q}\xrightarrow{r_{1}}e_{1}\xrightarrow{r_{2}}\ldots\xrightarrow{r_{l}}e_{a*},w_{z}(e_{q},e_{*})\in{\mathcal{G}}\}. (9)

We adopt a constrained breadth-first search to retrieve the reasoning paths wzsubscript𝑤𝑧w_{z} from KGs. In experiments, all retrieved paths are used for reasoning. The detailed retrieval algorithm can be found in Section A.2.

Despite we can utilize the retrieved reasoning paths and directly get the answers with majority voting. The retrieved reasoning paths could be noisy and irrelevant to the questions, which leads to incorrect answers (He et al., 2021; Zhang et al., 2022). Therefore, we propose a reasoning module to explore the ability of LLMs to identify the important reasoning paths and answer the questions based on them.

Reasoning. The reasoning module takes the question q𝑞q and a set of reasoning paths 𝒲zsubscript𝒲𝑧{\mathcal{W}}_{z} to generate answers a𝑎a. Similarly, we design a reasoning instruction prompt to guide LLMs to conduct reasoning based on the retrieved reasoning paths 𝒲zsubscript𝒲𝑧{\mathcal{W}}_{z}:

Reasoning Prompt Template Based on the reasoning paths, please answer the given question. Please keep the answer as simple as possible and return all the possible answers as a list. Reasoning Paths: <Reasoning Paths> Question: <Question>

where <Reasoning Paths> denotes the retrieved reasoning paths 𝒲zsubscript𝒲𝑧{\mathcal{W}}_{z} which are also formatted as a series of structural sentences. The detailed of prompts can be found in Section A.9.

The optimization of reasonsubscriptreason{\mathcal{L}}_{\text{reason}} can be written as

argmaxθlogPθ(a|q,𝒵K,𝒢)=logz𝒵Kwz𝒲zi=1|a|Pθ(ti|t<i,q,wz),subscriptargmax𝜃subscript𝑃𝜃conditional𝑎𝑞subscript𝒵𝐾𝒢subscript𝑧subscript𝒵𝐾subscriptsubscript𝑤𝑧subscript𝒲𝑧superscriptsubscriptproduct𝑖1𝑎subscript𝑃𝜃conditionalsubscript𝑡𝑖subscript𝑡absent𝑖𝑞subscript𝑤𝑧\operatorname*{arg\,max}_{\theta}\log P_{\theta}(a|q,{\mathcal{Z}}_{K},{\mathcal{G}})=\log\sum_{z\in{\mathcal{Z}}_{K}}\sum_{w_{z}\in{\mathcal{W}}_{z}}\prod_{i=1}^{|a|}P_{\theta}(t_{i}|t_{<i},q,w_{z}), (10)

where Pθ(a|q,𝒵K,𝒢)subscript𝑃𝜃conditional𝑎𝑞subscript𝒵𝐾𝒢P_{\theta}(a|q,{\mathcal{Z}}_{K},{\mathcal{G}}) denotes probability of reasoning the correct answer a𝑎a based on K𝐾K relation paths 𝒵Ksubscript𝒵𝐾{\mathcal{Z}}_{K}, and tsubscript𝑡t_{*} denote the tokens of answers a𝑎a.

5 Experiment

In our experiments, we aim to answer the following research questions:

  • RQ1: Can RoG achieve state-of-the-art performance on the KGQA tasks?

  • RQ2: Can the planning module of RoG be integrated with other LLMs to improve their performance?

  • RQ3: Can RoG be finetuned and effectively transferred to other knowledge graphs?

  • RQ4: Can RoG conduct faithful reasoning and generate interpretable reasoning results?

5.1 Experiment Settings

Datasets. We evaluate the reasoning ability of RoG on two benchmark KGQA datasets:

Table 1: Statistics of datasets.
Datasets #Train #Test Max #hop
WebQSP 2,826 1,628 2
CWQ 27,639 3,531 4

WebQuestionSP (WebQSP) (Yih et al., 2016) and Complex WebQuestions (CWQ) (Talmor & Berant, 2018), which contain up to 4-hop questions. The statistic of the datasets are given in Table 1. Freebase (Bollacker et al., 2008) is the background knowledge graph for both datasets, which contains around 88 million entities, 20 thousand relations, and 126 million triples. The details of the datasets are described in Section A.3.

Baselines. We compare RoG with 21 baselines grouping into 5 categories: 1) Embedding-based methods, 2) Retrieval-augmented methods, 3) Semantic parsing methods, 4) LLMs, and 5) LLMs+KGs methods. The details of each baseline are described in Section A.4.

Evaluation Metrics. Following previous works, we use Hits@1 and F1 as the evaluation metrics. Hits@1 measures the proportion of questions whose top-1 predicted answer is correct. Since a question may correspond to multiple answers, F1 considers the coverage of all answers, which balances the precision and recall of the predicted answers.

Implementations. For RoG, we use LLaMA2-Chat-7B (Touvron et al., 2023) as the LLM backbone, which is instruction finetuned on the training split of WebQSP and CWQ as well as Freebase for 3 epochs. We generate the top-3 relation paths using beam-search for each question. Since UniKGQA (Jiang et al., 2022) and DECAF (Yu et al., 2022a) are state-of-the-art methods, we directly refer their results and those of the other baselines reported in their papers for comparisons. For LLMs, we use zero-shot prompting to conduct KGQA. The detailed settings are described in Section A.5.

Table 2: Performance comparison with different baselines on the two KGQA datasets.
Type Methods WebQSP CWQ
Hits@1 F1 Hits@1 F1
Embedding KV-Mem (Miller et al., 2016) 46.7 34.5 18.4 15.7
EmbedKGQA (Saxena et al., 2020) 66.6 - 45.9 -
NSM (He et al., 2021) 68.7 62.8 47.6 42.4
TransferNet (Shi et al., 2021) 71.4 - 48.6 -
KGT5 Saxena et al. (2022) 56.1 - 36.5 -
Retrieval GraftNet (Sun et al., 2018) 66.4 60.4 36.8 32.7
PullNet (Sun et al., 2019) 68.1 - 45.9 -
SR+NSM (Zhang et al., 2022) 68.9 64.1 50.2 47.1
SR+NSM+E2E (Zhang et al., 2022) 69.5 64.1 49.3 46.3
Semantic Parsing SPARQL (Sun et al., 2020) - - 31.6 -
QGG (Lan & Jiang, 2020) 73.0 73.8 36.9 37.4
ArcaneQA (Gu & Su, 2022) - 75.3 - -
RnG-KBQA (Ye et al., 2022) - 76.2 - -
LLMs Flan-T5-xl (Chung et al., 2022) 31.0 - 14.7 -
Alpaca-7B (Taori et al., 2023) 51.8 - 27.4 -
LLaMA2-Chat-7B (Touvron et al., 2023) 64.4 - 34.6 -
ChatGPT 66.8 - 39.9 -
ChatGPT+CoT 75.6 - 48.9 -
LLMs+KGs KD-CoT (Wang et al., 2023b) 68.6 52.5 55.7 -
UniKGQA (Jiang et al., 2022) 77.2 72.2 51.2 49.1
DECAF (DPR+FiD-3B) (Yu et al., 2022a) 82.1 78.8 - -
RoG 85.7 70.8 62.6 56.2

5.2 RQ1: KGQA Performance Comparison

Main Results. In this section, we compare RoG with other baselines on KGQA tasks. The results are shown in Table 2. Our method achieves the best performance on both datasets across most metrics. Specifically, compared to the SOTA method DECAF (Yu et al., 2022a) on WebQSP, our method improves Hits@1 by 4.4%. On the CWQ dataset, which is more challenging due to multi-hop questions, our method improves both Hits@1 and F1 by 22.3% and 14.4% against the SOTA model UniKGQA (Jiang et al., 2022). These results demonstrate the superior reasoning ability of our method in KGQA.

Among other methods, retrieval-augmented approaches outperform conventional embedding-based methods by retrieving relevant subgraphs from KGs, which reduces reasoning complexity. Furthermore, SR+NSM and SR+NSM+E2E adopt relation paths-based retrieval which achieves better performance, highlighting the importance of relation paths. Semantic parsing methods perform better than retrieval methods on WebQSP but worse on CWQ due to the complexity of generating logical queries for complex questions in CWQ. Although LLMs-based methods achieve comparable performance, they are limited by hallucinations and lack of knowledge as shown in Section 5.5. The LLMs+KGs methods achieve the second-best performance, which demonstrates the effectiveness of unifying KGs and LLMs for reasoning.

Ablation Study. We conduct an ablation study to analyze the effectiveness of the planning module and reasoning module in our method (RoG). We compare four variants: 1) w/o planning, where we remove the planning module and perform reasoning without retrieved reasoning paths; 2) w/o reasoning, where we remove the reasoning module and use all answers from retrieved reasoning paths as results; 3) w/ random plans, where we randomly retrieve reasoning paths from KGs and feed them into the reasoning module; 4) w/ vote reasoning, where we adopt the majority voting to select top-5 answers from retrieved reasoning paths. The results are shown in Table 3.

Table 3: Ablation studies of RoG.
Method WebQSP CWQ
Precision Recall F1 Precision Recall F1
RoG 74.77 75.84 70.81 57.69 58.19 56.17
RoG w/o planning 57.26 50.16 49.69 35.35 34.77 33.76
RoG w/o reasoning 46.90 79.85 49.56 18.88 67.89 22.26
RoG w/ random plans 38.66 38.31 35.24 38.99 39.29 37.64
RoG w/ vote reasoning 54.80 60.44 47.96 22.92 47.98 26.52

From the results, it is evident that without a planning module, our method degenerates to conventional LLMs that solely rely on questions as input, suffering from the lack of knowledge issue. Although removing the reasoning module leads to high recall due to an increased number of answers, precision drops significantly because of noise in retrieved paths. This demonstrates the effectiveness of the reasoning module in identifying important reasoning paths and filtering out noise. Furthermore, using random plans achieves worse performance than removing the planning module, highlighting the importance of a planning module who generates faithful reasoning plans. Using a simple majority vote reasoning can improve the results which also demonstrate the necessity of reasoning module.

5.3 RQ2: Plug-and-Play RoG Planning Module

Table 4: Effects of integrating the planning module of RoG with different LLMs for reasoning.
Methods WebQSP CWQ
Hits@1 Recall Hits@1 Recall
ChatGPT 66.77 49.27 39.90 35.07
ChatGPT + RoG Planning 81.51 71.60 52.68 48.51
Alpaca-7B 51.78 33.65 27.44 23.62
Alpaca-7B + RoG Planning 56.16 74.20 44.04 38.46
LLaMA2-Chat-7B 64.37 44.61 34.60 29.91
LLaMA2-Chat-7B + RoG Planning 74.20 56.16 56.41 51.99
Flan-T5-xl 30.95 17.08 14.69 12.25
Flan-T5-xl + RoG Planning 67.87 44.93 37.81 32.57

In this section, we evaluate the effectiveness of integrating the planning module of RoG with different LLMs during inference to improve their performance. Specifically, we first adopt the planning module of RoG to generate relation paths and feed the retrieved reasoning paths as context into different LLMs for reasoning. The results are presented in Table 4. To account for the fact that it is difficult to extract the number of answers from LLM’s output. We only report the Hits@1 and Recall metrics.

From the results, we can notice that the performance of all LLMs is substantially improved by integrating the planning module of RoG. Specifically, the Hits@1 of ChatGPT, Alpaca, LLaMA2, and Flan-T5 are improved by 8.5%, 15.3%, and 119.3%, respectively. This demonstrates that the planning module of RoG can be seamlessly integrated with other LLMs to improve their performance without retraining.

Table 5: Performance of RoG on MetaQA-3hop.
表 5:RoG 在 MetaQA-3hop 上的性能。
Strategies 策略 MetaQA-3hop 元QA-3hop
Hits@1 F1
RoG (train from scratch) RoG(从头开始训练) 84.81 41.32
RoG (transfer from Freebase)
RoG(从Freebase转移)
88.98 50.68

5.4 RQ3: Transferability to Other KGs
5.4RQ3:可转让给其他幼稚园

We also evaluate the transferability of RoG to other KGs. We select the MetaQA-3hop dataset (Zhang et al., 2018) which is based on Wiki-Movies KGs333https://research.fb.com/downloads/babi.. We select 1000 samples from the training split and utilize two training strategies to finetune RoG: 1) training from scratch, where we directly optimize RoG with 1000 samples; 2) transfer from Freebase, where we conduct a further finetuning based on RoG trained for Freebase. The results are shown in Table 5. From results, we can see that transfer from Freebase achieves better performance than training from scratch, which demonstrates the transferability of RoG to other KGs.
我们还评估了 RoG 向其他 KG 的可转移性。我们选择基于Wiki-Movies KG 3 的MetaQA-3hop数据集(Zhang et al., 2018)。我们从训练拆分中选择 1000 个样本,并利用两种训练策略来微调 RoG:1) 从头开始训练,我们直接用 1000 个样本优化 RoG;2)从Freebase转移,我们根据为Freebase训练的RoG进行进一步的微调。结果如表5所示。从结果中,我们可以看到从Freebase转移比从头开始训练的性能更好,这表明RoG可以转移到其他KG。

Refer to caption
Figure 3: Faithfulness of top-K𝐾K generated relation paths. Green bars denote the average number of retrieved reasoning paths, solid lines denote the answer coverage of the retrieved paths, and dashed line denote the answer coverage of the reasoning module based on the retrieved reasoning paths.
图 3:顶部 K𝐾K 生成的关系路径的忠实度。绿色条表示检索到的推理路径的平均数,实线表示检索到的推理路径的答案覆盖率,虚线表示基于检索到的推理路径的推理模块的答案覆盖率。
Table 6: Example of lack of knowledge for ChatGPT + CoT and interpretable predictions of RoG.
表 6:缺乏 ChatGPT + CoT 知识的示例和 RoG 的可解释预测。
Question 问题 What type of government is used in the country with Northern District?
北区国家使用什么类型的政府?
Answer Parliamentary system
ChatGPT + CoT To determine what type of government is used in a country with a Northern District, we need more specific information. The term ”Northern District” does not specify the name of any particular country or provide enough context to identify its government type. Could you please provide additional details or clarify the specific country you are referring to?
Reasoning Paths Northern District \to location.administrative_division.first_level_division_of \to Israel \to government.form_of_government.countries \to Parliamentary system
RoG The type of government used in the country with the Northern District is a Parliamentary system. Explanations: 1. Northern District \to location.administrative_division.first_level_division_of \to Israel: The Northern District is a part of Israel. 2. Israel \to government.form_of_government.countries \to Parliamentary system: Israel has a Parliamentary system of government. So, by following the reasoning paths provided, we can conclude that the country with the Northern District, which is Israel, uses a Parliamentary system of government.
Table 7: Example of the hallucination of ChatGPT + CoT and faithful reasoning of RoG.
Question Where is the home stadium of the team who won the 1946 World Series championship?
Answer Busch Stadium
ChatGPT + CoT Step 1: Identify the team that won the 1946 World Series championship. The team that won the 1946 World Series championship was the St. Louis Cardinals. Step 2: Determine the home stadium of the St. Louis Cardinals during that time. The St. Louis Cardinals’ home stadium during the 1946 season was Sportsman’s Park. Therefore, the possible answer to the question ”Where is the home stadium of the team who won the 1946 World Series championship?” is: - Sportsman’s Park.
Reasoning Paths 1946 World Series \to sports.sports_team.championships \to St. Louis Cardinals \to sports.sports_team.arena_stadium \to Busch Stadium 1946 World Series \to sports.sports_team.championships \to St. Louis Cardinals \to sports.sports_team.arena_stadium \to Roger Dean Stadium
RoG Based on the reasoning paths provided, the home stadium of the team who won the 1946 World Series championship is Busch Stadium. This conclusion is based on the following path: 1. 1946 World Series \to sports.sports_team.championships \to St. Louis Cardinals \to sports.sports_team.arena_stadium \to Busch Stadium The reasoning path leads us to the St. Louis Cardinals as the team that won the 1946 World Series, and Busch Stadium is the stadium associated with the St. Louis Cardinals. Therefore, Busch Stadium is the home stadium of the team that won the 1946 World Series championship.

5.5 RQ4: Faithful reasoning and interpretable results

Quantitative Results. To evaluate the faithfulness of the relation paths, we illustrate the quantitative results in Figure 3. In experiments, we sweep the number of top-K𝐾K relation paths generated by RoG. From results, we can see that the number of retrieved reasoning paths increases with K𝐾K, which also leads to more answers to be covered (recall). This demonstrates the faithfulness of the relation paths by retrieving the answers. However, more retrieved reasoning paths also lead to more noise and retrieval time (shown in Section A.6.1), which decreases the precision and makes little contribution to the final results (reasoning-f1). Therefore, we set K=3𝐾3K=3 in experiments.

Case studies. We also illustrate two case studies in Table 6 and Table 7. In Table 6, we can find that ChatGPT+CoT suffers from the lack of knowledge issue and cannot answer the question. On the contrary, RoG can generate faithful relation paths and retrieve valid reasoning paths from KGs for reasoning. Besides, RoG can provide interpretable explanations based on the reasoning paths. In Table 7, we can see that ChatGPT+CoT suffers from hallucinations and generates incorrect answers. In contrast, although the retrieved reasoning paths contain noises, the reasoning module can identify the correct reasoning paths and conduct faithful reasoning. These results demonstrate the effectiveness of RoG in conducting faithful reasoning and generating interpretable results. More cases can be found in Sections A.7 and A.8.

6 Conclusion

In this paper, we propose a novel method called reasoning on graphs (RoG) that synergizes LLMs with KGs to conduct faithful and interpretable reasoning. To address the issues of hallucinations and lack of knowledge, we present a planning-retrieval-reasoning framework, which allows LLMs to access the latest knowledge while reasoning based on faithful plans on graphs. RoG not only enhances the reasoning capability of LLMs by distilling knowledge from KGs through training but also enables seamless integration with any LLMs during inference. Extensive experiments on two benchmark KGQA datasets demonstrate the superiority of RoG in reasoning ability and interpretability.

References

  • Bang et al. (2023) Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
  • Besta et al. (2023) Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
  • Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp.  1247–1250, 2008.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Chung et al. (2022) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  • Creswell & Shanahan (2022) Antonia Creswell and Murray Shanahan. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271, 2022.
  • Gu & Su (2022) Yu Gu and Yu Su. Arcaneqa: Dynamic program induction and contextualized encoding for knowledge base question answering. In Proceedings of the 29th International Conference on Computational Linguistics, pp.  1718–1731, 2022.
  • He et al. (2021) Gaole He, Yunshi Lan, Jing Jiang, Wayne Xin Zhao, and Ji-Rong Wen. Improving multi-hop knowledge base question answering by learning intermediate supervision signals. In Proceedings of the 14th ACM international conference on web search and data mining, pp.  553–561, 2021.
  • He et al. (2022) Hangfeng He, Hongming Zhang, and Dan Roth. Rethinking with retrieval: Faithful large language model inference. arXiv preprint arXiv:2301.00303, 2022.
  • Hong et al. (2023) Ruixin Hong, Hongming Zhang, Hong Zhao, Dong Yu, and Changshui Zhang. Faithful question answering with monte-carlo planning. ACL2023, 2023.
  • Hu et al. (2023) Nan Hu, Yike Wu, Guilin Qi, Dehai Min, Jiaoyan Chen, Jeff Z Pan, and Zafar Ali. An empirical study of pre-trained language models in simple knowledge graph question answering. World Wide Web, pp.  1–32, 2023.
  • Huang & Chang (2023) Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. ACL 2023, 2023.
  • Izacard & Grave (2021) Gautier Izacard and Édouard Grave. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 874–880, 2021.
  • Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  • Jiang et al. (2022) Jinhao Jiang, Kun Zhou, Xin Zhao, and Ji-Rong Wen. Unikgqa: Unified retrieval and reasoning for solving multi-hop question answering over knowledge graph. In The Eleventh International Conference on Learning Representations, 2022.
  • Jordan et al. (1999) Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37:183–233, 1999.
  • Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6769–6781, 2020.
  • Khot et al. (2022) Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, 2022.
  • Lan & Jiang (2020) Yunshi Lan and Jing Jiang. Query graph generation for answering multi-hop complex questions from knowledge bases. Association for Computational Linguistics, 2020.
  • Li et al. (2023) Shiyang Li, Yifan Gao, Haoming Jiang, Qingyu Yin, Zheng Li, Xifeng Yan, Chao Zhang, and Bing Yin. Graph reasoning for question answering with triplet retrieval. In Findings of the Association for Computational Linguistics: ACL 2023, 2023.
  • Luo et al. (2023) Linhao Luo, Jiaxin Ju, Bo Xiong, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. Chatrule: Mining logical rules with large language models for knowledge graph reasoning. arXiv preprint arXiv:2309.01538, 2023.
  • Miller et al. (2016) Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. Key-value memory networks for directly reading documents. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  1400–1409, 2016.
  • Pan et al. (2023) Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. Unifying large language models and knowledge graphs: A roadmap. arXiv preprint arXiv:2306.08302, 2023.
  • Pan et al. (2022) Xiaoman Pan, Wenlin Yao, Hongming Zhang, Dian Yu, Dong Yu, and Jianshu Chen. Knowledge-in-context: Towards knowledgeable semi-parametric language models. In The Eleventh International Conference on Learning Representations, 2022.
  • Saxena et al. (2020) Apoorv Saxena, Aditay Tripathi, and Partha Talukdar. Improving multi-hop question answering over knowledge graphs using knowledge base embeddings. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp.  4498–4507, 2020.
  • Saxena et al. (2022) Apoorv Saxena, Adrian Kochsiek, and Rainer Gemulla. Sequence-to-sequence knowledge graph completion and question answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2814–2828, 2022.
  • Shi et al. (2021) Jiaxin Shi, Shulin Cao, Lei Hou, Juanzi Li, and Hanwang Zhang. Transfernet: An effective and transparent framework for multi-hop question answering over relation graph. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  4149–4158, 2021.
  • Sun et al. (2018) Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhutdinov, and William Cohen. Open domain question answering using early fusion of knowledge bases and text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  4231–4242, 2018.
  • Sun et al. (2019) Haitian Sun, Tania Bedrax-Weiss, and William Cohen. Pullnet: Open domain question answering with iterative retrieval on knowledge bases and text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2380–2390, 2019.
  • Sun et al. (2020) Yawei Sun, Lingling Zhang, Gong Cheng, and Yuzhong Qu. Sparqa: skeleton-based semantic parsing for complex questions over knowledge bases. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  8952–8959, 2020.
  • Tafjord et al. (2022) Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. Entailer: Answering questions with faithful and truthful chains of reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  2078–2093, 2022.
  • Talmor & Berant (2018) Alon Talmor and Jonathan Berant. The web as a knowledge-base for answering complex questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  641–651, 2018.
  • Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: an instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • Wang et al. (2021) Hongwei Wang, Hongyu Ren, and Jure Leskovec. Relational message passing for knowledge graph completion. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp.  1697–1707, 2021.
  • Wang et al. (2023a) Jiapu Wang, Boyue Wang, Meikang Qiu, Shirui Pan, Bo Xiong, Heng Liu, Linhao Luo, Tengfei Liu, Yongli Hu, Baocai Yin, et al. A survey on temporal knowledge graph completion: Taxonomy, progress, and prospects. arXiv preprint arXiv:2308.02457, 2023a.
  • Wang et al. (2023b) Keheng Wang, Feiyu Duan, Sirui Wang, Peiguang Li, Yunsen Xian, Chuantao Yin, Wenge Rong, and Zhang Xiong. Knowledge-driven cot: Exploring faithful reasoning in llms for knowledge-intensive question answering. arXiv preprint arXiv:2308.13259, 2023b.
  • Wang et al. (2023c) Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091, 2023c.
  • Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2022.
  • Wei et al. (2021) Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  • Xu et al. (2022) Xiaohan Xu, Peng Zhang, Yongquan He, Chengpeng Chao, and Chaoyang Yan. Subgraph neighboring relations infomax for inductive link prediction on knowledge graphs. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022.
  • Yao et al. (2022) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2022.
  • Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  • Yasunaga et al. (2021) Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec. Qa-gnn: Reasoning with language models and knowledge graphs for question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  535–546, 2021.
  • Ye et al. (2022) Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou, and Caiming Xiong. Rng-kbqa: Generation augmented iterative ranking for knowledge base question answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  6032–6043, 2022.
  • Yih et al. (2016) Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. The value of semantic parse labeling for knowledge base question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  201–206, 2016.
  • Yu et al. (2022a) Donghan Yu, Sheng Zhang, Patrick Ng, Henghui Zhu, Alexander Hanbo Li, Jun Wang, Yiqun Hu, William Yang Wang, Zhiguo Wang, and Bing Xiang. Decaf: Joint decoding of answers and logical forms for question answering over knowledge bases. In The Eleventh International Conference on Learning Representations, 2022a.
  • Yu et al. (2022b) Donghan Yu, Chenguang Zhu, Yuwei Fang, Wenhao Yu, Shuohang Wang, Yichong Xu, Xiang Ren, Yiming Yang, and Michael Zeng. Kg-fid: Infusing knowledge graph in fusion-in-decoder for open-domain question answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  4961–4974, 2022b.
  • Zhang et al. (2022) Jing Zhang, Xiaokang Zhang, Jifan Yu, Jian Tang, Jie Tang, Cuiping Li, and Hong Chen. Subgraph retrieval enhanced model for multi-hop knowledge base question answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  5773–5784, 2022.
  • Zhang et al. (2021) Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga, Hongyu Ren, Percy Liang, Christopher D Manning, and Jure Leskovec. Greaselm: Graph reasoning enhanced language models. In International conference on learning representations, 2021.
  • Zhang et al. (2018) Yuyu Zhang, Hanjun Dai, Zornitsa Kozareva, Alexander Smola, and Le Song. Variational reasoning for question answering with knowledge graph. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.

Appendix A Appendix

A.1 Detailed Related Work

A.1.1 LLM Reasoning Prompt

Many studies have been proposed to harness the reasoning ability of LLMs to handle complex tasks through prompting (Wei et al., 2022; Wang et al., 2022; Yao et al., 2023; Besta et al., 2023). Chain-of-Thought Wei et al. (2022) enables LLMs to generate a reasoning chain that could be helpful to reasoning. Tree of thoughts (Yao et al., 2023) expands the reasoning chain to a tree structure to explore more reasoning paths. Graph of thoughts further models the reasoning chain as a graph with an aggregation operation to synergize the reasoning paths. Plan-and-solve (Wang et al., 2023c) prompts LLMs to generate a plan and execute based on it. DecomP (He et al., 2021) prompts LLMs to decompose the reasoning task into a series of sub-tasks and solve them step by step. However, the problem of hallucinations and lack of knowledge affect the faithfulness of the reasoning. ReACT (Yao et al., 2022) treats LLMs as agents, which interact with the environment to get the latest knowledge for reasoning. To explore faithful reasoning, Entailer (Tafjord et al., 2022) introduces a verifier to validate the reasoning steps generated by LLMs. Creswell & Shanahan (2022) present a framework including two LLMs that are used for selecting and generating reasoning steps, respectively. FAME (Hong et al., 2023) introduces the Monte-Carlo planning to generate faithful reasoning steps. RR (He et al., 2022) and KD-CoT Wang et al. (2023b) aim to retrieve relevant knowledge from KGs to produce faithful reasoning plans for LLMs.

A.1.2 Knowledge Graph Question Answering

Embedding-based methods model the entities and relations in embedding space and design special model architectures to reason answers. KV-Mem (Miller et al., 2016) adopts a Key-Value memory network to store triples for reasoning. EmbedKGQA (Saxena et al., 2020) and NSM (He et al., 2021) utilize the sequential model to mimic the multi-hop reasoning process. QA-GNN (Yasunaga et al., 2021) and Greaselm (Zhang et al., 2021) further adopt the graph neural network to capture the graph structure for reasoning. However, these methods need to design different model architectures, which are not flexible and generalizable.

Retrieval-augmented methods aims to retrieve the relative facts from the KGs to improve the reasoning performance. Early works adopt the page rank or random walk algorithm to retrieve subgraphs from KGs for reasoning (Sun et al., 2018; 2019). However, they ignore the semantic information in questions and lead to noisy retrieval results. Zhang et al. (2022) proposes a relation paths-based subgraph retrieval, resulting a better retrieval and QA performance. Other lines of studies retrieving triples from KGs via BM25 (Li et al., 2023) or DPR (Karpukhin et al., 2020; Yu et al., 2022b) to improve the performance of LLMs. They discard the structure information in KGs which leads to suboptimal results. Recently, UniKGQA (Jiang et al., 2022) unifies the graph retrieval and reasoning process into a single model with LLMs, which achieves state-of-the-art performance on KGQA tasks.

Semantic parsing methods parse the question into a structural query (e.g., SPARQL) which can be executed by a query engine to get answers (Sun et al., 2020; Lan & Jiang, 2020). ArcaneQA (Gu & Su, 2022) dynamically generates the query based on results from previous steps. RnG-KBQA (Ye et al., 2022) first enumerate all possible queries and then rank them to get the final output. These methods heavily rely on the quality of generated queries. If the query is not executable, no answers will be generated. DECAF (Yu et al., 2022a) combines semantic parsing and LLMs reasoning to jointly generate answers, which also reach salient performance on KGQA tasks.

A.2 Retrieval Algorithm

Given a question q𝑞q and a relation path as plan z𝑧z, we adopt a constrained breadth-first search to retrieve the reasoning paths. The pseudocode code is presented in Algorithm 1.

We first initialize a queue of current reasoning paths 𝒬𝒬{\mathcal{Q}} with the entities in the question 𝒯qsubscript𝒯𝑞{\mathcal{T}}_{q} (line 3-5). Then, we iteratively expand each reasoning path in 𝒬𝒬{\mathcal{Q}} by adding the triples that are connected to the entities in the queue following the relation in relation path (line 11-19). The reasoning path is expanded until the length is equal to the length of the relation path. The expanded reasoning path is added to the set 𝒲zsubscript𝒲𝑧{\mathcal{W}}_{z} as the final results (line 8-10).

Input: Question q𝑞q, relation path z={r1,r2,,rl}𝑧subscript𝑟1subscript𝑟2subscript𝑟𝑙z=\{r_{1},r_{2},\dots,r_{l}\}, KG 𝒢𝒢{\mathcal{G}}.
Output: Reasoning paths 𝒲zsubscript𝒲𝑧{\mathcal{W}}_{z}.
1
2𝒲zsubscript𝒲𝑧{\mathcal{W}}_{z}\leftarrow\emptyset;
3 𝒬Queue()𝒬Queue(){\mathcal{Q}}\leftarrow\text{Queue()};
4 foreach eq𝒯qsubscript𝑒𝑞subscript𝒯𝑞e_{q}\in{\mathcal{T}}_{q} do
       𝒬.append((eq,[]))formulae-sequence𝒬appendsubscript𝑒𝑞{\mathcal{Q}}.\text{append}((e_{q},[])); // Initialize queue with question entities.
5      
6 end foreach
7while 𝒬𝒬{\mathcal{Q}}\neq\emptyset do
8       (s,wz)𝒬.pop()formulae-sequence𝑠subscript𝑤𝑧𝒬pop(s,w_{z})\leftarrow{\mathcal{Q}}.\text{pop}();
9       if len(wz)=len(z)lensubscript𝑤𝑧len𝑧\text{len}(w_{z})=\text{len}(z) then
10             𝒲z.append(wz)formulae-sequencesubscript𝒲𝑧appendsubscript𝑤𝑧{\mathcal{W}}_{z}.\text{append}(w_{z});
11            
12       end if
13      if len(wz)<len(z)lensubscript𝑤𝑧len𝑧\text{len}(w_{z})<\text{len}(z) then
             rz[len(wz)+1]𝑟𝑧delimited-[]lensubscript𝑤𝑧1r\leftarrow z[\text{len}(w_{z})+1]; // Get relation for next step.
14             foreach (s,r,t)𝒢𝑠superscript𝑟𝑡𝒢(s,r^{\prime},t)\in{\mathcal{G}} do
15                   if r=rsuperscript𝑟𝑟r^{\prime}=r then
                         wz.append((s,r,t))formulae-sequencesubscriptsuperscript𝑤𝑧append𝑠𝑟𝑡w^{\prime}_{z}.\text{append}((s,r,t)); // Expand the reasoning path.
16                         𝒬.append((t,wz))formulae-sequence𝒬append𝑡subscriptsuperscript𝑤𝑧{\mathcal{Q}}.\text{append}((t,w^{\prime}_{z}));
17                        
18                   end if
19                  
20             end foreach
21            
22       end if
23      
24 end while
25
26return 𝒲zsubscript𝒲𝑧{\mathcal{W}}_{z}.;
27
Algorithm 1 Retrieve reasoning paths based on relation paths

A.3 Datasets

We adopt two benchmark KGQA datasets: WebQuestionSP (WebQSP)444https://www.microsoft.com/en-us/download/details.aspx?id=52763 (Yih et al., 2016) and Complex WebQuestions (CWQ)555https://www.tau-nlp.sites.tau.ac.il/compwebq (Talmor & Berant, 2018) in this work. We follow previous works (Sun et al., 2018; Jiang et al., 2022) to use the same train and test splits for fair comparison. The statistics of the answer numbers and reasoning hops are presented in Table 8 and Table 9, respectively.

To evaluate the transferability of RoG to other KGs. We further select the MetaQA-3hop dataset (Zhang et al., 2018) which is based on Wiki-Movies KGs666https://research.fb.com/downloads/babi.. We select 1000 samples from the training split. The statistic of the dataset is presented in Table 10.

Both WebQSP and CWQ can be reasoned based on Freebase KGs777https://github.com/microsoft/FastRDFStore (Bollacker et al., 2008). To reduce the size of KGs, following previous works (He et al., 2021; Jiang et al., 2022), we construct a subgraph of Freebase by extracting all triples that contain within the max reasoning hops of question entities in WebQSP and CWQ. Similarly, we construct a subgraph of Wiki-Movies KGs for MetaQA-3hop. The statistics of constructed KGs are presented in Table 11.

Table 8: Statistics of the number of answers for questions in WebQSP and CWQ.
Dataset #Ans = 1 2 \geq #Ans \leq 4 5 \geq #Ans \leq 9 #Ans \geq 10
WebQSP 51.2% 27.4% 8.3% 12.1%
CWQ 70.6% 19.4% 6% 4%
Table 9: Statistics of the hop of questions in WebQSP and CWQ.
Dataset 1 hop 2 hop \geq 3 hop
WebQSP 65.49 % 34.51% 0.00%
CWQ 40.91 % 38.34% 20.75%
Table 10: Statistics of MetaQA-3hop datasets.
Datasets #Train #Test #hop
MetaQA-3hop 1,000 1,4274 3
Table 11: Statistics of constructed knowledge graphs.
KG #Entities #Relations #Triples
Freebase 2,566,291 7,058 8,309,195
Wiki-Movie 43,234 9 133,582

A.4 Baselines

We compare RoG with 21 baselines grouping into 5 categories: 1) Embedding-based methods, 2) Retrieval-augmented methods, 3) Semantic parsing methods, 4) LLMs, and 5) LLMs+KGs methods. The details of each baseline are described as follows.

Embedding-based methods.

  • KV-Mem (Miller et al., 2016) adopts a Key-Value memory network to store triples and perform multi-hop reasoning by iterative operating on the memory.

  • EmbedKGQA (Saxena et al., 2020) models the reasoning on KGs as a sequential link prediction problem by using the embedding of entities and questions.

  • NSM (He et al., 2021) utilizes the sequential model to mimic the multi-hop reasoning process.

  • TransferNet (Shi et al., 2021) adopts a graph neural network to capture the relevance between entities and questions for reasoning.

  • KGT5 (Saxena et al., 2022) finetunes a sequence-to-sequence framework on KGs and generates answers based on the input question.

Retrieval-augmented methods.

  • GraftNet (Sun et al., 2018) retrieves relevant subgraphs from KGs with entity linking.

  • PullNet (Sun et al., 2019) trains a retrieval model composed of a LSTM and a graph neural network to retrieve a question-specific subgraph.

  • SR+NSM (Zhang et al., 2022) proposes a relation-path retrieval to retrieve subgraphs for multi-hop reasoning.

  • SR+NSM+E2E (Zhang et al., 2022) further adopts an end-to-end training strategy to jointly train the retrieval and reasoning modules of SR+NSM.

Semantic parsing methods.

  • SPARQL (Sun et al., 2020) presents a novel skeleton grammar to represent the high-level structure of a complex question with language modes.

  • QGG (Lan & Jiang, 2020) generates a query graph for a question by simultaneously adding constraints and extending relation paths.

  • ArcaneQA (Gu & Su, 2022) dynamically generates the query based on results from previous steps.

  • RnG-KBQA (Ye et al., 2022) first enumerates all possible queries and then ranks them to get the final output.

Large language models (LLMs).

  • Flan-T5 (Chung et al., 2022) is an enhanced version of T5 models that is instruction finetuned on mixture of tasks.

  • Alpaca (Taori et al., 2023) is based on LLaMA and finetuned on an instruction-following dataset.

  • LLaMA2-Chat (Touvron et al., 2023) is a large language model that is optimized for dialogue purposes.

  • ChatGPT888https://openai.com/blog/chatgpt is a powerful closed-source LLM that could follow instructions to conduct complex tasks.

  • ChatGPT+CoT (Wei et al., 2022) uses the Chain-of-thought prompt to improve the reason ability of ChatGPT.

LLMs+KGs methods.

  • KD-CoT Wang et al. (2023b) retrieves relevant knowledge from KGs to generate faithful reasoning plans for LLMs.

  • UniKGQA (Jiang et al., 2022) unifies the graph retrieval and reasoning process into a single model with LLMs, which achieves state-of-the-art performance on KGQA tasks.

  • DECAF (Yu et al., 2022a) combines semantic parsing and LLMs reasoning to jointly generate answers, which also reach salient performance on KGQA tasks.

A.5 Implementation Settings

For RoG, we use LLaMA2-Chat-7B (Touvron et al., 2023) as the LLM backbone, which is instruction finetuned on the training split of WebQSP and CWQ as well as Freebase for 3 epochs. The batch size is set to 4 and the learning rate is set to 2e-5. We use the cosine learning rate scheduler policy with the warmup ratio set to 0.03. The training is conducted on 2 A100-80G GPUs for 38 hours. During inference, we first adopt the LLM to generate top-K𝐾K relation paths with the highest probability as the plans. Then, we adopt the Algorithm 1 to retrieve reasoning paths, which are fed into the LLM to reason the final answers.

For LLM beelines, we use zero-shot prompting to conduct KGQA, which directly asks LLMs to answer the question. For other baselines, we directly copy their results reported in UniKGQA (Jiang et al., 2022) and DECAF (Yu et al., 2022a) for comparisons.

A.6 Additional Experiment Results

A.6.1 Retrieval Costs

We present the retrieval time and number of retrieved reasoning paths in Figure 4. From results, we can see that the retrieval time increases with the number of top-K𝐾K relation paths. Therefore, we should select a proper K𝐾K to balance the retrieval time and the number of retrieved reasoning paths. In experiments, we set K=3𝐾3K=3.

Refer to caption
Figure 4: Average retrieval time and average number of retrieved reasoning paths w.r.t. the number of top-K𝐾K relation paths.

A.6.2 Performance on different hops

We present the performance of RoG and its variants on different hops of questions in Table 12. From results, we can see that RoG achieves better performance than its variants on different hops of questions, especially questions with more than 3 hops. This demonstrates the importance of relation paths for improving the reasoning performance of LLMs on complex questions.

Table 12: F1 scores of RoG and its variants for different hops of questions.
Methods WebQSP CWQ
1 hop 2 hop \geq 3 hop 1 hop 2 hop \geq 3 hop
RoG 77.03 64.86 - 62.88 58.46 37.82
RoG w/o reasoning 57.06 25.49 - 17.06 34.25 17.07
RoG w/o planning 50.33 51.66 - 31.04 33.93 23.29

A.6.3 Performance on different answer numbers

We also present the performance of RoG and its variants on questions with different numbers of answers in Table 13. From results, we can see that RoG achieves better performance than its variants on questions with different numbers of answers. Specifically, with the number of answer increasing, the performance of RoG w/o planning decreases significantly due to the lack of knowledge from KGs. Although, RoG w/o reasoning can retrieve more answers to improve the performance. It still is inferior to RoG due to the lack of reasoning ability to remove the noise.

Table 13: F1 scores of RoG and its variants for questions with varying numbers of answers.
Methods WebQSP CWQ
#Ans = 1 2 \geq #Ans \leq 4 5 \geq #Ans \leq 9 #Ans \geq 10 #Ans = 1 2 \geq #Ans \leq 4 5 \geq #Ans \leq 9 #Ans \geq 10
RoG 67.89 79.39 75.04 58.33 56.90 53.73 58.36 43.62
RoG w/o reasoning 33.49 52.80 58.05 66.01 16.61 27.06 40.10 34.45
RoG w/o planning 55.03 51.08 44.81 27.00 34.08 34.16 31.67 25.21

A.7 Case Studies: Relation Paths

We illustrate several examples of relation paths generated by RoG in Table 14.

A.8 Case Studies: Interpretable Results

We illustrate several examples of interpretable reasoning results generated by RoG in Table 15.

A.9 Prompts

Planning module aims to generate faithful relation paths as plans for answering the question. The instruction template is presented as follows:

Planning Prompt Template Please generate a valid relation path that can be helpful for answering the following question: <Question>

where <Question> indicates the question.

The reasoning module takes the question q𝑞q and a set of reasoning paths 𝒲zsubscript𝒲𝑧{\mathcal{W}}_{z} to generate answers a𝑎a. The instruction template is presented as follows:

Reasoning Prompt Template Based on the reasoning paths, please answer the given question. Please keep the answer as simple as possible and return all the possible answers as a list. Reasoning Paths: <Reasoning Paths> Question: <Question>

where <Reasoning Paths> denotes the retrieved reasoning paths 𝒲zsubscript𝒲𝑧{\mathcal{W}}_{z} which are formatted as a series of structural sentences:

e0r1e1rlelsubscript𝑒0subscript𝑟1subscript𝑒1subscript𝑟𝑙subscript𝑒𝑙e_{0}\to r_{1}\to e_{1}\to\dots\to r_{l}\to e_{l}

\dots

e0r1e1rlelsubscript𝑒0subscript𝑟1subscript𝑒1subscript𝑟𝑙subscript𝑒𝑙e_{0}\to r_{1}\to e_{1}\to\dots\to r_{l}\to e_{l}.

To exploit the explanation ability of RoG, we design a new instruction template for the reasoning module to generate interpretable results. The instruction template is presented as follows:

Explanation Prompt Template Based on the reasoning paths, please answer the given question and explain why. Here are some examples: <Examples> Reasoning Paths: <Reasoning Paths> Question: <Question>

where the Examples denotes a few-shot human-annotated examples to demonstrate the explanation process.

Table 14: Examples of the generated relation paths.
Question Top-3 Relation Paths
what does jamaican people speak?
z1::subscript𝑧1absentz_{1}: location.country.languages_spoken
z2::subscript𝑧2absentz_{2}: language.human_language.countries_spoken_in
z3::subscript𝑧3absentz_{3}: location.country.official_language
where is jamarcus russell from?
z1::subscript𝑧1absentz_{1}: location.location.people_born_here
z2::subscript𝑧2absentz_{2}: people.person.place_of_birth
z3::subscript𝑧3absentz_{3}: sports.sports_league_draft_pick.player \to sports.sports_league_draft_pick.location
where did edgar allan poe died?
z1::subscript𝑧1absentz_{1}: people.deceased_person.place_of_death
z2::subscript𝑧2absentz_{2}: people.cause_of_death.people
z3::subscript𝑧3absentz_{3}: people.person.place_of_birth
what highschool did harper lee go to?
z1::subscript𝑧1absentz_{1}: people.person.education \to education.educational_institution.students_graduates
z2::subscript𝑧2absentz_{2}: education.education.student \to education.educational_institution.students_graduates
z3::subscript𝑧3absentz_{3}: people.person.education \to education.education.institutio
what are the songs that justin bieber wrote?
z1::subscript𝑧1absentz_{1}: music.recording.artist
z2::subscript𝑧2absentz_{2}: music.composition.composer
z3::subscript𝑧3absentz_{3}: music.composer.compositions
what are the religions practiced in indonesia?
z1::subscript𝑧1absentz_{1}: people.person.nationality \to people.person.religion
z2::subscript𝑧2absentz_{2}: location.statistical_region.religions \to location.religion_percentage.religion
z3::subscript𝑧3absentz_{3}: location.country.languages_spoken \to religion.religion.languages
Lou Seal is the mascot for the team that last won the World Series when?
z1::subscript𝑧1absentz_{1}: sports.mascot.team \to sports.sports_championship_event.champion
z2::subscript𝑧2absentz_{2}: sports.mascot.team \to sports.sports_team.championships
z3::subscript𝑧3absentz_{3}: sports.sports_championship_event.championship
What country in the Caribbean contains Saint Michael Parish?
z1::subscript𝑧1absentz_{1}: location.administrative_division.first_level_division_of
z2::subscript𝑧2absentz_{2}: location.location.containedby
z3::subscript𝑧3absentz_{3}: location.location.contains
What type of government is used in the country with Northern District?
z1::subscript𝑧1absentz_{1}: location.administrative_division.first_level_division_of \to government.form_of_government.countries
z2::subscript𝑧2absentz_{2}: location.administrative_division.first_level_division_of \to location.country.form_of_government
z3::subscript𝑧3absentz_{3}: administrative_division.first_level_division_of \to government.form_of_government.countries
The people from the country that contains Nord-Ouest Department speak what languages today?
z1::subscript𝑧1absentz_{1}: location.administrative_division.first_level_division_of \to language.human_language.countries_spoken_in
z2::subscript𝑧2absentz_{2}: location.administrative_division.first_level_division_of \to location.country.languages_spoken
z3::subscript𝑧3absentz_{3}: base.aareas.schema.administrative_area.administrative_parent \to location.country.languages_spoken
What stadium does the team with mascot named Hank play at?
z1::subscript𝑧1absentz_{1}: sports.mascot.team \to sports.sports_facility.teams
z2::subscript𝑧2absentz_{2}: sports.sports_team.team_mascot \to sports.sports_facility.teams
z3::subscript𝑧3absentz_{3}: sports.mascot.team \to sports.sports_team.arena_stadium
Which popular sports team in Spain, that won the 2014 Eurocup Finals championship?
z1::subscript𝑧1absentz_{1}: sports.sports_team.championships \to sports.sports_team_location.teams
z2::subscript𝑧2absentz_{2}: sports.sports_team.location \to sports.sports_team_location.teams
z2::subscript𝑧2absentz_{2}: sports.sports_team.location \to sports.sports_team_location.teams
z3::subscript𝑧3absentz_{3}: sports.sports_team_location.teams
What educational institution with the mascot named Washington University Bear did Tennessee Williams go to?
z1::subscript𝑧1absentz_{1}: education.educational_institution.mascot
z2::subscript𝑧2absentz_{2}: people.person.nationality \to location.location.containedby
z3::subscript𝑧3absentz_{3}: education.education.student \to education.education.institution
Who is the current head coach of the NFL squad owned by the Rooney family?
z1::subscript𝑧1absentz_{1}: sports.professional_sports_team.owner_s \to american_football.football_coach.current_team_head_coached
z2::subscript𝑧2absentz_{2}: sports.professional_sports_team.owner_s \to american_football.football_team.current_head_coach
z3::subscript𝑧3absentz_{3}: american_football.football_coach.current_team_head_coached
What is the home field of the sports team whose mascot is named Fredbird?
z1::subscript𝑧1absentz_{1}: sports.mascot.team \to sports.sports_facility.teams
z2::subscript𝑧2absentz_{2}: sports.sports_team.team_mascot \to sports.sports_facility.teams
z3::subscript𝑧3absentz_{3}: sports.mascot.team \to sports.sports_team.arena_stadium
Table 15: Examples of faithful and interpretable reasoning of RoG.
Question Lou Seal is the mascot for the team that last won the World Series when?
Answer 2014
Reasoning Paths Lou Seal \to sports.mascot.team \to San Francisco Giants \to sports.sports_championship_event.champion \to 2014 World Series
RoG Based on the provided reasoning paths, we can determine when the team for which Lou Seal is the mascot last won the World Series: 1. Lou Seal is associated with the San Francisco Giants. 2. The San Francisco Giants are associated with winning the 2014 World Series. Therefore, the San Francisco Giants, the team for which Lou Seal is the mascot, last won the World Series in 2014.
Question what is my timezone in louisiana?
Answer Central Time Zone
Reasoning Paths Louisiana \to location.location.partially_containedby \to Tangipahoa River \to geography.river.origin \to Osyka \to location.location.time_zones \to Central Time Zone Louisiana \to location.location.containedby \to Alexandria \to location.location.containedby \to United States of America \to location.location.time_zones \to Central Time Zone
RoG Based on the reasoning paths provided, it can be determined that Louisiana is in the Central Time Zone. Here’s the explanation: 1. The first reasoning path tells us that Louisiana is partially contained by the Tangipahoa River, which is a geographical feature. However, it doesn’t directly mention the time zone. 2. The second reasoning path establishes that Louisiana is contained by Alexandria, which is in turn contained by the United States of America. The United States of America is in the Central Time Zone. Therefore, based on these reasoning paths, the time zone in Louisiana is the Central Time Zone.
Question Which child of Walt Disney died from lung cancer?
Answer Sharon Mae Disney
Reasoning Paths Lung cancer \to people.deceased_person.cause_of_death \to Sharon Mae Disney Lung cancer \to people.deceased_person.cause_of_death \to Walt Disney Walt Disney \to people.deceased_person.cause_of_death \to Lung cancer Walt Disney \to people.person.children \to Diane Disney Miller Walt Disney \to people.cause_of_death.people \to Circulatory collapse Walt Disney \to people.person.parents \to Flora Call Disney Walt Disney \to people.person.parents \to Elias Disney Walt Disney \to people.person.parents \to Sharon Mae Disney
RoG Based on the reasoning paths provided, it appears that Sharon Mae Disney is the child of Walt Disney who died from lung cancer. The reasoning path ”Walt Disney \to people.deceased_person.cause_of_death \to Lung cancer” directly connects Walt Disney to lung cancer as a cause of death, and since Sharon Mae Disney is also linked to lung cancer as a cause of death through the path ”Lung cancer \to people.deceased_person.cause_of_death \to Sharon Mae Disney,” it can be concluded that Sharon Mae Disney is the child of Walt Disney who died from lung cancer.