Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning
图推理:忠实且可解释的大型语言模型推理
Abstract 抽象
Large language models (LLMs) have demonstrated impressive reasoning abilities in complex tasks. However, they lack up-to-date knowledge and experience hallucinations during reasoning, which can lead to incorrect reasoning processes and diminish their performance and trustworthiness. Knowledge graphs (KGs), which capture vast amounts of facts in a structured format, offer a reliable source of knowledge for reasoning. Nevertheless, existing KG-based LLM reasoning methods only treat KGs as factual knowledge bases and overlook the importance of their structural information for reasoning.
In this paper, we propose a novel method called reasoning on graphs (RoG) that synergizes LLMs with KGs to enable faithful and interpretable reasoning. Specifically, we present a planning-retrieval-reasoning framework, where RoG first generates relation paths grounded by KGs as faithful plans. These plans are then used to retrieve valid reasoning paths from the KGs for LLMs to conduct faithful reasoning. Furthermore, RoG not only distills knowledge from KGs to improve the reasoning ability of LLMs through training but also allows seamless integration with any arbitrary LLMs during inference. Extensive experiments on two benchmark KGQA datasets demonstrate that RoG achieves state-of-the-art performance on KG reasoning tasks and generates faithful and interpretable reasoning results111Code and data are available at: https://github.com/RManLuo/reasoning-on-graphs.
大型语言模型(LLMs)在复杂任务中表现出令人印象深刻的推理能力。然而,他们在推理过程中缺乏最新的知识和经历幻觉,这可能导致不正确的推理过程并降低他们的表现和可信度。知识图谱 (KG) 以结构化格式捕获大量事实,为推理提供了可靠的知识来源。然而,现有的基于LLMKG的推理方法仅将KG视为事实知识库,而忽略了其结构信息对推理的重要性。在本文中,我们提出了一种称为图推理(RoG)的新方法,LLMs该方法与KG协同作用,以实现忠实和可解释的推理。具体来说,我们提出了一个计划-检索-推理框架,其中 RoG 首先生成以 KG 为基础的关系路径作为忠实计划。然后,这些计划用于从幼稚园检索有效的推理路径,以便LLMs进行忠实的推理。此外,RoG不仅从KG中提炼知识以提高LLMs通过训练的推理能力,而且还LLMs允许在推理过程中与任意任意进行无缝集成。在两个基准 KGQA 数据集上的大量实验表明,RoG 在 KG 推理任务上实现了最先进的性能,并产生了忠实且可解释的推理结果 1 。
1 Introduction 1引言
Large language models (LLMs) have shown great performance in many NLP tasks (Brown et al., 2020; Bang et al., 2023). What’s especially striking is their ability to handle complex tasks through reasoning (Wei et al., 2022; Huang & Chang, 2023). To further unleash LLMs’ reasoning ability, the plan-and-solve paradigm (Wang et al., 2023c) has been proposed, in which LLMs are prompted to generate a plan and execute each reasoning step. In this way, LLMs decompose complex reasoning tasks into a series of sub-tasks and solve them step by step (Khot et al., 2022).
大型语言模型(LLMs)在许多NLP任务中表现出出色的性能(Brown et al., 2020;Bang 等人,2023 年)。特别引人注目的是他们通过推理处理复杂任务的能力(Wei 等人,2022 年;Huang & Chang,2023 年)。为了进一步释放LLMs推理能力,提出了计划与解决范式(Wang et al., 2023c),其中LLMs提示生成计划并执行每个推理步骤。通过这种方式,LLMs将复杂的推理任务分解为一系列子任务并逐步解决(Khot et al., 2022)。
Despite their success, LLMs are still limited by the lack of knowledge and prone to hallucinations during reasoning, which can lead to errors in reasoning processes (Hong et al., 2023; Wang et al., 2023b). For example, as shown in Figure 1, LLMs do not have the latest knowledge and hallucinate an incorrect reasoning step: “has a daughter”. These issues largely diminish the performance and trustworthiness of LLMs in high-stakes scenarios, such as legal judgment and medical diagnosis.
尽管他们取得了成功,LLMs但仍然受到缺乏知识的限制,并且在推理过程中容易出现幻觉,这可能导致推理过程中的错误(Hong 等人,2023 年;Wang 等人,2023b)。例如,如图 1 所示,LLMs没有最新的知识并产生幻觉一个错误的推理步骤:“有一个女儿”。这些问题在很大程度上降低LLMs了高风险场景(例如法律判断和医疗诊断)的性能和可信度。
To tackle the issues, knowledge graphs (KGs) have been incorporated to improve the reasoning ability of LLMs (Pan et al., 2023; Luo et al., 2023). KGs capture abundant factual knowledge in a structured format, which provides a faithful knowledge source for reasoning. As a typical reasoning task, knowledge graph question answering (KGQA) aims to obtain answers based on knowledge from KGs (Sun et al., 2019; Hu et al., 2023). Previous works that jointly use KGs and LLMs for KGQA reasoning can be broadly divided into two categories: 1) semantic parsing methods (Lan & Jiang, 2020; Ye et al., 2022), which use LLMs to convert questions into logical queries that are executed on KGs to obtain answers; and 2) retrieval-augmented methods (Li et al., 2023; Pan et al., 2022), which retrieve triples from KGs as knowledge context and uses LLMs to obtain the final answers.
为了解决这些问题,已经纳入了知识图谱 (KG) 以提高推理能力(LLMsPan et al., 2023;Luo 等人,2023 年)。幼稚园以结构化的形式捕捉丰富的事实知识,为推理提供了忠实的知识来源。作为一种典型的推理任务,知识图谱问答(KGQA)旨在根据KG的知识获得答案(Sun et al., 2019;胡 等人,2023 年)。以前联合使用KG和LLMsKGQA推理的工作大致可分为两类:1)语义解析方法(Lan & 江,2020;Ye et al., 2022),用于LLMs将问题转换为逻辑查询,在 KG 上执行以获得答案;和 2) 检索增强方法(Li 等人,2023 年;Pan et al., 2022),从 KG 中检索三元组作为知识上下文,并用于LLMs获得最终答案。
Although semantic parsing methods can generate more accurate and interpretable results by leveraging reasoning on KGs, the generated logical queries can often be non-executable and yield no answers, due to syntax and semantic limitations (Yu et al., 2022a). Retrieval-augmented methods are more flexible and exploit the ability of LLMs for reasoning. However, they only treat KGs as factual knowledge bases and overlook the importance of their structural information for reasoning (Jiang et al., 2022). For instance, as shown in Figure 1, a relation path, which is a sequence of relations, “child_ofhas_son” can be used to obtain answers to the question “Who is the brother of Justin Bieber?”. Therefore, it is essential to enable LLMs to directly reason on KGs to achieve faithful and interpretable reasoning.
尽管语义解析方法可以通过利用 KG 上的推理来生成更准确和可解释的结果,但由于语法和语义限制,生成的逻辑查询通常是不可执行的,并且不会产生答案(Yu 等人,2022a)。检索增强方法更灵活,并利用了LLMs推理的能力。然而,他们只将KG视为事实知识库,而忽略了其结构信息对推理的重要性(江等人,2022)。例如,如图 1 所示,关系路径(即关系序列“child_of has_son”可用于获得“贾斯汀·比伯的兄弟是谁”问题的答案。因此,必须能够LLMs直接对KG进行推理,以实现忠实和可解释的推理。
In this paper, we propose a novel method called reasoning on graphs (RoG) that synergizes LLMs with KGs to conduct faithful and interpretable reasoning. To address the issues of hallucinations and lack of knowledge, we present a planning-retrieval-reasoning framework, where RoG first generates relation paths grounded by KGs as faithful plans via the planning module. These plans are then used to retrieve valid reasoning paths from KGs to conduct faithful reasoning by the retrieval-reasoning module. In this way, we not only retrieve the latest knowledge from KGs but also consider the guidance of KG structure for reasoning and explanations. Moreover, the planning module of RoG can be plug-and-play with different LLMs during inference to improve their performance.
Based on this framework, RoG is optimized by two tasks: 1) planning optimization, where we distill knowledge from KGs into LLMs to generate faithful relation paths as plans; and 2) retrieval-reasoning optimization, where we enable LLMs to conduct faithful reasoning based on retrieved paths and generate interpretable results. We conduct extensive experiments on two benchmark KGQA datasets, and the results demonstrate that RoG achieves state-of-the-art performance on KG reasoning tasks and generates faithful and interpretable reasoning results.
在本文中,我们提出了一种称为图形推理(RoG)的新方法,LLMs该方法与KG协同作用,以进行忠实且可解释的推理。为了解决幻觉和缺乏知识的问题,我们提出了一个计划-检索-推理框架,其中 RoG 首先通过计划模块生成由 KG 作为忠实计划的关系路径。然后,这些计划用于从 KG 中检索有效的推理路径,以通过检索推理模块进行忠实的推理。这样,我们不仅可以从幼稚园中检索到最新的知识,还可以考虑幼稚园结构的指导进行推理和解释。此外,RoG的规划模块可以在推理过程中与不同的LLMs模块即插即用,以提高其性能。基于该框架,RoG通过两个任务进行优化:1)计划优化,我们将KG中的知识提炼成LLMs忠实的关系路径作为计划;2)检索推理优化,我们能够LLMs根据检索到的路径进行忠实的推理并生成可解释的结果。我们在两个基准KGQA数据集上进行了广泛的实验,结果表明RoG在KG推理任务上实现了最先进的性能,并产生了忠实和可解释的推理结果。
![Refer to caption](/html/2310.01061/assets/x1.png)
2 Related Work
LLM Reasoning Prompt. Many studies have been proposed to harness the reasoning ability of LLMs to handle complex tasks through prompting (Wei et al., 2022; Wang et al., 2022; Yao et al., 2023; Besta et al., 2023). Plan-and-solve (Wang et al., 2023c) prompts LLMs to generate a plan and conduct reasoning based on it. DecomP (He et al., 2021) prompts LLMs to decompose the reasoning task into a series of sub-tasks and solve them step by step. However, the problem of hallucinations and lack of knowledge affect the faithfulness of LLMs’ reasoning. ReACT (Yao et al., 2022) treats LLMs as agents, which interact with the environment to get the latest knowledge for reasoning. To explore faithful reasoning, FAME (Hong et al., 2023) introduces the Monte-Carlo planning to generate faithful reasoning steps. RR (He et al., 2022) and KD-CoT Wang et al. (2023b) further retrieve relevant knowledge from KGs to produce faithful reasoning plans for LLMs.
Knowledge Graph Question Answering (KGQA). Conventional embedding-based methods represent the entities and relations in embedding space and design special model architectures (e.g., Key-Value memory networks, sequential models, and graph neural networks) to reason answers (Miller et al., 2016; He et al., 2021; Yasunaga et al., 2021). To integrate LLMs for KGQA, retrieval-augmented methods aim to retrieve the relative facts from the KGs to improve the reasoning performance (Li et al., 2023; Karpukhin et al., 2020). Recently, UniKGQA (Jiang et al., 2022) which unifies the graph retrieval and reasoning process into a single model with LLMs, achieves STOA performance. Semantic parsing methods convert the question into a structural query (e.g., SPARQL) by LLMs, which can be executed by a query engine to reason the answers on KGs (Sun et al., 2020; Lan & Jiang, 2020). However, these methods heavily rely on the quality of generated queries. If the query is not executable, no answers will be generated. DECAF (Yu et al., 2022a) combines semantic parsing and LLMs reasoning to jointly generate answers, which also reach salient performance on KGQA tasks.
3 Preliminary
Knowledge Graphs (KGs) contain abundant factual knowledge in the form of a set of triples: , where and denote the set of entities and relations, respectively.
Relation Paths are a sequence of relations: , where denotes the -th relation in the path and denotes the length of the path.
Reasoning Paths are the instances of a relation path in KGs: , where denotes the -th entity and denotes the -th relation in the relation path .
Example 1.
Given a relation path: , a reasoning path instance could be: , which denotes “Alice” is married to “Bob” and “Bob” is the father of “Charlie”.
Knowledge Graph Question Answering (KGQA) is a typical reasoning task based on KGs. Given a natural language question and a KG , the task aims to design a function to predict answers based on knowledge from , i.e., . Following previous works (Sun et al., 2019; Jiang et al., 2022), we assume the entities mentioned in and answers are labeled and linked to the corresponding entities in , i.e., .
4 Approach
In this section, we introduce our method: reasoning on graphs (RoG). We present a novel planning-retrieval-reasoning framework that synergizes LLMs and KGs to conduct faithful and interpretable reasoning for KGQA. The overall framework of RoG is illustrated in Figure 2.
![Refer to caption](/html/2310.01061/assets/x2.png)
4.1 Reasoning on Graphs: Planning-Retrieval-Reasoning
Recently, many techniques have been explored to improve the reasoning ability of LLMs by planning, which first prompts LLMs to generate a reasoning plan and then conduct reasoning based on it (Wang et al., 2023c). However, LLMs are known for having hallucination issues, which are prone to generating incorrect plans and leading to wrong answers (Ji et al., 2023). To address this issue, we present a novel planning-retrieval-reasoning framework, which makes the reasoning plans grounded by KGs and then retrieves faithful reasoning paths for LLM reasoning.
Relation paths, which capture semantic relations between entities, have been utilized in many reasoning tasks on KGs (Wang et al., 2021; Xu et al., 2022). Moreover, compared to the dynamically updated entities, the relations in KGs are more stable (Wang et al., 2023a). By using relation paths, we can always retrieve the latest knowledge from KGs for reasoning. Therefore, relation paths can serve as faithful plans for reasoning the answer to KGQA task.
Example 2.
Given a question “Who is the child of Alice”, we can generate a relation path as the plan: . This relation path expresses the plan: 1) find the person that “Alice” is married to; 2) find the child of that person. We can execute the plan (relation path) by retrieving a reasoning path from KGs as: . Finally, we can answer the question based on the reasoning path, which is “Charlie”.
By treating relation paths as plans, we can make sure the plans are grounded by KGs, which enables LLMs to conduct faithful and interpretable reasoning on graphs. In a nutshell, we formulate our RoG as an optimization problem that aims to maximize the probability of reasoning the answer from a knowledge graph w.r.t the question by generating relation paths as the plan:
(1) |
where denotes the parameters of LLMs, denotes the relation paths (plans) generated by LLMs, and denotes the set of possible relation paths. The latter term is the probability of generating a faithful relation path grounded by KG, given the question , which is realized by the planning module. The former term is the probability of reasoning an answer given the question , relation path , and KG , which is computed by the retrieval-reasoning module.
4.2 Optimization Framework
Despite the advantage of generating relation paths as plans, the LLMs have zero knowledge of the relations contained in KGs. Therefore, LLMs cannot directly generate relation paths grounded by KGs as faithful plans. Moreover, LLMs might not understand the reasoning paths correctly and conduct reasoning based on them. To address these issues, we design two instruction tuning tasks: 1) planning optimization, which distills the knowledge from KGs into LLMs to generate faithful relation paths as plans; 2) retrieval-reasoning optimization, which enables LLMs to reason based on the retrieved reasoning paths.
The objective function in equation 1 can be optimized by maximizing the evidence lower bound (ELBO) (Jordan et al., 1999), which is formulated as
(2) |
where denotes the posterior distribution of faithful relation paths grounded by KGs. The latter term minimizes the KL divergence between the posterior and the prior, which encourages LLMs to generate faithful relation paths (planning optimization). The former term maximizes the expectation that retrieval-reasoning module generates correct answers based on the relation paths and KGs (retrieval-reasoning optimization).
Planning optimization. In planning optimization, we aim to distill the knowledge from KGs into LLMs to generate faithful relation paths as plans. This can be achieved by minimizing the KL divergence with the posterior distribution of faithful relation paths , which can be approximated by the valid relation paths in KGs.
Given a question and answer , we could find the path instances connecting and in KGs. The corresponding relation path can be considered valid and serve as a faithful plan for answering the question . Therefore, the posterior distribution can be formally approximated as
(3) |
where denote the existence of a path instance connecting the question and answer entities in . To reduce the number of valid relation paths, we only consider the shortest paths between and in KGs (Zhang et al., 2022). Therefore, the KL divergence can be calculated as
(4) | ||||
By optimizing the equation 4, we maximize the probability of LLMs generating faithful relation paths through distilling the knowledge from KGs.
Retrieval-reasoning optimization. In retrieval-reasoning optimization, we aim to enable LLMs to conduct reasoning based on the retrieved reasoning paths. For the retrieval-reasoning module, we follow the FiD framework (Izacard & Grave, 2021), which allows reasoning on multiple retrieved reasoning paths, formulated as
(5) |
By approximating the expectation with K sampled plans , the objective function of reasoning optimization can be written as
(6) |
This maximizes the probability of LLMs generating correct answers based on the retrieved reasoning paths.
The final objective function of RoG is the combination of the planning optimization and retrieval-reasoning optimization, which can be formulated as
(7) |
From equation 7, we can see that we adopt the same LLM for both planning and reasoning, which are jointly trained on two instruction-tuning tasks, i.e., (planning and retrieval-reasoning). We will discuss the implementation details of these two tasks in the following subsections.
4.3 Planning Module
The planning module aims to generate faithful relation paths as plans for answering the question. To utilize the instruction-following ability of LLMs (Wei et al., 2021), we design a simple instruction template that prompts LLMs to generate relation paths:
where <Question> indicates the question . The question together with the instruction template is fed into LLMs to generate the relation paths, which are structurally formatted as a sentence:
<PATH> <SEP> <SEP> <SEP> </PATH>
where <PATH>, <SEP>, </PATH> are special tokens indicating the start, separator, and end of the relation path, respectively222The relation name could be split into multiple tokens. For example, “born_in” could be split into “born” and “_in” by tokenizer. In this way, we could fully utilize the semantic information in relation names and generalize to different KGs..
Therefore, the optimization of can be achieved as
(8) |
where denotes the prior distribution of generating faithful relation path , and denotes the probability of each token in generated by LLMs.
4.4 Retrieval-Reasoning Module
Retrieval. Given a question and a relation path as plan , the retrieval module aims to retrieve the reasoning paths from KG . The retrieval process can be conducted by finding paths in that start from the question entities and follow the relation paths , formulated as
(9) |
We adopt a constrained breadth-first search to retrieve the reasoning paths from KGs. In experiments, all retrieved paths are used for reasoning. The detailed retrieval algorithm can be found in Section A.2.
Despite we can utilize the retrieved reasoning paths and directly get the answers with majority voting. The retrieved reasoning paths could be noisy and irrelevant to the questions, which leads to incorrect answers (He et al., 2021; Zhang et al., 2022). Therefore, we propose a reasoning module to explore the ability of LLMs to identify the important reasoning paths and answer the questions based on them.
Reasoning. The reasoning module takes the question and a set of reasoning paths to generate answers . Similarly, we design a reasoning instruction prompt to guide LLMs to conduct reasoning based on the retrieved reasoning paths :
where <Reasoning Paths> denotes the retrieved reasoning paths which are also formatted as a series of structural sentences. The detailed of prompts can be found in Section A.9.
The optimization of can be written as
(10) |
where denotes probability of reasoning the correct answer based on relation paths , and denote the tokens of answers .
5 Experiment
In our experiments, we aim to answer the following research questions:
-
•
RQ1: Can RoG achieve state-of-the-art performance on the KGQA tasks?
-
•
RQ2: Can the planning module of RoG be integrated with other LLMs to improve their performance?
-
•
RQ3: Can RoG be finetuned and effectively transferred to other knowledge graphs?
-
•
RQ4: Can RoG conduct faithful reasoning and generate interpretable reasoning results?
5.1 Experiment Settings
Datasets. We evaluate the reasoning ability of RoG on two benchmark KGQA datasets:
Datasets | #Train | #Test | Max #hop |
WebQSP | 2,826 | 1,628 | 2 |
CWQ | 27,639 | 3,531 | 4 |
WebQuestionSP (WebQSP) (Yih et al., 2016) and Complex WebQuestions (CWQ) (Talmor & Berant, 2018), which contain up to 4-hop questions. The statistic of the datasets are given in Table 1. Freebase (Bollacker et al., 2008) is the background knowledge graph for both datasets, which contains around 88 million entities, 20 thousand relations, and 126 million triples. The details of the datasets are described in Section A.3.
Baselines. We compare RoG with 21 baselines grouping into 5 categories: 1) Embedding-based methods, 2) Retrieval-augmented methods, 3) Semantic parsing methods, 4) LLMs, and 5) LLMs+KGs methods. The details of each baseline are described in Section A.4.
Evaluation Metrics. Following previous works, we use Hits@1 and F1 as the evaluation metrics. Hits@1 measures the proportion of questions whose top-1 predicted answer is correct. Since a question may correspond to multiple answers, F1 considers the coverage of all answers, which balances the precision and recall of the predicted answers.
Implementations. For RoG, we use LLaMA2-Chat-7B (Touvron et al., 2023) as the LLM backbone, which is instruction finetuned on the training split of WebQSP and CWQ as well as Freebase for 3 epochs. We generate the top-3 relation paths using beam-search for each question. Since UniKGQA (Jiang et al., 2022) and DECAF (Yu et al., 2022a) are state-of-the-art methods, we directly refer their results and those of the other baselines reported in their papers for comparisons. For LLMs, we use zero-shot prompting to conduct KGQA. The detailed settings are described in Section A.5.
Type | Methods | WebQSP | CWQ | ||
Hits@1 | F1 | Hits@1 | F1 | ||
Embedding | KV-Mem (Miller et al., 2016) | 46.7 | 34.5 | 18.4 | 15.7 |
EmbedKGQA (Saxena et al., 2020) | 66.6 | - | 45.9 | - | |
NSM (He et al., 2021) | 68.7 | 62.8 | 47.6 | 42.4 | |
TransferNet (Shi et al., 2021) | 71.4 | - | 48.6 | - | |
KGT5 Saxena et al. (2022) | 56.1 | - | 36.5 | - | |
Retrieval | GraftNet (Sun et al., 2018) | 66.4 | 60.4 | 36.8 | 32.7 |
PullNet (Sun et al., 2019) | 68.1 | - | 45.9 | - | |
SR+NSM (Zhang et al., 2022) | 68.9 | 64.1 | 50.2 | 47.1 | |
SR+NSM+E2E (Zhang et al., 2022) | 69.5 | 64.1 | 49.3 | 46.3 | |
Semantic Parsing | SPARQL (Sun et al., 2020) | - | - | 31.6 | - |
QGG (Lan & Jiang, 2020) | 73.0 | 73.8 | 36.9 | 37.4 | |
ArcaneQA (Gu & Su, 2022) | - | 75.3 | - | - | |
RnG-KBQA (Ye et al., 2022) | - | 76.2 | - | - | |
LLMs | Flan-T5-xl (Chung et al., 2022) | 31.0 | - | 14.7 | - |
Alpaca-7B (Taori et al., 2023) | 51.8 | - | 27.4 | - | |
LLaMA2-Chat-7B (Touvron et al., 2023) | 64.4 | - | 34.6 | - | |
ChatGPT | 66.8 | - | 39.9 | - | |
ChatGPT+CoT | 75.6 | - | 48.9 | - | |
LLMs+KGs | KD-CoT (Wang et al., 2023b) | 68.6 | 52.5 | 55.7 | - |
UniKGQA (Jiang et al., 2022) | 77.2 | 72.2 | 51.2 | 49.1 | |
DECAF (DPR+FiD-3B) (Yu et al., 2022a) | 82.1 | 78.8 | - | - | |
RoG | 85.7 | 70.8 | 62.6 | 56.2 |
5.2 RQ1: KGQA Performance Comparison
Main Results. In this section, we compare RoG with other baselines on KGQA tasks. The results are shown in Table 2. Our method achieves the best performance on both datasets across most metrics. Specifically, compared to the SOTA method DECAF (Yu et al., 2022a) on WebQSP, our method improves Hits@1 by 4.4%. On the CWQ dataset, which is more challenging due to multi-hop questions, our method improves both Hits@1 and F1 by 22.3% and 14.4% against the SOTA model UniKGQA (Jiang et al., 2022). These results demonstrate the superior reasoning ability of our method in KGQA.
Among other methods, retrieval-augmented approaches outperform conventional embedding-based methods by retrieving relevant subgraphs from KGs, which reduces reasoning complexity. Furthermore, SR+NSM and SR+NSM+E2E adopt relation paths-based retrieval which achieves better performance, highlighting the importance of relation paths. Semantic parsing methods perform better than retrieval methods on WebQSP but worse on CWQ due to the complexity of generating logical queries for complex questions in CWQ. Although LLMs-based methods achieve comparable performance, they are limited by hallucinations and lack of knowledge as shown in Section 5.5. The LLMs+KGs methods achieve the second-best performance, which demonstrates the effectiveness of unifying KGs and LLMs for reasoning.
Ablation Study. We conduct an ablation study to analyze the effectiveness of the planning module and reasoning module in our method (RoG). We compare four variants: 1) w/o planning, where we remove the planning module and perform reasoning without retrieved reasoning paths; 2) w/o reasoning, where we remove the reasoning module and use all answers from retrieved reasoning paths as results; 3) w/ random plans, where we randomly retrieve reasoning paths from KGs and feed them into the reasoning module; 4) w/ vote reasoning, where we adopt the majority voting to select top-5 answers from retrieved reasoning paths. The results are shown in Table 3.
Method | WebQSP | CWQ | ||||
Precision | Recall | F1 | Precision | Recall | F1 | |
RoG | 74.77 | 75.84 | 70.81 | 57.69 | 58.19 | 56.17 |
RoG w/o planning | 57.26 | 50.16 | 49.69 | 35.35 | 34.77 | 33.76 |
RoG w/o reasoning | 46.90 | 79.85 | 49.56 | 18.88 | 67.89 | 22.26 |
RoG w/ random plans | 38.66 | 38.31 | 35.24 | 38.99 | 39.29 | 37.64 |
RoG w/ vote reasoning | 54.80 | 60.44 | 47.96 | 22.92 | 47.98 | 26.52 |
From the results, it is evident that without a planning module, our method degenerates to conventional LLMs that solely rely on questions as input, suffering from the lack of knowledge issue. Although removing the reasoning module leads to high recall due to an increased number of answers, precision drops significantly because of noise in retrieved paths. This demonstrates the effectiveness of the reasoning module in identifying important reasoning paths and filtering out noise. Furthermore, using random plans achieves worse performance than removing the planning module, highlighting the importance of a planning module who generates faithful reasoning plans. Using a simple majority vote reasoning can improve the results which also demonstrate the necessity of reasoning module.
5.3 RQ2: Plug-and-Play RoG Planning Module
Methods | WebQSP | CWQ | ||
Hits@1 | Recall | Hits@1 | Recall | |
ChatGPT | 66.77 | 49.27 | 39.90 | 35.07 |
ChatGPT + RoG Planning | 81.51 | 71.60 | 52.68 | 48.51 |
Alpaca-7B | 51.78 | 33.65 | 27.44 | 23.62 |
Alpaca-7B + RoG Planning | 56.16 | 74.20 | 44.04 | 38.46 |
LLaMA2-Chat-7B | 64.37 | 44.61 | 34.60 | 29.91 |
LLaMA2-Chat-7B + RoG Planning | 74.20 | 56.16 | 56.41 | 51.99 |
Flan-T5-xl | 30.95 | 17.08 | 14.69 | 12.25 |
Flan-T5-xl + RoG Planning | 67.87 | 44.93 | 37.81 | 32.57 |
In this section, we evaluate the effectiveness of integrating the planning module of RoG with different LLMs during inference to improve their performance. Specifically, we first adopt the planning module of RoG to generate relation paths and feed the retrieved reasoning paths as context into different LLMs for reasoning. The results are presented in Table 4. To account for the fact that it is difficult to extract the number of answers from LLM’s output. We only report the Hits@1 and Recall metrics.
From the results, we can notice that the performance of all LLMs is substantially improved by integrating the planning module of RoG. Specifically, the Hits@1 of ChatGPT, Alpaca, LLaMA2, and Flan-T5 are improved by 8.5%, 15.3%, and 119.3%, respectively. This demonstrates that the planning module of RoG can be seamlessly integrated with other LLMs to improve their performance without retraining.
表 5:RoG 在 MetaQA-3hop 上的性能。
Strategies 策略 | MetaQA-3hop 元QA-3hop | |
Hits@1 | F1 | |
RoG (train from scratch) RoG(从头开始训练) | 84.81 | 41.32 |
RoG (transfer from Freebase) RoG(从Freebase转移) |
88.98 | 50.68 |
5.4 RQ3: Transferability to Other KGs
5.4RQ3:可转让给其他幼稚园
We also evaluate the transferability of RoG to other KGs. We select the MetaQA-3hop dataset (Zhang et al., 2018) which is based on Wiki-Movies KGs333https://research.fb.com/downloads/babi.. We select 1000 samples from the training split and utilize two training strategies to finetune RoG: 1) training from scratch, where we directly optimize RoG with 1000 samples; 2) transfer from Freebase, where we conduct a further finetuning based on RoG trained for Freebase. The results are shown in Table 5. From results, we can see that transfer from Freebase achieves better performance than training from scratch, which demonstrates the transferability of RoG to other KGs.
我们还评估了 RoG 向其他 KG 的可转移性。我们选择基于Wiki-Movies KG 3 的MetaQA-3hop数据集(Zhang et al., 2018)。我们从训练拆分中选择 1000 个样本,并利用两种训练策略来微调 RoG:1) 从头开始训练,我们直接用 1000 个样本优化 RoG;2)从Freebase转移,我们根据为Freebase训练的RoG进行进一步的微调。结果如表5所示。从结果中,我们可以看到从Freebase转移比从头开始训练的性能更好,这表明RoG可以转移到其他KG。
![Refer to caption](/html/2310.01061/assets/x3.png)
图 3:顶部 生成的关系路径的忠实度。绿色条表示检索到的推理路径的平均数,实线表示检索到的推理路径的答案覆盖率,虚线表示基于检索到的推理路径的推理模块的答案覆盖率。
表 6:缺乏 ChatGPT + CoT 知识的示例和 RoG 的可解释预测。
Question 问题 |
What type of government is used in the country with Northern District? 北区国家使用什么类型的政府? |
Answer | Parliamentary system |
ChatGPT + CoT | To determine what type of government is used in a country with a Northern District, we need more specific information. The term ”Northern District” does not specify the name of any particular country or provide enough context to identify its government type. Could you please provide additional details or clarify the specific country you are referring to? |
Reasoning Paths | Northern District location.administrative_division.first_level_division_of Israel government.form_of_government.countries Parliamentary system |
RoG | The type of government used in the country with the Northern District is a Parliamentary system. Explanations: 1. Northern District location.administrative_division.first_level_division_of Israel: The Northern District is a part of Israel. 2. Israel government.form_of_government.countries Parliamentary system: Israel has a Parliamentary system of government. So, by following the reasoning paths provided, we can conclude that the country with the Northern District, which is Israel, uses a Parliamentary system of government. |
Question | Where is the home stadium of the team who won the 1946 World Series championship? |
Answer | Busch Stadium |
ChatGPT + CoT | Step 1: Identify the team that won the 1946 World Series championship. The team that won the 1946 World Series championship was the St. Louis Cardinals. Step 2: Determine the home stadium of the St. Louis Cardinals during that time. The St. Louis Cardinals’ home stadium during the 1946 season was Sportsman’s Park. Therefore, the possible answer to the question ”Where is the home stadium of the team who won the 1946 World Series championship?” is: - Sportsman’s Park. |
Reasoning Paths | 1946 World Series sports.sports_team.championships St. Louis Cardinals sports.sports_team.arena_stadium Busch Stadium 1946 World Series sports.sports_team.championships St. Louis Cardinals sports.sports_team.arena_stadium Roger Dean Stadium |
RoG | Based on the reasoning paths provided, the home stadium of the team who won the 1946 World Series championship is Busch Stadium. This conclusion is based on the following path: 1. 1946 World Series sports.sports_team.championships St. Louis Cardinals sports.sports_team.arena_stadium Busch Stadium The reasoning path leads us to the St. Louis Cardinals as the team that won the 1946 World Series, and Busch Stadium is the stadium associated with the St. Louis Cardinals. Therefore, Busch Stadium is the home stadium of the team that won the 1946 World Series championship. |
5.5 RQ4: Faithful reasoning and interpretable results
Quantitative Results. To evaluate the faithfulness of the relation paths, we illustrate the quantitative results in Figure 3. In experiments, we sweep the number of top- relation paths generated by RoG. From results, we can see that the number of retrieved reasoning paths increases with , which also leads to more answers to be covered (recall). This demonstrates the faithfulness of the relation paths by retrieving the answers. However, more retrieved reasoning paths also lead to more noise and retrieval time (shown in Section A.6.1), which decreases the precision and makes little contribution to the final results (reasoning-f1). Therefore, we set in experiments.
Case studies. We also illustrate two case studies in Table 6 and Table 7. In Table 6, we can find that ChatGPT+CoT suffers from the lack of knowledge issue and cannot answer the question. On the contrary, RoG can generate faithful relation paths and retrieve valid reasoning paths from KGs for reasoning. Besides, RoG can provide interpretable explanations based on the reasoning paths. In Table 7, we can see that ChatGPT+CoT suffers from hallucinations and generates incorrect answers. In contrast, although the retrieved reasoning paths contain noises, the reasoning module can identify the correct reasoning paths and conduct faithful reasoning. These results demonstrate the effectiveness of RoG in conducting faithful reasoning and generating interpretable results. More cases can be found in Sections A.7 and A.8.
6 Conclusion
In this paper, we propose a novel method called reasoning on graphs (RoG) that synergizes LLMs with KGs to conduct faithful and interpretable reasoning. To address the issues of hallucinations and lack of knowledge, we present a planning-retrieval-reasoning framework, which allows LLMs to access the latest knowledge while reasoning based on faithful plans on graphs. RoG not only enhances the reasoning capability of LLMs by distilling knowledge from KGs through training but also enables seamless integration with any LLMs during inference. Extensive experiments on two benchmark KGQA datasets demonstrate the superiority of RoG in reasoning ability and interpretability.
References
- Bang et al. (2023) Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
- Besta et al. (2023) Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
- Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1247–1250, 2008.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Chung et al. (2022) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Creswell & Shanahan (2022) Antonia Creswell and Murray Shanahan. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271, 2022.
- Gu & Su (2022) Yu Gu and Yu Su. Arcaneqa: Dynamic program induction and contextualized encoding for knowledge base question answering. In Proceedings of the 29th International Conference on Computational Linguistics, pp. 1718–1731, 2022.
- He et al. (2021) Gaole He, Yunshi Lan, Jing Jiang, Wayne Xin Zhao, and Ji-Rong Wen. Improving multi-hop knowledge base question answering by learning intermediate supervision signals. In Proceedings of the 14th ACM international conference on web search and data mining, pp. 553–561, 2021.
- He et al. (2022) Hangfeng He, Hongming Zhang, and Dan Roth. Rethinking with retrieval: Faithful large language model inference. arXiv preprint arXiv:2301.00303, 2022.
- Hong et al. (2023) Ruixin Hong, Hongming Zhang, Hong Zhao, Dong Yu, and Changshui Zhang. Faithful question answering with monte-carlo planning. ACL2023, 2023.
- Hu et al. (2023) Nan Hu, Yike Wu, Guilin Qi, Dehai Min, Jiaoyan Chen, Jeff Z Pan, and Zafar Ali. An empirical study of pre-trained language models in simple knowledge graph question answering. World Wide Web, pp. 1–32, 2023.
- Huang & Chang (2023) Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. ACL 2023, 2023.
- Izacard & Grave (2021) Gautier Izacard and Édouard Grave. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 874–880, 2021.
- Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
- Jiang et al. (2022) Jinhao Jiang, Kun Zhou, Xin Zhao, and Ji-Rong Wen. Unikgqa: Unified retrieval and reasoning for solving multi-hop question answering over knowledge graph. In The Eleventh International Conference on Learning Representations, 2022.
- Jordan et al. (1999) Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37:183–233, 1999.
- Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781, 2020.
- Khot et al. (2022) Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, 2022.
- Lan & Jiang (2020) Yunshi Lan and Jing Jiang. Query graph generation for answering multi-hop complex questions from knowledge bases. Association for Computational Linguistics, 2020.
- Li et al. (2023) Shiyang Li, Yifan Gao, Haoming Jiang, Qingyu Yin, Zheng Li, Xifeng Yan, Chao Zhang, and Bing Yin. Graph reasoning for question answering with triplet retrieval. In Findings of the Association for Computational Linguistics: ACL 2023, 2023.
- Luo et al. (2023) Linhao Luo, Jiaxin Ju, Bo Xiong, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. Chatrule: Mining logical rules with large language models for knowledge graph reasoning. arXiv preprint arXiv:2309.01538, 2023.
- Miller et al. (2016) Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. Key-value memory networks for directly reading documents. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1400–1409, 2016.
- Pan et al. (2023) Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. Unifying large language models and knowledge graphs: A roadmap. arXiv preprint arXiv:2306.08302, 2023.
- Pan et al. (2022) Xiaoman Pan, Wenlin Yao, Hongming Zhang, Dian Yu, Dong Yu, and Jianshu Chen. Knowledge-in-context: Towards knowledgeable semi-parametric language models. In The Eleventh International Conference on Learning Representations, 2022.
- Saxena et al. (2020) Apoorv Saxena, Aditay Tripathi, and Partha Talukdar. Improving multi-hop question answering over knowledge graphs using knowledge base embeddings. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 4498–4507, 2020.
- Saxena et al. (2022) Apoorv Saxena, Adrian Kochsiek, and Rainer Gemulla. Sequence-to-sequence knowledge graph completion and question answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2814–2828, 2022.
- Shi et al. (2021) Jiaxin Shi, Shulin Cao, Lei Hou, Juanzi Li, and Hanwang Zhang. Transfernet: An effective and transparent framework for multi-hop question answering over relation graph. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 4149–4158, 2021.
- Sun et al. (2018) Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhutdinov, and William Cohen. Open domain question answering using early fusion of knowledge bases and text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4231–4242, 2018.
- Sun et al. (2019) Haitian Sun, Tania Bedrax-Weiss, and William Cohen. Pullnet: Open domain question answering with iterative retrieval on knowledge bases and text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2380–2390, 2019.
- Sun et al. (2020) Yawei Sun, Lingling Zhang, Gong Cheng, and Yuzhong Qu. Sparqa: skeleton-based semantic parsing for complex questions over knowledge bases. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 8952–8959, 2020.
- Tafjord et al. (2022) Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. Entailer: Answering questions with faithful and truthful chains of reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2078–2093, 2022.
- Talmor & Berant (2018) Alon Talmor and Jonathan Berant. The web as a knowledge-base for answering complex questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 641–651, 2018.
- Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: an instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Wang et al. (2021) Hongwei Wang, Hongyu Ren, and Jure Leskovec. Relational message passing for knowledge graph completion. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 1697–1707, 2021.
- Wang et al. (2023a) Jiapu Wang, Boyue Wang, Meikang Qiu, Shirui Pan, Bo Xiong, Heng Liu, Linhao Luo, Tengfei Liu, Yongli Hu, Baocai Yin, et al. A survey on temporal knowledge graph completion: Taxonomy, progress, and prospects. arXiv preprint arXiv:2308.02457, 2023a.
- Wang et al. (2023b) Keheng Wang, Feiyu Duan, Sirui Wang, Peiguang Li, Yunsen Xian, Chuantao Yin, Wenge Rong, and Zhang Xiong. Knowledge-driven cot: Exploring faithful reasoning in llms for knowledge-intensive question answering. arXiv preprint arXiv:2308.13259, 2023b.
- Wang et al. (2023c) Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091, 2023c.
- Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2022.
- Wei et al. (2021) Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Xu et al. (2022) Xiaohan Xu, Peng Zhang, Yongquan He, Chengpeng Chao, and Chaoyang Yan. Subgraph neighboring relations infomax for inductive link prediction on knowledge graphs. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022.
- Yao et al. (2022) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2022.
- Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
- Yasunaga et al. (2021) Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec. Qa-gnn: Reasoning with language models and knowledge graphs for question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 535–546, 2021.
- Ye et al. (2022) Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou, and Caiming Xiong. Rng-kbqa: Generation augmented iterative ranking for knowledge base question answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6032–6043, 2022.
- Yih et al. (2016) Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. The value of semantic parse labeling for knowledge base question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 201–206, 2016.
- Yu et al. (2022a) Donghan Yu, Sheng Zhang, Patrick Ng, Henghui Zhu, Alexander Hanbo Li, Jun Wang, Yiqun Hu, William Yang Wang, Zhiguo Wang, and Bing Xiang. Decaf: Joint decoding of answers and logical forms for question answering over knowledge bases. In The Eleventh International Conference on Learning Representations, 2022a.
- Yu et al. (2022b) Donghan Yu, Chenguang Zhu, Yuwei Fang, Wenhao Yu, Shuohang Wang, Yichong Xu, Xiang Ren, Yiming Yang, and Michael Zeng. Kg-fid: Infusing knowledge graph in fusion-in-decoder for open-domain question answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 4961–4974, 2022b.
- Zhang et al. (2022) Jing Zhang, Xiaokang Zhang, Jifan Yu, Jian Tang, Jie Tang, Cuiping Li, and Hong Chen. Subgraph retrieval enhanced model for multi-hop knowledge base question answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5773–5784, 2022.
- Zhang et al. (2021) Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga, Hongyu Ren, Percy Liang, Christopher D Manning, and Jure Leskovec. Greaselm: Graph reasoning enhanced language models. In International conference on learning representations, 2021.
- Zhang et al. (2018) Yuyu Zhang, Hanjun Dai, Zornitsa Kozareva, Alexander Smola, and Le Song. Variational reasoning for question answering with knowledge graph. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
Appendix A Appendix
A.1 Detailed Related Work
A.1.1 LLM Reasoning Prompt
Many studies have been proposed to harness the reasoning ability of LLMs to handle complex tasks through prompting (Wei et al., 2022; Wang et al., 2022; Yao et al., 2023; Besta et al., 2023). Chain-of-Thought Wei et al. (2022) enables LLMs to generate a reasoning chain that could be helpful to reasoning. Tree of thoughts (Yao et al., 2023) expands the reasoning chain to a tree structure to explore more reasoning paths. Graph of thoughts further models the reasoning chain as a graph with an aggregation operation to synergize the reasoning paths. Plan-and-solve (Wang et al., 2023c) prompts LLMs to generate a plan and execute based on it. DecomP (He et al., 2021) prompts LLMs to decompose the reasoning task into a series of sub-tasks and solve them step by step. However, the problem of hallucinations and lack of knowledge affect the faithfulness of the reasoning. ReACT (Yao et al., 2022) treats LLMs as agents, which interact with the environment to get the latest knowledge for reasoning. To explore faithful reasoning, Entailer (Tafjord et al., 2022) introduces a verifier to validate the reasoning steps generated by LLMs. Creswell & Shanahan (2022) present a framework including two LLMs that are used for selecting and generating reasoning steps, respectively. FAME (Hong et al., 2023) introduces the Monte-Carlo planning to generate faithful reasoning steps. RR (He et al., 2022) and KD-CoT Wang et al. (2023b) aim to retrieve relevant knowledge from KGs to produce faithful reasoning plans for LLMs.
A.1.2 Knowledge Graph Question Answering
Embedding-based methods model the entities and relations in embedding space and design special model architectures to reason answers. KV-Mem (Miller et al., 2016) adopts a Key-Value memory network to store triples for reasoning. EmbedKGQA (Saxena et al., 2020) and NSM (He et al., 2021) utilize the sequential model to mimic the multi-hop reasoning process. QA-GNN (Yasunaga et al., 2021) and Greaselm (Zhang et al., 2021) further adopt the graph neural network to capture the graph structure for reasoning. However, these methods need to design different model architectures, which are not flexible and generalizable.
Retrieval-augmented methods aims to retrieve the relative facts from the KGs to improve the reasoning performance. Early works adopt the page rank or random walk algorithm to retrieve subgraphs from KGs for reasoning (Sun et al., 2018; 2019). However, they ignore the semantic information in questions and lead to noisy retrieval results. Zhang et al. (2022) proposes a relation paths-based subgraph retrieval, resulting a better retrieval and QA performance. Other lines of studies retrieving triples from KGs via BM25 (Li et al., 2023) or DPR (Karpukhin et al., 2020; Yu et al., 2022b) to improve the performance of LLMs. They discard the structure information in KGs which leads to suboptimal results. Recently, UniKGQA (Jiang et al., 2022) unifies the graph retrieval and reasoning process into a single model with LLMs, which achieves state-of-the-art performance on KGQA tasks.
Semantic parsing methods parse the question into a structural query (e.g., SPARQL) which can be executed by a query engine to get answers (Sun et al., 2020; Lan & Jiang, 2020). ArcaneQA (Gu & Su, 2022) dynamically generates the query based on results from previous steps. RnG-KBQA (Ye et al., 2022) first enumerate all possible queries and then rank them to get the final output. These methods heavily rely on the quality of generated queries. If the query is not executable, no answers will be generated. DECAF (Yu et al., 2022a) combines semantic parsing and LLMs reasoning to jointly generate answers, which also reach salient performance on KGQA tasks.
A.2 Retrieval Algorithm
Given a question and a relation path as plan , we adopt a constrained breadth-first search to retrieve the reasoning paths. The pseudocode code is presented in Algorithm 1.
We first initialize a queue of current reasoning paths with the entities in the question (line 3-5). Then, we iteratively expand each reasoning path in by adding the triples that are connected to the entities in the queue following the relation in relation path (line 11-19). The reasoning path is expanded until the length is equal to the length of the relation path. The expanded reasoning path is added to the set as the final results (line 8-10).
A.3 Datasets
We adopt two benchmark KGQA datasets: WebQuestionSP (WebQSP)444https://www.microsoft.com/en-us/download/details.aspx?id=52763 (Yih et al., 2016) and Complex WebQuestions (CWQ)555https://www.tau-nlp.sites.tau.ac.il/compwebq (Talmor & Berant, 2018) in this work. We follow previous works (Sun et al., 2018; Jiang et al., 2022) to use the same train and test splits for fair comparison. The statistics of the answer numbers and reasoning hops are presented in Table 8 and Table 9, respectively.
To evaluate the transferability of RoG to other KGs. We further select the MetaQA-3hop dataset (Zhang et al., 2018) which is based on Wiki-Movies KGs666https://research.fb.com/downloads/babi.. We select 1000 samples from the training split. The statistic of the dataset is presented in Table 10.
Both WebQSP and CWQ can be reasoned based on Freebase KGs777https://github.com/microsoft/FastRDFStore (Bollacker et al., 2008). To reduce the size of KGs, following previous works (He et al., 2021; Jiang et al., 2022), we construct a subgraph of Freebase by extracting all triples that contain within the max reasoning hops of question entities in WebQSP and CWQ. Similarly, we construct a subgraph of Wiki-Movies KGs for MetaQA-3hop. The statistics of constructed KGs are presented in Table 11.
Dataset | #Ans = 1 | 2 #Ans 4 | 5 #Ans 9 | #Ans 10 |
WebQSP | 51.2% | 27.4% | 8.3% | 12.1% |
CWQ | 70.6% | 19.4% | 6% | 4% |
Dataset | 1 hop | 2 hop | 3 hop |
WebQSP | 65.49 % | 34.51% | 0.00% |
CWQ | 40.91 % | 38.34% | 20.75% |
Datasets | #Train | #Test | #hop |
MetaQA-3hop | 1,000 | 1,4274 | 3 |
KG | #Entities | #Relations | #Triples |
Freebase | 2,566,291 | 7,058 | 8,309,195 |
Wiki-Movie | 43,234 | 9 | 133,582 |
A.4 Baselines
We compare RoG with 21 baselines grouping into 5 categories: 1) Embedding-based methods, 2) Retrieval-augmented methods, 3) Semantic parsing methods, 4) LLMs, and 5) LLMs+KGs methods. The details of each baseline are described as follows.
Embedding-based methods.
-
•
KV-Mem (Miller et al., 2016) adopts a Key-Value memory network to store triples and perform multi-hop reasoning by iterative operating on the memory.
-
•
EmbedKGQA (Saxena et al., 2020) models the reasoning on KGs as a sequential link prediction problem by using the embedding of entities and questions.
-
•
NSM (He et al., 2021) utilizes the sequential model to mimic the multi-hop reasoning process.
-
•
TransferNet (Shi et al., 2021) adopts a graph neural network to capture the relevance between entities and questions for reasoning.
-
•
KGT5 (Saxena et al., 2022) finetunes a sequence-to-sequence framework on KGs and generates answers based on the input question.
Retrieval-augmented methods.
-
•
GraftNet (Sun et al., 2018) retrieves relevant subgraphs from KGs with entity linking.
-
•
PullNet (Sun et al., 2019) trains a retrieval model composed of a LSTM and a graph neural network to retrieve a question-specific subgraph.
-
•
SR+NSM (Zhang et al., 2022) proposes a relation-path retrieval to retrieve subgraphs for multi-hop reasoning.
-
•
SR+NSM+E2E (Zhang et al., 2022) further adopts an end-to-end training strategy to jointly train the retrieval and reasoning modules of SR+NSM.
Semantic parsing methods.
-
•
SPARQL (Sun et al., 2020) presents a novel skeleton grammar to represent the high-level structure of a complex question with language modes.
-
•
QGG (Lan & Jiang, 2020) generates a query graph for a question by simultaneously adding constraints and extending relation paths.
-
•
ArcaneQA (Gu & Su, 2022) dynamically generates the query based on results from previous steps.
-
•
RnG-KBQA (Ye et al., 2022) first enumerates all possible queries and then ranks them to get the final output.
Large language models (LLMs).
-
•
Flan-T5 (Chung et al., 2022) is an enhanced version of T5 models that is instruction finetuned on mixture of tasks.
-
•
Alpaca (Taori et al., 2023) is based on LLaMA and finetuned on an instruction-following dataset.
-
•
LLaMA2-Chat (Touvron et al., 2023) is a large language model that is optimized for dialogue purposes.
-
•
ChatGPT888https://openai.com/blog/chatgpt is a powerful closed-source LLM that could follow instructions to conduct complex tasks.
-
•
ChatGPT+CoT (Wei et al., 2022) uses the Chain-of-thought prompt to improve the reason ability of ChatGPT.
LLMs+KGs methods.
-
•
KD-CoT Wang et al. (2023b) retrieves relevant knowledge from KGs to generate faithful reasoning plans for LLMs.
-
•
UniKGQA (Jiang et al., 2022) unifies the graph retrieval and reasoning process into a single model with LLMs, which achieves state-of-the-art performance on KGQA tasks.
-
•
DECAF (Yu et al., 2022a) combines semantic parsing and LLMs reasoning to jointly generate answers, which also reach salient performance on KGQA tasks.
A.5 Implementation Settings
For RoG, we use LLaMA2-Chat-7B (Touvron et al., 2023) as the LLM backbone, which is instruction finetuned on the training split of WebQSP and CWQ as well as Freebase for 3 epochs. The batch size is set to 4 and the learning rate is set to 2e-5. We use the cosine learning rate scheduler policy with the warmup ratio set to 0.03. The training is conducted on 2 A100-80G GPUs for 38 hours. During inference, we first adopt the LLM to generate top- relation paths with the highest probability as the plans. Then, we adopt the Algorithm 1 to retrieve reasoning paths, which are fed into the LLM to reason the final answers.
A.6 Additional Experiment Results
A.6.1 Retrieval Costs
We present the retrieval time and number of retrieved reasoning paths in Figure 4. From results, we can see that the retrieval time increases with the number of top- relation paths. Therefore, we should select a proper to balance the retrieval time and the number of retrieved reasoning paths. In experiments, we set .
![Refer to caption](/html/2310.01061/assets/x4.png)
A.6.2 Performance on different hops
We present the performance of RoG and its variants on different hops of questions in Table 12. From results, we can see that RoG achieves better performance than its variants on different hops of questions, especially questions with more than 3 hops. This demonstrates the importance of relation paths for improving the reasoning performance of LLMs on complex questions.
Methods | WebQSP | CWQ | ||||
1 hop | 2 hop | 3 hop | 1 hop | 2 hop | 3 hop | |
RoG | 77.03 | 64.86 | - | 62.88 | 58.46 | 37.82 |
RoG w/o reasoning | 57.06 | 25.49 | - | 17.06 | 34.25 | 17.07 |
RoG w/o planning | 50.33 | 51.66 | - | 31.04 | 33.93 | 23.29 |
A.6.3 Performance on different answer numbers
We also present the performance of RoG and its variants on questions with different numbers of answers in Table 13. From results, we can see that RoG achieves better performance than its variants on questions with different numbers of answers. Specifically, with the number of answer increasing, the performance of RoG w/o planning decreases significantly due to the lack of knowledge from KGs. Although, RoG w/o reasoning can retrieve more answers to improve the performance. It still is inferior to RoG due to the lack of reasoning ability to remove the noise.
Methods | WebQSP | CWQ | ||||||
#Ans = 1 | 2 #Ans 4 | 5 #Ans 9 | #Ans 10 | #Ans = 1 | 2 #Ans 4 | 5 #Ans 9 | #Ans 10 | |
RoG | 67.89 | 79.39 | 75.04 | 58.33 | 56.90 | 53.73 | 58.36 | 43.62 |
RoG w/o reasoning | 33.49 | 52.80 | 58.05 | 66.01 | 16.61 | 27.06 | 40.10 | 34.45 |
RoG w/o planning | 55.03 | 51.08 | 44.81 | 27.00 | 34.08 | 34.16 | 31.67 | 25.21 |
A.7 Case Studies: Relation Paths
We illustrate several examples of relation paths generated by RoG in Table 14.
A.8 Case Studies: Interpretable Results
We illustrate several examples of interpretable reasoning results generated by RoG in Table 15.
A.9 Prompts
Planning module aims to generate faithful relation paths as plans for answering the question. The instruction template is presented as follows:
where <Question> indicates the question.
The reasoning module takes the question and a set of reasoning paths to generate answers . The instruction template is presented as follows:
where <Reasoning Paths> denotes the retrieved reasoning paths which are formatted as a series of structural sentences:
.
To exploit the explanation ability of RoG, we design a new instruction template for the reasoning module to generate interpretable results. The instruction template is presented as follows:
where the Examples denotes a few-shot human-annotated examples to demonstrate the explanation process.
Question | Top-3 Relation Paths | ||||
what does jamaican people speak? |
|
||||
where is jamarcus russell from? |
|
||||
where did edgar allan poe died? |
|
||||
what highschool did harper lee go to? |
|
||||
what are the songs that justin bieber wrote? |
|
||||
what are the religions practiced in indonesia? |
|
||||
Lou Seal is the mascot for the team that last won the World Series when? |
|
||||
What country in the Caribbean contains Saint Michael Parish? |
|
||||
What type of government is used in the country with Northern District? |
|
||||
The people from the country that contains Nord-Ouest Department speak what languages today? |
|
||||
What stadium does the team with mascot named Hank play at? |
|
||||
Which popular sports team in Spain, that won the 2014 Eurocup Finals championship? |
|
||||
What educational institution with the mascot named Washington University Bear did Tennessee Williams go to? |
|
||||
Who is the current head coach of the NFL squad owned by the Rooney family? |
|
||||
What is the home field of the sports team whose mascot is named Fredbird? |
|
Question | Lou Seal is the mascot for the team that last won the World Series when? |
Answer | 2014 |
Reasoning Paths | Lou Seal sports.mascot.team San Francisco Giants sports.sports_championship_event.champion 2014 World Series |
RoG | Based on the provided reasoning paths, we can determine when the team for which Lou Seal is the mascot last won the World Series: 1. Lou Seal is associated with the San Francisco Giants. 2. The San Francisco Giants are associated with winning the 2014 World Series. Therefore, the San Francisco Giants, the team for which Lou Seal is the mascot, last won the World Series in 2014. |
Question | what is my timezone in louisiana? |
Answer | Central Time Zone |
Reasoning Paths | Louisiana location.location.partially_containedby Tangipahoa River geography.river.origin Osyka location.location.time_zones Central Time Zone Louisiana location.location.containedby Alexandria location.location.containedby United States of America location.location.time_zones Central Time Zone |
RoG | Based on the reasoning paths provided, it can be determined that Louisiana is in the Central Time Zone. Here’s the explanation: 1. The first reasoning path tells us that Louisiana is partially contained by the Tangipahoa River, which is a geographical feature. However, it doesn’t directly mention the time zone. 2. The second reasoning path establishes that Louisiana is contained by Alexandria, which is in turn contained by the United States of America. The United States of America is in the Central Time Zone. Therefore, based on these reasoning paths, the time zone in Louisiana is the Central Time Zone. |
Question | Which child of Walt Disney died from lung cancer? |
Answer | Sharon Mae Disney |
Reasoning Paths | Lung cancer people.deceased_person.cause_of_death Sharon Mae Disney Lung cancer people.deceased_person.cause_of_death Walt Disney Walt Disney people.deceased_person.cause_of_death Lung cancer Walt Disney people.person.children Diane Disney Miller Walt Disney people.cause_of_death.people Circulatory collapse Walt Disney people.person.parents Flora Call Disney Walt Disney people.person.parents Elias Disney Walt Disney people.person.parents Sharon Mae Disney |
RoG | Based on the reasoning paths provided, it appears that Sharon Mae Disney is the child of Walt Disney who died from lung cancer. The reasoning path ”Walt Disney people.deceased_person.cause_of_death Lung cancer” directly connects Walt Disney to lung cancer as a cause of death, and since Sharon Mae Disney is also linked to lung cancer as a cause of death through the path ”Lung cancer people.deceased_person.cause_of_death Sharon Mae Disney,” it can be concluded that Sharon Mae Disney is the child of Walt Disney who died from lung cancer. |