From RAG to RICHES: Retrieval Interlaced with Sequence Generation
从 RAG 到 RICHES:检索与序列生成交织在一起
Abstract 摘要
We present Riches, a novel approach that interleaves retrieval with sequence generation tasks. Riches offers an alternative to conventional RAG systems by eliminating the need for separate retriever and generator. It retrieves documents by directly decoding their contents, constrained on the corpus. Unifying retrieval with generation allows us to adapt to diverse new tasks via prompting alone. Riches can work with any Instruction-tuned model, without additional training. It provides attributed evidence, supports multi-hop retrievals and interleaves thoughts to plan on what to retrieve next, all within a single decoding pass of the LLM. We demonstrate the strong performance of Riches across ODQA tasks including attributed and multi-hop QA.
我们提出了 Riches,这是一种将检索与序列生成任务交错的新方法。Riches 通过消除对单独检索器和生成器的需求,为传统的 RAG 系统提供了一种替代方案。它通过直接解码文档内容来检索文档,受限于语料库。将检索与生成统一起来使我们能够仅通过提示适应多样的新任务。Riches 可以与任何经过指令调优的模型一起工作,而无需额外的训练。它提供了归因证据,支持多跳检索,并交错思考以规划接下来要检索的内容,所有这些都在LLM的单次解码过程中完成。我们展示了 Riches 在包括归因和多跳问答在内的 ODQA 任务中的强大表现。
1 Introduction 1 引言
Large language models (LLMs) have increasingly become the backbone for much of natural language processing and there has been a push to formulate a wide range of tasks as sequence to sequence transduction.
However, when LLMs need to interact with non-parametric knowledge in the form of an external evidence corpus, the typical approaches chain LLM generations with calls to a separate retrieval model as part of a multi-system pipeline.
In this paper we introduce a new approach, Riches (Retrieval Interlaced with Sequence Generation) which can natively interleave text generations with retrievals from an evidence corpus using a single LLM and decoding process.
大型语言模型(LLMs)越来越成为自然语言处理的支柱,并且有推动将广泛的任务表述为序列到序列的转导。然而,当LLMs需要以外部证据语料库的形式与非参数知识进行交互时,典型的方法将LLM生成与对单独检索模型的调用链在一个多系统管道中作为部分。在本文中,我们介绍了一种新方法,Riches(与序列生成交错的检索),它可以使用单个LLM和解码过程原生地将文本生成与来自证据语料库的检索交错。
Riches builds on previous work that demonstrated the application of constrained decoding to retrieval over a corpus Jain et al. (2023); Bevilacqua et al. (2022) but extends this work to support multiple retrievals, entwined in a standard text generation procedure.
In this approach, we retrieve documents by directly decoding their contents or related natural language retrieval keys that point to the documents they were generated from.
For example, Figure 1 illustrates a solution from Riches to multi-hop question answering Yang et al. (2018), where evidence must be retrieved from multiple separate documents, by iteratively generating an unconstrained ‘thought’ about what needs to be retrieved and then generating a supporting proposition derived from an evidence corpus and tied to an original piece of supporting text. Riches executes this task in a single decoder pass.
For this example task, which is evaluated alongside others in Section 6, we have built on recent advances in chain-of-thought reasoning via prompting alone Yao et al. (2022) but have directly integrated the retrieval step without needing to account for any interaction with an external retrieval system.
Riches 基于之前的工作,展示了约束解码在检索语料库中的应用 Jain et al. (2023); Bevilacqua et al. (2022),但将这项工作扩展到支持多次检索,交织在标准文本生成过程中。在这种方法中,我们通过直接解码文档的内容或指向生成文档的相关自然语言检索键来检索文档。例如,图 1 说明了 Riches 在多跳问答中的解决方案 Yang et al. (2018),其中必须从多个独立文档中检索证据,通过迭代生成关于需要检索内容的无约束“思考”,然后生成一个支持命题,该命题源自证据语料库并与原始支持文本相关联。Riches 在单次解码过程中执行此任务。对于这个示例任务,在第 6 节中与其他任务一起评估,我们基于最近在仅通过提示进行的思维链推理的进展 Yao et al. (2022),但直接集成了检索步骤,而无需考虑与外部检索系统的任何交互。
The observations we build this work on are:
我们构建这项工作的观察是:
-
1.
LLMs are knowledge warehouses: They internalise and generalise over vast quantities of training data and are often able to generate surprisingly accurate knowledge in response to complex inputs Sun et al. (2022). However they are also susceptible to hallucination and cannot account for fresh knowledge, not available at the time of training. That is where retrieval shines.
LLMs 是知识仓库:它们内化并概括大量的训练数据,通常能够对复杂输入生成令人惊讶的准确知识(Sun et al., 2022)。然而,它们也容易出现幻觉,并且无法考虑在训练时不可用的新知识。这就是检索的优势所在。 -
2.
LLM decoding is a search process: Language model decoders search for a single sequence in the set of all possible token sequences (Graves, 2012). Retrievers just need to constrain this search space to those sequences that are known to exist in a corpus of interest.
LLM 解码是一个搜索过程:语言模型解码器在所有可能的标记序列中搜索单一序列(Graves, 2012)。检索器只需将这个搜索空间限制在已知存在于感兴趣语料库中的序列上。 -
3.
Unifying tasks unlocks rapid development via prompting By unifying retrieval with generation in a single decoder pass, we create a system that can be adapted to diverse new tasks via prompting alone, directly benefiting from the advances in instruction following. We later show that Riches works with an off-the-shelf instruction-tuned model, without any additional training. This is in contrast to pipelines that need to be rebuilt/retrained on a task-by-task basis.
统一任务通过提示解锁快速发展 通过在单个解码器传递中将检索与生成统一,我们创建了一个可以仅通过提示适应多样新任务的系统,直接受益于指令跟随的进展。我们随后展示了 Riches 与现成的指令调优模型一起工作,无需任何额外训练。这与需要在逐个任务基础上重建/重新训练的管道形成对比。
There is an another advantage of using language models as search agents. Of the two core operations in retrieval, indexing and search, indexing is constrained by corpus size, while search typically depends only on the index structure. Using large language models for indexing billion-token corpora is highly expensive, but search does not face the same bottle-neck. This enables us to unlock the knowledge stored in very large models for retrieval.
使用语言模型作为搜索代理还有另一个优势。在检索的两个核心操作中,索引和搜索,索引受到语料库大小的限制,而搜索通常仅依赖于索引结构。使用大型语言模型对十亿标记的语料库进行索引是非常昂贵的,但搜索并不面临同样的瓶颈。这使我们能够解锁存储在非常大型模型中的知识以进行检索。
This work overlaps with a variety of related work focusing on retrieval, retrieval augmented generation Lewis et al. (2020), reasoning in language models, and open domain question answering.
We discuss their connections to Riches in Section 2, then introduce the key components of the generalizable Riches approach in Section 3.
本研究与多项相关工作重叠,这些工作集中于检索、检索增强生成(Lewis et al.,2020)、语言模型中的推理以及开放域问答。我们在第 2 节讨论它们与 Riches 的联系,然后在第 3 节介绍可推广的 Riches 方法的关键组成部分。
While Riches is applicable to any task that can be reduced to an interleaved generation of unconstrained text and pre-defined retrieval keys, we validate the approach with tasks in open domain question answering and show how it natively supports single-hop question answering, including the case where attribution to a source text is required; multi-hop question answering; and interleaving retrieval with ‘planning steps’ that enhance the retrieval performance. Results are presented in Section 6.2 along with qualitative examples and analysis in Section 6.3 to help motivate the approach.
虽然 Riches 适用于任何可以简化为无约束文本和预定义检索键交错生成的任务,但我们通过开放域问答中的任务验证了该方法,并展示了它如何原生支持单跳问答,包括需要对源文本进行归属的情况;多跳问答;以及与“规划步骤”交错的检索,这些步骤增强了检索性能。结果在第 6.2 节中呈现,并在第 6.3 节中提供定性示例和分析,以帮助激励该方法。
2 Related Work 相关工作
Retrieval Augmented Generation (RAG)
检索增强生成 (RAG)
ODQA tasks predominantly employ the RAG approach Lewis et al. (2020) where typically a dense retriever (Karpukhin et al., 2020) retrieves documents from an evidence corpus and feeds to a language model for the final answer.
These pipelines involve switching between heterogeneous models and are hard to train in concert.
Moreover, Dense retrievers fail to generalize out-of-domain (Thakur et al., 2021).
ODQA 任务主要采用 RAG 方法(Lewis et al., 2020),通常由密集检索器(Karpukhin et al., 2020)从证据库中检索文档,并将其输入语言模型以获得最终答案。这些流程涉及在异构模型之间切换,且难以协同训练。此外,密集检索器在领域外的泛化能力不足(Thakur et al., 2021)。
Generative Retrieval 生成检索
(Metzler et al., 2021) techniques shifting the onus of Search from non-parametric nearest neighbor scan to language models. Differentiable Search Index (Tay et al., 2022) memorizes a mapping of query to opaque document identifiers, however memorization struggles to generalize to unseen corpus (Pradeep et al., 2023).
An alternative approach is to use natural language keys as document identifiers, where keys are constrained decoded to lie in the corpus (De Cao et al., 2020; Bevilacqua et al., 2022). These systems still need an external model to generate answers.
1-Pager (Jain et al., 2023) unifies evidence and answer generation, by generating a sequence of keywords that map to a document. However, isolated keywords limit context understanding and suffer similar pitfalls as lexical matching.
(Metzler et al., 2021) 技术将搜索的责任从非参数最近邻扫描转移到语言模型上。可微分搜索索引 (Tay et al., 2022) 记忆查询与不透明文档标识符的映射,然而记忆化在未见语料上难以泛化 (Pradeep et al., 2023)。一种替代方法是使用自然语言键作为文档标识符,其中键被限制解码以位于语料中 (De Cao et al., 2020; Bevilacqua et al., 2022)。这些系统仍然需要一个外部模型来生成答案。1-Pager (Jain et al., 2023) 通过生成一系列映射到文档的关键词来统一证据和答案生成。然而,孤立的关键词限制了上下文理解,并遭遇与词汇匹配相似的陷阱。
Recitation 朗诵
Separate from retrieval augmentation, language models have been shown to recite entire passages from memory (Sun et al., 2022; Yu et al., 2022). But these passages are prone to hallucination.
Our aim is to intersect contextual passage generation with corpus grounding.
GopherCite (Menick et al., 2022), a noteworthy work in this direction, generates quotes verbatim from a small set of documents using constrained decoding.
Riches aims to scale this to a billion-token corpus.
除了检索增强,语言模型已被证明能够从记忆中背诵整段文字(Sun et al., 2022; Yu et al., 2022)。但这些段落容易出现幻觉。我们的目标是将上下文段落生成与语料库基础相交。GopherCite(Menick et al., 2022)是朝这个方向迈出的一个重要工作,它使用受限解码从一小组文档中逐字生成引用。Riches 旨在将这一方法扩展到十亿标记的语料库。
Iterative reasoning and Search
迭代推理与搜索
In recent times, there have been several efforts to improve multi-hop question answering by better reasoning (Asai et al., 2023) and planning (Adolphs et al., 2021; Yao et al., 2022). Language models have also been applied to the task of search to explore alternative paths Yao et al. (2023); Hao et al. (2023).
最近,已经有多项努力通过更好的推理(Asai et al., 2023)和规划(Adolphs et al., 2021;Yao et al., 2022)来改善多跳问答。语言模型也被应用于搜索任务,以探索替代路径(Yao et al., 2023;Hao et al., 2023)。
Our work builds on these advances in reasoning while integrating search within generation.
我们的工作建立在这些推理进展的基础上,同时将搜索整合到生成中。
3 Retrieving while Generating
3 检索与生成
We present a method of interleaving unconstrained text generation with the generation of retrieval keys that point into a retrieval corpus.
For example, Figure 1 shows generations that interleave unconstrained ‘thoughts’ with evidence sentences drawn from a predefined corpus for a multi-hop question answering task.
Later in this section we’ll introduce a number of different choices of retrieval key as well as a variety of tasks that benefit from interleaved generation and retrieval.
However, for now we simply define a retrieval key as a sequence of tokens that exists in a pre-defined finite set of sequences where every entry is associated with one or more documents in an underlying corpus .
我们提出了一种将无约束文本生成与指向检索语料库的检索键生成交错的方法。例如,图 1 展示了将无约束的“思维”与从预定义语料库中提取的证据句子交错生成的实例,适用于多跳问答任务。在本节后面,我们将介绍多种不同的检索键选择以及多种受益于交错生成和检索的任务。然而,目前我们仅将检索键定义为存在于预定义有限序列集 中的一系列标记,其中每个条目与基础语料库 中的一个或多个文档相关联。
Formally, we focus on the sequence to sequence transduction task where we predict an output sequence conditioned on an input sequence and we mark the start and end of a retrieval key in with special markers << and >>.
If we let be a function that returns all retrieval key spans from (i.e. ) then we can update the standard autoregressive language modeling probability
正式地,我们关注序列到序列的转导任务,在该任务中,我们预测一个输出序列 ,该序列以输入序列 为条件,并且我们在 中用特殊标记 << 和 >> 标记检索键的开始和结束。如果我们让 成为一个函数,该函数返回 中的所有检索键跨度(即 ),那么我们可以更新标准自回归语言建模概率。
(1) |
to include the indicator function that maps elements of onto one and otherwise to zero.
以包含指示函数 ,该函数将 的元素映射到一,而其他则映射到零。
(2) |
where is a normalizing term that accounts for the probability mass assigned by Equation 1 to disallowed sequences.
In practice, we do not need to compute and can sample from Equation 2 in the usual way, one token at a time, by simply zeroing out the probability of disallowed continuations as presented in Section 3.1.
其中 是一个归一化项,用于考虑方程 1 分配给不允许序列的概率质量。在实践中,我们不需要计算 ,可以按照通常的方式从方程 2 中逐个采样,通过简单地将不允许的延续的概率归零,如第 3.1 节所述。
3.1 Constrained Beam Decoding
3.1 受限束解码
We opt for Beam Search (Graves, 2012) as our decoding strategy to simulate a heuristic Best-first search. Here, the action or next node space is the entire vocab. At each time step, the LLM estimates the value of each node (token) given the paths explored so far and adds them to the fixed-size queue (Beam). Figure 2 visualizes how the beam progresses over decoding timesteps.
Unlike regular beam decoding where the top decoded sequences have only small variations, constraints impose sparsity over the search space resulting in diverse beams.
In Section 3.3, we discuss how beam can hurt unconstrained generation and suggest hybrid decoding strategy as workarounds.
Constrained decoding can also gain from more sophisticated algorithms such as value-based decoding (Ren et al., 2017), look-ahead scoring and planning (Lu et al., 2021; Hao et al., 2023).
我们选择梁搜索(Graves, 2012)作为我们的解码策略,以模拟启发式的最佳优先搜索。在这里,动作或下一个节点空间是整个词汇表。在每个时间步,LLM根据迄今为止探索的路径估计每个节点(标记)的值,并将其添加到固定大小的队列(梁)中。图 2 可视化了梁在解码时间步中的进展。与常规梁解码不同,常规梁解码的顶级解码序列只有小的变化,约束在搜索空间中施加稀疏性,导致多样化的梁。在第 3.3 节中,我们讨论了梁如何影响无约束生成,并建议混合解码策略作为解决方案。约束解码也可以从更复杂的算法中受益,例如基于值的解码(Ren et al., 2017)、前瞻评分和规划(Lu et al., 2021;Hao et al., 2023)。
3.2 Efficient Constraints via the FM-Index
3.2 通过 FM-索引实现高效约束
During decoding, model outputs are constrained to the corpus by masking out any continuation not in the corpus. To compute the continuations of a sequence, we use FM-index (Ferragina and Manzini, 2000), a compressed suffix array augmented with additional data structures to support fast substring search operations. Unlike a Trie structure, it is also highly space economical due to the compression.
Given a prefix, FM-Index can efficiently compute the next allowed tokens in O(Vocab), independent of the corpus-size. Below is the pseudo code for the modified decoding process.
在解码过程中,模型输出受到语料库的限制,通过屏蔽掉任何不在语料库中的延续。为了计算序列的延续,我们使用 FM-index(Ferragina 和 Manzini,2000),这是一种压缩后缀数组,增强了额外的数据结构以支持快速的子字符串搜索操作。与 Trie 结构不同,由于压缩,它在空间上也非常经济。给定一个前缀,FM-Index 可以在 O(Vocab)的时间内高效计算下一个允许的标记,与语料库大小无关。以下是修改后的解码过程的伪代码。
3.3 Adaptive Beam Size 3.3 自适应波束大小
In Section 5.2 we introduce some tasks that interleave constrained and unconstrained generation.
The constrained generations must be precise—to match the target retrieval key exactly.
The unconstrained generations are generally more robust to small variations in surface form—these only need to convey the correct information to a reader, or to provide the model room for a ‘thought’ trace when reasoning about a response.
在 5.2 节中,我们介绍了一些交替进行约束生成和非约束生成的任务。约束生成必须精确——以完全匹配目标检索键。非约束生成通常对表面形式的小变化更具鲁棒性——这些只需向读者传达正确的信息,或在推理响应时为模型提供“思维”痕迹的空间。
To ensure that Riches can properly make use of beam search, which is here intended to ensure the model does not get stuck irretrievably after generating an incorrect constrained prefix, we introduce an adaptive decoding strategy that switches between full beam decoding for the sensitive constrained sequences but opts for greedy decoding when unconstrained. In practise, a constrained prefix is expanded to next beam-size probable tokens while an unconstrained prefix is expanded to only the next one token.
This is expected to provide room for rest of the beam to be utilized largely for constrained sequences. Section 6.1 shows experiments with multiple decode modes.
为了确保 Riches 能够正确利用束搜索,这里旨在确保模型在生成错误的约束前缀后不会不可逆地卡住,我们引入了一种自适应解码策略,该策略在对敏感的约束序列进行全束解码时切换,但在无约束时选择贪婪解码。在实践中,约束前缀扩展到下一个束大小的可能标记,而无约束前缀仅扩展到下一个标记。这预计将为其余束的利用提供空间,主要用于约束序列。第 6.1 节展示了多种解码模式的实验。
3.4 Indexing Strategies 3.4 索引策略
The FM-Index used by Riches supports efficient indexing of all sub-strings in a corpus, which is useful when we want to generate corpus text verbatim.
However, it is not clear that this is the best option of retrieval key for the auto-regressive decoder in Section 3.1. A key question in index construction is the document representation used in indexing.
In traditional lexical-based retrieval systems, documents are represented by the terms in it, with transformations such as stemming, weighing by corpus statistics Robertson et al. (2009).
Neural retrieval systems transform raw text into dense vector representations and offload representation computation to the neural network. But even in this case, proper document chunking and/or multi-vector document significantly impact final performance Lee et al. (2021); Khattab and Zaharia (2020).
Riches 使用的 FM-Index 支持对语料库中所有子字符串的高效索引,这在我们希望逐字生成语料库文本时非常有用。然而,目前尚不清楚这是否是第 3.1 节中自回归解码器的最佳检索键选项。索引构建中的一个关键问题是用于索引的文档表示。在传统的基于词汇的检索系统中,文档通过其中的术语进行表示,并进行诸如词干提取、根据语料库统计加权等转换(Robertson 等,2009)。神经检索系统将原始文本转换为密集向量表示,并将表示计算卸载到神经网络中。但即便在这种情况下,适当的文档分块和/或多向量文档对最终性能有显著影响(Lee 等,2021;Khattab 和 Zaharia,2020)。
In this section, we introduce a few different choices of retrieval keys, including a propositional index that requires indexing time neural computation. A key consideration here is the interplay between the retrieval index and the search strategy.
在本节中,我们介绍几种不同的检索键选择,包括需要索引时间神经计算的命题索引。这里一个关键考虑是检索索引与搜索策略之间的相互作用。
Document Title and Section Headers
文档标题和章节标题
Many retrieval corpora such as Wikipedia have consistent structures in the form of titles and sometimes sub-titles and metadata. This provides a hierarchical structure such that one can first decode titles, sub-titles and then the document.
许多检索语料库,如维基百科,具有一致的结构,包括标题、有时还有副标题和元数据。这提供了一种层次结构,使得人们可以首先解码标题、副标题,然后是文档。
Paragraph Sub-string 段落子字符串
A natural option for retrieval key is any sub-string of the unit of text being indexed itself.
In most open domain question answering approaches, paragraph is the de-facto unit of evidence.
We can index paragraphs efficiently using the FM-index (Section 3.2) and decode sub-strings directly with Riches to get pointers into the retrieval corpus.
It should be noted that this yields an inherently many-to-many mapping between paragraphs and retrieval keys, but that the mapping is in-effect one-to-one for longer sequences of tokens.
检索键的自然选项是被索引文本单元的任何子字符串。在大多数开放领域问答方法中,段落是事实上的证据单元。我们可以使用 FM 索引(第 3.2 节)高效地索引段落,并直接使用 Riches 解码子字符串,以获取检索语料库中的指针。需要注意的是,这在段落和检索键之间产生了固有的多对多映射,但对于较长的标记序列,该映射实际上是一对一的。
Sentence Sub-string 句子子字符串
Similarly, individual sentences form a natural retrieval key. Sentence are smaller units of information than passage, but may not be interpretable or stand-alone.
类似地,单个句子形成一个自然的检索关键。句子是比段落更小的信息单元,但可能无法被解释或独立存在。
Propositional Index 命题索引
The above choices do not perform any non-trivial indexing step, unlike standard approaches in information retrieval where documents are mapped to sparse or dense vectors.
The omission of this indexing step may be desirable but it also forces Riches to deal with the non-uniform and diffused information in raw text.
An alternative that is closer, in intent, to the offline indexing step used by other IR systems, is to map each indexed chunk to a set of uniformly structured propositions (Min et al., 2023; Chen et al., 2022). A proposition is a stand-alone unit that efficiently encodes small, atomic chunks of factual information. For example, instead of the sentence "He has 7M followers on Twitter" a proposition would be decontextualized to "Tom Cruise has 7M followers on Twitter."
We adopt a pre-existing propositional index from Chen et al. 2023 described in Section 5.1.
上述选择没有执行任何非平凡的索引步骤,这与信息检索中的标准方法不同,后者将文档映射到稀疏或密集向量。省略这一索引步骤可能是可取的,但它也迫使 Riches 处理原始文本中不均匀和分散的信息。一个在意图上更接近其他信息检索系统所使用的离线索引步骤的替代方案是将每个索引块映射到一组统一结构的命题(Min et al., 2023; Chen et al., 2022)。命题是一个独立的单元,有效地编码小的、原子的事实信息块。例如,代替句子“他在 Twitter 上有 700 万粉丝”,命题将去上下文化为“汤姆·克鲁斯在 Twitter 上有 700 万粉丝。”我们采用了 Chen et al. 2023 中描述的现有命题索引,如第 5.1 节所述。
Section 6.1 compares various Retrieval keys for the ODQA task with illustrations in Section A.4.
第 6.1 节比较了 ODQA 任务的各种检索键,并在 A.4 节中进行了说明。
4 Interleaving Retrieval and Generation
4 交错检索与生成
We have presented a method of interleaving unconstrained text generation with constrained generation of retrieval keys.
In this section we introduce a handful of tasks that make use of this interleaving either as a core task requirement, or as a means to an end by interleaving ‘thoughts’ with retrieval actions to help guide search.
我们提出了一种将无约束文本生成与检索键的有约束生成交错的方法。在本节中,我们介绍了一些任务,这些任务利用这种交错,作为核心任务要求,或通过将“思考”与检索动作交错来帮助指导搜索,以达到目的。
Attributed Question Answering
属性问题回答
We apply Riches to the open domain question answering (ODQA) task where we score both the ability to correctly predict a short answer string and retrieve attribution for that answer Bohnet et al. (2022).
See Table 1 for examples.
我们将 Riches 应用于开放领域问答(ODQA)任务,在该任务中,我们评估正确预测短答案字符串的能力以及检索该答案的归属(Bohnet et al.,2022)。请参见表 1 以获取示例。
Multi-hop Question Answering
多跳问答
Interleaving between generation and retrieval can be powerful in multi-hop reasoning, where the model needs to retrieve and stitch together knowledge from multiple sources.
Examples of Riches outputs for multi-hop QA are given in Table 2.
在多跳推理中,生成与检索之间的交错可以发挥强大的作用,因为模型需要从多个来源检索并拼接知识。表 2 中给出了多跳问答的 Riches 输出示例。
"Thinking" for Retrieval
“思维”用于检索
Multi-step questions often require breaking down a query into smaller steps and reasoning or planning what to retrieve next. Foreshadowing retrieval with thoughts is crucial in this context. It helps direct the retrieval process, avoid repetitions, and, more importantly, allows for iterating upon and correcting previously erroneous retrievals. A few such demonstrations can be found in Table 2.
多步骤问题通常需要将查询分解为更小的步骤,并推理或规划接下来要检索的内容。在这种情况下,提前思考检索是至关重要的。它有助于引导检索过程,避免重复,更重要的是,允许对之前错误的检索进行迭代和修正。表 2 中可以找到一些这样的示例。
Query 查询 |
Dense Retriever (GTR-xxl) 密集检索器 (GTR-xxl) |
Riches Retrieval Output 丰富检索输出 |
# Leveraging Parametric knowledge # 利用参数知识 |
||
who was the actor that played ben stone on law and order 谁是扮演《法律与秩序》中本·斯通的演员 |
Benjamin Stone was a former Law and Order character 本杰明·斯通曾是《法律与秩序》中的一个角色 |
Ben Stone was played by Michael Moriarty 本·斯通由迈克尔·莫里亚提饰演 |
jharkhand which festival is associated with cattle worship 贾坎德邦与牛崇拜相关的节日是哪个? |
Gaijatra is the festival of cows Gaijatra 是牛的节日 |
Sohrai is a cattle festival of Jharkhand. Sohrai 是贾坎德的一个牛类节日 |
# Better Context understanding # 更好的上下文理解 |
||
who plays at the prudential center in newark 谁在纽瓦克的普鲁登셜中心演出 |
The Prudential Center is in Newark. 普天寿中心位于纽瓦克。 |
The Prudential Center is home to the New Jersey Devils 普天寿中心是新泽西魔鬼队的主场。 |
when was coffee first made into a drink 咖啡首次被制成饮料是在什么时候 |
In 1724, society in Lima discovered an exotic drink called coffee. 在 1724 年,利马的社会发现了一种名为咖啡的异国饮品。 |
The earliest credible evidence of coffee-drinking or knowledge of the coffee tree appears in the middle of the 15th century. 关于饮用咖啡或对咖啡树的知识的最早可信证据出现在 15 世纪中叶。 |
表 1:单跳问答中丰富检索与密集检索的示例。仅显示检索到的文本以作说明。
Query 查询 | Riches Output 财富产出 |
---|---|
who had a baby at 100 in the bible
圣经中谁在 100 岁时生了孩子 (# Keywords guide retrieval) (# 关键词引导检索) |
keyword: Sarah’s age at birth of Isaac
关键词:莎拉在以撒出生时的年龄 << Sarah was 90 years old when Isaac was born. >> << 萨拉在以撒出生时 90 岁。 >> keyword: Abraham’s age at birth of Isaac 关键词:亚伯拉罕在以撒出生时的年龄 << Abraham was 100 years old when Isaac was born. >> << 亚伯拉罕在以撒出生时已经一百岁。 >> answer: Sarah and Abraham 答案:撒拉和亚伯拉罕 |
when did the united states start using the death penalty
美国何时开始使用死刑 (# keywords help recover from inaccurate retrieval) (# 关键词有助于从不准确的检索中恢复) |
keyword: Capital punishment in the United States
关键词:美国的死刑 << Capital punishment was reinstated in the United States in 1976. >> << 美国于 1976 年恢复了死刑。 >> keyword: First execution in the United States 关键词:美国首次执行死刑 << The first recorded execution in the United States took place in 1608 at the Jamestown Colony in Virginia. >> << 美国首次记录的处决发生在 1608 年,地点是弗吉尼亚的詹姆斯敦殖民地。 >> answer: 1608 |
What distinction is held by the former NBA player who was a member of the Charlotte Hornets during their 1992-93 season and was head coach for the WNBA team Charlotte Sting?
这位前 NBA 球员在 1992-93 赛季曾是夏洛特黄蜂队的成员,并担任 WNBA 球队夏洛特刺队的主教练,他有什么特别的荣誉? (# Planning what to retrieve) (# 规划要检索的内容) |
keyword: member of the Charlotte Hornets during their 1992-93 season
关键词:1992-93 赛季夏洛特黄蜂队的成员 << Muggsy Bogues played for the Charlotte Hornets from 1988 to 1997. >> << 穆格西·博格斯于 1988 年至 1997 年期间为夏洛特黄蜂队效力。 >> keyword: Muggsy Bogues’ distinction 关键词:穆格西·博格斯的卓越 << Muggsy Bogues is the shortest player ever to play in the National Basketball Association (NBA). >> << 穆格西·博格斯是历史上在国家篮球协会(NBA)中身高最矮的球员。 >> answer: shortest player ever to play in the National Basketball Association (NBA) 答案:历史上在国家篮球协会(NBA)中身高最矮的球员 |
表 2:Riches 的示例迭代检索输出。备注标注为(# 评论)
5 Experimental Setup 5 实验设置
5.1 Datasets 5.1 数据集
Queryset 查询集
Our experiments are focused on open domain question answering tasks including both single and multi-hop benchmarks. For single-hop, we use the Open-NQ Kwiatkowski et al. (2019) dataset. To evaluate multi-hop reasoning, we look into Hotpot-QA (Yang et al., 2018) and Musique-Ans (Trivedi et al., 2022). The latter includes varying hops and different composition operations, offering a rich test-bed for how well Riches can generalize across a diverse range of queries.
我们的实验集中于开放领域问答任务,包括单跳和多跳基准测试。对于单跳,我们使用 Open-NQ Kwiatkowski 等人(2019)数据集。为了评估多跳推理,我们研究 Hotpot-QA(Yang 等人,2018)和 Musique-Ans(Trivedi 等人,2022)。后者包括不同的跳数和不同的组合操作,为 Riches 在多样化查询中的泛化能力提供了丰富的测试平台。
Corpus 语料库
Section 3.4 describes multiple strategies to index the corpus. Each type of retrieval key needs to be accompanied with its own corpus. Title, passage and sentence keys are derived from the Wikipedia corpus presented in Bohnet et al. 2022.
For propositions, we re-use the Factoid-Wiki corpus built by Chen et al. 2023. This is derived from Bohnet et al. 2022 by decomposing passages into smaller, compact propositions using a finetuned Flan-T5-large (Wei et al., 2021) model. We drop the titles from Factoid-Wiki and only use the propositions (See Section A.2).
第 3.4 节描述了多种对语料库进行索引的策略。每种检索键都需要配备其自身的语料库。标题、段落和句子键源自 Bohnet 等人(2022)提供的维基百科语料库。对于命题,我们重新使用了 Chen 等人(2023)构建的 Factoid-Wiki 语料库。这是通过使用微调的 Flan-T5-large(Wei 等人,2021)模型,将段落分解为更小、更紧凑的命题,从 Bohnet 等人(2022)中得出的。我们从 Factoid-Wiki 中去掉标题,仅使用命题(见 A.2 节)。
5.2 Evaluation 5.2 评估
The standard metric for ODQA benchmarks has predominantly been F1 answer match accuracy.
However, language models are prone to hallucinate and F1 stand-alone can be misleading as the answer may not be conditioned on the evidence. Attribution (Rashkin et al., 2021) helps us trade-off answer accuracy for faithfulness to the evidence.
Thus, we measure two competing metrics: i) end-to-end answer accuracy with F1 and ii) attribution of the answer to evidence using AutoAIS (Bohnet et al., 2022).
AutoAIS, or AIS for short, is automatically computed by classifying whether the evidence text entails the question and predicted answer pair. We re-use the NLI scorer and formulation from Bohnet et al. 2022 (See details in Section A.2).
The evidence text here is the concatenation of all retrieval keys in the Riches output. The unconstrained thoughts are discarded from evaluation.
Only the top beam output is considered for evaluation.
ODQA 基准的标准指标主要是 F1 答案匹配准确率。然而,语言模型容易产生幻觉,单独使用 F1 可能会产生误导,因为答案可能并未基于证据。归因(Rashkin 等,2021)帮助我们在答案准确性和对证据的忠实性之间进行权衡。因此,我们测量两个相互竞争的指标:i)使用 F1 的端到端答案准确性和 ii)使用 AutoAIS(Bohnet 等,2022)将答案归因于证据。AutoAIS,简称 AIS,是通过分类证据文本是否包含问题和预测答案对来自动计算的。我们重新使用了 Bohnet 等(2022)的 NLI 评分器和公式(详见 A.2 节)。这里的证据文本是 Riches 输出中所有检索键的连接。无约束的思考被排除在评估之外。仅考虑顶级束输出进行评估。
5.3 Models and Inference 5.3 模型与推断
Throughout our experiments, we use off-the-shelf instruction-tuned models in a few-shot setting, without any fine-tuning. We test the instruction-tuned versions of PaLM2-M and its larger variant PaLM2-L (Anil et al., 2023) based on stacked Transformer architecture. We use 3 example demonstrations in our prompt (Section A.1), with different sets of examples for single-hop (NQ) and multi-hop (Hotpot, Musique) datasets. The unconstrained sequences or thoughts are formulated as hint keywords.
Our final setup uses a beam of 10 with constrained decoding (Section 3.1), adaptive beam size (Section 3.3) and propositions as retrieval keys. Later in Section 6, we ablate these choices. Note that only the top-beam result is considered for evaluation.
在我们的实验中,我们使用现成的指令调优模型,在少量示例的设置下,不进行任何微调。我们测试了基于堆叠 Transformer 架构的 PaLM2-M 及其更大变体 PaLM2-L(Anil 等,2023)的指令调优版本。我们在提示中使用了 3 个示例演示(第 A.1 节),对于单跳(NQ)和多跳(Hotpot,Musique)数据集使用不同的示例集。无约束的序列或思路被表述为提示关键词。我们的最终设置使用束宽为 10 的约束解码(第 3.1 节)、自适应束大小(第 3.3 节)和作为检索键的命题。稍后在第 6 节中,我们将消融这些选择。请注意,仅考虑顶束结果进行评估。
5.4 Baselines 5.4 基线
We experiment with 3 types of baselines: no retrieval, the standard dense retriever and an iterative retriever suited for multi-hop QA. We opt for baselines that test the out-of-domain system performance, similar to Riches.
我们实验了三种基线:无检索、标准密集检索器和适用于多跳问答的迭代检索器。我们选择了测试域外系统性能的基线,类似于 Riches。
No Retrieval 无检索
We compare to a few-shot unconstrained baseline using the same 3-shot prompt as Riches allowing for chain-of-thought reasoning. The setup generates an answer along with hallucinated evidences, not grounded to a corpus. This is a measure of model’s memorization capabilities.
我们使用与 Riches 相同的 3-shot 提示进行比较,以几-shot 无约束基线,允许链式思维推理。该设置生成一个答案以及虚构的证据,这些证据并未基于任何语料库。这是对模型记忆能力的衡量。
Generalized Dense Retriever
泛化密集检索器
For single-hop QA, we compare our approach against the Generalized T5 retriever (GTR-xxl, 11B variant) (Ni et al., 2021). GTR undergoes multi-staged training, first on unsupervised web-mined corpus and then supervised search datasets including NQ. It has been shown to generalize well out-of-domain.
However, GTR and other conventional dense retrievers provide only retrieved documents, not the answers themselves. To extract answers, we use the PaLM2-M model in a few-shot setting (see Section A.1 for the details).
对于单跳问答,我们将我们的方法与通用 T5 检索器(GTR-xxl,11B 变体)(Ni 等,2021)进行比较。GTR 经历了多阶段训练,首先在无监督的网络挖掘语料库上,然后在包括 NQ 的监督搜索数据集上进行训练。研究表明,它在域外具有良好的泛化能力。然而,GTR 和其他传统的密集检索器仅提供检索到的文档,而不是答案本身。为了提取答案,我们在少量样本设置中使用 PaLM2-M 模型(详见 A.1 节)。
Since Riches generates a single output with a varying number of interleaved documents, direct comparison with dense retrievers that fetch a fixed top-k documents is challenging. We set k to a value equivalent to the mean documents Riches fetches for single-hop. When retrieval keys are different, such as passages vs propositions, we approximately match the tokens used by both setups. In our final experiments, we compare against k=1 passage and k=2 propositions for GTR-xxl.
由于 Riches 生成一个输出,同时包含不同数量的交错文档,因此与获取固定前 k 个文档的密集检索器进行直接比较是具有挑战性的。我们将 k 设置为与 Riches 在单跳中获取的文档平均值相等。当检索键不同,例如段落与命题时,我们大致匹配两种设置中使用的标记。在我们的最终实验中,我们将 GTR-xxl 与 k=1 的段落和 k=2 的命题进行比较。
Iterative Retrieval (Iter)
迭代检索 (Iter)
For Multi-hop QA, we adopt a popular method where question is decomposed into sub-queries (Khot et al., 2022). At each step, passages are retrieved for a sub-query and fed as input for the next query, until one converges to an answer.
The method has the same surface form as Riches, except for the key distinction that each step requires switching between a heterogeneous mix of models.
In our experiments, we retrieve top-1 document with GTR-xxl and use PaLM2-M few-shot for both decomposing the query and generating the final answer (See prompt at Section A.1). Max allowed steps is set to 4 where most of the queries converge.
对于多跳问答,我们采用了一种流行的方法,将问题分解为子查询(Khot et al., 2022)。在每一步中,针对子查询检索段落,并将其作为下一个查询的输入,直到最终得出答案。该方法与 Riches 的表面形式相同,唯一的区别在于每一步需要在异构模型之间切换。在我们的实验中,我们使用 GTR-xxl 检索排名第一的文档,并使用 PaLM2-M 的少量示例来分解查询和生成最终答案(见 A.1 节的提示)。最大允许步骤设置为 4,大多数查询在此时收敛。
6 Results 结果
Retrieval Key 检索关键 | Hits@1 命中率@1 |
---|---|
Title 标题 | 19.5 |
Paragraph with Title 带标题的段落 | 15.5 |
Paragraph 段落 | 19.0 |
Sentence with Title 带标题的句子 | 19.1 |
Sentence 句子 | 20.6 |
Proposition 命题 | 33.9 |
表 3:NQ 上检索键的比较
Retriever 检索器 | Answerer 回答者 | NQ | Hotpot 火锅 | Musique 音乐 | |||
---|---|---|---|---|---|---|---|
F1 | AutoAIS 自动 AIS | F1 | AutoAIS 自动 AIS | F1 | AutoAIS 自动 AIS | ||
No Retrieval 无检索 | |||||||
Unconstrained PaLM2-M 无约束的 PaLM2-M | 41.4 | - | 39.1 | - | 18.3 | - | |
Dense Retrieval 密集检索 | |||||||
GTR Passage GTR 段落 | PaLM2-M | 41.9 | 48.7 | 34.9 | 19.6 | 7.2 | 17.9 |
GTR Proposition GTR 命题 | PaLM2-M | 36.6 | 63.2 | 27.4 | 18.5 | 10.5 | 20.4 |
Iterative 迭代 | PaLM2-M | 34.4 | 66.8 | 34.2 | 30.9 | 17.5 | 38.4 |
Riches 财富 | |||||||
PaLM2-M | 40.2 | 59.2 | 41.0 | 36.5 | 19.1 | 39.6 | |
PaLM2-L | 46.7 | 59.6 | 51.1 | 35.6 | 28.2 | 37.5 |
表 4:Riches 的整体性能比较。对于密集检索器,检索到的前 k 个文档被输入到少量示例回答者中,其中 GTR 段落的 k=1,GTR 命题的 k=2。对于迭代检索,每一步检索最多 4 个文档,k=1。
Beam 梁 | F1 | AutoAIS 自动 AIS |
---|---|---|
1 | 19.3 | 26.1 |
5 | 35.8 | 58.7 |
10 | 40.2 | 59.2 |
表 5:光束大小对 NQ 的影响,使用 PaLM2-M。
Unconst. 非约束。 | Adaptive 适应性 | NQ | Hotpot 火锅 | ||
Keywords 关键词 | Beam 梁 | F1 | AIS | F1 | AIS |
X | X | 37.9 | 57.5 | 39.2 | 33.9 |
✓ | X | 36.9 | 51.5 | 38.4 | 32.3 |
✓ | ✓ | 40.2 | 59.2 | 41.0 | 36.5 |
表 6:使用自适应束搜索交错无约束关键词和检索键。贪婪解码无约束子序列使得有约束检索能够充分利用束搜索。
In the following sections, we investigate the key building blocks of Riches: i) indexing strategies (Section 3.4) amenable to auto-regressive decoding ii) effect of beam decoding (Section 3.1) iii) suitable mechanisms to interleave thoughts and retrieval keys (Section 3.3). Finally, we compare Riches against conventional retrieval systems. We also draw a detailed analysis of wins and losses to fathom the strengths and pitfalls of the system.
在以下部分中,我们研究了 Riches 的关键构建模块:i) 适合自回归解码的索引策略(第 3.4 节) ii) 波束解码的影响(第 3.1 节) iii) 适合交错思维和检索关键的机制(第 3.3 节)。最后,我们将 Riches 与传统检索系统进行比较。我们还对胜利和失败进行了详细分析,以深入了解该系统的优缺点。
6.1 Riches building blocks
6.1 财富构建模块
Retrieval Keys 检索键
We explore the following retrieval key candidates as detailed in Section 3.4:
a) Title: Wikipedia page and section titles, ranking paragraphs within the section using TF-IDF scores.
b) Paragraph with Title: Decodes the page title, section title, and full paragraph.
c) Paragraph: Decodes the paragraph only.
d) Sentence: Uses individual sentences.
e) Proposition: Uses atomic information units derived from paragraphs.
Table 3 shows that among the retrieval keys explored, the propositional index is best aligned with our decoding search strategy, perhaps its compact nature is most suited for autoregressive decoding. An in-depth analysis of retrieval keys is provided in Section A.4.
In the following experiments, we use proposition as our retrieval key.
我们探讨了第 3.4 节中详细介绍的以下检索关键候选项:a) 标题:维基百科页面和章节标题,使用 TF-IDF 分数对章节内的段落进行排名。b) 带标题的段落:解码页面标题、章节标题和完整段落。c) 段落:仅解码段落。d) 句子:使用单独的句子。e) 命题:使用从段落中提取的原子信息单元。表 3 显示,在所探讨的检索关键中,命题索引与我们的解码搜索策略最为一致,也许其紧凑的特性最适合自回归解码。第 A.4 节提供了对检索关键的深入分析。在接下来的实验中,我们使用命题作为我们的检索关键。
Effect of Beam size
光束大小的影响
Table 5 shows how greedy decoding can get stuck with poor retrieval keys. A larger beam enables better search space exploration, albeit with diminishing returns. In our final experiments, we use a beam of 10.
表 5 显示了贪婪解码如何在不良检索键上陷入困境。更大的束宽可以更好地探索搜索空间,尽管收益递减。在我们的最终实验中,我们使用了束宽为 10。
Interleaving with Adaptive Beam
自适应波束交错
Table 6 shows the impact of interleaving thoughts with retrieval keys.
First, we note that an adaptive beam is crucial for interleaving unconstrained and constrained sequences. Without an adaptive beam, minor irrelevant variations in unconstrained thoughts can consume and overwhelm the available space in the beam. By greedily decoding unconstrained sequences, the beam space is preserved for backtracking during document search.
Once we have an adaptive beam in place, the insertion of keywords enhances both answer and retrieval performance, reminiscent of chain-of-thought technique to enable better retrieval.
表 6 显示了交错思维与检索关键字的影响。首先,我们注意到自适应束对于交错无约束和有约束序列至关重要。如果没有自适应束,无约束思维中的微小无关变化可能会消耗并淹没束中的可用空间。通过贪婪地解码无约束序列,束空间得以保留,以便在文档搜索过程中进行回溯。一旦我们建立了自适应束,关键字的插入增强了答案和检索性能,类似于链式思维技术,以实现更好的检索。
6.2 Overall Results 6.2 总体结果
Table 4 shows the overall performance of Riches across various datasets. First, we compare with the no-retrieval baseline and observe that jointly retrieving and answering does not negatively impact model’s answering capabilities.
For single-hop tasks, Riches competes well with dense retrievers, offering higher answer accuracy at the expense of attribution. In multi-hop QA, Riches excels, outperforming iterative baselines by +15 F1 points on Hotpot and +11 on Musique, with comparable or better attribution. The increase in answer accuracy with the larger PaLM2-L model suggests improved performance with larger model sizes. Notably, Riches achieves these results with a single inference pass, unlike the Iterative baseline, which requires a model call at each sub-query step.
表 4 显示了 Riches 在各种数据集上的整体表现。首先,我们与无检索基线进行比较,观察到联合检索和回答并未对模型的回答能力产生负面影响。在单跳任务中,Riches 与密集检索器竞争良好,提供了更高的答案准确性,代价是归因。在多跳问答中,Riches 表现出色,在 Hotpot 上比迭代基线高出 15 个 F1 点,在 Musique 上高出 11 个 F1 点,且归因相当或更好。使用更大的 PaLM2-L 模型时,答案准确性的提高表明更大模型尺寸带来了性能改善。值得注意的是,Riches 在单次推理过程中实现了这些结果,而迭代基线在每个子查询步骤都需要模型调用。
6.3 Qualitative analysis 6.3 定性分析
We inspect 50 win and loss examples each to analyze the strength and weaknesses of the system.
我们检查了 50 个胜利和失败的例子,以分析系统的优缺点。
Wins 胜利
Several properties distinguish Riches from dense retrievers: a) Riches allows large language models to utilize their parametric knowledge for retrieval. Since the search operation in Riches is independent of corpus size, it can employ much larger models at query time. b) The inherent alignment of instruction-tuned models enables them to retrieve contextually relevant passages, whereas dense retrievers may sometimes latch onto keywords. c) The interleaved thoughts guide the model toward more accurate retrievals. Table 1 demonstrates these scenarios for single-hop retrievals and Table 2 for multi-hop retrievals.
几个属性将 Riches 与密集检索器区分开来:a) Riches 允许大型语言模型利用其参数知识进行检索。由于 Riches 中的搜索操作与语料库大小无关,因此在查询时可以使用更大的模型。b) 指令调优模型的固有对齐使其能够检索上下文相关的段落,而密集检索器有时可能会依赖关键词。c) 交错的思维引导模型朝向更准确的检索。表 1 展示了单跳检索的这些场景,表 2 展示了多跳检索的场景。
Can the model retrieve what it doesn’t know?
模型能否检索它不知道的内容?
A language model may hold stale or incorrect information. However, Riches can often override model’s pre-existing knowledge and generate correct answers by constraining on the corpus (Section A.3)
语言模型可能持有过时或不正确的信息。然而,Riches 通常可以通过对语料库的约束(第 A.3 节)来覆盖模型的既有知识并生成正确的答案。
Failure mode 失效模式 | Queries(%) 查询(%) |
---|---|
Index Failure 索引失败 | 40% |
Search Failure 搜索失败 | 52% |
Attribution Failure 归因失败 | 8% |
表 7:Hotpot-QA 中财富的损失类别
Losses 损失
We inspect 50 failed queries and categorize the losses (Table 7) as follows: a) Index failure: the proposition is absent from the index or not decontextualized. b) Search failure: Proposition exists in the index, but could not be generated c) Attribution failure: The answer is partially attributed, with LLM hallucinating based on partial evidence. (see Section A.3 for examples)
我们检查了 50 个失败的查询,并将损失分类(表 7)如下:a) 索引失败:命题在索引中缺失或未去上下文化。b) 搜索失败:命题存在于索引中,但无法生成。c) 归因失败:答案部分归因,LLM 基于部分证据产生幻觉。(参见 A.3 节以获取示例)
7 Conclusion 7 结论
Retrieval has so far been alienated from the rapid progress in instruction tuning. This work makes the following contribution:
i) an approach that can seamlessly integrate retrieval with generation. ii) a thorough investigation of indexing and search strategies that enable such an approach to be effective. iii) proof-of-concept of the capabilities of such a system on a variety of QA tasks.
We hope the ideas introduced in this work fuel progress in aligning retrieval to generation and simplifying it.
检索迄今为止与快速发展的指令调优相脱节。本研究做出了以下贡献:i) 一种可以无缝集成检索与生成的方法。ii) 对索引和搜索策略的深入研究,使得这种方法能够有效。iii) 在多种问答任务上证明了这种系统的能力的概念验证。我们希望本研究中提出的思想能够推动检索与生成的对齐和简化进程。
8 Limitations 8 局限性
First we note the limitations in our experimental setup. All our experiments are based on Wikipedia, a corpus heavily seen during pre-training. This work does not analyze how Riches fares on corpora unseen during pre-training. Furthermore, we only examine a handful of factoid question-answering tasks due to the lack of objective evaluations. Performance on tasks such as long-form QA is deferred for future work.
There are also certain inherent limitations with Riches. It forces verbatim emission of corpus text, which might be an overkill for tasks where a similarity-based metric is sufficient. Riches lacks the ability to retrieve dozens of documents, a necessity for certain summarization tasks. For long documents with diffused information, rewriting into propositions adds complexity and can be cumbersome.
Lastly, while Riches’s search operation is independent of corpus size, the use of beam search and communication between the FM-index and Transformer model can slow down inference.
首先,我们注意到实验设置的局限性。我们所有的实验都基于维基百科,这是一个在预训练期间大量使用的语料库。本研究并未分析 Riches 在预训练期间未见过的语料库上的表现。此外,由于缺乏客观评估,我们仅检查了一小部分事实问答任务。对于长篇问答等任务的表现将推迟到未来的工作中。Riches 本身也存在某些固有的局限性。它强制逐字输出语料文本,这可能对于某些只需相似度度量的任务来说过于复杂。Riches 缺乏检索数十篇文档的能力,而这对于某些摘要任务是必要的。对于信息分散的长文档,重写为命题增加了复杂性,并可能变得繁琐。最后,尽管 Riches 的搜索操作与语料库大小无关,但使用束搜索以及 FM-index 与 Transformer 模型之间的通信可能会减慢推理速度。
9 Ethical Considerations 伦理考虑
All artifacts used in this paper, including models, datasets, and baselines, are under permissive licenses and publicly available. We have attempted to provide detailed information to facilitate the reproduction of our results.
本文中使用的所有文献,包括模型、数据集和基准,均在宽松许可下公开可用。我们已尽力提供详细信息,以便于结果的再现。
Our findings are based on English-language data from Wikipedia, and we have not tested the generalizability of our claims to other languages or domains.
我们的发现基于来自维基百科的英语数据,我们尚未测试我们的主张在其他语言或领域的普遍适用性。
Lastly, the datasets used in this work are not expected to contain any offensive content. However, it is important to note that Large Language Models (LLMs) can exhibit biases related to gender, race, and region, and are also prone to hallucination. Although Riches aims to ground its generation in an external corpus, some biases may still be present.
最后,本研究中使用的数据集不应包含任何冒犯性内容。然而,重要的是要注意,大型语言模型(LLMs)可能会表现出与性别、种族和地区相关的偏见,并且也容易产生幻觉。尽管 Riches 旨在将其生成内容与外部语料库相结合,但某些偏见仍可能存在。
References
- Adolphs et al. (2021) Leonard Adolphs, Benjamin Boerschinger, Christian Buck, Michelle Chen Huebscher, Massimiliano Ciaramita, Lasse Espeholt, Thomas Hofmann, Yannic Kilcher, Sascha Rothe, Pier Giuseppe Sessa, et al. 2021. Boosting search engines with interactive agents. arXiv preprint arXiv:2109.00527.
- Anil et al. (2023) Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- Asai et al. (2023) Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511.
- Bevilacqua et al. (2022) Michele Bevilacqua, Giuseppe Ottaviano, Patrick Lewis, Wen tau Yih, Sebastian Riedel, and Fabio Petroni. 2022. Autoregressive search engines: Generating substrings as document identifiers. In arXiv pre-print 2204.10628.
- Bohnet et al. (2022) Bernd Bohnet, Vinh Q Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Jacob Eisenstein, Kuzman Ganchev, Jonathan Herzig, Kai Hui, et al. 2022. Attributed question answering: Evaluation and modeling for attributed large language models. arXiv preprint arXiv:2212.08037.
- Chen et al. (2022) Sihao Chen, Senaka Buthpitiya, Alex Fabrikant, Dan Roth, and Tal Schuster. 2022. Propsegment: A large-scale corpus for proposition-level segmentation and entailment recognition. arXiv preprint arXiv:2212.10750.
- Chen et al. (2023) Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Dong Yu, and Hongming Zhang. 2023. Dense x retrieval: What retrieval granularity should we use? arXiv preprint arXiv:2312.06648.
- De Cao et al. (2020) Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. 2020. Autoregressive entity retrieval. arXiv preprint arXiv:2010.00904.
- Ferragina and Manzini (2000) Paolo Ferragina and Giovanni Manzini. 2000. Opportunistic data structures with applications. In Proceedings 41st annual symposium on foundations of computer science, pages 390–398. IEEE.
- Graves (2012) Alex Graves. 2012. Sequence transduction with recurrent neural networks.
- Hao et al. (2023) Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992.
- Jain et al. (2023) Palak Jain, Livio Baldini Soares, and Tom Kwiatkowski. 2023. 1-pager: One pass answer generation and evidence retrieval. arXiv preprint arXiv:2310.16568.
- Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906.
- Khattab and Zaharia (2020) Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48.
- Khot et al. (2022) Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2022. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406.
- Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
- Lee et al. (2021) Jinhyuk Lee, Alexander Wettig, and Danqi Chen. 2021. Phrase retrieval learns passage retrieval, too. arXiv preprint arXiv:2109.08133.
- Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Lu et al. (2021) Ximing Lu, Sean Welleck, Peter West, Liwei Jiang, Jungo Kasai, Daniel Khashabi, Ronan Le Bras, Lianhui Qin, Youngjae Yu, Rowan Zellers, et al. 2021. Neurologic a* esque decoding: Constrained text generation with lookahead heuristics. arXiv preprint arXiv:2112.08726.
- Menick et al. (2022) Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, and Nat McAleese. 2022. Teaching language models to support answers with verified quotes.
- Metzler et al. (2021) Donald Metzler, Yi Tay, Dara Bahri, and Marc Najork. 2021. Rethinking search: making domain experts out of dilettantes. ACM SIGIR Forum, 55(1):1–27.
- Min et al. (2023) Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251.
- Ni et al. (2021) Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Ábrego, Ji Ma, Vincent Y Zhao, Yi Luan, Keith B Hall, Ming-Wei Chang, et al. 2021. Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899.
- Pradeep et al. (2023) Ronak Pradeep, Kai Hui, Jai Gupta, Adam D Lelkes, Honglei Zhuang, Jimmy Lin, Donald Metzler, and Vinh Q Tran. 2023. How does generative retrieval scale to millions of passages? arXiv preprint arXiv:2305.11841.
- Rashkin et al. (2021) Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. 2021. Measuring attribution in natural language generation models. arXiv preprint arXiv:2112.12870.
- Ren et al. (2017) Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. 2017. Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 290–298.
- Robertson et al. (2009) Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
- Sun et al. (2022) Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. 2022. Recitation-augmented language models. arXiv preprint arXiv:2210.01296.
- Tay et al. (2022) Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, et al. 2022. Transformer memory as a differentiable search index. Advances in Neural Information Processing Systems, 35:21831–21843.
- Thakur et al. (2021) Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663.
- Trivedi et al. (2022) Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554.
- Wei et al. (2021) Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
- Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models, 2023. URL https://arxiv. org/pdf/2305.10601. pdf.
- Yao et al. (2022) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
- Yu et al. (2022) Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu, Michael Zeng, and Meng Jiang. 2022. Generate rather than retrieve: Large language models are strong context generators. arXiv preprint arXiv:2209.10063.
Appendix A Appendix 附录 A 附录
Dataset 数据集 | Split 划分 | Queries 查询 | Hops 啤酒花 |
---|---|---|---|
Open-NQ 开放式 NQ | Test 测试 | 3610 | 1 |
Hotpot 火锅 | Dev 开发 | 7405 | 2 |
MuSiQue-Ans MuSiQue-答案 | Dev 开发 | 2412 | 2-4 |
表 8:我们实验中使用的 ODQA 数据集
Corpus 语料库 | Docs 文档 | Avg Words 平均字数 |
---|---|---|
Passage 段落 | 40M | 58.5 |
Sentence 句子 | 114M | 21.0 |
Propositions 命题 | 256M | 11.0 |
表 9:我们实验中使用的检索语料库
A.1 Experiment Details A.1 实验细节
For given input query, write 1-3 passages to answer the query.
Write a hint keyword and a passage contained within << and >>.
A passage must be a complete sentence and not a phrase. It must contain complete context for answering the query and should not begin with it, he, they etc.
Do not repeat any passages. Aim for new keywords. 对于给定的输入查询,撰写 1-3 段落以回答该查询。写一个提示关键词和一个包含在<<和>>之间的段落。段落必须是完整的句子,而不是短语。它必须包含回答查询的完整上下文,并且不应以它、他、他们等开头。不要重复任何段落。目标是新的关键词。 |
question: The football manager who recruited Cristiano Ronaldo managed Manchester United during what timeframe? question: 招募克里斯蒂亚诺·罗纳尔多的足球经理在什么时间段管理曼联? |
passage: keyword: Cristiano Ronaldo’s recruiting manager << Alex Ferguson recruited Cristiano Ronaldo >> keyword: Sir Alex Ferguson’s tenure at Manchester United << Sir Alex Ferguson managed Manchester United from 1986 to 2013. >> passage: keyword: 克里斯蒂亚诺·罗纳尔多的招募经理 << 亚历克斯·弗格森招募了克里斯蒂亚诺·罗纳尔多 >> keyword: 亚历克斯·弗格森在曼联的任期 << 亚历克斯·弗格森从 1986 年到 2013 年管理曼联。 >> |
answer: 1986 to 2013 答案:1986 年至 2013 年 |
question: Were Eatza Pizza and Your Pie founded in the same state? 问题:Eatza Pizza 和 Your Pie 是在同一个州成立的吗? |
passage: keyword: Eatza Pizza founded in state << Eatza Pizza was founded in Arizona >> keyword: Your Pie founded in state << Your Pie was founded in Athens, Georgia >> passage: keyword: Eatza Pizza 成立于州 << Eatza Pizza 成立于亚利桑那州 >> keyword: Your Pie 成立于州 << Your Pie 成立于乔治亚州雅典 >> |
answer: no answer: 不 |
question: In which stadium do the teams owned by Myra Kraft’s husband play? question: Myra Kraft 的丈夫拥有的球队在哪个体育场比赛? |
passage: keyword: Myra Kraft’s husband << Robert Kraft’s wife is Myra Kraft. >> keyword: Robert Kraft’s team << Robert Kraft is the owner of the New England Patriots. >> keyword: New England Patriots stadium << Gillette Stadium is the home of the New England Patriots. >> passage: keyword: Myra Kraft 的丈夫 << 罗伯特·克拉夫特的妻子是 Myra Kraft。 >> keyword: 罗伯特·克拉夫特的球队 << 罗伯特·克拉夫特是新英格兰爱国者队的老板。 >> keyword: 新英格兰爱国者队体育场 << 吉列体育场是新英格兰爱国者队的主场。 >> |
answer: Gillette Stadium 答案: 吉列体育场 |
question: <question> 问题: <问题> |
passage: 段落: |
表 10:在 Riches 中用于多跳问答的少量示例提示
For given input query, write 1-3 passages to answer the query.
Write a hint keyword and a passage contained within << and >>.
A passage must be a complete sentence and not a phrase. It must contain complete context for answering the query and should not begin with it, he, they etc.
Do not repeat any passages. Aim for new keywords. 对于给定的输入查询,撰写 1-3 段落以回答该查询。写一个提示关键词和一个包含在<<和>>之间的段落。段落必须是完整的句子,而不是短语。它必须包含回答查询的完整上下文,并且不应以它、他、他们等开头。不要重复任何段落。目标是新的关键词。 |
question: who is the owner of phoenix mall pune? question: 谁是浦那凤凰购物中心的拥有者? |
passage: keyword: Phoenix Market City owner << Phoenix Market City is developed by Phoenix Mills Limited. >> passage: keyword: 凤凰市场城的拥有者 << 凤凰市场城由凤凰磨坊有限公司开发。 >> |
answer: Phoenix Mills Limited 答案:凤凰磨坊有限公司 |
question: what brings in more money nba or nfl? 问题:NBA 和 NFL 哪个带来更多收入? |
passage: keyword: NFL revenues << NFL revenues are well over $10 billion per season. >> keyword: NBA revenue << NBA amasses about $6 billion annually. >> passage: keyword: NFL 收入 << NFL 收入每个赛季超过 100 亿美元。 >> keyword: NBA 收入 << NBA 每年收入约为 60 亿美元。 >> |
answer: NFL |
question: when was the french national anthem adopted? 问题:法国国歌是什么时候被采纳的? |
passage: keyword: French national anthem << La Marseillaise became the national anthem of France. >> keyword: La Marseillaise adoption << La Marseillaise was adopted by France in 1795. >> passage: keyword: 法国国歌 << 《马赛曲》成为法国的国歌。 >> keyword: 《马赛曲》采纳 << 《马赛曲》于 1795 年被法国采纳。 >> |
answer: 1795 |
question: question 问题: question |
passage: 段落: |
表 11: 在 Riches 中用于单跳问答的少量示例提示
In-context prompts 上下文提示
We use 2 different sets of few-shot demonstration for single-hop (NQ) and multi-hop (Hotpot, Musique) datasets displayed in Table 10 and Table 11 respectively. Both prompts carry the same instruction, but the multi-hop variants provides demonstrations with multiple evidence passages.
我们使用两组不同的少量示例,分别针对单跳(NQ)和多跳(Hotpot, Musique)数据集,如表 10 和表 11 所示。两个提示都包含相同的指令,但多跳变体提供了多个证据段的示例。
Computing constraints 计算约束
An example of constrained decoding is illustrated in Figure 3.
图 3 中展示了一个受限解码的例子。
Answer the ’question’ only based on the given ’passage’.
If the ’passage’ lacks context or is not relevant, say ’Cannot answer’ else say generate a short answer.
Do not answer the query from outside the scope of the passage. 仅根据给定的“段落”回答“问题”。如果“段落”缺乏上下文或不相关,则说“无法回答”,否则生成简短答案。不要回答超出段落范围的问题。 |
question: what brings in more money nba or nfl? 问题:NBA 和 NFL 哪个带来更多收入? |
passage: NFL revenues are well over $10 billion per season.
NBA amasses about $6 billion annually. passage: NFL 的收入每个赛季超过 100 亿美元。NBA 每年约为 60 亿美元。 |
answer: NFL |
question: when did they put warnings on cigarette packs question: 他们什么时候在香烟包装上放置警告? |
passage: Tobacco packaging 1978’s warning was not removed, so now every cigarette pack contains both warnings (one on each lateral). passage: 1978 年的烟草包装警告没有被移除,因此现在每个香烟包装上都包含两个警告(每侧一个)。 |
answer: Cannot Answer answer: 无法回答 |
question: when was the french national anthem adopted? 问题:法国国歌是什么时候被采纳的? |
passage: La Marseillaise became the national anthem of France.
La Marseillaise was adopted by France in 1795. passage: 《马赛曲》成为法国的国歌。《马赛曲》于 1795 年被法国采纳。 |
answer: 1795 |
question: question 问题: question |
passage: passage |
answer: 答案: |
表 12:用于从命题中提取答案的少量示例提示
You are given a multi-hop ‘question‘. Decompose it into simple single-hop query, passage. And finally write the overall answer. 您将获得一个多跳“问题”。将其分解为简单的单跳查询和段落。最后写出整体答案。 |
question: In what country was Lost Gravity manufactured? 问题:失重过山车是在什么国家制造的? |
query: Who manufactured The Lost Gravity (roller coaster)? 查询:谁制造了失重过山车? |
passage: Lost Gravity is a steel roller coaster at Walibi Holland manufactured by Mack Rides. passage: 失落的重力是位于荷兰瓦利比的一个钢制过山车,由麦克游乐设施公司制造。 |
query: Mack Rides is from which country? query: 麦克游乐设施公司来自哪个国家? |
passage: Mack Rides is based in Germany. 马可游乐设施公司总部位于德国。 |
answer: Germany 答案: 德国 |
question: Do James Cameron and Christopher Nolan share their profession? 问题: 詹姆斯·卡梅隆和克里斯托弗·诺兰分享他们的职业吗? |
query: What is the profession of James Cameron? query: 詹姆斯·卡梅隆的职业是什么? |
passage: James Cameron is a Director. passage: 詹姆斯·卡梅隆是一位导演。 |
query: What is the profession of Christopher Nolan? query: 克里斯托弗·诺兰的职业是什么? |
passage: Christopher Nolan is a Director. passage: 克里斯托弗·诺兰是一位导演。 |
answer: Yes 答案: 是的 |
question: The actor that stars as Joe Proctor on the series "Power" also played a character on "Entourage" that has what last name? 问题: 在系列剧《权力》中饰演乔·普罗克特的演员在《兄弟连》中也饰演了一个角色,该角色的姓是什么? |
query: Who is the actor that stars as Joe Proctor on the series "Power"? query: 谁是电视剧《权力》中饰演乔·普罗克特的演员? |
passage: Joe Proctor on the series "Power" was potrayed by Jerry Ferrara. passage: 电视剧《权力》中乔·普罗克特由杰瑞·费拉拉饰演。 |
query: Jerry Ferrara played a character on Entourage named what? query: Jerry Ferrara 在《陪伴者》中扮演的角色叫什么? |
passage: Jerry Ferrara played the character of Assante on Entourage. passage: Jerry Ferrara 在《陪伴者》中扮演了 Assante 这个角色。 |
answer: Assante 答案:阿桑特 |
question: <question> 问题: <问题> |
<sub-query steps so far> <子查询步骤到目前为止> |
表 13:迭代基线的少量示例提示
Baselines 基线
For the dense-retriever baseline, answers are extracted from retrieved passages with an external reader. We use PaLM2-M with a few-shot prompt (Table 12).
对于密集检索基线,答案是从检索到的段落中通过外部阅读器提取的。我们使用带有少量示例提示的 PaLM2-M(表 12)。
For iterative retrieval baseline, we use PaLM2-M for both query decomposition and answering. At each step, the model can choose to generate a sub-query or the final answer. The unified prompt is provided at Table 13.
对于迭代检索基线,我们使用 PaLM2-M 进行查询分解和回答。在每一步中,模型可以选择生成子查询或最终答案。统一提示见表 13。
A.2 Evaluation A.2 评估
Datasets 数据集
We use Musique-Ans (Trivedi et al., 2022) subset of Musique which consists of answerable queries.
Details of query sets evaluated can be found in Table 8.
To make retrieval challenging, we use the full Wikipedia corpus for retrieval (Table 9). This is different from the typical Hotpot and Musique setting which use the first Wikipedia paragraph (5M documents) and documents associated with query-set (1.3M) respectively.
我们使用 Musique-Ans(Trivedi 等,2022)子集,该子集由可回答的查询组成。评估的查询集的详细信息可以在表 8 中找到。为了增加检索的挑战性,我们使用完整的维基百科语料库进行检索(表 9)。这与典型的 Hotpot 和 Musique 设置不同,后者分别使用第一个维基百科段落(500 万文档)和与查询集相关的文档(130 万)进行检索。
Metrics 指标
AutoAIS is an automated way of measuring AIS (Attributable to Identified Source) (Rashkin et al., 2021). AutoAIS formulates evaluation as a Natural Language Inference task that asks a model whether the
question and answer are entailed by the provided evidence. We re-use a T5-11B checkpoint finetuned on a collection of NLI tasks from (Bohnet et al., 2022).
Question answering is formulated into NLI task as follows:
AutoAIS 是一种自动化测量 AIS(可归因于已识别来源)的方法(Rashkin 等,2021)。AutoAIS 将评估公式化为自然语言推理任务,询问模型问题和答案是否由提供的证据所蕴含。我们重新使用了一个在(Bohnet 等,2022)中针对一系列 NLI 任务微调的 T5-11B 检查点。问题回答被公式化为 NLI 任务如下:
hypothesis: <retrieved evidence1> <retrieved evidence2> ... premise: The answer to the question ’<question>’ is ’<predicted answer>’
The NLI scorer provides a calibrated score between 0 (not attributed) to 1 (completely attributed) which is averaged over the dataset to get the final AutoAIS score.
假设:<检索证据 1> <检索证据 2> ... 前提:问题‘<问题>’的答案是‘<预测答案>’ NLI 评分器提供一个从 0(未归因)到 1(完全归因)的校准分数,该分数在数据集上取平均以获得最终的 AutoAIS 分数。
Query 查询 | Retrievals 检索 | Comment 评论 |
Index failure 索引失败 | ||
how many episodes of touching evil are there 《触动邪恶》共有多少集 |
A total of 35 episodes were produced. 共制作了 35 集。 |
Proposition lacks context 命题缺乏上下文 |
who is the coach for the ottawa senators 渥太华参议员的教练是谁 |
D. J. Smith is the head coach of the Ottawa Senators. D. J. Smith 是渥太华参议员队的主教练。 |
Incorrect Proposition generated 生成了不正确的命题。 |
Search failure 搜索失败 | ||
what age do you need to be to buy a bb gun 购买 BB 枪需要多少岁? |
18 years of age or older. 18 岁或以上。 |
partial phrase decoded 部分短语解码 |
how many seasons of the bastard executioner are there 《混蛋刽子手》有多少季 |
The Bastard Executioner is an American historical fiction drama television series.
《混蛋刽子手》是一部美国历史虚构剧集。 The Bastard Executioner is an American historical fiction drama television series. 《私生子刽子手》是一部美国历史虚构剧集。 |
repeated retrieval 重复检索 |
who plays gram on the young and the restless 谁在《年轻与不安》中扮演格兰姆 |
The Young and the Restless is an American television soap opera.
《年轻与不安》是一部美国电视肥皂剧。 The Young and the Restless was first broadcast on March 26, 1973. 《年轻与躁动》首次播出于 1973 年 3 月 26 日。 |
irrelevant 无关紧要 |
表 14:财富中的示例损失
Unconstrained Generation 无约束生成 |
Constrained Generation 受限生成 |
Q: who was the actor that played ben stone on law and order 问:在《法律与秩序》中扮演本·斯通的演员是谁 |
|
Ben Stone was played by actor Jerry Orbach. 本·斯通由演员杰瑞·奥巴赫饰演。 |
Ben Stone was played by Michael Moriarty. 本·斯通由迈克尔·莫里亚提饰演。 |
Q: how many pieces in a terry’s chocolate orange 问:一盒特里的巧克力橙里有多少块? |
|
Terry’s Chocolate Orange is made with 32 segments 特里的巧克力橙子由 32 个部分组成 |
Terry’s Chocolate Orange is divided into 20 segments 特里的巧克力橙子被分为 20 个部分 |
Q: who sings the song only in my dreams 问:谁唱了《只在我的梦中》这首歌 |
|
The song "Only in My Dreams" is sung by the band Air Supply. 《只在我的梦中》这首歌是由乐队 Air Supply 演唱的。 |
Only in My Dreams is the debut single by Debbie Gibson. 只有在我的梦中是黛比·吉布森的首张单曲。 |
表 15:无约束生成与有约束生成。示例中,无约束LLM产生错误答案,但对语料库的约束帮助 Riches 覆盖这一先前知识以获得正确答案。
A.3 Extended Qualitative Analysis
A.3 扩展定性分析
We provide examples for loss categories defined in Section 6 in Table 14.
Table 15 showcases a few selected examples where the unconstrained model emits incorrect answer, but constraining on the corpus guides it towards correct answer.
我们在表 14 中提供了第 6 节中定义的损失类别的示例。表 15 展示了一些选定的示例,其中无约束模型产生了错误的答案,但对语料库的约束引导其朝向正确答案。
A.4 Index representation qualitative analysis
A.4 索引表示定性分析
In this section we analyse retrievals from a few select examples from the OpenNQ development dataset. We compare retrievals using different document representations in the index, including: title + tf-idf, passage, sentence and proposition (see Section 3.4).
在本节中,我们分析来自 OpenNQ 开发数据集的一些精选示例的检索结果。我们比较使用索引中不同文档表示的检索结果,包括:标题 + tf-idf、段落、句子和命题(见第 3.4 节)。
Question who holds the world record for the most world records
问题:谁保持着最多世界纪录的世界纪录?
Gold Answer Ashrita Furman
金色答案 阿什丽塔·弗曼
-
•
title + tf-idf: List of world records in track cycling. World records in the sport of track cycling are ratified by the Union Cycliste Internationale (UCI).
标题 + tf-idf:场地自行车世界纪录。场地自行车运动的世界纪录由国际自行车联盟(UCI)认证。 -
•
Passage: According to the Guinness Book of World Records, the record for the largest event was set in 2009.
Passage: 根据《吉尼斯世界纪录》,最大事件的纪录是在 2009 年创下的。 -
•
Sentence: As of January 2021, he has 1.7 million followers on Twitter
Sentence: 截至 2021 年 1 月,他在推特上拥有 170 万粉丝。 -
•
Proposition: Ashrita Furman holds the Guinness world record for the most Guinness world records.
命题:阿什里塔·弗曼(Ashrita Furman)保持着吉尼斯世界纪录中最多吉尼斯世界纪录的记录。 -
•
Source text for proposition: Ashrita Furman (born Keith Furman, September 16, 1954) is a Guinness World Records record-breaker. As of 2017, Furman has set more than 600 official Guinness Records and currently holds 531 records, thus holding the Guinness world record for the most Guinness world records.
命题的来源文本:阿什里塔·弗曼(Ashrita Furman,原名基思·弗曼,1954 年 9 月 16 日出生)是一位吉尼斯世界纪录的纪录保持者。截至 2017 年,弗曼已创下超过 600 项官方吉尼斯纪录,并且目前保持着 531 项纪录,因此保持着吉尼斯世界纪录中最多吉尼斯世界纪录的记录。
Analysis In this example, only the propositional-based retrieval was able to retrieve a correct answer. Part of the reason why passage or sentence representation is hard to retrieve with auto-regressive decoding is that the main evidence for this answer in the "source text for proposition" comes at the end of a complex sentence ("[…] thus holding the Guiness world record for the most Guiness world records").
分析 在这个例子中,只有基于命题的检索能够检索到正确的答案。部分原因是,使用自回归解码时,段落或句子表示难以检索,因为在“命题的源文本”中,该答案的主要证据出现在一个复杂句子的末尾(“[…] 因此保持了吉尼斯世界纪录中最多的吉尼斯世界纪录”)。
Question who has the most number one single hits
问题:谁拥有最多的第一单曲?
Gold Answer The Beatles 黄金答案 披头士乐队
-
•
title + tf-idf: List of Billboard Hot 100 number-one singles of the 1990s. This article is about the American Billboard Hot 100 chart held during the 1990s.
标题 + tf-idf: 1990 年代《公告牌热 100》排行榜的冠军单曲列表。本文讨论的是 1990 年代的美国《公告牌热 100》排行榜。 -
•
Passage: According to the Recording Industry Association of America (RIAA), the album has sold more than 500,000 copies in the United States,
根据美国唱片工业协会(RIAA)的数据,该专辑在美国的销量超过了 500,000 张 -
•
Sentence: As of 2021, The Beatles remain the act with the most Christmas number-one singles in the United Kingdom.
句子:截至 2021 年,披头士乐队仍然是英国圣诞节冠军单曲最多的艺人。 -
•
Proposition: The Beatles hold the record for most number-one hits on the Billboard Hot 100.
命题:披头士乐队保持着《公告牌》百强单曲榜上冠军单曲最多的记录。 -
•
Source text for proposition: The Beatles, Awards and achievements. As of 2017, they hold the record for most number-one hits on the Billboard Hot 100, with twenty.
提议的源文本:披头士乐队,奖项和成就。截至 2017 年,他们在《公告牌》百强单曲榜上拥有最多的第一名单曲记录,共有二十首。
Analysis The correct answer is retrieved in this example using both sentence and proposition based representations. However, the sentence representation here does not provide proper justification for the answer. Note that the propositional representation correctly decontextualizes the pronoun in "[…] they hold the record […]" to "The Beatles hold the record […]" making the retrieval easier using constrained decoding.
分析 在这个例子中,正确答案是通过句子和命题两种表示法检索到的。然而,这里的句子表示法并没有为答案提供适当的理由。请注意,命题表示法正确地将"[…] 他们保持记录 […]"中的代词去语境化为"披头士乐队保持记录 […]",使得使用约束解码进行检索变得更容易。
Question how many episodes of sabrina the teenage witch are there
问题:萨布 rina 青少年女巫有多少集?
Gold Answer 163 黄金答案:163
-
•
title + tf-idf: Sabrina the Teenage Witch (1996 TV series). The first four seasons aired on ABC from September 27, 1996 to May 5, 2000. The final three seasons ran on The WB from September 22, 2000 to April 24, 2003.
标题 + tf-idf:青春魔法师萨布 rina(1996 年电视剧)。前四季于 1996 年 9 月 27 日至 2000 年 5 月 5 日在 ABC 播出。最后三季于 2000 年 9 月 22 日至 2003 年 4 月 24 日在 WB 播出。 -
•
Passage: Sabrina the Teenage Witch is an American television sitcom created by Nell Scovell, based on the Archie Comics series of the same name.
《萨布 rina:青少年女巫》是由内尔·斯科维尔创作的美国情景喜剧,基于同名的阿奇漫画系列。 -
•
Sentence: Sabrina the Teenage Witch is an American television sitcom created by Nell Scovell, based on the Archie Comics series of the same name.
句子:青春女巫萨布 rina 是由内尔·斯科维尔创作的美国电视情景喜剧,基于同名的阿奇漫画系列。 -
•
Proposition: Sabrina the Teenage Witch had 163 episodes.
命题:青春女巫萨布 rina 共有 163 集。 -
•
Source text for proposition: This is an episode list for Sabrina the Teenage Witch, an American sitcom that debuted on ABC in 1996. From Season 5, the program was aired on The WB. The series ran for seven seasons totaling 163 episodes.
提议的源文本:这是《萨布 rina 青少年女巫》的剧集列表,这是一部于 1996 年在 ABC 首播的美国情景喜剧。从第五季开始,该节目在 WB 频道播出。该系列共播出了七季,总计 163 集。
Analysis All retrievals using non-propositional representations select part of the main article for "Sabrina the Teenage Witch". This article, however, does not contain the answer to the question. In the propositional case, there is a straightforward proposition that is constructed from a passage from the "List of Sabrina the Teenage Witch episodes". Note that the source passage contains a reference that becomes ambiguous out-of-context ("The series" is decontextualized to "Sabrina the Teenage Witch" in the proposition).
分析:所有使用非命题表示法的检索都选择了“萨布 rina 青少年女巫”主要文章的一部分。然而,这篇文章并不包含问题的答案。在命题情况下,有一个简单的命题是从“萨布 rina 青少年女巫剧集列表”的一段中构建的。请注意,源段落中包含的引用在上下文之外变得模糊(“该系列”在命题中被去上下文化为“萨布 rina 青少年女巫”)。
Question what is dj’s boyfriends name on full house
问题:全屋子里 DJ 的男朋友叫什么名字
Gold Answers Steve Hale, Steven "Steve" Hale, rich kid Nelson, or Viper
答案:史蒂夫·海尔,史蒂文“史蒂夫”·海尔,富家子弟尼尔森,或毒蛇
-
•
title + tf-idf: Full House (season 8). The eighth and final season of the ABC sitcom Full House originally aired between September 27, 1994 and May 23, 1995.
title + tf-idf: 《满屋》(第八季)。ABC 情景喜剧《满屋》的第八季也是最后一季,最初于 1994 年 9 月 27 日至 1995 年 5 月 23 日播出。 -
•
Passage: Full House (1987–1995) and its Netflix sequel Fuller House.
文章:《满屋》(1987–1995)及其 Netflix 续集《满屋更满》。 -
•
Sentence: In the 1990s, she appeared in the films Blues Brothers 2000
句子:在 1990 年代,她出现在电影《蓝调兄弟 2000》中。 -
•
Proposition: Steve Hale was D.J.’s boyfriend in seasons six and seven.
命题:史蒂夫·海尔是第六季和第七季中 D.J.的男朋友。 -
•
Source text for proposition: Full House, Production, Casting. As babies, the children were played by Daniel and Kevin Renteria, and in season six, the roles of the twins were succeeded by Blake and Dylan Tuomy-Wilhoit. The last main character added was Steve Hale, who was D.J. ’s boyfriend in seasons six and seven. He was played by Scott Weinger.
提议的源文本:全屋,制作,选角。作为婴儿,孩子们由丹尼尔和凯文·伦特里亚扮演,在第六季中,双胞胎的角色由布莱克和迪伦·图米-威尔霍伊特接替。最后一个加入的主要角色是史蒂夫·海尔,他是 D.J.在第六季和第七季的男朋友。他由斯科特·温格饰演。
Analysis The source sentence with the correct answer presents a challenge for auto-regressive decoding since the sentence prefix focuses on an aspect unrelated to the question ("The last main character added […]"). With propositionalization, the sentence structure becomes aligned with the question, but requires that the model already knows the answer to the question, given that the first entity in the sentece is the answer.
分析:带有正确答案的源句对自回归解码提出了挑战,因为句子前缀关注于与问题无关的一个方面(“最后一个加入的主要角色[…]”)。通过命题化,句子结构与问题对齐,但要求模型已经知道问题的答案,因为句子中的第一个实体就是答案。
Question who is the girl in green day 21 guns
问题:绿色日子《21 枪》中的女孩是谁
Gold Answer Lisa Stelly 答案:丽莎·斯特利
-
•
title + tf-idf: Boulevard of Broken Dreams (Green Day song), Music video. The video won six awards at the MTV Video Music Awards in 2005, most notably for Video of the Year. It also won Best Group Video, Best Rock Video, Best Direction, Best Editing, and Best Cinematography.
标题 + tf-idf:破碎梦想大道(绿日乐队歌曲),音乐视频。该视频在 2005 年的 MTV 音乐录像奖中获得六项大奖,最显著的是年度最佳视频。它还获得了最佳团体视频、最佳摇滚视频、最佳导演、最佳剪辑和最佳摄影奖。 -
•
Passage: "21 Guns" is a song by American rock band Green Day. It was released as the second single from their eighth studio album, 21st Century Breakdown (2009), and serves as the sixteenth track from the album. The single was released through Reprise Records on May 25, 2009 as a digital download and July 14, 2009 as a CD single.
《21 Guns》是美国摇滚乐队绿日乐队的一首歌曲。它作为他们第八张录音室专辑《21 世纪崩溃》(2009 年)的第二支单曲发布,并且是专辑中的第十六首曲目。该单曲于 2009 年 5 月 25 日通过 Reprise Records 以数字下载形式发布,并于 2009 年 7 月 14 日以 CD 单曲形式发布。 -
•
Sentence: "21 Guns" is a song by American rock band Green Day.
句子:“21 Guns”是美国摇滚乐队绿日乐队的一首歌曲。 -
•
Proposition: The girl in the music video is Teresa Lourenco.
命题:音乐视频中的女孩是特蕾莎·洛伦科。 -
•
Source text for proposition: The music video for Ägainf̈eatures Kravitz with his girlfriend in his apartment (Gershon), whom he does not seem to be interested in. Similar to the song’s lyrical content, he meets a girl (Teresa Lourenco), who works as a waitress in a restaurant/diner.
提议的来源文本:Ägain 的音乐视频中,克拉维茨与他的女友在公寓里(格尔肖恩),他似乎对她并不感兴趣。与歌曲的歌词内容类似,他遇到了一个女孩(特蕾莎·洛伦科),她在一家餐厅/小餐馆里当服务员。
Analysis In this case, all retrievals fail to retrieve the correct answer. In the case of the proposition-based representation, the model decodes a proposition where the subject is an ambiguous reference ("The girl") which has not been properly decontextualized (the source passage above makes it clear that the reference is not related to the question). Interestingly, the source passage with the correct answer requires an inferential step and its proposition representations are been decontextualized properly. Source text with correct answer: 21 Guns (song), Music video. The video takes place with the band and the album’s two protagonists Christian (Josh Boswell) and Gloria (Lisa Stelly) taking refuge in a white room after robbing a bank..
分析 在这种情况下,所有检索都未能找到正确答案。在基于命题的表示中,模型解码出一个命题,其中主语是一个模糊的指代(“女孩”),该指代没有被适当地去语境化(上述源段落明确表明该指代与问题无关)。有趣的是,包含正确答案的源段落需要一个推理步骤,其命题表示已被适当地去语境化。包含正确答案的源文本:21 Guns(歌曲),音乐视频。视频中,乐队和专辑的两个主角克里斯蒂安(乔什·博斯威尔)和格洛丽亚(丽莎·斯特利)在抢劫银行后躲避在一个白色房间里。
Relevant generated propositions:
相关生成命题:
-
•
The video takes place with the band and the album’s two protagonists Christian and Gloria.
视频发生在乐队和专辑的两个主角克里斯蒂安和格洛丽亚之间。 -
•
Gloria is played by Lisa Stelly.
To properly retrieve this passage using proposition-based representation we would need to properly disambiguate "The video" to "21 guns" and perform inference over these two propositions. Alternatively, proposition generation could generate more complex propositions containing both pieces of information, such as: The "21 Guns" video takes place with the protagonist Gloria, played by Lisa Stelly.
为了正确地使用基于命题的表示法检索这一段落,我们需要正确地将“视频”消歧为“21 枪”,并对这两个命题进行推理。或者,命题生成可以生成包含这两条信息的更复杂的命题,例如:“21 枪”视频的主角是由 Lisa Stelly 扮演的 Gloria。
Question how many seasons of vampire diaries r there
问《吸血鬼日记》有多少季
Gold Answer eight, or 8
答:八季,或 8 季
-
•
title + tf-idf: The Vampire Diaries. The Vampire Diaries is an American supernatural teen drama television series developed by Kevin Williamson and Julie Plec, based on the book series of the same name written by L. J. Smith. The series premiered on The CW on September 10, 2009, and concluded on March 10, 2017, having aired 171 episodes over eight seasons.
title + tf-idf: 吸血鬼日记。吸血鬼日记是一部由凯文·威廉姆森和朱莉·普莱克开发的美国超自然青少年剧集,基于 L.J.史密斯所著的同名书籍系列。该系列于 2009 年 9 月 10 日在 CW 首播,并于 2017 年 3 月 10 日结束,共播出 171 集,历时八季。 -
•
Passage: The Vampire Diaries is an American supernatural teen drama television series developed by Kevin Williamson and Julie Plec, based on the book series of the same name written by L. J. Smith. The series premiered on The CW on September 10, 2009, and concluded on March 10, 2017, having aired 171 episodes over eight seasons.
Passage: 吸血鬼日记是一部由凯文·威廉姆森和朱莉·普莱克开发的美国超自然青少年剧集,基于 L.J.史密斯所著的同名书籍系列。该系列于 2009 年 9 月 10 日在 CW 首播,并于 2017 年 3 月 10 日结束,共播出 171 集,历时八季。 -
•
Sentence: The series premiered on The CW on September 10, 2009, and concluded on March 10, 2017, having aired 171 episodes over eight seasons.
该系列于 2009 年 9 月 10 日在 CW 首播,并于 2017 年 3 月 10 日结束,共播出 171 集,历时八季。 -
•
Proposition: The Vampire Diaries is an American supernatural drama television series.
命题:吸血鬼日记是一部美国超自然剧集。 -
•
Source text for proposition: The Vampire Diaries is an American supernatural drama television series that premiered on The CW on September 10, 2009, and concluded on March 10, 2017 after airing eight seasons.
命题的源文本:吸血鬼日记是一部美国超自然剧集,于 2009 年 9 月 10 日在 CW 首播,并于 2017 年 3 月 10 日结束,共播出八季。
Analysis In this case only the proposition-based representation retrieval is incorrect. We believe the retrieval fails here due to improper decontextualization of the correct answer passage. The sentence with the correct answer includes the proposition: The series aired 171 episodes over eight seasons.. Making it difficult for the model to
分析 在这种情况下,仅基于命题的表示检索是不正确的。我们认为检索失败的原因在于对正确答案段落的不当去语境化。包含正确答案的句子包括命题:该系列在八季中播出了 171 集。这使得模型难以
A.5 Computations involved A.5 相关计算
Evaluating the precise compute cost for Riches depends on the specific implementations of the decoding algorithm, but we can sketch the key operations involved in retrieval: indexing and search.
Indexing depends on the number of items in the corpus . We use a model of size to rewrite each passage (average length ) into propositions. The overall indexing cost is proportional to , similar in magnitude to the cost for encoding the corpus in dense retrieval, differing only by a constant factor. Note that our experiments use a T5-large backbone (770M) for Riches much smaller than T5-xxl (11B) used in the dense baselines.
评估 Riches 的精确计算成本取决于解码算法的具体实现,但我们可以概述检索中涉及的关键操作:索引和搜索。索引依赖于语料库中的项目数量 。我们使用大小为 的模型将每个段落(平均长度 )重写为命题。整体索引成本与 成正比,数量级上与在密集检索中对语料库进行编码的成本相似,仅相差一个常数因子。请注意,我们的实验使用 T5-large 主干(770M)进行 Riches,远小于在密集基线中使用的 T5-xxl(11B)。
Now let’s look at the search operation. At each auto-regressive step, besides standard decoding, the only additional operation is computing FM-index constraints, which consumes CPU resources. However, while the index is efficient, communication between the index on the host and the Transformer model on the GPU/TPU adds latency to the decoding step. In contrast, RAG systems retrieve documents from index using nearest neighbor scan in a single go. But even there, the documents need to encoded as input to the language model.
现在让我们来看一下搜索操作。在每个自回归步骤中,除了标准解码,唯一的额外操作是计算 FM-index 约束,这会消耗 CPU 资源。然而,尽管索引效率高,但主机上的索引与 GPU/TPU 上的 Transformer 模型之间的通信会增加解码步骤的延迟。相比之下,RAG 系统通过最近邻扫描一次性从索引中检索文档。但即便如此,这些文档仍需编码为语言模型的输入。