LongRAG: Enhancing Retrieval-Augmented Generation
with Long-context LLMs
LongRAG:利用长上下文LLMs增强检索增强生成
Abstract
In traditional RAG framework, the basic retrieval units are normally short.
The common retrievers like DPR normally work with 100-word Wikipedia paragraphs.
Such a design forces the retriever to search over a large corpus to find the ‘needle’ unit.
In contrast, the readers only need to extract answers from the short retrieved units.
Such an imbalanced ‘heavy’ retriever and ‘light’ reader design can lead to sub-optimal performance.
In order to alleviate the imbalance, we propose a new framework LongRAG, consisting of a ‘long retriever’ and a ‘long reader’.
LongRAG processes the entire Wikipedia into 4K-token units, which is 30x longer than before.
By increasing the unit size, we significantly reduce the total units from 22M to 600K.
This significantly lowers the burden of retriever, which leads to a remarkable retrieval score: answer recall@1=71% on NQ (previously 52%) and answer recall@2=72% (previously 47%) on HotpotQA (full-wiki).
Then we feed the top-k retrieved units ( 30K tokens) to an existing long-context LLM to perform zero-shot answer extraction.
Without requiring any training, LongRAG achieves an EM of 62.7% on NQ and 64.3% on HotpotQA (full-wiki), which is on par with the (fully-trained) SoTA model.
Our study offers insights into the future roadmap for combining RAG with long-context LLMs.
在传统的 RAG 框架中,基本的检索单元通常较短。像 DPR 这样的常见检索器通常处理 100 字的维基百科段落。这种设计迫使检索器在大规模语料库中寻找“针”一样的单元。相比之下,阅读器仅需从检索到的短单元中提取答案。这种“重”检索器与“轻”阅读器的不平衡设计可能导致性能不佳。为了缓解这种不平衡,我们提出了一种新的框架 LongRAG,它包括一个‘长检索器’和一个‘长阅读器’。LongRAG 将整个维基百科处理成 4K 令牌的单元,比之前长 30 倍。通过增加单元大小,我们将总单元数从 2200 万显著减少到 60 万。这大大减轻了检索器的负担,从而显著提高了检索分数:在 NQ 上的答案召回率@1 达到 71%(之前为 52%),在 HotpotQA(全维基)上的答案召回率@2 达到 72%(之前为 47%)。然后,我们将前 k 个检索单元(约 30K 令牌)输入到一个现有的长上下文LLM中进行零样本答案提取。无需任何训练,LongRAG 在 NQ 上达到了 62.7%的精确匹配率,在 HotpotQA 上达到了 64%。3%在 HotpotQA(全维基)上,与(完全训练的)最先进模型持平。我们的研究为未来结合 RAG 与长上下文LLMs的路线图提供了洞见。
LongRAG: Enhancing Retrieval-Augmented Generation
with Long-context LLMs
Ziyan Jiang University of Waterloo 滑铁卢大学 ziyanjiang528@gmail.com Xueguang Ma University of Waterloo 滑铁卢大学 x93ma@uwaterloo.ca Wenhu Chen University of Waterloo 滑铁卢大学 wenhuchen@uwaterloo.ca
1 Introduction
1 引言
Retrieval-Augmented Generation (RAG) methods have long been employed to enhance large language models (LLMs) Mialon et al. (2023). Knowledge in the form of natural language can be entirely offloaded from the parametric knowledge of LLMs by leveraging a standalone retrieval component from an external corpus. The existing RAG framework tends to use short retrieval units, such as 100-word passages in popular open-domain question-answering tasks Chen et al. (2017); Lewis et al. (2020); Karpukhin et al. (2020). The retriever is tasked with finding the “needle” (i.e. the precise tiny retrieval unit) from the “haystack” (i.e. the massive corpus with tens of millions of information units). Subsequently, the retrieved units are passed to the reader to generate the final response. On the contrary, the reader only needs to extract answers from these retrievals, which is a fairly easy task. This kind of imbalanced design, with a “heavy” retriever and a “light” reader, puts too much pressure on the retriever. Therefore, existing RAG models Izacard and Grave (2020b) have to recall huge amounts of units, such as the top-100/200, combined with additional re-ranker to achieve the best performance. Moreover, short retrieval units can lead to semantic incompleteness due to document truncation. This can lead to information loss, ultimately hurting the end performance.
This design choice was made in an era when NLP models were heavily restricted by their ability to handle long contexts.
With the recent advances in long-context language models, the retriever and reader can potentially handle up to 128K or even millions of tokens as input Reid et al. (2024); Achiam et al. (2023).
检索增强生成(RAG)方法长期以来被用于提升大型语言模型(LLMs)的性能,如 Mialon 等人(2023)所述。通过利用从外部语料库中独立检索组件,自然语言形式的知识可以完全从LLMs的参数知识中卸载。现有的 RAG 框架倾向于使用短检索单元,例如在流行的开放领域问答任务中使用 100 字段落,如 Chen 等人(2017)、Lewis 等人(2020)和 Karpukhin 等人(2020)所述。检索器负责从“干草堆”(即包含数千万信息单元的大量语料库)中找到“针”(即精确的小检索单元)。随后,检索到的单元被传递给阅读器以生成最终响应。相反,阅读器仅需从这些检索结果中提取答案,这是一项相对简单的任务。这种“重”检索器与“轻”阅读器的不平衡设计,给检索器带来了过大的压力。 因此,现有的 RAG 模型Izacard 和 Grave(2020b)必须召回大量单元,如前 100 或 200 个,并结合额外的重排序器以达到最佳性能。此外,短的检索单元可能导致因文档截断而产生的语义不完整。这可能造成信息丢失,最终损害终端性能。这一设计选择是在 NLP 模型处理长上下文能力受限的时代做出的。随着近期长上下文语言模型的进步,检索器和阅读器有可能处理多达 128K 甚至数百万个令牌作为输入Reid 等人(2024);Achiam 等人(2023)。
In this paper, we propose to revisit this design choice for open-domain question answering and propose the LongRAG framework as a solution to balance the workload between the retriever and the reader, as illustrated in Figure 1. There are three important designs in our novel framework:
在本文中,我们提议重新审视这一设计选择,以适应开放域问答系统,并提出了 LongRAG 框架,作为平衡检索器与阅读器之间工作负载的解决方案,如图1所示。我们的新框架包含三个重要设计:
-
1.
Long Retrieval Unit: By using entire Wikipedia documents or grouping multiple related documents, we can construct long retrieval units with more than 4K tokens. This design could also significantly reduce the corpus size (number of retrieval units in the corpus). Then, the retriever’s task becomes much easier with more complete information.
1. 长检索单元:通过使用整个维基百科文档或组合多个相关文档,我们可以构建包含超过 4K 个标记的长检索单元。这种设计还能显著减少语料库规模(即语料库中检索单元的数量)。随后,检索器的任务在拥有更完整信息的情况下变得更为简单。 -
2.
Long Retriever: The long retriever will identify coarse relevant information for the given query by searching through all the long retrieval units in the corpus. Only the top 4 to 8 retrieval units (without re-ranking) are used for the next step.
-
3.
Long Reader: The long reader will further extract answers from the concatenation of retrievals, which is normally around 30K tokens. We simply prompt an existing long-context LM (like Gemini or GPT4) with the question to produce the answers in a zero-shot fashion.
2. 长检索器:长检索器将通过搜索语料库中的所有长检索单元,为给定查询识别出大致相关的信息。仅使用排名前 4 至 8 的检索单元(无需重新排序)进行下一步处理。 3. 长阅读器:长阅读器将进一步从检索结果的串联中提取答案,这些串联通常包含约 30K 个标记。我们直接提示现有的长上下文语言模型(如Gemini或 GPT4)以问题形式生成答案,采用零样本学习方式。
These three novel designs significantly boost the overall performance of RAG on open-domain question-answering tasks like NQ Kwiatkowski et al. (2019) and HotpotQA Yang et al. (2018). LongRAG has several advantages: 1) It does not require additional re-rankers and the best results can be attained by only considering the top 4-8 retrieved units. 2) The long retrieval unit amalgamates comprehensive information from related documents, which can be used directly to answer multi-hop questions without iterative retrieval.
这三种新颖设计显著提升了 RAG 在开放域问答任务中的整体表现,如 NQ Kwiatkowski 等人(2019)和 HotpotQA Yang 等人(2018)。LongRAG 具有以下优势:1)它无需额外的重排序器,仅通过考虑排名前 4-8 的检索单元即可达到最佳结果。2)长检索单元整合了相关文档的全面信息,可直接用于回答多跳问题,无需迭代检索。
In our experiments, we adopt off-the-shelf retrievers like BGE Xiao et al. (2023) and readers like Gemini-1.5-Pro (Reid et al., 2024) or GPT-4o OpenAI (2024) without any tuning on NQ or HotpotQA. In our experiments, we reduce the NQ corpus size from 22M to 600K document units, which improves the answer recall@1 from 52% (DPR) to 71%.
Similarly, we reduce the HotpotQA corpus size from 5M to 500K, which improves the recall@2 from 47% (DPR) to 72%. The improvement in retriever can significantly benefit the reader model. By exploiting the long-context understanding ability of GPT-4o, LongRAG can achieve an EM of 62% on NQ and 64% on HotpotQA. These results could be comparable to the strongest fully trained RAG models like Atlas Izacard et al. (2022) and MDR Xiong et al. (2020b).
在我们的实验中,我们采用了现成的检索器如 BGEXiao 等人(2023)和阅读器如Gemini-1.5-Pro(Reid 等人,2024)或 GPT-4oOpenAI(2024),未对 NQ 或 HotpotQA 进行任何调整。在实验中,我们将 NQ 语料库规模从 2200 万缩减至 60 万文档单元,使得答案召回率@1 从 52%(DPR)提升至 71%。同样,HotpotQA 语料库规模从 500 万缩减至 50 万,召回率@2 从 47%(DPR)提升至 72%。检索器的改进显著惠及阅读器模型。通过利用 GPT-4o 的长上下文理解能力,LongRAG 在 NQ 上能达到 62%的精确匹配度(EM),在 HotpotQA 上达到 64%。这些成果可与最强全训练 RAG 模型如 AtlasIzacard 等人(2022)和 MDRXiong 等人(2020b)相媲美。
We perform ablation studies in subsection 4.4 to prove why longer retrieval units are necessary. Given a budget of 40K recall tokens, with ‘short retriever units’, we can increase the number of recalled units to reach a marvelously high recall score (91% for recall@200). However, the end performance dips significantly due to the huge amount of ‘hard negatives’, which confuses the reader. With ‘long retriever units’, we observe an entirely different trend. As we recall more units (from 1 to 8 units), both the recall and end performance will increase or plateau. The impact of ‘hard negative’ is much less severe in LongRAG. It shows that LongRAG can better exploit the advances in the long-context LLMs (reader). As the long-context methods evolve, the performance of LongRAG will continue to improve. Therefore, we believe the modern RAG systems should re-consider the granularity of their retrieval units to exploit the advantages of the current long-context LLMs.
我们在4.4 小节进行消融研究,以证明为何需要更长的检索单元。在 40K 召回令牌的预算下,使用“短检索单元”,我们能够增加召回单元的数量,达到惊人的高召回分数(召回@200 为 91%)。然而,由于存在大量“困难负例”,最终性能显著下降,这使得读者感到困惑。采用“长检索单元”时,我们观察到了完全不同的趋势。随着召回单元数量的增加(从 1 个增至 8 个),召回率和最终性能均会提升或趋于稳定。在 LongRAG 中,“困难负例”的影响要小得多。这表明 LongRAG 能更好地利用长上下文LLMs(阅读器)的进步。随着长上下文方法的发展,LongRAG 的性能将持续提升。因此,我们认为现代 RAG 系统应重新考虑其检索单元的粒度,以发挥当前长上下文LLMs的优势。
Meanwhile, there is still room for improvement in our framework, particularly the need for stronger long embedding models, as shown in Table 3. Additionally, more general methods to formulate long retrieval units beyond hyperlinks will be helpful.
与此同时,我们的框架仍有改进空间,特别是需要更强大的长嵌入模型,如表3所示。此外,制定超越超链接的长检索单元的更通用方法也将有所帮助。
2 Related Work
2 相关工作
2.1 Retrieval-Augmented Generation.
2.1 检索增强生成。
Augmenting language models with information retrieved from large corpora has become a popular and effective approach for knowledge-intensive tasks, particularly open-domain question answering. The predominant architecture follows a retriever-reader style Chen et al. (2017); Guu et al. (2020), where the input query retrieves information from a corpus, and a language model uses this information as additional context to make a final prediction. Recent work has focused on improving the retriever (Karpukhin et al., 2020; Xiong et al., 2020a; Qu et al., 2020; Xiong et al., 2020b; Khalifa et al., 2023), enhancing the reader (Izacard and Grave, 2020b; Cheng et al., 2021; Yu et al., 2021; Borgeaud et al., 2022), fine-tuning the retriever and reader jointly (Yu, 2022; Izacard et al., 2022; Singh et al., 2021; Izacard and Grave, 2020a), and integrating the retriever with the black-box language model (Yu et al., 2023; Shi et al., 2023; Trivedi et al., 2022). However, the impact of document granularity on the effectiveness and efficiency of the retrieval-augmented generation pipeline remains underexplored.
增强语言模型通过从大型语料库中检索信息已成为处理知识密集型任务,尤其是开放领域问答的一种流行且有效的方法。主流架构遵循检索器-阅读器模式Chen 等人(2017);Guu 等人(2020),其中输入查询从语料库中检索信息,语言模型利用这些信息作为额外上下文进行最终预测。近期工作侧重于改进检索器(Karpukhin 等人,2020;Xiong 等人,2020a;Qu 等人,2020;Xiong 等人,2020b;Khalifa 等人,2023),提升阅读器(Izacard 和 Grave,2020b;Cheng 等人,2021;Yu 等人,2021;Borgeaud 等人,2022),以及联合微调检索器和阅读器(Yu,2022;Izacard 等人,2022;Singh 等人, 2021;Izacard 和 Grave, 2020a),并将检索器与黑箱语言模型集成(Yu 等人, 2023;Shi 等人, 2023;Trivedi 等人, 2022)。然而,文档粒度对增强型检索生成流程的有效性和效率的影响仍未得到充分探索。
2.2 Long Context Large Language Models.
2.2 长上下文大型语言模型。
The effectiveness of Transformer-based models is hindered by the quadratic increase in computational cost relative to sequence length, especially when dealing with long context inputs. In order to solve this issue, different approaches have been proposed to mitigate computational issues, including sliding memory window and chunk segmentation (Hao et al., 2022; Ratner et al., 2023; Zhu et al., 2024b). FlashAttention Dao et al. (2022) has also been a pivotal strategy to significantly reduce the memory footprint to almost linear w.r.t sequence length.
基于 Transformer 的模型效果受限于计算成本随序列长度呈二次方增长的问题,特别是在处理长上下文输入时。为解决这一问题,提出了多种方法来缓解计算难题,包括滑动记忆窗口和分块分割(Hao 等人, 2022;Ratner 等人, 2023;Zhu 等人, 2024b)。FlashAttentionDao 等人(2022)也是一种关键策略,能显著将内存占用降至接近线性于序列长度。
To enable length extrapolation, RoPE Su et al. (2021) and AliBI Press et al. (2021) position encodings have shown potential to enable length extrapolation, which have been widely used in the literature. Recent endeavors have explored diverse strategies to tackle this challenge, which is mainly Position reorganization (Jin et al., 2024; An et al., 2024), Position interpolation (Chen et al., 2023a; Peng et al., 2023; Liu et al., 2024). Furthermore, alternative architectures beyond the Transformer have been explored to handle long inputs more naturally. These diverse approaches claim that they can enhance the capabilities of LLMs in processing long context inputs more efficiently.
为实现长度外推,RoPE Su 等(2021)与 AliBI Press 等(2021)位置编码展现了实现长度外推的潜力,这些方法在文献中已被广泛应用。近期研究探索了多种策略以应对这一挑战,主要包括位置重组(Jin 等,2024;An 等,2024)及位置插值(Chen 等,2023a;Peng 等,2023;Liu 等,2024)。此外,还探索了超越 Transformer 的替代架构,以更自然地处理长输入。这些多样化的方法声称能提升LLMs处理长上下文输入的效率。
2.3 Long Context Embedding
2.3 长上下文嵌入
Recent efforts also increased the context length for embedding models, extending the supported text snippet length from a limit of 512 tokens to 32k tokens. Typically, the development of long-context embedding models involves first obtaining a long-context backbone model. This can be achieved either by pre-training with long inputs from scratch Günther et al. (2023); Nussbaum et al. (2024); Chen et al. (2024) or by utilizing existing large language models that support longer context Wang et al. (2023).
Additionally, some works extend the capabilities of existing embedding models to handle long contexts by applying LLM content window extension methods on embedding models Zhu et al. (2024a); Peng and Quesnelle (2023), or by employing state-space encoder models Saad-Falcon et al. (2024).
近期努力也提升了嵌入模型的上下文长度,将支持的文本片段长度从 512 个 token 的限制扩展到了 32k 个 token。通常,开发长上下文嵌入模型的过程首先需要获取一个长上下文的主干模型。这可以通过从头开始使用长输入进行预训练来实现Günther 等人(2023); Nussbaum 等人(2024); Chen 等人(2024),或者利用支持更长上下文的现有大型语言模型Wang 等人(2023)。此外,一些工作通过在嵌入模型上应用LLM内容窗口扩展方法Zhu 等人(2024a); Peng 和 Quesnelle(2023),或者采用状态空间编码器模型Saad-Falcon 等人(2024),来扩展现有嵌入模型处理长上下文的能力。
3 LongRAG
Our proposed LongRAG framework is comprised of two components: the Long Retriever and the Long Reader. An illustrative example of these two components are depicted in Figure 2.
我们提出的 LongRAG 框架包含两个组成部分:长篇检索器和长篇阅读器。这两个组件的示例在图2中展示。
3.1 Long Retriever
3.1 长篇检索器
The traditional RAG framework employs smaller retrieval units and prioritizes retrieving the exact fine-grained short context containing the answer. In contrast, our proposed LongRAG framework places greater emphasis on recall, aiming to retrieve relevant context with much coarse granularity. This design choice shifts more burden from the retriever to the reader to extract the exact answers from the relevant context.
传统 RAG 框架采用较小的检索单元,并优先检索包含答案的精确细粒度短上下文。相比之下,我们提出的 LongRAG 框架更注重召回率,旨在以更粗的粒度检索相关上下文。这一设计选择将更多负担从检索器转移到阅读器,以便从相关上下文中提取确切答案。
We denote our corpus for retrieval as , which is a collection of documents. Formally speaking, the long context retriever is a function: that takes as input a question and a corpus and returns a filtered set of texts . In traditional RAG, is usually small which contains about hundred of tokens, which should contain exact information related to the question . In our framework, is usually more than 4K tokens, which contains relavant but not exact information related to the question . The long retriever function is then divided into three steps:
Formulate long retrieval units
我们将用于检索的语料库表示为@0#,它是一组@1#文档的集合。正式地说,长上下文检索器是一个函数:@2#,它以一个问题@3#和一个语料库@4#作为输入,并返回一组经过筛选的文本@5#。在传统的 RAG 中,@6#通常较小,包含约数百个令牌,应包含与问题@7#相关的精确信息。在我们的框架中,@8#通常超过 4K 个令牌,包含与问题@9#相关的但非精确信息。长检索器函数@10#随后分为三个步骤:
制定长检索单元
A function is applied to the corpus to form retrieval units: . In traditional RAG, the retrieval unit is typically a short span of passage which is split from the documents , containing hundreds of tokens. In our framework, could be as long as the whole document or even a group of documents, resulting in much longer retrieval units. We group the documents based on their relationships, using hyperlinks embedded within each document. The grouping algorithm is shown in Algorithm 1. The output group is a list of documents that are related to each other. By having a longer retrieval unit, there are two advantages: First, it ensures the semantic integrity of each retrieval unit; Second, it provides much richer context for tasks that require information from multiple documents.
对语料库应用一个函数以形成 检索单元: 。在传统的 RAG 中,检索单元 通常是从文档 中分割出的短段落,包含数百个词元。在我们的框架中, 可以长达整个文档甚至一组文档,从而产生更长的检索单元。我们根据文档间的关系,利用每个文档中嵌入的超链接进行分组。分组算法如算法1所示。输出组是相互关联的文档列表。通过使用更长的检索单元,具有两大优势:首先,它确保了每个检索单元的语义完整性;其次,它为需要多文档信息支持的任务提供了更丰富的上下文。
Similarity search 相似性搜索
We utilize an encoder, denoted as , to map the input question to a -dimensional vector. Additionally, we employ a different encoder, , to map the retrieval unit to a -dimensional vector. We define the similarity between the question and the retrieval unit using the dot product of their vectors:
我们利用一个编码器,记为 ,将输入的问题映射为一个 维的向量。同时,我们采用另一个不同的编码器, ,将检索单元映射为一个 维的向量。我们通过它们的向量点积来定义问题与检索单元之间的相似度:
In LongRAG settings, is challenging given the length of , so we resort to an approximation as below.
在 LongRAG 设置中,由于 的长度, 的处理颇具挑战性,因此我们采取如下近似方法。
We approximate it by maximizing the scores of all chunks within the retrieval unit , akin to the MaxP design in Dai and Callan (2019). We consider different levels of granularity, including passage level, document level, and the complete grouped document. The empirical study about this settings is in Table 3.
With this similarity score setup, we will retrieve the top retrieval units closest to the given query. For efficient retrieval, we precompute the embedding of each retrieval unit and predict the exact inner product search index in FAISS (Johnson et al., 2019).
我们通过最大化检索单元 内所有分块 的得分来进行近似,类似于Dai 和 Callan(2019)中的 MaxP 设计。我们考虑了不同的粒度级别,包括段落级、文档级以及完整的分组文档。关于此设置的实证研究见表3。通过这种相似度得分设定,我们将检索与给定查询最接近的 个检索单元。为了高效检索,我们预先计算每个检索单元 的嵌入,并在 FAISS(Johnson 等人,2019)中预测精确的内积搜索索引。
Aggregate retrieval result
聚合检索结果
We will concatenate the top retrieval units into the long context as the retrieval result, denoted by . Depending on the selection of retrieval units, a larger retrieval unit size will result in a smaller value of being used. For instance, if the retrieval unit is a passage, is approximately above 100; if it’s a document, is around 10; and for grouped documents as retrieval units, we typically set to 4 to 8.
我们将把排名前 的检索单元连接起来,形成长上下文作为检索结果,记为 。根据检索单元的选择,较大的检索单元尺寸会使得 的值更小。例如,如果检索单元是一段文字, 大约在 100 以上;若为一篇文档, 大约在 10 左右;而当以分组文档作为检索单元时,我们通常将 设定在 4 到 8 之间。
3.2 Long Reader
3.2 长篇阅读器
The long reader operates straightforwardly. We feed the related instruction , the question , and the long retrieval result into an LLM, enabling it to reason over the long context and generate the final output.
It’s important that the LLM used in the long reader can handle long contexts and does not exhibit excessive position bias. We select Gemini-1.5-Pro Reid et al. (2024) and GPT-4o OpenAI (2024) as our long reader given their strong ability to handle long context input.
长阅读器操作简便。我们将相关的指令 、问题 以及长检索结果 输入至一个LLM中,使其能够基于长上下文进行推理并生成最终输出。关键在于,用于长阅读器的LLM需具备处理长上下文的能力,且不应表现出过度的位置偏差。我们选择了Gemini-1.5-Pro Reid 等人(2024)和 GPT-4o OpenAI(2024)作为我们的长阅读器,鉴于它们在处理长上下文输入方面的强大能力。
We utilize different approaches for short and long contexts. For short contexts, typically containing fewer than 1K tokens, we instruct the reader to directly extract the answer from the provided context retrieved from the corpus. For long contexts, typically longer than 4K tokens, we empirically find that using a similar prompt as for short contexts, where the model extracts the final answer directly from the long context, often leads to decreased performance. Instead, the most effective approach is to utilize the LLM as a chat model. Initially, it outputs a long answer, typically spanning a few words to a few sentences. Subsequently, we prompt it to generate a short answer by further extracting it from the long answer. The prompt is provided in the Appendix 6.1.
我们采用不同的方法处理短上下文和长上下文。对于短上下文,通常包含少于 1K 个令牌,我们指导读者直接从从语料库检索到的上下文中提取答案。对于长上下文,通常超过 4K 个令牌,我们实证发现,使用类似于短上下文的提示,即模型直接从长上下文中提取最终答案,往往会降低性能。相反,最有效的方法是利用LLM作为聊天模型。首先,它输出一个长答案,通常涵盖几个词到几句话。随后,我们提示它从长答案中进一步提取出短答案。提示内容在附录6.1中提供。
4 Experiments
4 实验
In this section, we will first detail the dataset we adopt, and then demonstrate the retriever performance. Finally, we will show the end question-answering performance.
本节中,我们将首先详细介绍所采用的数据集,然后展示检索器的性能。最后,我们将展示端到端的问答性能。
Retrieval Unit 检索单元 | Corpus Size | Num of Retrieval Units 检索单元数量 | Average Num of Tokens 平均令牌数量 | Answer Recall (AR) | |
---|---|---|---|---|---|
Corpus | Test Set | ||||
Passage | 22M | 1 | 120 | 130 | 52.24 |
100 | 12K | 14K | 89.92 | ||
200 | 24K | 28K | 91.30 | ||
Document | 3M | 1 | 820 | 4K | 69.45 |
5 | 4K | 18K | 85.37 | ||
10 | 8K | 34K | 88.12 | ||
Grouped Documents 分组文档 | 600K | 1 | 4K | 6K | 71.69 |
4 | 16K | 25K | 86.30 | ||
8 | 32K | 50K | 88.53 |
表 1:该表展示了在 NQ 上的检索性能。采用长上下文检索器(每个检索单元的平均令牌数高达 6K)将语料库规模压缩了多达 30 倍(从 2200 万降至 60 万),使首位答案召回率提升了约 20 个百分点(从 52.24%提高到 71.69%)。此外,长上下文检索所需检索单元数量大幅减少(仅为原来的十分之一),即可达到相似效果。因此,整合长上下文检索显著减轻了检索器模型的负担。
4.1 Data
4.1 数据
Our proposed methods are tested on two Wikipedia-related question answering datasets: Natural Questions and HotpotQA.
我们提出的方法在两个与维基百科相关的问答数据集上进行了测试:自然问题(Natural Questions)和热点问答(HotpotQA)。
Natural Question 自然问题(Natural Questions)
(Kwiatkowski et al., 2019) was designed for end-to-end question answering. The questions were mined from real Google search queries and the answers were spans in Wikipedia articles identified by annotators. This dataset contains 3,610 questions.
(Kwiatkowski 等人, 2019) 是为端到端问答设计的。问题来源于真实的谷歌搜索查询,答案则是由标注者在维基百科文章中标识的文本片段。该数据集包含 3,610 个问题。
HotpotQA
(Yang et al., 2018) consists of two-hop questions over diverse topics. We focus on the fullwiki setting in which two Wikipedia passages are required to answer the questions. Since the gold passages for the test set are not available, we follow prior work (Xiong et al., 2020b) and evaluate on the development set, which has 7,405 questions. There are two main question types in HotpotQA: (1) comparison questions usually require contrasting two entities and (2) bridge questions can be answered by following a connecting entity that links one document to another.
(Yang 等人, 2018) 包含涉及多主题的两步问题。我们专注于 fullwiki 设置,即需要两篇维基百科文章来回答问题。由于测试集的黄金文章不可用,我们遵循先前工作(Xiong 等人, 2020b)的做法,在开发集上进行评估,该集包含 7,405 个问题。HotpotQA 中有两种主要问题类型:(1) 比较问题通常需要对比两个实体;(2) 桥接问题可以通过追踪连接两个文档的实体来回答。
Wikipedia (Knowledge Source)
维基百科(知识源)
We use different versions of English Wikipedia for different datasets following previous works Lewis et al. (2020); Yang et al. (2018). For NQ, we use the Wikipedia dumps from December 20, 2018, which contain approximately 3 million documents and 22 million passages. For HotpotQA, we use the abstract paragraphs from the October 1, 2017 dump, which contain around 5 million documents. For each page, only the plain text is extracted and all structured data sections such as lists, tables and figures are stripped from the document.
我们根据先前的工作Lewis 等人(2020);Yang 等人(2018),为不同的数据集使用不同版本的英文维基百科。对于 NQ,我们使用 2018 年 12 月 20 日的维基百科转储,其中包含约 300 万份文档和 2200 万条段落。对于 HotpotQA,我们使用 2017 年 10 月 1 日转储中的摘要段落,包含约 500 万份文档。对于每个页面,仅提取纯文本,所有结构化数据部分如列表、表格和图表均从文档中移除。
4.2 Retrieval Performance
4.2 检索性能
Metrics
Retrieval performance is measured using Answer Recall (AR) and Recall (R). For NQ, we use only answer recall, while for HotpotQA, we use both metrics. Answer Recall is the recall of the answer string in all the retrieved documents that we plan to use in the reader. For example, if the retrieval unit is at the “passage” level and the number of retrieval units is 100, answer recall measures whether the answer string is present in these 100 passages. For HotpotQA, we compute AR only for questions with span answers, specifically the “bridge” type questions, while ignoring yes/no and comparison questions, following previous work (Khalifa et al., 2022). Recall used for HotpotQA measures whether the two gold documents are present in all the retrieved results. For example, if the retrieval unit is at the “document” level and the number of retrieval units is 10, recall measures whether both gold documents are present among the 10 retrieved documents.
检索性能通过答案召回率(AR)和召回率(R)来衡量。对于 NQ,我们仅使用答案召回率,而对于 HotpotQA,我们使用这两种指标。答案召回率是指在我们计划用于阅读器的所有检索文档中,答案字符串的召回率。例如,如果检索单元处于“段落”级别且检索单元数量为 100,答案召回率衡量的是答案字符串是否存在于这 100 个段落中。对于 HotpotQA,我们仅针对具有跨度答案的“桥梁”类型问题计算 AR,而忽略是/否和比较问题,遵循先前的工作(Khalifa 等人,2022)。用于 HotpotQA 的召回率衡量的是两个黄金文档是否存在于所有检索结果中。例如,如果检索单元处于“文档”级别且检索单元数量为 10,召回率衡量的是这两个黄金文档是否存在于 10 个检索文档中。
Experiment Setup 实验设置
We leverage open-sourced dense retrieval toolkit, Tevatron Gao et al. (2022), for all our retrieval experiments. The base embedding model we used is bge-large-en-v1.5, a general-purpose embeddings model that isn’t specifically trained on our test data.
我们利用开源的密集检索工具包 Tevatron Gao 等人(2022)进行所有检索实验。我们使用的基嵌入模型是 bge-large-en-v1.5,这是一个通用嵌入模型,并非专门针对我们的测试数据进行训练。
Retrieval Unit 检索单元 | Corpus Size | Num of Retrieval Units 检索单元数量 | Average Num of Tokens 平均令牌数量 | Recall (R) | Answer Recall 答案召回 (AR) | |
---|---|---|---|---|---|---|
Corpus | Test Set | |||||
Document | 5.2M | 2 | 130 | 200 | 30.01 | 47.75 |
100 | 6.5K | 10K | 74.84 | 84.67 | ||
200 | 13K | 20K | 79.68 | 88.34 | ||
Grouped Documents 分组文档 | 500K | 2 | 1K | 8K | 56.30 | 72.49 |
8 | 4K | 29K | 74.71 | 84.40 |
表 2:该表展示了在 HotpotQA 上的检索性能。与 NQ 上的发现类似,长上下文检索能显著减轻整个 RAG 框架中检索器组件的负担。
Table 1 and Table 2 have shown the retrieval results on NQ and HotpotQA. In the NQ dataset, we utilize three different retrieval units, ranging from shorter to longer: passage, document, and grouped documents. In the table, we have mentioned two kinds of average number of tokens in each retrieval unit: one for the entire corpus and one for each test set. The retrieval units for each test case can sometimes be much longer than the average size across the whole corpus, as the corpus might include some Wikipedia pages with very few words, while the test cases may focus more on longer documents. Generally, our long-context retriever (at the document level and grouped document level) uses retrieval units containing an average of 6K tokens. By using longer retrieval units, there are several advantages: 1) It will significantly alleviate the burden on the retriever by compressing the corpus size by approximately 30 times, from 22M to 600K. The top-1 answer recall improves by about 20 points, from 52.24 to 71.69. We could use significantly fewer retrieval units to achieve comparable retrieval performance. For instance, 8 retrieval units at the grouped document level can achieve similar recall as 100 retrieval units at the passage level. 2) It could provide more comprehensive information to the reader. In the original passage-level RAG setup, information might be incomplete due to the chunking operation. In the HotpotQA dataset, we observe similar results. One notable difference is that in HotpotQA, the retrieval units are only at the document level and grouped document level, as HotpotQA uses only abstract paragraphs from each Wikipedia page.
表1和表2展示了在 NQ 和 HotpotQA 上的检索结果。在 NQ 数据集中,我们采用了三种不同长度的检索单元,从短到长依次为:段落、文档和分组文档。在表格中,我们提及了两种平均令牌数:一种是整个语料库的平均数,另一种是每个测试集的平均数。每个测试案例的检索单元有时会比整个语料库的平均尺寸大得多,因为语料库可能包含一些词汇量极少的维基百科页面,而测试案例可能更关注较长的文档。通常,我们的长上下文检索器(在文档级和分组文档级)使用的检索单元平均包含 6K 个令牌。通过使用较长的检索单元,有以下几个优势:1) 它能显著减轻检索器的负担,通过将语料库大小压缩约 30 倍,从 2200 万降至 60 万。顶级答案召回率提高了约 20 个百分点,从 52.24%提升至 71.69%。我们可以使用明显更少的检索单元来达到相当的检索性能。 例如,在组合文档级别上的 8 个检索单元可以实现与段落级别上 100 个检索单元相近的召回率。2)它能为读者提供更全面的信息。在原始的段落级 RAG 设置中,由于分块操作,信息可能不完整。在 HotpotQA 数据集中,我们观察到了类似的结果。一个显著的差异是,在 HotpotQA 中,检索单元仅限于文档级别和组合文档级别,因为 HotpotQA 仅使用每个维基百科页面的摘要段落。
Model | Granularity | AR@1 |
---|---|---|
BGE-Large | 512-tokens chunk 512-token 分块 | 71.7% |
E5-Mistral-7B | 4000-tokens chunk 4000-词块分段 | 54.2% |
E5-Mistral-7B | entire grouped retrieval unit 整体组块检索单元 翻译文本: |
23.4% |
表 3:长检索单元编码的不同方法。使用通用嵌入模型并通过最大化查询与检索单元内所有分块间的相似度得分进行近似,优于采用长嵌入模型对整个上下文进行编码。
Encode the long retrieval unit
长检索单元编码
As discussed in Section 3.2, it’s very challenging to employ an encoder, , to map the retrieval unit to a -dimensional vector when is very long. Therefore, we use an approximation in our proposed system. Table 3 demonstrates that our approximation, , is much more effective than encoding the entire long context directly.
We compare three methods: 1) Using the general embedding model “bge-large-en-v1.5” (Xiao et al., 2023), with selected as text of 512-token size. 2) Using long embedding model “E5-Mistral-7B” (Zhu et al., 2024a), with selected as the whole document, which has an average size of 4K tokens. 3) Using long embeddings model “E5-Mistral-7B”, with no approximation, encoding the entire directly, where has an average size of 6K tokens. We can notice from the table that our approximation by taking the maximum score between the query and each text piece from the long context produces much better results than encoding them directly using the long embedding model. We believe future improvements in the research direction of long embedding models will further enhance our framework to reduce memory consumption.
如第3.2节所述,当 非常长时,使用编码器 将检索单元 映射到一个 维向量极具挑战性。因此,我们在提出的系统中采用了近似方法。表3展示出我们的近似方法 比直接编码整个长上下文要有效得多。我们比较了三种方法:1) 使用通用嵌入模型“bge-large-en-v1.5”(Xiao 等人,2023),其中 选取为 512 个词元大小的文本。2) 使用长嵌入模型“E5-Mistral-7B”(Zhu 等人,2024a),其中 选取为整个文档,平均大小为 4K 个词元。3) 使用长嵌入模型“E5-Mistral-7B”,不采用近似,直接编码整个 ,其中 平均大小为 6K 个词元。从表中可以看出,通过在查询与长上下文中的每个文本片段之间取最大得分来进行的近似处理,其效果远优于直接使用长嵌入模型进行编码。 我们相信,长期嵌入模型研究方向的未来改进将进一步增强我们的框架,以减少内存消耗。
Method | EM |
---|---|
Closed-Book | |
GPT-4-Turbo Achiam et al. (2023) GPT-4-Turbo Achiam 等人(2023) |
41.2 |
Gemini-1.5-Pro Reid et al. (2024) Gemini-1.5-Pro 里德等人 (2024) |
47.8 |
Claude-3-Opus Anthropic (2024) Claude-3-作品 Anthropic(2024 年) |
49.2 |
Fully-supervised RAG 全监督 RAG | |
REALM Guu et al. (2020) REALM Guu 等人(2020 年) |
40.4 |
DPR Karpukhin et al. (2020) DPR 卡尔普克欣等人(2020) |
41.5 |
RAG Lewis et al. (2020) RAG 刘易斯等人(2020) |
44.5 |
RETRO Borgeaud et al. (2022) RETRO 博尔格奥等人(2022) |
45.5 |
RePAQ Lewis et al. (2021) RePAQ Lewis 等人(2021) |
47.8 |
Fusion-in-Decoder (Izacard and Grave, 2020b) Fusion-in-Decoder (Izacard 和 Grave,2020b) |
51.4 |
EMDR2 Singh et al. (2021) EMDR2Singh 等人(2021) |
52.5 |
Atlas (Izacard et al., 2022) Atlas (Izacard 等人,2022) |
64.0 |
No Fine-tuning RAG 无微调 RAG | |
REPLUG (Shi et al., 2023) 重插 (石等人,2023) |
45.5 |
LongRAG (Gemini-1.5-Pro; Recall 4 units) LongRAG(Gemini-1.5-Pro;召回 4 台) |
58.6 |
LongRAG (GPT-4o; Recall 4 units) LongRAG(GPT-4o;召回 4 单元) |
62.7 |
表 4:该表展示了在 NQ 数据集上的问答结果。我们与三组基线进行了比较:闭卷模式,即直接使用 16 次上下文示例提示最先进的LLMs;全监督 RAG,采用 RAG 框架且模型完全受监督并在训练数据上进行训练;以及无微调 RAG,使用 RAG 框架但不进行任何调优。
Method | EM |
---|---|
Closed-Book | |
Claude-3-Opus Anthropic (2024) Claude-3-Opus Anthropic(2024) |
32.8 |
Gemini-1.5-Pro Reid et al. (2024) Gemini-1.5-Pro 里德等人 (2024) |
33.9 |
GPT-4-Turbo Achiam et al. (2023) GPT-4-Turbo Achiam 等人(2023) |
42.4 |
Fully-supervised RAG 全监督 RAG | |
CogQA Ding et al. (2019) CogQA 丁等人(2019) |
37.1 |
DrKIT Dhingra et al. (2020) DrKIT Dhingra 等人 (2020) |
42.1 |
Transformer-XH Zhao et al. (2019) Transformer-XH Zhao 等人 (2019) |
51.6 |
QAMAT+ Chen et al. (2023b) QAMAT+ Chen 等人 (2023b) |
57.6 |
HGN Fang et al. (2019) HGN Fang 等人(2019) |
59.7 |
PathRetriever Asai et al. (2019) PathRetriever Asai 等人(2019) |
60.0 |
HopRetrieve Li et al. (2021) HopRetrieve Li 等人(2021) |
62.1 |
MDR Xiong et al. (2020b) MDR 熊等人 (2020b) |
62.3 |
HopRetrieve-plus Li et al. (2021) HopRetrieve-plus 李等人 (2021) |
66.5 |
AISO Zhu et al. (2021) AISO 朱等人 (2021) |
68.1 |
COS Ma et al. (2023) COS 马等人 (2023) |
68.2 |
No Fine-tuning RAG 无需微调的 RAG | |
DSP Khattab et al. (2022) DSP Khattab 等人(2022) |
51.4 |
PromptRank Khalifa et al. (2023) PromptRank Khalifa 等人(2023) |
55.7 |
LongRAG (Gemini-1.5-Pro; Recall 8 units) LongRAG(Gemini-1.5-Pro;召回 8 台) |
57.5 |
LongRAG (GPT-4o; Recall 8 units) LongRAG (GPT-4o; 召回 8 单元) |
64.3 |
表 5: 该表展示了在 Hotpot-QA 开发集上的问答结果。我们与三组基线进行了比较:闭卷模式,即直接使用最先进的LLMs进行 16 次上下文内示例提示;完全监督的 RAG,采用 RAG 框架且模型在训练数据上进行完全监督训练;以及无微调 RAG,使用 RAG 框架但不进行任何调整。
4.3 Full QA Performance
4.3 全面问答性能
We leverage Gemini-1.5-Pro and GPT-4o as the reader in our LongRAG framework.
The prompt we use for our experiments are in Table 6. We also refine the standard exact match rate definition to more fairly evaluate LongRAG’s performance. More details can be found in Section 6.2.
我们利用Gemini-1.5-Pro 和 GPT-4o 作为 LongRAG 框架中的阅读器。我们在实验中使用的提示语见表6。同时,我们改进了标准的精确匹配率定义,以更公平地评估 LongRAG 的性能。更多详情可参见第6.2节。
We compare our model with several groups of strong previous models as baselines. The first group is “Closed-Book”: These baselines mean that no retrieval component is used; instead, state-of-the-art LLMs are employed to directly obtain the final result. We evaluate our results on Gemini-1.5-pro Reid et al. (2024), Claude-3-Opus Anthropic (2024) and GPT-4-Turbo Achiam et al. (2023). All models are evaluated on 16-shot in-context learning with direct prompting; The second group is “Fully-supervised RAG”, and these baselines involve full-supervised fine-tuning on the training dataset. The third group is “No Fine-tuning RAG”, and these baselines doesn’t involve any supervised fine-tuning on the training dataset.
我们将我们的模型与几组先前的强大模型进行比较,作为基线。第一组是“闭卷”:这些基线意味着不使用检索组件;相反,采用最先进的LLMs直接获取最终结果。我们在Gemini-1.5-pro Reid 等人(2024)、Claude-3-Opus Anthropic (2024)和 GPT-4-Turbo Achiam 等人(2023)上评估我们的结果。所有模型均在 16 次上下文学习的直接提示下进行评估;第二组是“全监督 RAG”,这些基线涉及在训练数据集上的全监督微调。第三组是“无微调 RAG”,这些基线不涉及在训练数据集上的任何监督微调。
The QA results on NQ are presented in Table 4, and the QA results on HotpotQA are presented in Table 5. On the NQ dataset, LongRAG achieves a 62.7 exact match rate, which is on par of the strongest fine-tuned RAG model like Atlas. On the HotpotQA dataset, LongRAG achieves a 64.3 exact match rate, which is also close to the SoTA fully-supervised RAG frameworks.
NQ 上的 QA 结果展示在表4中,HotpotQA 上的 QA 结果展示在表5中。在 NQ 数据集上,LongRAG 达到了 62.7 的精确匹配率,与最强的微调 RAG 模型如 Atlas 相当。在 HotpotQA 数据集上,LongRAG 达到了 64.3 的精确匹配率,也接近于最先进的完全监督的 RAG 框架。
4.4 Ablation Studies
4.4 消融研究
We perform several in-depth ablation to understand what are the important factors in our LongRAG system including "unit size" and "reader variant".
我们进行了多项深入的消融研究,以理解 LongRAG 系统中的重要因素,包括“单元大小”和“阅读器变体”。
Retrieval Unit Selection 检索单元选择
Figure 3 and Figure 4 compare different settings of LongRAG. This table leverages 200 random test cases from the test set to help compare different retrieval unit granularity selection and the optimal number of retrieval units used in the reader. On the NQ dataset, we have two observations: First, regardless of which retrieval unit is selected, there will be a turning point where feeding more retrieval units into the reader becomes detrimental. This is due to the excessive burden placed on the reader, preventing it from effectively understanding and extracting relevant information from the long context. For passage-level retrieval units, the turning point is between 100 and 200; for document-level retrieval units, the turning point is between 5 and 10; and for grouped documents level, the turning point is between 4 and 8. In general, the most suitable context length fed into the reader is around 30K tokens. Second, the semantic integrity is important when comparing the performance of passage-level retrieval units with document or grouped documents level retrieval units, highlighting the advantage of using longer and more complete retrieval units.
图3和图4对比了 LongRAG 的不同设置。该表利用测试集中的 200 个随机测试案例,有助于比较不同的检索单元粒度选择以及阅读器中使用的最佳检索单元数量。在 NQ 数据集上,我们有两个观察结果:首先,无论选择哪种检索单元,都会有一个转折点,即向阅读器输入更多检索单元变得有害。这是由于阅读器承受了过重的负担,无法有效理解和提取长上下文中的相关信息。对于段落级检索单元,转折点在 100 到 200 之间;对于文档级检索单元,转折点在 5 到 10 之间;而对于分组文档级,转折点在 4 到 8 之间。总体而言,最适合输入阅读器的上下文长度大约为 30K 个令牌。 其次,在比较段落级检索单元与文档或分组文档级检索单元的性能时,语义完整性至关重要,这突显了使用更长、更完整检索单元的优势。
Recall vs. EM 召回与电磁记忆
In Figure 5, we compare the relationship between retrieval recall and end performance across varying context lengths for different retrieval unit selections. We observe that using fewer retrieval units in the reader with longer retrieval units design reduces the introduction of distractors or hard negatives under a given length budget. Consequently, the end performance does not increase monotonically with the recall score. In the future, with advancements in long embedding models and improved retrieval recall for long retrieval units, we can expect better end performance.
在图5中,我们比较了不同检索单元选择下,检索召回率与最终性能在不同上下文长度之间的关系。我们观察到,在阅读器中使用较少的检索单元并采用较长的检索单元设计,可以减少在给定长度预算下引入干扰项或困难负样本的情况。因此,最终性能并不随召回分数单调增加。未来,随着长嵌入模型的进步和长检索单元召回率的提高,我们可以期待更好的最终性能。
Reader Model
In Figure 6, we compare the performance of six different readers: Gemini-1.5-pro, GPT-4-Turbo, GPT-4o, Claude-3-Opus, Claude-3.5-Sonnet and DeepSeek-V2-Chat. The results indicate that GPT-4o achieves the highest exact match score on the 200 test questions of the NQ dataset among the three models. This suggests that GPT-4o is the most effective in the role of a long reader in the LongRAG framework. The enhanced performance of GPT-4o can be attributed to its superior ability to process and comprehend lengthy contexts, ensuring that crucial information is accurately extracted. Therefore, we mainly report the GPT-4o results in our main table.
Besides, Gemini-1.5-pro, GPT-4-Turbo, Claude-3-Opus, and Claude-3.5-Sonnet could achieve very similar results. These state-of-the-art black box LLMs are also effective readers within the LongRAG framework. Deepseek-V2-Chat is one of the best open-source LLMs, but its performance degrades significantly compared to the previous five black-box LLMs. The above experiments demonstrate that our current framework depends on the long-context understanding ability of LLMs, and we still have a long way to go in harnessing open-source LLMs within our framework.
在图6中,我们对比了六种不同阅读器的性能:Gemini-1.5-专业版、GPT-4-Turbo、GPT-4o、Claude-3-Opus、Claude-3.5-Sonnet和 DeepSeek-V2-Chat。结果显示,在 NQ 数据集的 200 个测试问题中,GPT-4o 在三个模型中取得了最高的完全匹配分数。这表明 GPT-4o 在 LongRAG 框架中担任长阅读器的角色时最为有效。GPT-4o 的增强性能可归因于其处理和理解长篇上下文的卓越能力,确保关键信息被准确提取。因此,我们主要在主表中报告 GPT-4o 的结果。此外,Gemini-1.5-专业版、GPT-4-Turbo、Claude-3-Opus 和Claude-3.5-Sonnet也能取得非常相似的结果。这些顶尖的黑箱LLMs在 LongRAG 框架内同样是有效的阅读器。DeepSeek-V2-Chat 是最佳的开源LLMs之一,但其性能与前五个黑箱LLMs相比显著下降。 上述实验表明,我们当前的框架依赖于LLMs的长上下文理解能力,并且在我们的框架内充分利用开源LLMs方面,我们仍有很长的路要走。
5 Conclusion
5 结论
In this paper, we propose a new framework, LongRAG, to alleviate the imbalance between the burden of the retriever. The LongRAG framework consists of a “long retriever” and a “long reader” component on top of the 4K-token retrieval units. Our proposed framework can significantly reduce the corpus size by 10 to 30 times, which greatly improves the recall of the retriever. On the other hand, the long retrieval unit preserves the semantic integrity of each document. We test our framework on end-to-end question answering tasks and demonstrate its superior performance without any training. We believe LongRAG can pave the road for the modern RAG system design.
本文中,我们提出了一种新的框架——LongRAG,旨在缓解检索器负担的不平衡。LongRAG 框架包括基于 4K-token 检索单元的“长检索器”和“长阅读器”组件。我们提出的框架能显著减少语料库大小达 10 至 30 倍,从而大幅提升检索器的召回率。另一方面,长检索单元保持了每份文档的语义完整性。我们在端到端问答任务上测试了我们的框架,并展示了其无需任何训练的卓越性能。我们相信 LongRAG 能为现代 RAG 系统设计铺平道路。
Limitation
There are three major limitations of our proposed framework. First, it relies on the long embedding model. Although recent studies have made progress in this direction, there is still a need for stronger long embedding models. In our work, we use an approximation to calculate the semantic score with a regular embedding model, which proves more effective than using a long embedding model. Future improvements in long embedding models could help us further enhance the performance of our system and reduce the storage size of corpus embeddings if the entire long context could be encoded directly. The second limitation is that we only use a black-box LLM as the reader. A reader that supports long input and is less affected by position bias is necessary. Currently, most open-source LLMs do not meet these requirements. The third limitation is that our grouping methods are based on hyperlinks, which are specific to the Wikipedia corpus. A more general grouping method should be considered.
我们的提案框架存在三大局限。首先,它依赖于长嵌入模型。尽管近期研究在此方向上取得了进展,但仍需更强大的长嵌入模型。在我们的工作中,我们采用了一种近似方法,通过常规嵌入模型计算语义得分,事实证明这比使用长嵌入模型更为有效。未来长嵌入模型的改进有望帮助我们进一步提升系统性能,并在能够直接编码整个长上下文的情况下,减少语料库嵌入的存储大小。第二个局限是我们仅使用了一个黑盒LLM作为阅读器。需要一个支持长输入且受位置偏差影响较小的阅读器。目前,大多数开源LLMs尚不满足这些要求。第三个局限在于我们的分组方法基于超链接,这是针对维基百科语料库的特定方法。应考虑采用更为通用的分组方法。
References
-
Achiam et al. (2023) Achiam 等人(2023)
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023.
Gpt-4 technical report.
arXiv preprint arXiv:2303.08774.
乔什·阿奇姆、史蒂文·阿德勒、桑迪尼·阿加瓦尔、拉玛·艾哈迈德、伊尔格·阿克卡亚、弗洛伦西亚·莱奥尼·阿莱曼、迪奥戈·阿尔梅达、扬科·阿尔滕施密特、萨姆·奥特曼、沙亚马尔·阿南德卡特等人。2023 年。GPT-4 技术报告。arXiv 预印本 arXiv:2303.08774。 -
An et al. (2024) 安等人(2024)
Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. 2024.
Training-free long-context scaling of large language models.
arXiv preprint arXiv:2402.17463.
安晨昕、黄飞、张俊、龚善山、仇小鹏、周畅、孔令鹏。2024 年。无需训练的大语言模型长上下文扩展。arXiv 预印本 arXiv:2402.17463。 -
Anthropic (2024)
Anthropic. 2024.
Introducing the next generation of claude.
安布里克. 2024. 推出下一代claude. -
Asai et al. (2019) Asai 等人 (2019)
Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, and Caiming Xiong. 2019.
Learning to retrieve reasoning paths over wikipedia graph for question answering.
arXiv preprint arXiv:1911.10470.
浅井明里、桥本和真、哈贾内·哈吉希尔兹、理查德·索彻和熊才明。2019 年。学习通过维基百科图谱检索推理路径以进行问答。arXiv 预印本 arXiv:1911.10470。 -
Borgeaud et al. (2022) Borgeaud 等人(2022)
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022.
Improving language models by retrieving from trillions of tokens.
In International conference on machine learning, pages 2206–2240. PMLR.
塞巴斯蒂安·博尔戈、阿尔蒂尔·芒什、乔丹·霍夫曼、特雷弗·蔡、伊丽莎·拉瑟福德、凯蒂·米利坎、乔治·范登·德里斯切、让-巴蒂斯特·莱斯皮亚、博格丹·达莫克、艾丹·克拉克等。2022 年。通过从数万亿个词元中检索来改进语言模型。在国际机器学习会议上,第 2206-2240 页。PMLR。 -
Chen et al. (2017) 陈丹琦等人 (2017)
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017.
Reading wikipedia to answer open-domain questions.
In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879.
陈丹琦、Adam Fisch、Jason Weston 和 Antoine Bordes. 2017. 通过阅读维基百科回答开放领域问题. 在计算语言学协会第 55 届年会论文集(第 1 卷:长论文)中,页码 1870–1879. -
Chen et al. (2024) 陈等人 (2024)
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024.
Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.
arXiv preprint arXiv:2402.03216.
陈建律, 肖士涛, 张培天, 罗坤, 连德富, 刘峥. 2024. Bge m3-embedding: 通过自知识蒸馏实现多语言、多功能、多粒度文本嵌入. arXiv 预印本 arXiv:2402.03216. -
Chen et al. (2023a) 陈等(2023a)
Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. 2023a.
Extending context window of large language models via positional interpolation.
arXiv preprint arXiv:2306.15595.
陈守元, 王世民, 陈良建, 田元东. 2023a. 通过位置插值扩展大型语言模型的上下文窗口. arXiv 预印本 arXiv:2306.15595. -
Chen et al. (2023b) 陈灵娇、Matei Zaharia 和 James Zou。2023b。
Wenhu Chen, Pat Verga, Michiel de Jong, John Wieting, and William Cohen. 2023b.
Augmenting pre-trained language models with qa-memory for open-domain question answering.
In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1597–1610.
陈文虎, Pat Verga, Michiel de Jong, John Wieting, William Cohen. 2023b. 通过 QA 记忆增强预训练语言模型以支持开放域问答. 在第 17 届欧洲计算语言学协会会议论文集中, 页码 1597–1610. -
Cheng et al. (2021) 程浩等 (2021)
Hao Cheng, Yelong Shen, Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2021.
Unitedqa: A hybrid approach for open domain question answering.
arXiv preprint arXiv:2101.00178.
程浩, 沈耶隆, 刘晓东, 何鹏程, 陈卫柱, 高剑峰. 2021. 联合问答:一种开放域问答的混合方法. arXiv 预印本 arXiv:2101.00178. -
Dai and Callan (2019) 戴和卡兰 (2019)
Zhuyun Dai and Jamie Callan. 2019.
Deeper text understanding for ir with contextual neural language modeling.
In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pages 985–988.
戴竹韵与 Jamie Callan。2019 年。利用上下文神经语言模型进行更深层次的文本理解以支持信息检索。在第 42 届国际 ACM SIGIR 关于信息检索研究与开发的会议论文集中,页码 985–988。 -
Dao et al. (2022) Dao 等人(2022)
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022.
Flashattention: Fast and memory-efficient exact attention with io-awareness.
Advances in Neural Information Processing Systems, 35:16344–16359.
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra 及 Christopher Ré。2022 年。Flashattention: 快速且内存高效的精确注意力机制,具备 I/O 感知能力。神经信息处理系统进展,35 卷:16344–16359。 -
Dhingra et al. (2020) 丁格拉等人(2020 年)
Bhuwan Dhingra, Manzil Zaheer, Vidhisha Balachandran, Graham Neubig, Ruslan Salakhutdinov, and William W Cohen. 2020.
Differentiable reasoning over a virtual knowledge base.
arXiv preprint arXiv:2002.10640.
Bhuwan Dhingra, Manzil Zaheer, Vidhisha Balachandran, Graham Neubig, Ruslan Salakhutdinov, 和 William W Cohen. 2020. 可微分推理于虚拟知识库之上. arXiv 预印本 arXiv:2002.10640. -
Ding et al. (2019) Ding 等 (2019)
Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang, and Jie Tang. 2019.
Cognitive graph for multi-hop reading comprehension at scale.
In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2694–2703.
丁铭, 周畅, 陈启彬, 杨洪侠, 和 唐杰. 2019. 认知图谱在大规模多跳阅读理解中的应用. 在计算语言学协会第 57 届年会论文集, 页码 2694–2703. -
Fang et al. (2019) 方宇伟等 (2019)
Yuwei Fang, Siqi Sun, Zhe Gan, Rohit Pillai, Shuohang Wang, and Jingjing Liu. 2019.
Hierarchical graph network for multi-hop question answering.
arXiv preprint arXiv:1911.03631.
方宇伟, 孙思琪, 甘哲, 皮拉·皮莱, 王硕航, 刘静静. 2019. 多跳问答的分层图网络. arXiv 预印本 arXiv:1911.03631. -
Gao et al. (2022) 高等人 (2022)
Luyu Gao, Xueguang Ma, Jimmy J. Lin, and Jamie Callan. 2022.
Tevatron: An efficient and flexible toolkit for dense retrieval.
ArXiv, abs/2203.05765.
高璐宇, 马学光, Jimmy J. Lin, 和 Jamie Callan. 2022. Tevatron: 一个高效且灵活的密集检索工具包. ArXiv, abs/2203.05765. -
Günther et al. (2023) Günther 等人 (2023)
Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Mohammad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, et al. 2023.
Jina embeddings 2: 8192-token general-purpose text embeddings for long documents.
arXiv preprint arXiv:2310.19923.
迈克尔·Günther, 杰克敏·翁, 伊莎贝尔·莫尔, 阿拉丁·阿卜德萨拉姆, 坦吉·阿贝尔, 穆罕默德·卡利姆·阿克雷姆, 苏珊娜·古兹曼, 乔治奥斯·马斯特拉帕斯, 萨巴·斯图鲁亚, 王波, 等人. 2023. Jina 嵌入 2: 适用于长文档的 8192-token 通用文本嵌入. arXiv 预印本 arXiv:2310.19923. -
Guu et al. (2020) Guu 等人(2020)
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020.
Retrieval augmented language model pre-training.
In International conference on machine learning, pages 3929–3938. PMLR.
Kelvin Guu、Kenton Lee、Zora Tung、Panupong Pasupat 和 Mingwei Chang。2020 年。检索增强型语言模型预训练。在国际机器学习会议上,第 3929-3938 页。PMLR。 -
Hao et al. (2022) Hao 等人(2022)
Yaru Hao, Yutao Sun, Li Dong, Zhixiong Han, Yuxian Gu, and Furu Wei. 2022.
Structured prompting: Scaling in-context learning to 1, 000 examples.
ArXiv, abs/2212.06713.
郝亚茹, 孙宇涛, 董力, 韩志雄, 顾雨轩, 以及韦福如. 2022. 结构化提示:将上下文学习扩展至 1,000 个示例. arXiv, abs/2212.06713. -
Izacard and Grave (2020a)
Izacard 和 Grave (2020a) Gautier Izacard and Edouard Grave. 2020a. Distilling knowledge from reader to retriever for question answering. arXiv preprint arXiv:2012.04584.
高缇耶·伊扎卡尔和爱德华·格雷夫. 2020a. 从阅读器中提炼知识以增强问答检索器. arXiv 预印本 arXiv:2012.04584. -
Izacard and Grave (2020b)
Izacard 和 Grave (2020b) Gautier Izacard and Edouard Grave. 2020b. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282.
Gautier Izacard 和 Edouard Grave. 2020b. 利用段落检索与生成模型在开放域问答中的应用。 arXiv 预印本 arXiv:2007.01282. -
Izacard et al. (2022) Izacard 等人 (2022)
Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane A. Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2022.
Few-shot learning with retrieval augmented language models.
ArXiv, abs/2208.03299.
Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane A. Yu, Armand Joulin, Sebastian Riedel, 和 Edouard Grave. 2022. 基于检索增强的语言模型在小样本学习中的应用. ArXiv, abs/2208.03299. -
Jin et al. (2024) Jin 等人 (2024)
Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, and Xia Hu. 2024.
Llm maybe longlm: Self-extend llm context window without tuning.
arXiv preprint arXiv:2401.01325.
金鸿业, 韩晓天, 杨景峰, 姜志猛, 刘子瑞, 张嘉元, 陈慧源, 和 胡晓. 2024. Llm 或为 LongLM: 无需微调即可自我扩展 llm 上下文窗口. arXiv 预印本 arXiv:2401.01325. -
Johnson et al. (2019) Johnson 等人(2019)
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019.
Billion-scale similarity search with gpus.
IEEE Transactions on Big Data, 7(3):535–547.
Jeff Johnson, Matthijs Douze, 和 Hervé Jégou. 2019. 基于 GPU 的十亿级相似度搜索. IEEE 大数据交易, 7(3):535–547. -
Karpukhin et al. (2020) Karpukhin 等人(2020)
Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020.
Dense passage retrieval for open-domain question answering.
arXiv preprint arXiv:2004.04906.
弗拉基米尔·卡普希欣、巴拉斯·奥古兹、闵世元、帕特里克·刘易斯、李德尔·吴、谢尔盖·埃德诺夫、陈丹琦和温-陶·伊。2020 年。面向开放域问答的密集段落检索。arXiv 预印本 arXiv:2004.04906。 -
Khalifa et al. (2022) 哈利法等人(2022)
Muhammad Khalifa, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, and Lu Wang. 2022.
Few-shot reranking for multi-hop qa via language model prompting.
arXiv preprint arXiv:2205.12650.
穆罕默德·哈利法、拉詹南根·罗格斯瓦兰、李文泰、李鸿乐和卢王。2022 年。通过语言模型提示进行少样本多跳问答重排序。arXiv 预印本 arXiv:2205.12650。 -
Khalifa et al. (2023) Khalifa 等人(2023)
Muhammad Khalifa, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, and Lu Wang. 2023.
Few-shot reranking for multi-hop qa via language model prompting.
arXiv preprint arXiv:2205.12650.
Muhammad Khalifa, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, 和王璐。2023 年。通过语言模型提示进行少样本重排序的多跳问答。arXiv 预印本 arXiv:2205.12650。 -
Khattab et al. (2022) 哈塔布等人(2022 年)
Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. 2022.
Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp.
arXiv preprint arXiv:2212.14024.
Omar Khattab, Keshav Santhanam, 李湘 Lisa, David Hall, Percy Liang, Christopher Potts, 和 Matei Zaharia。2022 年。演示-搜索-预测:为知识密集型 NLP 组合检索与语言模型。arXiv 预印本 arXiv:2212.14024。 -
Kwiatkowski et al. (2019)
克维亚特科夫斯基等人(2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, 等人. 2019. 自然问题:问答研究的一个基准. 计算语言学协会会刊, 7:453–466. -
Lewis et al. (2020) 刘易斯等人(2020)
Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020.
Retrieval-augmented generation for knowledge-intensive nlp tasks.
ArXiv, abs/2005.11401.
Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, 和 Douwe Kiela. 2020. 知识密集型 NLP 任务的检索增强生成. arXiv, abs/2005.11401. -
Lewis et al. (2021) 刘易斯等人(2021)
Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich Küttler, Aleksandra Piktus, Pontus Stenetorp, and Sebastian Riedel. 2021.
Paq: 65 million probably-asked questions and what you can do with them.
Transactions of the Association for Computational Linguistics, 9:1098–1115.
Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich Küttler, Aleksandra Piktus, Pontus Stenetorp, 和 Sebastian Riedel. 2021. Paq: 6500 万可能被问及的问题及其应用. 计算语言学协会会刊, 9:1098–1115. -
Li et al. (2021) Li 等人 (2021)
Shaobo Li, Xiaoguang Li, Lifeng Shang, Xin Jiang, Qun Liu, Chengjie Sun, Zhenzhou Ji, and Bingquan Liu. 2021.
Hopretriever: Retrieve hops over wikipedia to answer complex questions.
In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 13279–13287.
李少波、李晓光、尚丽峰、江欣、刘群、孙成杰、姬振周、刘冰泉。2021 年。Hopretriever:在维基百科上检索跳跃以回答复杂问题。在《AAAI 人工智能会议论文集》,第 35 卷,第 13279-13287 页。 -
Liu et al. (2024) 刘等人(2024)
Jiaheng Liu, Zhiqi Bai, Yuanxing Zhang, Chenchen Zhang, Yu Zhang, Ge Zhang, Jiakai Wang, Haoran Que, Yukang Chen, Wenbo Su, et al. 2024.
E^ 2-llm: Efficient and extreme length extension of large language models.
Findings of ACL 2024.
刘家亨、白志琪、张远行、张晨辰、张宇、张歌、王家凯、阙浩然、陈宇光、苏文博等。2024 年。E^ 2-llm:大型语言模型的高效及极长扩展。ACL 2024 研究成果。 -
Ma et al. (2023) Ma 等人(2023)
Kaixin Ma, Hao Cheng, Yu Zhang, Xiaodong Liu, Eric Nyberg, and Jianfeng Gao. 2023.
Chain-of-skills: A configurable model for open-domain question answering.
In The 61st Annual Meeting Of The Association For Computational Linguistics.
马开心、程昊、张宇、刘晓东、Eric Nyberg 和 高剑锋。2023 年。技能链:一种可配置的开放域问答模型。在计算语言学协会第 61 届年会上。 -
Mialon et al. (2023) Mialon 等人(2023)
Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. 2023.
Augmented language models: a survey.
arXiv preprint arXiv:2302.07842.
Grégoire Mialon、Roberto Dessì、Maria Lomeli、Christoforos Nalmpantis、Ram Pasunuru、Roberta Raileanu、Baptiste Rozière、Timo Schick、Jane Dwivedi-Yu、Asli Celikyilmaz 等人。2023 年。增强型语言模型:一项调查。arXiv 预印本 arXiv:2302.07842。 -
Nussbaum et al. (2024) Nussbaum 等人 (2024)
Zach Nussbaum, John X Morris, Brandon Duderstadt, and Andriy Mulyar. 2024.
Nomic embed: Training a reproducible long context text embedder.
arXiv preprint arXiv:2402.01613.
扎克·纳索鲍姆, 约翰·X·莫里斯, 布兰登·达德斯塔特, 和安德烈·穆利亚尔. 2024. Nomic 嵌入:训练一个可复现的长上下文文本嵌入器. arXiv 预印本 arXiv:2402.01613. -
OpenAI (2024)
OpenAI. 2024.
Hello gpt4-o.
OpenAI. 2024. 你好,gpt4-o. -
Peng and Quesnelle (2023)
彭博文和杰弗里·奎斯内尔(2023 年) Bowen Peng and Jeffrey Quesnelle. 2023. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have.
彭博文和杰弗里·奎斯内尔。2023 年。Ntk-aware 缩放绳索技术使得 llama 模型无需微调即可扩展至超长(8k+)上下文长度,且最小化困惑度下降。https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have。 -
Peng et al. (2023) 彭等(2023)
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. 2023.
Yarn: Efficient context window extension of large language models.
arXiv preprint arXiv:2309.00071.
彭博文、杰弗里·奎斯内尔、范鸿禄和恩里科·希波莱。2023 年。Yarn:大型语言模型高效上下文窗口扩展方法。arXiv 预印本 arXiv:2309.00071。 -
Press et al. (2021) Press 等人 (2021)
Ofir Press, Noah Smith, and Mike Lewis. 2021.
Train short, test long: Attention with linear biases enables input length extrapolation.
In International Conference on Learning Representations.
Ofir Press, Noah Smith, 和 Mike Lewis. 2021. 训练短,测试长:线性偏差注意力实现输入长度外推。 在国际学习表示会议上。 -
Qu et al. (2020) Qu 等人 (2020)
Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2020.
Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering.
arXiv preprint arXiv:2010.08191.
曲英奇、丁宇辰、刘静、刘凯、任瑞阳、赵鑫、董大祥、吴华、王海峰. 2020. RocketQA:一种针对开放域问答的密集段落检索优化训练方法. arXiv 预印本 arXiv:2010.08191. -
Ratner et al. (2023) Ratner 等人 (2023)
Nir Ratner, Yoav Levine, Yonatan Belinkov, Ori Ram, Inbal Magar, Omri Abend, Ehud Karpas, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023.
Parallel context windows for large language models.
In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6383–6402, Toronto, Canada. Association for Computational Linguistics.
尼尔·拉特纳、约夫·莱文、约纳坦·贝林科夫、奥里·拉姆、英巴尔·马加尔、奥姆里·阿本德、埃胡德·卡帕斯、阿莫农·沙舒瓦、凯文·莱顿-布朗、约夫·肖哈姆. 2023. 大型语言模型的并行上下文窗口. 在计算语言学协会第 61 届年会论文集(第 1 卷:长篇论文),第 6383-6402 页,加拿大多伦多. 计算语言学协会. -
Reid et al. (2024) Reid 等人(2024)
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. 2024.
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.
arXiv preprint arXiv:2403.05530.
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, 等人. 2024. Gemini 1.5:解锁跨百万级上下文令牌的多模态理解。 arXiv 预印本 arXiv:2403.05530. -
Saad-Falcon et al. (2024)
Saad-Falcon 等人(2024) Jon Saad-Falcon, Daniel Y Fu, Simran Arora, Neel Guha, and Christopher Ré. 2024. Benchmarking and building long-context retrieval models with loco and m2-bert. arXiv preprint arXiv:2402.07440.
乔恩·萨德-法尔孔,丹尼尔 Y·傅,西姆兰·阿罗拉,尼尔·古哈,以及克里斯托弗·Ré。2024 年。利用 loco 和 m2-bert 基准测试及构建长上下文检索模型。arXiv 预印本 arXiv:2402.07440。 -
Shi et al. (2023) 石等人(2023 年)
Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023.
Replug: Retrieval-augmented black-box language models.
arXiv preprint arXiv:2301.12652.
史玮嘉,闵世勋,安田道弘,闵俊,理查德·詹姆斯,迈克·刘易斯,卢克·泽特勒默,以及温-陶·伊。2023 年。Replug:增强检索的黑盒语言模型。arXiv 预印本 arXiv:2301.12652。 -
Singh et al. (2021) Singh 等人 (2021)
Devendra Singh, Siva Reddy, Will Hamilton, Chris Dyer, and Dani Yogatama. 2021.
End-to-end training of multi-document reader and retriever for open-domain question answering.
Advances in Neural Information Processing Systems, 34:25968–25981.
Devendra Singh, Siva Reddy, Will Hamilton, Chris Dyer, 和 Dani Yogatama. 2021. 面向开放域问答的多文档阅读器与检索器端到端训练. 神经信息处理系统进展, 34:25968–25981. -
Su et al. (2021) Su 等人 (2021)
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021.
Roformer: Enhanced transformer with rotary position embedding.
arXiv preprint arXiv:2104.09864.
苏建林, 陆宇, 潘盛峰, Ahmed Murtadha, 温波, 刘云峰. 2021. 旋转位置嵌入增强的 Transformer:Roformer. arXiv 预印本 arXiv:2104.09864. -
Trivedi et al. (2022) Trivedi 等人 (2022)
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022.
Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions.
arXiv preprint arXiv:2212.10509.
哈什·特里维迪, 尼兰詹·巴拉苏布拉曼尼亚姆, 图沙尔·科特, 阿希什·萨巴瓦尔. 2022. 通过链式思考推理交错检索知识密集型多步骤问题. arXiv 预印本 arXiv:2212.10509. -
Wang et al. (2023) 王等人(2023)
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2023.
Improving text embeddings with large language models.
arXiv preprint arXiv:2401.00368.
梁王, 杨楠, 黄小龙, 杨林军, 马然·马朱默德, 魏富如. 2023. 通过大型语言模型提升文本嵌入质量. arXiv 预印本 arXiv:2401.00368. -
Xiao et al. (2023) 肖等人(2023)
Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023.
C-pack: Packaged resources to advance general chinese embedding.
arXiv:2309.07597.
肖诗涛、刘峥、张培天和 Niklas Muennighoff。2023 年。C-pack:推进通用中文嵌入的资源包。arXiv:2309.07597。 -
Xiong et al. (2020a) Xiong 等人(2020a)
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020a.
Approximate nearest neighbor negative contrastive learning for dense text retrieval.
arXiv preprint arXiv:2007.00808.
李雄、陈彦雄、李烨、邓国锋、刘家林、Paul Bennett、Junaid Ahmed 和 Arnold Overwijk。2020a。近似最近邻负对比学习用于密集文本检索。arXiv 预印本 arXiv:2007.00808。 -
Xiong et al. (2020b) 熊文翰等 (2020b)
Wenhan Xiong, Xiang Lorraine Li, Srini Iyer, Jingfei Du, Patrick Lewis, William Yang Wang, Yashar Mehdad, Wen-tau Yih, Sebastian Riedel, Douwe Kiela, et al. 2020b.
Answering complex open-domain questions with multi-hop dense retrieval.
arXiv preprint arXiv:2009.12756.
熊文翰、李湘罗琳、斯里尼·艾耶、杜静菲、刘派、威廉·杨·王、亚夏尔·梅赫达德、温涛·伊、塞巴斯蒂安·里德尔、杜威·基拉等,2020b。多跳密集检索回答复杂开放域问题。arXiv 预印本 arXiv:2009.12756。 -
Yang et al. (2018) 杨等人(2018)
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018.
Hotpotqa: A dataset for diverse, explainable multi-hop question answering.
arXiv preprint arXiv:1809.09600.
杨志林、彭琪、张赛征、Yoshua Bengio、William W. Cohen、Ruslan Salakhutdinov 和 Christopher D. Manning。2018 年。HotpotQA:一个多样化、可解释的多跳问答数据集。arXiv 预印本 arXiv:1809.09600。 -
Yu et al. (2021) 于等人(2021)
Donghan Yu, Chenguang Zhu, Yuwei Fang, Wenhao Yu, Shuohang Wang, Yichong Xu, Xiang Ren, Yiming Yang, and Michael Zeng. 2021.
Kg-fid: Infusing knowledge graph in fusion-in-decoder for open-domain question answering.
arXiv preprint arXiv:2110.04330.
于东汉、朱晨光、方宇伟、余文浩、王硕航、徐义聪、任翔、杨一鸣和 Michael Zeng。2021 年。Kg-fid:在开放域问答的 Decoder 中融合知识图谱。arXiv 预印本 arXiv:2110.04330。 -
Yu (2022)
Wenhao Yu. 2022.
Retrieval-augmented generation across heterogeneous knowledge.
In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop, pages 52–58.
俞文豪。2022 年。跨异构知识的检索增强生成。在2022 年北美计算语言学协会年会:人类语言技术学生研究研讨会论文集,第 52-58 页。 -
Yu et al. (2023) Yu 等人 (2023)
Wenhao Yu, Zhihan Zhang, Zhenwen Liang, Meng Jiang, and Ashish Sabharwal. 2023.
Improving language models via plug-and-play retrieval feedback.
arXiv preprint arXiv:2305.14002.
俞文豪,张志涵,梁振文,姜蒙,以及 Ashish Sabharwal。2023 年。通过即插即用检索反馈改进语言模型。arXiv 预印本 arXiv:2305.14002。 -
Zhao et al. (2019) 赵等人(2019 年)
Chen Zhao, Chenyan Xiong, Corby Rosset, Xia Song, Paul Bennett, and Saurabh Tiwary. 2019.
Transformer-xh: Multi-evidence reasoning with extra hop attention.
In International Conference on Learning Representations.
陈昭, 熊晨彦, 柯比·罗斯特, 宋霞, 保罗·贝内特, 和苏尔巴·蒂瓦里. 2019.Transformer-xh: 利用额外跳跃注意力的多证据推理.在国际学习表征会议. -
Zhu et al. (2024a) 朱等 (2024a)
Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. 2024a.
Longembed: Extending embedding models for long context retrieval.
arXiv preprint arXiv:2404.12096.
朱大伟, 王亮, 杨楠, 宋一凡, 吴文浩, 魏福儒, 和李世健. 2024a.Longembed: 扩展嵌入模型以支持长上下文检索.arXiv 预印本 arXiv:2404.12096. -
Zhu et al. (2024b) 朱大为等(2024b)
Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. 2024b.
PoSE: Efficient context window extension of LLMs via positional skip-wise training.
In The Twelfth International Conference on Learning Representations.
朱大为, 杨楠, 王亮, 宋一帆, 吴文浩, 韦福如, 李寿山. 2024b. PoSE: 通过位置跳跃式训练实现LLMs高效上下文窗口扩展. 在第十二届国际学习表征会议上. -
Zhu et al. (2021) 朱大为等(2021)
Yunchang Zhu, Liang Pang, Yanyan Lan, Huawei Shen, and Xueqi Cheng. 2021.
Adaptive information seeking for open-domain question answering.
In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3615–3626.
朱云昌、庞亮、兰艳艳、沈华伟和程学旗。2021 年。开放域问答中的自适应信息检索。在《2021 年自然语言处理实证方法会议论文集》中,第 3615-3626 页。
Method | Prompt |
---|---|
Closed-Book |
Here are some examples of questions and their corresponding answer, each with a “Question” field and an “Answer” field. Answer the question directly and don’t output other thing.
以下是一些问题及其对应的答案示例,每个问题和答案分别对应“问题”和“答案”字段。直接回答问题,不输出其他内容。 “Question”: …“Answer”: … “问题”:…“答案”:… “Question”: …“Answer”: … “问题”:……“答案”:…… “Question”: …“Answer”: … “问题”:……“答案”:…… … “Question”: …“Answer”: … “问题”:……“答案”:…… Answer the following question. 回答以下问题。 “Question”: who is the owner of reading football club “Answer”: “问题”:阅读足球俱乐部的所有者是谁?“答案”: |
LongRAG |
Turn 1: Go through the following context and then answer the question. The context is a list of Wikipedia documents, ordered by title: ….
轮次 1: 先浏览以下上下文,然后回答问题。上下文是一系列按标题排序的维基百科文档:…. Each Wikipedia document contains a title field and a text field. The context is: 每个维基百科文档包含一个标题字段和一个文本字段。上下文如下: “Title”: …“Text”: … “标题”:……“文本”:…… “Title”: …“Text”: …… “标题”:……“文本”:…… “Title”: …“Text”: … “标题”:…“正文”:… Find the useful documents from the context, then answer the question: …. Answer the question directly. Your response should be very concise. 从上下文中找出有用文件,然后回答问题:…。 直接回答问题。您的回答应非常简明扼要。 Turn 2: You have been provided with a question and its long answer. Your task is to derive a very concise short answer, extracting a substring from the given long answer. Short answer is typically an entity without any other redundant words. It’s important to ensure that the output short answer remains as simple as possible. Here a few examples: 第二步:你已获得一个问题及其详细答案。你的任务是从给定的详细答案中提炼出一个极其简明的短答案,即从中提取一个子串。短答案通常是一个实体,不含任何冗余词汇。确保输出的短答案尽可能简单至关重要。以下是几个示例: “Question”: …“Long Answer”: …“Short Answer”: … “问题”:…“详细答案”:…“短答案”:… “Question”: …“Long Answer”: …“Short Answer”: … “问题”:…“详细答案”:…“短答案”:… “Question”: …“Long Answer”: …“Short Answer”: … Extract the short answer of the following question and long answer: “Question”: when did the philadelphia eagles play in the super bowl last “Long Answer”: The Philadelphia Eagles last played in the Super Bowl on February 4, 2018, in Super Bowl LII. “Short Answer”: |
表 6:以下是我们用于所有实验的提示。对于闭卷方法,我们使用 16 个上下文示例进行少量样本学习。对于 LongRAG,我们采用两轮方法来提取最终答案。第一轮不需要任何上下文示例,生成一个较长的答案,通常从几个词到几句话不等。在第二轮中,我们使用 8 个上下文示例进行校准并提取确切的短答案,这通常只有几个词。
Question | Ground truth | LongRAG prediction LongRAG 预测 |
---|---|---|
where does the bob and tom show broadcast from 鲍勃和汤姆秀从哪里播出? |
Indianapolis , Indiana 印第安纳波利斯,印第安纳州 | Indianapolis |
who has given the theory of unbalanced economic growth 谁提出了非均衡经济增长理论 |
Hirschman | Albert O. Hirschman 阿尔伯特·O·赫希曼 |
when does season 6 of the next step start 《舞动奇迹》第六季何时开播 |
2018 | September 29, 2018 2018 年 9 月 29 日 |
what was the precursor to the present day internet 现代互联网的前身是什么 |
the ARPANET project ARPANET 项目 | ARPANET |
表 7:一些示例表明,LongRAG 已提取了别名或地面真相的不同形式。
6 Appendix
6 附录
6.1 Prompts Template for Long Context Reader
6.1 长上下文阅读器提示模板
We have put out prompts used for the experiments in Table 6. For the closed-book method, we use 16-shot in-context examples. For LongRAG, we use a two-turn approach to extract the final answer. In the first turn, the long retrieved context and the question are concatenated as input, and we do not use any in-context examples here due to the context being around 30K tokens. Empirically, we found it beneficial to let the reader generate a longer answer initially, typically ranging from a few words to a few sentences. In the second turn, we use 8-shot in-context examples to guide the reader in further extracting the most important part of the long answer as the short answer, which is typically just a few words.
我们在表6中展示了实验所用的提示。对于闭卷方法,我们采用了 16 次上下文示例。对于 LongRAG,我们采用两轮方式提取最终答案。首轮中,将长检索的上下文与问题连接作为输入,此处不使用任何上下文示例,因为上下文长度约为 30K 个词符。经验上,我们发现让阅读器先产生一个较长的初始答案是有益的,通常涵盖几个词到几句话。在第二轮中,我们使用 8 次上下文示例引导阅读器进一步从长答案中提取最关键部分作为简短答案,这通常仅包含几个词。
6.2 Refined Metric
6.2 细化度量标准
The most standard metric used in open-domain question answering tasks is EM (Exact Match), since the correct answer must be a substring within the corpus. In our framework, since the long retrieved context, which contains multiple highly-related documents to the given query, is fed into the reader, there is a much higher possibility that an alias of the ground truth exists in the context and can be extracted by the reader. As shown in Table 7, although LongRAG’s prediction doesn’t exactly match the ground truth, it’s obvious that LongRAG’s prediction is correct. To better and more fairly evaluate LongRAG’s performance, we have refined the EM metric slightly. We recognize it as an exact match if the prediction is less than five tokens (indicating that the short answer is successfully extracted as described in Section 6.1) and the ground truth is a substring of the prediction or vice versa. We have also manually verified that this refined metric indeed captures aliases or other forms of the ground truth. For the fully-supervised RAG baselines used in our paper, given that they are fine-tuned on the training data and the retrieval unit is a small snippet, we believe that the difference won’t be significant when using the refined EM.
在开放域问答任务中,最标准的度量指标是 EM(精确匹配),因为正确答案必须是语料库中的一个子串。在我们的框架中,由于输入阅读器的是包含多个与给定查询高度相关文档的长检索上下文,因此阅读器更有可能从上下文中提取出真实答案的别名。如表7所示,尽管 LongRAG 的预测并未完全匹配真实答案,但显然 LongRAG 的预测是正确的。为了更公正地评估 LongRAG 的性能,我们对 EM 指标进行了细微的改进。如果预测结果少于五个词(表明如第6.1节所述成功提取了简短答案),并且真实答案是预测结果的子串或反之,我们就将其视为精确匹配。我们还手动验证了这一改进的指标确实能捕捉到真实答案的别名或其他形式。 对于我们在论文中使用的全监督 RAG 基线,鉴于它们在训练数据上进行了微调且检索单元是一个小片段,我们相信使用改进的 EM 时差异不会显著。
6.3 Dataset Licenses
6.3 数据集许可证
-
•
NQ: Apache License 2.0
-
•
HotpotQA: CC BY-SA 4.0 License
• HotpotQA:CC BY-SA 4.0 许可证