ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
ColBERT：通过 BERT 上的语境化后期交互实现高效和有效的通道搜索

Omar Khattab 奥马尔-哈塔卜 Stanford University 斯坦福大学 okhattab@stanford.edu and Matei Zaharia 马泰-扎哈里亚 Stanford University 斯坦福大学 matei@cs.stanford.edu

(2020)

Abstract. 摘要

Recent progress in Natural Language Understanding (NLU) is driving fast-paced advances in Information Retrieval (IR), largely owed to fine-tuning deep language models (LMs) for document ranking. While remarkably effective, the ranking models based on these LMs increase computational cost by orders of magnitude over prior approaches, particularly as they must feed each query–document pair through a massive neural network to compute a single relevance score. To tackle this, we present ColBERT, a novel ranking model that adapts deep LMs (in particular, BERT) for efficient retrieval. ColBERT introduces a late interaction architecture that independently encodes the query and the document using BERT and then employs a cheap yet powerful interaction step that models their fine-grained similarity. By delaying and yet retaining this fine-granular interaction, ColBERT can leverage the expressiveness of deep LMs while simultaneously gaining the ability to pre-compute document representations offline, considerably speeding up query processing. Beyond reducing the cost of re-ranking the documents retrieved by a traditional model, ColBERT’s pruning-friendly interaction mechanism enables leveraging vector-similarity indexes for end-to-end retrieval directly from a large document collection. We extensively evaluate ColBERT using two recent passage search datasets. Results show that ColBERT’s effectiveness is competitive with existing BERT-based models (and outperforms every non-BERT baseline), while executing two orders-of-magnitude faster and requiring four orders-of-magnitude fewer FLOPs per query.
自然语言理解（NLU）领域的最新进展推动了信息检索（IR）的快速发展，这主要归功于对用于文档排序的深度语言模型（LMs）进行了微调。基于这些 LMs 的排序模型虽然效果显著，但计算成本却比以前的方法高出几个数量级，特别是它们必须将每个查询-文档配对通过一个庞大的神经网络来计算单个相关性得分。为了解决这个问题，我们提出了 ColBERT，这是一种新颖的排序模型，它调整了深度 LM（尤其是 BERT），以实现高效检索。ColBERT 引入了一种后期交互架构，利用 BERT 对查询和文档进行独立编码，然后采用廉价但功能强大的交互步骤，对它们的细粒度相似性进行建模。通过延迟并保留这种细粒度交互，ColBERT 可以利用深度 LM 的表现力，同时获得离线预计算文档表示的能力，从而大大加快查询处理速度。除了降低用传统模型检索文档时重新排序的成本外，ColBERT的剪枝友好交互机制还能利用矢量相似性索引直接从大型文档集合中进行端到端检索。我们使用两个最新的通道搜索数据集对 ColBERT 进行了广泛评估。结果表明，ColBERT 的有效性与现有的基于 BERT 的模型相比具有竞争力（并且优于所有非 BERT 基线），同时执行速度快两个数量级，每次查询所需的 FLOP 减少了四个数量级。

^†^†copyright: rightsretained
版权：保留权利^†^†journalyear: 2020 期刊年份2020^†^†copyright: acmlicensed 版权所有：acmlicensed^†^†conference: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval; July 25–30, 2020; Virtual Event, China
会议：第 43 届 ACM SIGIR 信息检索研究与发展国际会议论文集》；2020 年 7 月 25-30 日；中国，虚拟活动^†^†price: 15.00 价格： 15.00^†^†doi: 10.1145/3397271.3401075^†^†isbn: 978-1-4503-8016-4/20/07

Refer to caption — Figure 1. Effectiveness (MRR@10) versus Mean Query Latency (log-scale) for a number of representative ranking models on MS MARCO Ranking (Nguyen et al., 2016). The figure also shows ColBERT. Neural re-rankers run on top of the official BM25 top-1000 results and use a Tesla V100 GPU. Methodology and detailed results are in §4.
图 1.MS MARCO Ranking（Nguyen 等人，2016 年）上一些具有代表性的排名模型的有效性（MRR@10）与平均查询延迟（对数尺度）对比。图中还显示了 ColBERT.神经再排名器在官方 BM25 前 1000 名结果之上运行，使用 Tesla V100 GPU。方法和详细结果见第 4 节。

1. Introduction

Over the past few years, the Information Retrieval (IR) community has witnessed the introduction of a host of neural ranking models, including DRMM (Guo et al., 2016), KNRM (Xiong et al., 2017; Dai et al., 2018), and Duet (Mitra et al., 2017; Mitra and Craswell, 2019). In contrast to prior learning-to-rank methods that rely on hand-crafted features, these models employ embedding-based representations of queries and documents and directly model local interactions (i.e., fine-granular relationships) between their contents. Among them, a recent approach has emerged that fine-tunes deep pre-trained language models (LMs) like ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) for estimating relevance. By computing deeply-contextualized semantic representations of query–document pairs, these LMs help bridge the pervasive vocabulary mismatch (Zhao, 2012; Mitra et al., 2018) between documents and queries (Qiao et al., 2019). Indeed, in the span of just a few months, a number of ranking models based on BERT have achieved state-of-the-art results on various retrieval benchmarks (Nogueira and Cho, 2019; MacAvaney et al., 2019; Dai and Callan, 2019b; Yilmaz et al., 2019) and have been proprietarily adapted for deployment by Google¹¹1https://blog.google/products/search/search-language-understanding-bert/
在过去几年里，信息检索（IR）界推出了一系列神经排名模型，包括DRMM（Guo等人，2016年）、KNRM（Xiong等人，2017年；Dai等人，2018年）和Duet（Mitra等人，2017年；Mitra和Craswell，2019年）。与之前依赖手工创建特征的学习排名方法不同，这些模型采用了基于嵌入的查询和文档表示法，并直接对其内容之间的局部交互（即细粒度关系）进行建模。其中，最近出现的一种方法是对深度预训练语言模型（LM）进行微调，如 ELMo（Peters 等人，2018 年）和 BERT（Devlin 等人，2018 年），用于估计相关性。通过计算查询-文档对的深度语境化语义表示，这些语言模型有助于弥合文档和查询之间普遍存在的词汇不匹配问题（Zhao，2012；Mitra 等人，2018）（Qiao 等人，2019）。事实上，在短短几个月的时间内，一些基于 BERT 的排名模型在各种检索基准上取得了最先进的结果（Nogueira 和 Cho，2019 年；MacAvaney 等人，2019 年；Dai 和 Callan，2019 年 b；Yilmaz 等人，2019 年），并已被 Google ¹ 自主调整部署。 and Bing²²2https://azure.microsoft.com/en-us/blog/bing-delivers-its-largest-improvement-in-search-experience-using-azure-gpus/ 和 Bing ² .

However, the remarkable gains delivered by these LMs come at a steep increase in computational cost. Hofstätter et al. (Hofstätter and Hanbury, 2019) and MacAvaney et al. (MacAvaney et al., 2019) observe that BERT-based models in the literature are 100-1000 $\times$ more computationally expensive than prior models—some of which are arguably not inexpensive to begin with (Ji et al., 2019). This quality–cost tradeoff is summarized by Figure 1, which compares two BERT-based rankers (Nogueira and Cho, 2019; Nogueira et al., 2019b) against a representative set of ranking models. The figure uses MS MARCO Ranking (Nguyen et al., 2016), a recent collection of 9M passages and 1M queries from Bing’s logs. It reports retrieval effectiveness (MRR@10) on the official validation set as well as average query latency (log-scale) using a high-end server that dedicates one Tesla V100 GPU per query for neural re-rankers. Following the re-ranking setup of MS MARCO, ColBERT (re-rank), the Neural Matching Models, and the Deep LMs re-rank the MS MARCO’s official top-1000 documents per query. Other methods, including ColBERT (full retrieval), directly retrieve the top-1000 results from the entire collection.
然而，这些 LM 所带来的显著收益是以计算成本的急剧增加为代价的。Hofstätter 等人（Hofstätter and Hanbury, 2019）和 MacAvaney 等人（MacAvaney et al., 2019）观察到，文献中基于 BERT 的模型在计算成本上要比之前的模型高出 100-1000 $\times$ --其中一些模型可以说一开始就不便宜（Ji 等人，2019）。图 1 对这种质量-成本权衡进行了总结，它将两种基于 BERT 的排序器（Nogueira 和 Cho，2019 年；Nogueira 等人，2019b）与一组具有代表性的排序模型进行了比较。该图使用了 MS MARCO Ranking（Nguyen 等人，2016 年），它是最近从必应日志中收集的 900 万个段落和 100 万个查询。它报告了在官方验证集上的检索效果（MRR@10）以及平均查询延迟（对数规模），使用的高端服务器为神经重排器的每个查询专门配备了一个 Tesla V100 GPU。按照 MS MARCO 的重新排序设置，ColBERT（重新排序）、神经匹配模型和深度 LM 对 MS MARCO 的每次查询的官方前 1000 个文档进行重新排序。其他方法，包括 ColBERT（完全检索），则直接检索整个文集中排名前 1000 位的结果。

As the figure shows, BERT considerably improves search precision, raising MRR@10 by almost 7% against the best previous methods; simultaneously, it increases latency by up to tens of thousands of milliseconds even with a high-end GPU. This poses a challenging tradeoff since raising query response times by as little as 100ms is known to impact user experience and even measurably diminish revenue (Kohavi et al., 2013). To tackle this problem, recent work has started exploring using Natural Language Understanding (NLU) techniques to augment traditional retrieval models like BM25 (Robertson et al., 1995). For example, Nogueira et al. (Nogueira et al., 2019c, a) expand documents with NLU-generated queries before indexing with BM25 scores and Dai & Callan (Dai and Callan, 2019a) replace BM25’s term frequency with NLU-estimated term importance. Despite successfully reducing latency, these approaches generally reduce precision substantially relative to BERT.
如图所示，BERT 显著提高了搜索精度，与之前的最佳方法相比，MRR@10 提高了近 7%；与此同时，即使使用高端 GPU，它也会将延迟时间延长数万毫秒。这就构成了一个具有挑战性的权衡，因为众所周知，查询响应时间只要提高 100 毫秒就会影响用户体验，甚至会显著减少收入（Kohavi 等人，2013 年）。为了解决这个问题，最近的研究开始探索使用自然语言理解（NLU）技术来增强传统的检索模型，如 BM25（Robertson 等人，1995 年）。例如，Nogueira 等人（Nogueira et al., 2019c, a）在用 BM25 分数编制索引之前，用 NLU 生成的查询扩展了文档；Dai 和 Callan（Dai and Callan, 2019a）用 NLU 估算的术语重要性取代了 BM25 的术语频率。尽管这些方法成功地减少了延迟，但与 BERT 相比，精度普遍大幅降低。

To reconcile efficiency and contextualization in IR, we propose ColBERT, a ranking model based on contextualized late interaction over BERT. As the name suggests, ColBERT proposes a novel late interaction paradigm for estimating relevance between a query $q$ and a document $d$ . Under late interaction, $q$ and $d$ are separately encoded into two sets of contextual embeddings, and relevance is evaluated using cheap and pruning-friendly computations between both sets—that is, fast computations that enable ranking without exhaustively evaluating every possible candidate.
为了协调 IR 中的效率和上下文化，我们提出了 ColBERT，一种基于上下文化后期交互的 BERT 排名模型。顾名思义，ColBERT 提出了一种新颖的后期交互模式，用于估计查询 $q$ 和文档 $d$ 之间的相关性。在后期交互中， $q$ 和 $d$ 被分别编码到两组上下文嵌入中，相关性通过两组嵌入之间廉价且易于剪枝的计算进行评估，也就是说，快速计算可实现排序，而无需详尽评估每个可能的候选项。

Figure 2 contrasts our proposed late interaction approach with existing neural matching paradigms. On the left, Figure 2 (a) illustrates representation-focused rankers, which independently compute an embedding for $q$ and another for $d$ and estimate relevance as a single similarity score between two vectors (Huang et al., 2013; Zamani et al., 2018). Moving to the right, Figure 2 (b) visualizes typical interaction-focused rankers. Instead of summarizing $q$ and $d$ into individual embeddings, these rankers model word- and phrase-level relationships across $q$ and $d$ and match them using a deep neural network (e.g., with CNNs/MLPs (Mitra et al., 2017) or kernels (Xiong et al., 2017)). In the simplest case, they feed the neural network an interaction matrix that reflects the similiarity between every pair of words across $q$ and $d$ . Further right, Figure 2 (c) illustrates a more powerful interaction-based paradigm, which models the interactions between words within as well as across $q$ and $d$ at the same time, as in BERT’s transformer architecture (Nogueira and Cho, 2019).
图 2 将我们提出的后期交互方法与现有的神经匹配范式进行了对比。左侧的图 2 (a) 展示了以表征为重点的排序器，它独立计算 $q$ 的嵌入和 $d$ 的嵌入，并将相关性估算为两个向量之间的单一相似性得分（Huang 等人，2013 年；Zamani 等人，2018 年）。向右移动，图 2 (b) 展示了典型的以交互为重点的排序器。这些排名器不是将 $q$ 和 $d$ 总结为单个嵌入，而是对 $q$ 和 $d$ 之间的单词和短语级关系进行建模，并使用深度神经网络（例如，使用 CNNs/MLPs （Mitra 等人，2017 年）或核（Xiong 等人，2017 年））进行匹配。在最简单的情况下，他们向神经网络输入一个交互矩阵，该矩阵反映了 $q$ 和 $d$ 中每对词语之间的相似性。再往右，图 2 (c) 展示了一个更强大的基于交互的范例，它同时模拟了 $q$ 和 $d$ 内以及跨 $q$ 和 $d$ 的单词之间的交互，就像 BERT 的transformer架构一样（Nogueira 和 Cho，2019 年）。

These increasingly expressive architectures are in tension. While interaction-based models (i.e., Figure 2 (b) and (c)) tend to be superior for IR tasks (Guo et al., 2019; Mitra et al., 2018), a representation-focused model—by isolating the computations among $q$ and $d$ —makes it possible to pre-compute document representations offline (Zamani et al., 2018), greatly reducing the computational load per query. In this work, we observe that the fine-grained matching of interaction-based models and the pre-computation of document representations of representation-based models can be combined by retaining yet judiciously delaying the query–document interaction. Figure 2 (d) illustrates an architecture that precisely does so. As illustrated, every query embedding interacts with all document embeddings via a MaxSim operator, which computes maximum similarity (e.g., cosine similarity), and the scalar outputs of these operators are summed across query terms. This paradigm allows ColBERT to exploit deep LM-based representations while shifting the cost of encoding documents offline and amortizing the cost of encoding the query once across all ranked documents. Additionally, it enables ColBERT to leverage vector-similarity search indexes (e.g., (Johnson et al., 2017; Abuzaid et al., 2019)) to retrieve the top- $k$ results directly from a large document collection, substantially improving recall over models that only re-rank the output of term-based retrieval.
这些表现力越来越强的架构之间存在着矛盾。虽然基于交互的模型（即图 2 (b) 和 (c)）往往更适合红外任务（Guo 等人，2019 年；Mitra 等人，2018 年），但以表示为重点的模型通过隔离 $q$ 和 $d$ 之间的计算，使得离线预计算文档表示成为可能（Zamani 等人，2018 年），从而大大减少了每次查询的计算负荷。在这项工作中，我们观察到基于交互模型的细粒度匹配和基于表征模型的文档表征预计算可以通过保留但明智地延迟查询-文档交互的方式结合起来。图 2 (d) 展示的架构正是如此。如图所示，每个查询嵌入都会通过一个 MaxSim 运算符与所有文档嵌入进行交互，该运算符会计算最大相似度（如余弦相似度），这些运算符的标量输出会在所有查询词之间求和。这种模式使 ColBERT 能够利用基于 LM 的深度表示法，同时转移离线文档编码的成本，并摊销在所有排序文档中对查询进行一次编码的成本。此外，它还使 ColBERT 能够利用矢量相似性搜索索引（例如，（Johnson 等人，2017 年；Abuzaid 等人，2019 年）），直接从大型文档集合中检索出排名靠前的 $k$ 结果，与仅对基于术语的检索输出进行重新排序的模型相比，大大提高了召回率。

As Figure 1 illustrates, ColBERT can serve queries in tens or few hundreds of milliseconds. For instance, when used for re-ranking as in “ColBERT (re-rank)”, it delivers over 170 $\times$ speedup (and requires 14,000 $\times$ fewer FLOPs) relative to existing BERT-based models, while being more effective than every non-BERT baseline (§4.2 & 4.3). ColBERT’s indexing—the only time it needs to feed documents through BERT—is also practical: it can index the MS MARCO collection of 9M passages in about 3 hours using a single server with four GPUs (§4.5), retaining its effectiveness with a space footprint of as little as few tens of GiBs. Our extensive ablation study (§4.4) shows that late interaction, its implementation via MaxSim operations, and crucial design choices within our BERT-based encoders are all essential to ColBERT’s effectiveness.
如图 1 所示，ColBERT 可以在几十或几百毫秒内完成查询。例如，在 "ColBERT (re-rank) "中用于重新排序时，与现有的基于 BERT 的模型相比，ColBERT 的速度提高了 170 $\times$ （所需 FLOPs 减少了 14000 $\times$ ），同时比所有非 BERT 基线更有效（§4.2 和 4.3）。ColBERT 的索引编制--它唯一需要通过 BERT 送入文档的时间--也很实用：它可以在大约 3 个小时内，使用一台配备 4 个 GPU 的服务器为包含 9 百万段落的 MS MARCO 文集编制索引（§4.5），在占用空间仅为几十个 GiB 的情况下保持其有效性。我们广泛的消融研究（§4.4）表明，后期交互、通过MaxSim操作的实现，以及基于BERT的编码器中的关键设计选择，对ColBERT的有效性都至关重要。

Our main contributions are as follows.
我们的主要贡献如下

(1)

We propose late interaction (§3.1) as a paradigm for efficient and effective neural ranking.

(1) 我们提出了后期交互（§3.1）作为高效和有效的神经排序范例。
(2)

We present ColBERT (§3.2 & 3.3), a highly-effective model that employs novel BERT-based query and document encoders within the late interaction paradigm.

(2) 我们介绍了 ColBERT（§3.2 和 3.3），这是一种高效模型，在后期交互范式中采用了基于 BERT 的新型查询和文档编码器。
(3)

We show how to leverage ColBERT both for re-ranking on top of a term-based retrieval model (§3.5) and for searching a full collection using vector similarity indexes (§3.6).

(3) 我们展示了如何利用 ColBERT 在基于术语的检索模型之上进行重新排序（§3.5），以及如何利用向量相似性索引搜索全集（§3.6）。
(4)

We evaluate ColBERT on MS MARCO and TREC CAR, two recent passage search collections.

(4) 我们在 MS MARCO 和 TREC CAR 这两个最新的段落搜索集合上对 ColBERT 进行了评估。

2. Related Work 2. 相关工作

Neural Matching Models. Over the past few years, IR researchers have introduced numerous neural architectures for ranking. In this work, we compare against KNRM (Xiong et al., 2017; Dai et al., 2018), Duet (Mitra et al., 2017; Mitra and Craswell, 2019), ConvKNRM (Dai et al., 2018), and fastText+ConvKNRM (Hofstätter et al., 2019a). KNRM proposes a differentiable kernel-pooling technique for extracting matching signals from an interaction matrix, while Duet combines signals from exact-match-based as well as embedding-based similarities for ranking. Introduced in 2018, ConvKNRM learns to match $n$ -grams in the query and the document. Lastly, fastText+ConvKNRM (abbreviated fT+ConvKNRM) tackles the absence of rare words from typical word embeddings lists by adopting sub-word token embeddings.
神经匹配模型。在过去几年中，IR 研究人员推出了许多用于排序的神经架构。在这项工作中，我们比较了 KNRM（Xiong 等人，2017；Dai 等人，2018）、Duet（Mitra 等人，2017；Mitra 和 Craswell，2019）、ConvKNRM（Dai 等人，2018）和 fastText+ConvKNRM （Hofstätter 等人，2019a）。KNRM 提出了一种从交互矩阵中提取匹配信号的可微分内核池技术，而 Duet 则结合了基于精确匹配和基于嵌入的相似性信号来进行排序。ConvKNRM 于 2018 年推出，可学习匹配查询和文档中的 $n$ -grams。最后，fastText+ConvKNRM（缩写为 fT+ConvKNRM）通过采用子词标记嵌入，解决了典型词嵌入列表中缺少罕见词的问题。

In 2018, Zamani et al. (Zamani et al., 2018) introduced SNRM, a representation-focused IR model that encodes each query and each document as a single, sparse high-dimensional vector of “latent terms”. By producing a sparse-vector representation for each document, SNRM is able to use a traditional IR inverted index for representing documents, allowing fast end-to-end retrieval. Despite highly promising results and insights, SNRM’s effectiveness is substantially outperformed by the state of the art on the datasets with which it was evaluated (e.g., see (Yang et al., 2019; MacAvaney et al., 2019)). While SNRM employs sparsity to allow using inverted indexes, we relax this assumption and compare a (dense) BERT-based representation-focused model against our late-interaction ColBERT in our ablation experiments in §4.4. For a detailed overview of existing neural ranking models, we refer the readers to two recent surveys of the literature (Mitra et al., 2018; Guo et al., 2019).
2018 年，Zamani 等人（Zamani et al.，2018）推出了 SNRM，这是一种以表示为重点的红外模型，它将每个查询和每个文档编码为 "潜在术语 "的单个稀疏高维向量。通过为每个文档生成稀疏向量表示，SNRM 能够使用传统的红外反演索引来表示文档，从而实现快速的端到端检索。尽管SNRM取得了非常有前景的结果和见解，但在对其进行评估的数据集上，SNRM的有效性大大超过了目前的技术水平（例如，参见（Yang等人，2019；MacAvaney等人，2019））。虽然 SNRM 采用稀疏性允许使用倒置索引，但我们放宽了这一假设，并在第 4.4 节的消融实验中，将基于（密集）BERT 的以表征为重点的模型与我们的后期交互 ColBERT 进行了比较。关于现有神经排序模型的详细概述，我们请读者参考最近的两篇文献调查（Mitra 等人，2018；Guo 等人，2019）。

Language Model Pretraining for IR. Recent work in NLU emphasizes the importance pre-training language representation models in an unsupervised fashion before subsequently fine-tuning them on downstream tasks. A notable example is BERT (Devlin et al., 2018), a bi-directional transformer-based language model whose fine-tuning advanced the state of the art on various NLU benchmarks. Nogueira et al. (Nogueira and Cho, 2019), MacAvaney et al. (MacAvaney et al., 2019), and Dai & Callan (Dai and Callan, 2019b) investigate incorporating such LMs (mainly BERT, but also ELMo (Peters et al., 2018)) on different ranking datasets. As illustrated in Figure 2 (c), the common approach (and the one adopted by Nogueira et al. on MS MARCO and TREC CAR) is to feed the query–document pair through BERT and use an MLP on top of BERT’s [CLS] output token to produce a relevance score. Subsequent work by Nogueira et al. (Nogueira et al., 2019b) introduced duoBERT, which fine-tunes BERT to compare the relevance of a pair of documents given a query. Relative to their single-document BERT, this gives duoBERT a 1% MRR@10 advantage on MS MARCO while increasing the cost by at least 1.4 $\times$ .
红外语言模型预训练。近来在 NLU 领域开展的工作强调了以无监督方式预训练语言表征模型的重要性，然后再根据下游任务对其进行微调。一个显著的例子是 BERT（Devlin 等人，2018 年），这是一种基于 transformer 的双向语言模型，其微调推进了各种 NLU 基准的技术发展。Nogueira 等人（Nogueira and Cho, 2019）、MacAvaney 等人（MacAvaney et al., 2019）和 Dai & Callan（Dai and Callan, 2019b）研究了在不同的排序数据集上纳入此类 LM（主要是 BERT，但也有 ELMo (Peters et al., 2018)）。如图 2 (c) 所示，常见的方法（也是 Nogueira 等人在 MS MARCO 和 TREC CAR 上采用的方法）是通过 BERT 输入查询-文档对，并在 BERT 的 [CLS] 输出标记之上使用 MLP 生成相关性得分。诺盖拉等人的后续工作（Nogueira et al., 2019b）引入了duoBERT，对BERT进行了微调，以比较给定查询的一对文档的相关性。与他们的单文档 BERT 相比，duoBERT 在 MS MARCO 上的 MRR@10 优势为 1%，而成本至少增加了 1.4 $\times$ 。

BERT Optimizations. As discussed in §1, these LM-based rankers can be highly expensive in practice. While ongoing efforts in the NLU literature for distilling (Jiao et al., 2019; Tang et al., 2019), compressing (Zafrir et al., 2019), and pruning (Michel et al., 2019) BERT can be instrumental in narrowing this gap, they generally achieve significantly smaller speedups than our re-designed architecture for IR, due to their generic nature, and more aggressive optimizations often come at the cost of lower quality.
BERT 优化。如第 1 节所述，这些基于 LM 的排序器在实践中可能非常昂贵。虽然目前在 NLU 文献中对 BERT 进行蒸馏（Jiao 等人，2019 年；Tang 等人，2019 年）、压缩（Zafrir 等人，2019 年）和剪枝（Michel 等人，2019 年）的努力有助于缩小这一差距，但由于其通用性，它们通常比我们为 IR 重新设计的架构所实现的速度提升要小得多，而且更激进的优化往往以降低质量为代价。

Efficient NLU-based Models. Recently, a direction emerged that employs expensive NLU computation offline. This includes doc2query (Nogueira et al., 2019c) and DeepCT (Dai and Callan, 2019a). The doc2query model expands each document with a pre-defined number of synthetic queries queries generated by a seq2seq transformer model that is trained to generate queries given a document. It then relies on a BM25 index for retrieval from the (expanded) documents. DeepCT uses BERT to produce the term frequency component of BM25 in a context-aware manner, essentially representing a feasible realization of the term-independence assumption with neural networks (Mitra et al., 2019). Lastly, docTTTTTquery (Nogueira et al., 2019a) is identical to doc2query except that it fine-tunes a pre-trained model (namely, T5 (Raffel et al., 2019)) for generating the predicted queries.
基于 NLU 的高效模型。最近，出现了一种采用昂贵的离线 NLU 计算的方向。这包括 doc2query（Nogueira 等人，2019c）和 DeepCT（Dai 和 Callan，2019a）。doc2query 模型用预定义数量的合成查询扩展每个文档，查询由 seq2seq transformer 模型生成，该模型经过训练可生成给定文档的查询。然后，它依靠 BM25 索引从（扩展）文档中进行检索。DeepCT 使用 BERT 以上下文感知的方式生成 BM25 的术语频率分量，基本上代表了神经网络术语无关性假设的可行实现（Mitra 等人，2019 年）。最后，docTTTTTquery（Nogueira et al.，2019a）与 doc2query 完全相同，不同之处在于它对用于生成预测查询的预训练模型（即 T5（Raffel et al.，2019））进行了微调。

Concurrently with our drafting of this paper, Hofstätter et al. (Hofstätter et al., 2019b) published their Transformer-Kernel (TK) model. At a high level, TK improves the KNRM architecture described earlier: while KNRM employs kernel pooling on top of word-embedding-based interaction, TK uses a Transformer (Vaswani et al., 2017) component for contextually encoding queries and documents before kernel pooling. TK establishes a new state-of-the-art for non-BERT models on MS MARCO (Dev); however, the best non-ensemble MRR@10 it achieves is 31% while ColBERT reaches up to 36%. Moreover, due to indexing document representations offline and employing a MaxSim-based late interaction mechanism, ColBERT is much more scalable, enabling end-to-end retrieval which is not supported by TK.
在我们起草本文的同时，Hofstätter 等人（Hofstätter et al., 2019b）发表了他们的Transformer内核（TK）模型。在高层次上，TK 改进了前面描述的 KNRM 架构：KNRM 在基于词嵌入的交互之上采用了内核池，而 TK 则在内核池之前使用 Transformer （Vaswani 等人，2017 年）组件对查询和文档进行上下文编码。TK 为 MS MARCO（Dev）上的非 BERT 模型建立了新的先进水平；然而，它达到的最佳非集合 MRR@10 为 31%，而 ColBERT 则高达 36%。此外，由于离线为文档表示法编制索引并采用基于 MaxSim 的后期交互机制，ColBERT 的可扩展性更强，可实现端到端检索，而 TK 却不支持这一点。

3. ColBERT

ColBERT prescribes a simple framework for balancing the quality and cost of neural IR, particularly deep language models like BERT. As introduced earlier, delaying the query–document interaction can facilitate cheap neural re-ranking (i.e., through pre-computation) and even support practical end-to-end neural retrieval (i.e., through pruning via vector-similarity search). ColBERT addresses how to do so while still preserving the effectiveness of state-of-the-art models, which condition the bulk of their computations on the joint query–document pair.
ColBERT 提供了一个简单的框架，用于平衡神经红外的质量和成本，特别是像 BERT 这样的深度语言模型。如前所述，延迟查询与文档的交互可以促进廉价的神经重新排序（即通过预计算），甚至支持实用的端到端神经检索（即通过向量相似性搜索进行剪枝）。ColBERT 解决了如何在保持最先进模型有效性的同时做到这一点的问题，这些模型的大部分计算都是以联合查询-文档对为条件的。

Even though ColBERT’s late-interaction framework can be applied to a wide variety of architectures (e.g., CNNs, RNNs, transformers, etc.), we choose to focus this work on bi-directional transformer-based encoders (i.e., BERT) owing to their state-of-the-art effectiveness yet very high computational cost.
尽管 ColBERT 的后期交互框架可应用于多种体系结构（如 CNN、RNN、变换器等），但由于基于双向 transformer 的编码器（即 BERT）具有最先进的功效，但计算成本非常高，因此我们选择将这项工作的重点放在这些编码器上。

3.1. Architecture

Figure 3 depicts the general architecture of ColBERT, which comprises: (a) a query encoder $f_{Q}$ , (b) a document encoder $f_{D}$ , and (c) the late interaction mechanism. Given a query $q$ and document $d$ , $f_{Q}$ encodes $q$ into a bag of fixed-size embeddings $E_{q}$ while $f_{D}$ encodes $d$ into another bag $E_{d}$ . Crucially, each embeddings in $E_{q}$ and $E_{d}$ is contextualized based on the other terms in $q$ or $d$ , respectively. We describe our BERT-based encoders in §3.2.
图 3 描述了 ColBERT 的总体架构，其中包括：(a) 查询编码器 $f_{Q}$ ；(b) 文档编码器 $f_{D}$ ；(c) 后期交互机制。给定查询 $q$ 和文档 $d$ ， $f_{Q}$ 将 $q$ 编码为固定大小的嵌入包 $E_{q}$ ，而 $f_{D}$ 将 $d$ 编码为另一个包 $E_{d}$ 。重要的是， $E_{q}$ 和 $E_{d}$ 中的每个嵌入词都分别根据 $q$ 或 $d$ 中的其他术语进行了上下文化。我们将在第 3.2 节中介绍基于 BERT 的编码器。

Using $E_{q}$ and $E_{d}$ , ColBERT computes the relevance score between $q$ and $d$ via late interaction, which we define as a summation of maximum similarity (MaxSim) operators. In particular, we find the maximum cosine similarity of each $v\in E_{q}$ with vectors in $E_{d}$ , and combine the outputs via summation. Besides cosine, we also evaluate squared L2 distance as a measure of vector similarity. Intuitively, this interaction mechanism softly searches for each query term $t_{q}$ —in a manner that reflects its context in the query—against the document’s embeddings, quantifying the strength of the “match” via the largest similarity score between $t_{q}$ and a document term $t_{d}$ . Given these term scores, it then estimates the document relevance by summing the matching evidence across all query terms.
利用 $E_{q}$ 和 $E_{d}$ ，ColBERT 通过后期交互计算 $q$ 和 $d$ 之间的相关性得分，我们将其定义为最大相似性 (MaxSim) 运算符的求和。具体来说，我们会找出每个 $v\in E_{q}$ 与 $E_{d}$ 中向量的最大余弦相似度，并通过求和将输出合并。除了余弦值，我们还将 L2 距离平方值作为向量相似性的衡量标准。直观地说，这种交互机制软搜索每个查询术语 $t_{q}$ --以反映其在查询中的上下文的方式--与文档的嵌入相对应，通过 $t_{q}$ 与文档术语 $t_{d}$ 之间最大的相似性得分来量化 "匹配 "的强度。有了这些术语得分，它就可以通过对所有查询术语的匹配证据求和来估计文档相关性。

While more sophisticated matching is possible with other choices such as deep convolution and attention layers (i.e., as in typical interaction-focused models), a summation of maximum similarity computations has two distinctive characteristics. First, it stands out as a particularly cheap interaction mechanism, as we examine its FLOPs in §4.2. Second, and more importantly, it is amenable to highly-efficient pruning for top- $k$ retrieval, as we evaluate in §4.3. This enables using vector-similarity algorithms for skipping documents without materializing the full interaction matrix or even considering each document in isolation. Other cheap choices (e.g., a summation of average similarity scores, instead of maximum) are possible; however, many are less amenable to pruning. In §4.4, we conduct an extensive ablation study that empirically verifies the advantage of our MaxSim-based late interaction against alternatives.
虽然深度卷积和注意力层（即典型的以交互为重点的模型）等其他选择可以实现更复杂的匹配，但最大相似性计算的求和有两个显著特点。首先，它是一种特别便宜的交互机制，我们将在第 4.2 节中研究它的 FLOPs。其次，更重要的是，正如我们在第 4.3 节中所评估的那样，它可用于高效剪枝，以实现 $k$ 顶级检索。这使得我们可以使用向量相似性算法来跳过文档，而无需将整个交互矩阵具体化，甚至无需单独考虑每个文档。其他廉价的选择（例如，平均相似性得分的求和，而不是最大值）也是可能的；但是，许多选择都不太适合剪枝。在第 4.4 节中，我们进行了广泛的消减研究，通过经验验证了我们基于 MaxSim 的后期交互与其他方法相比的优势。

3.2. Query & Document Encoders
3.2.查询和文件编码器

Prior to late interaction, ColBERT encodes each query or document into a bag of embeddings, employing BERT-based encoders. We share a single BERT model among our query and document encoders but distinguish input sequences that correspond to queries and documents by prepending a special token [Q] to queries and another token [D] to documents.
在后期交互之前，ColBERT 采用基于 BERT 的编码器，将每个查询或文档编码成嵌入包。我们的查询和文档编码器共用一个 BERT 模型，但通过在查询前添加一个特殊标记 [Q]，在文档前添加另一个标记 [D] 来区分查询和文档对应的输入序列。

Query Encoder. Given a textual query $q$ , we tokenize it into its BERT-based WordPiece (Wu et al., 2016) tokens $q_{1}q_{2}...q_{l}$ . We prepend the token [Q] to the query. We place this token right after BERT’s sequence-start token [CLS]. If the query has fewer than a pre-defined number of tokens $N_{q}$ , we pad it with BERT’s special [mask] tokens up to length $N_{q}$ (otherwise, we truncate it to the first $N_{q}$ tokens). This padded sequence of input tokens is then passed into BERT’s deep transformer architecture, which computes a contextualized representation of each token.
查询编码器。给定一个文本查询 $q$ ，我们将其标记化为基于 BERT 的 WordPiece（Wu 等人，2016 年）标记 $q_{1}q_{2}...q_{l}$ 。我们在查询前加上标记 [Q]。我们将该标记放在 BERT 的序列起始标记 [CLS] 之后。如果查询的标记数 $N_{q}$ 少于预定义的标记数，我们就用 BERT 的特殊 [mask] 标记来填充，长度不超过 $N_{q}$ （否则，我们会将其截断到第一个标记 $N_{q}$ ）。然后，经过填充的输入标记序列会被传入 BERT 的深度 transformer 架构，该架构会计算每个标记的上下文化表示。

We denote the padding with masked tokens as query augmentation, a step that allows BERT to produce query-based embeddings at the positions corresponding to these masks. Query augmentation is intended to serve as a soft, differentiable mechanism for learning to expand queries with new terms or to re-weigh existing terms based on their importance for matching the query. As we show in §4.4, this operation is essential for ColBERT’s effectiveness.
我们将使用屏蔽标记进行的填充称为查询增强（query augmentation），这一步骤允许 BERT 在这些屏蔽对应的位置生成基于查询的嵌入。查询增强的目的是作为一种软性的、可区分的机制，用于学习使用新术语扩展查询，或根据现有术语对匹配查询的重要性对其进行重新权衡。正如我们在第4.4节中所展示的，这一操作对ColBERT的有效性至关重要。

Given BERT’s representation of each token, our encoder passes the contextualized output representations through a linear layer with no activations. This layer serves to control the dimension of ColBERT’s embeddings, producing $m$ -dimensional embeddings for the layer’s output size $m$ . As we discuss later in more detail, we typically fix $m$ to be much smaller than BERT’s fixed hidden dimension.
根据 BERT 对每个标记的表示，我们的编码器将上下文化的输出表示通过一个没有激活的线性层。该层的作用是控制 ColBERT 嵌入的维度，为该层的输出大小 $m$ 生成 $m$ 维的嵌入。正如我们稍后详细讨论的那样，我们通常会将 $m$ 固定为比 BERT 的固定隐藏维度小得多。

While ColBERT’s embedding dimension has limited impact on the efficiency of query encoding, this step is crucial for controlling the space footprint of documents, as we show in §4.5. In addition, it can have a significant impact on query execution time, particularly the time taken for transferring the document representations onto the GPU from system memory (where they reside before processing a query). In fact, as we show in §4.2, gathering, stacking, and transferring the embeddings from CPU to GPU can be the most expensive step in re-ranking with ColBERT. Finally, the output embeddings are normalized so each has L2 norm equal to one. The result is that the dot-product of any two embeddings becomes equivalent to their cosine similarity, falling in the $[-1,1]$ range.
虽然ColBERT的嵌入维度对查询编码的效率影响有限，但正如我们在第4.5节中所展示的，这一步对控制文档的空间占用至关重要。此外，它还会对查询执行时间产生重大影响，尤其是将文档表示从系统内存（处理查询前的内存）传输到GPU上所花费的时间。事实上，正如我们在第4.2节中所展示的，在使用ColBERT重新排序的过程中，收集、堆叠和将嵌入式从CPU传输到GPU是最昂贵的步骤。最后，对输出嵌入进行归一化处理，使每个嵌入的 L2 准则都等于 1。这样，任意两个嵌入式的点积就等同于它们的余弦相似度，在 $[-1,1]$ 范围内。

Document Encoder. Our document encoder has a very similar architecture. We first segment a document $d$ into its constituent tokens $d_{1}d_{2}...d_{m}$ , to which we prepend BERT’s start token [CLS] followed by our special token [D] that indicates a document sequence. Unlike queries, we do not append [mask] tokens to documents. After passing this input sequence through BERT and the subsequent linear layer, the document encoder filters out the embeddings corresponding to punctuation symbols, determined via a pre-defined list. This filtering is meant to reduce the number of embeddings per document, as we hypothesize that (even contextualized) embeddings of punctuation are unnecessary for effectiveness.
文档编码器。我们的文档编码器具有非常相似的结构。我们首先将文档 $d$ 分割成其组成标记 $d_{1}d_{2}...d_{m}$ ，然后将 BERT 的起始标记 [CLS] 和表示文档序列的特殊标记 [D] 加入其中。与查询不同，我们不会在文档中添加 [mask] 标记。将输入序列通过 BERT 和随后的线性层后，文档编码器会过滤掉与标点符号相对应的嵌入，这些嵌入是通过预定义列表确定的。这种过滤的目的是为了减少每份文档的嵌入数量，因为我们认为，标点符号的嵌入（即使是上下文化的）也没有必要，因为这样做才会有效。

In summary, given $q=q_{0}q_{1}...q_{l}$ and $d=d_{0}d_{1}...d_{n}$ , we compute the bags of embeddings $E_{q}$ and $E_{d}$ in the following manner, where $\#$ refers to the [mask] tokens:
总之，在给定 $q=q_{0}q_{1}...q_{l}$ 和 $d=d_{0}d_{1}...d_{n}$ 的情况下，我们按以下方式计算嵌入包 $E_{q}$ 和 $E_{d}$ ，其中 $\#$ 指的是 [掩码] 标记：

(1)		$\displaystyle E_{q}$	$\displaystyle:=\texttt{Normalize}(\;\texttt{CNN}(\;\texttt{BERT}(``[Q]q_{0}q_{1}...q_{l}\#\#...\#")\;)\;)$
(2)		$\displaystyle E_{d}$	$\displaystyle:=\texttt{Filter}(\;\texttt{Normalize}(\;\texttt{CNN}(\;\texttt{BERT}(``[D]d_{0}d_{1}...d_{n}")\;)\;)\;)$

3.3. Late Interaction 3.3.1.1 互动

Given the representation of a query $q$ and a document $d$ , the relevance score of $d$ to $q$ , denoted as $S_{q,d}$ , is estimated via late interaction between their bags of contextualized embeddings. As mentioned before, this is conducted as a sum of maximum similarity computations, namely cosine similarity (implemented as dot-products due to the embedding normalization) or squared L2 distance.
鉴于查询 $q$ 和文档 $d$ 的表示形式， $d$ 与 $q$ 的相关性得分（表示为 $S_{q,d}$ ）是通过它们的上下文嵌入包之间的后期交互来估算的。如前所述，这是最大相似性计算的总和，即余弦相似性（由于嵌入归一化，以点积的形式实现）或 L2 距离的平方。

(3)

\displaystyle S_{q,d}

\displaystyle:=\sum_{i\in[|E_{q}|]}\max_{j\in[|E_{d}|]}E_{q_{i}}\cdot E_{d_{j}}^{T}

ColBERT is differentiable end-to-end. We fine-tune the BERT encoders and train from scratch the additional parameters (i.e., the linear layer and the [Q] and [D] markers’ embeddings) using the Adam (Kingma and Ba, 2014) optimizer. Notice that our interaction mechanism has no trainable parameters. Given a triple $\langle q,d^{+},d^{-}\rangle$ with query $q$ , positive document $d^{+}$ and negative document $d^{-}$ , ColBERT is used to produce a score for each document individually and is optimized via pairwise softmax cross-entropy loss over the computed scores of $d^{+}$ and $d^{-}$ .
ColBERT 是端到端可微分的。我们使用 Adam（Kingma 和 Ba，2014 年）优化器对 BERT 编码器进行微调，并从头开始训练附加参数（即线性层以及 [Q] 和 [D] 标记的嵌入）。请注意，我们的交互机制没有可训练的参数。给定一个包含查询 $q$ 、正面文档 $d^{+}$ 和负面文档 $d^{-}$ 的三重 $\langle q,d^{+},d^{-}\rangle$ ，ColBERT 用于为每个文档单独生成一个分数，并通过对 $d^{+}$ 和 $d^{-}$ 的计算分数进行成对 softmax 交叉熵损失进行优化。

3.4. Offline Indexing: Computing & Storing Document Embeddings
3.4.Offline 索引：计算和存储文档嵌入

By design, ColBERT isolates almost all of the computations between queries and documents, largely to enable pre-computing document representations offline. At a high level, our indexing procedure is straight-forward: we proceed over the documents in the collection in batches, running our document encoder $f_{D}$ on each batch and storing the output embeddings per document. Although indexing a set of documents is an offline process, we incorporate a few simple optimizations for enhancing the throughput of indexing. As we show in §4.5, these optimizations can considerably reduce the offline cost of indexing.
根据设计，ColBERT 将查询和文档之间的几乎所有计算隔离开来，这在很大程度上是为了实现离线预计算文档表示。在高层次上，我们的索引编制程序非常简单：我们分批处理文档集中的文档，在每批文档上运行我们的文档编码器 $f_{D}$ ，并存储每个文档的输出嵌入。虽然为文档集编制索引是一个离线过程，但我们还是采用了一些简单的优化方法来提高索引的吞吐量。正如我们在第 4.5 节中所展示的，这些优化可以大大降低索引的离线成本。

To begin with, we exploit multiple GPUs, if available, for faster encoding of batches of documents in parallel. When batching, we pad all documents to the maximum length of a document within the batch.³³3The public BERT implementations we saw simply pad to a pre-defined length.
我们看到的公共 BERT 实现只是简单地填充到预定义的长度。
首先，我们会利用多个 GPU（如果有的话）来更快地对成批文档进行并行编码。批处理时，我们会将所有文档填充到批处理中文档的最大长度。 ³ To make capping the sequence length on a per-batch basis more effective, our indexer proceeds through documents in groups of $B$ (e.g., $B=$ 100,000) documents. It sorts these documents by length and then feeds batches of $b$ (e.g., $b=$ 128) documents of comparable length through our encoder. This length-based bucketing is sometimes refered to as a BucketIterator in some libraries (e.g., allenNLP). Lastly, while most computations occur on the GPU, we found that a non-trivial portion of the indexing time is spent on pre-processing the text sequences, primarily BERT’s WordPiece tokenization. Exploiting that these operations are independent across documents in a batch, we parallelize the pre-processing across the available CPU cores.
为了更有效地按批次设置序列长度上限，我们的索引器以 $B$ （例如， $B=$ 100,000 个）文档为一组对文档进行处理。它按长度对这些文档进行排序，然后将长度相当的 $b$ （例如， $b=$ 128）文档批次送入我们的编码器。在某些库（如 allenNLP）中，这种基于长度的分级有时被称为 BucketIterator。最后，虽然大部分计算都是在 GPU 上进行的，但我们发现索引时间的很大一部分都花在了文本序列的预处理上，主要是 BERT 的 WordPiece 标记化。利用这些操作在批次文档中的独立性，我们在可用的 CPU 内核上对预处理进行了并行化处理。

Once the document representations are produced, they are saved to disk using 32-bit or 16-bit values to represent each dimension. As we describe in §3.5 and 3.6, these representations are either simply loaded from disk for ranking or are subsequently indexed for vector-similarity search, respectively.
文档表示法制作完成后，会使用 32 位或 16 位值将其保存到磁盘中，以表示每个维度。正如我们在第 3.5 节和第 3.6 节中所描述的那样，这些表示法要么被简单地从磁盘加载以进行排序，要么随后被索引以进行向量相似性搜索。

3.5. Top- $k$ Re-ranking with ColBERT
3.5.Top- $k$ 使用 ColBERT 重新排序

Recall that ColBERT can be used for re-ranking the output of another retrieval model, typically a term-based model, or directly for end-to-end retrieval from a document collection. In this section, we discuss how we use ColBERT for ranking a small set of $k$ (e.g., $k=1000$ ) documents given a query $q$ . Since $k$ is small, we rely on batch computations to exhaustively score each document (unlike our approach in §3.6). To begin with, our query serving sub-system loads the indexed documents representations into memory, representing each document as a matrix of embeddings.
回顾一下，ColBERT 可用于对另一个检索模型（通常是基于术语的模型）的输出结果进行重新排序，也可直接用于从文档集合中进行端到端检索。在本节中，我们将讨论如何使用 ColBERT 对给定查询 $q$ 的一小部分 $k$ （例如 $k=1000$ ）文档进行排序。由于 $k$ 较小，因此我们依靠批量计算对每个文档进行详尽的评分（与第 3.6 节中的方法不同）。首先，我们的查询服务子系统会将索引文档表示载入内存，将每个文档表示为嵌入矩阵。

Given a query $q$ , we compute its bag of contextualized embeddings $E_{q}$ (Equation 1) and, concurrently, gather the document representations into a 3-dimensional tensor $D$ consisting of $k$ document matrices. We pad the $k$ documents to their maximum length to facilitate batched operations, and move the tensor $D$ to the GPU’s memory. On the GPU, we compute a batch dot-product of $E_{q}$ and $D$ , possibly over multiple mini-batches. The output materializes a 3-dimensional tensor that is a collection of cross-match matrices between $q$ and each document. To compute the score of each document, we reduce its matrix across document terms via a max-pool (i.e., representing an exhaustive implementation of our MaxSim computation) and reduce across query terms via a summation. Finally, we sort the $k$ documents by their total scores.
给定查询 $q$ ，我们计算其上下文嵌入包 $E_{q}$ （等式 1），同时将文档表示收集到由 $k$ 文档矩阵组成的三维张量 $D$ 中。我们将 $k$ 文档填充到最大长度，以方便分批操作，并将张量 $D$ 移到 GPU 的内存中。在 GPU 上，我们计算 $E_{q}$ 和 $D$ 的批量点积，可能需要多个迷你批次。输出结果是一个三维张量，它是 $q$ 和每个文档之间交叉匹配矩阵的集合。为了计算每个文档的得分，我们通过最大池（即代表我们的 MaxSim 计算的穷举实现）对其矩阵进行跨文档术语还原，并通过求和对查询术语进行还原。最后，我们按总分对 $k$ 文档进行排序。

Relative to existing neural rankers (especially, but not exclusively, BERT-based ones), this computation is very cheap that, in fact, its cost is dominated by the cost of gathering and transferring the pre-computed embeddings. To illustrate, ranking $k$ documents via typical BERT rankers requires feeding BERT $k$ different inputs each of length $l=|q|+|d_{i}|$ for query $q$ and documents $d_{i}$ , where attention has quadratic cost in the length of the sequence. In contrast, ColBERT feeds BERT only a single, much shorter sequence of length $l=|q|$ . Consequently, ColBERT is not only cheaper, it also scales much better with $k$ as we examine in §4.2.
相对于现有的神经排序器（尤其是但不限于基于 BERT 的排序器），这种计算方法的成本非常低廉，事实上，其成本主要来自于收集和传输预计算嵌入的成本。举例来说，通过典型的 BERT 排序器对 $k$ 文档进行排序，需要向 BERT 提供 $k$ 不同的输入，每个输入长度为 $l=|q|+|d_{i}|$ 查询 $q$ 和文档 $d_{i}$ ，其中注意力的成本与序列的长度成二次方关系。相比之下，ColBERT 只向 BERT 提供一个长度为 $l=|q|$ 的更短的单一序列。因此，ColBERT 不仅成本更低，而且随着 $k$ 的增加，其扩展性也会大大提高，这一点我们将在第 4.2 节中进行探讨。

3.6. End-to-end Top- $k$ Retrieval with ColBERT
3.6.ColBERT 的端到端 Top- $k$ 检索

As mentioned before, ColBERT’s late-interaction operator is specifically designed to enable end-to-end retrieval from a large collection, largely to improve recall relative to term-based retrieval approaches. This section is concerned with cases where the number of documents to be ranked is too large for exhaustive evaluation of each possible candidate document, particularly when we are only interested in the highest scoring ones. Concretely, we focus here on retrieving the top- $k$ results directly from a large document collection with $N$ (e.g., $N=10,000,000$ ) documents, where $k\ll N$ .
如前所述，ColBERT的后期交互运算符是专门为从大量文档中进行端到端检索而设计的，其主要目的是相对于基于术语的检索方法而言提高召回率。本节将讨论这样一种情况，即需要排序的文档数量过多，无法对每个可能的候选文档进行详尽的评估，尤其是当我们只对得分最高的文档感兴趣时。具体来说，我们在此关注的是直接从一个包含 $N$ （例如 $N=10,000,000$ ）文档的大型文档集中检索出得分最高的 $k$ 结果，其中 $k\ll N$ .

To do so, we leverage the pruning-friendly nature of the MaxSim operations at the backbone of late interaction. Instead of applying MaxSim between one of the query embeddings and all of one document’s embeddings, we can use fast vector-similarity data structures to efficiently conduct this search between the query embedding and all document embeddings across the full collection. For this, we employ an off-the-shelf library for large-scale vector-similarity search, namely faiss (Johnson et al., 2017) from Facebook.⁴⁴4https://github.com/facebookresearch/faiss
为此，我们利用了后期交互中的 MaxSim 操作的剪枝友好性。我们可以使用快速的向量相似性数据结构在查询嵌入和整个集合中的所有文档嵌入之间高效地进行搜索，而不是在一个查询嵌入和所有文档嵌入之间应用 MaxSim。为此，我们使用了一个用于大规模矢量相似性搜索的现成库，即 Facebook 的 faiss（Johnson 等人，2017 年）。 ⁴ In particular, at the end of offline indexing (§3.4), we maintain a mapping from each embedding to its document of origin and then index all document embeddings into faiss.
特别是，在离线索引（第 3.4 节）结束时，我们会维护每个嵌入到其来源文档的映射，然后将所有文档嵌入索引到 faiss 中。

Subsequently, when serving queries, we use a two-stage procedure to retrieve the top- $k$ documents from the entire collection. Both stages rely on ColBERT’s scoring: the first is an approximate stage aimed at filtering while the second is a refinement stage. For the first stage, we concurrently issue $N_{q}$ vector-similarity queries (corresponding to each of the embeddings in $E_{q}$ ) onto our faiss index. This retrieves the top- $k^{\prime}$ (e.g., $k^{\prime}=k/2$ ) matches for that vector over all document embeddings. We map each of those to its document of origin, producing $N_{q}\times k^{\prime}$ document IDs, only $K\leq N_{q}\times k^{\prime}$ of which are unique. These $K$ documents likely contain one or more embeddings that are highly similar to the query embeddings. For the second stage, we refine this set by exhaustively re-ranking only those $K$ documents in the usual manner described in §3.5.
随后，在提供查询时，我们会使用一个两阶段的程序，从整个文档集中检索出排名前 $k$ 的文档。这两个阶段都依赖于 ColBERT 的评分：第一个阶段是近似阶段，目的是过滤，而第二个阶段是细化阶段。在第一阶段，我们同时向我们的 faiss 索引发出 $N_{q}$ 向量相似性查询（与 $E_{q}$ 中的每个嵌入相对应）。这将检索出该向量在所有文档嵌入式中匹配度最高的 $k^{\prime}$ （例如， $k^{\prime}=k/2$ ）。我们将每个匹配项映射到其源文件，从而生成 $N_{q}\times k^{\prime}$ 文档 ID，其中只有 $K\leq N_{q}\times k^{\prime}$ 是唯一的。这些 $K$ 文档可能包含一个或多个与查询嵌入式高度相似的嵌入式。在第二阶段，我们将按照第 3.5 节中描述的常规方法，仅对 $K$ 文档进行详尽的重新排序，从而完善这个集合。

In our faiss-based implementation, we use an IVFPQ index (“inverted file with product quantization”). This index partitions the embedding space into $P$ (e.g., $P=1000$ ) cells based on $k$ -means clustering and then assigns each document embedding to its nearest cell based on the selected vector-similarity metric. For serving queries, when searching for the top- $k^{\prime}$ matches for a single query embedding, only the nearest $p$ (e.g., $p=10$ ) partitions are searched. To improve memory efficiency, every embedding is divided into $s$ (e.g., $s=16$ ) sub-vectors, each represented using one byte. Moreover, the index conducts the similarity computations in this compressed domain, leading to cheaper computations and thus faster search.
在基于 faiss 的实现中，我们使用了 IVFPQ 索引（"带乘积量化的反转文件"）。该索引根据 $k$ 均值聚类将嵌入空间划分为 $P$ （例如 $P=1000$ ）单元格，然后根据所选的向量相似度量将每个文档嵌入分配到其最近的单元格。对于服务查询，在搜索单个查询嵌入的顶部 $k^{\prime}$ 匹配项时，只搜索最近的 $p$ （例如 $p=10$ ）分区。为了提高内存效率，每个嵌入都被分为 $s$ （例如， $s=16$ ）子向量，每个子向量用一个字节表示。此外，索引会在这个压缩域中进行相似性计算，从而降低计算成本，加快搜索速度。

4. Experimental Evaluation
4.实验评估

We now turn our attention to empirically testing ColBERT, addressing the following research questions.
现在，我们将注意力转向对 ColBERT 进行实证测试，解决以下研究问题。

RQ₁: In a typical re-ranking setup, how well can ColBERT bridge the existing gap (highlighted in §1) between highly-efficient and highly-effective neural models? (§4.2)
问题 ₁ ：在一个典型的重新排序设置中，ColBERT能在多大程度上弥合（§1中强调的）高效和高效神经模型之间的现有差距？(§4.2)

RQ₂: Beyond re-ranking, can ColBERT effectively support end-to-end retrieval directly from a large collection? (§4.3)
问题 ₂ ：除了重新排序之外，ColBERT 还能有效地支持直接从大型合集中进行端到端检索吗？(§4.3)

RQ₃: What does each component of ColBERT (e.g., late interaction, query augmentation) contribute to its quality? (§4.4)
问题 ₃ ：ColBERT 的每个组成部分（如后期交互、查询增强）对其质量有何贡献？(§4.4)

RQ₄: What are ColBERT’s indexing-related costs in terms of offline computation and memory overhead? (§4.5)
问题 ₄ ：就离线计算和内存开销而言，ColBERT 的索引相关成本是多少？ (§4.5)

Method	MRR@10 (Dev)	MRR@10 (Eval) MRR@10 （评估）	Re-ranking Latency (ms) 重新排序延迟（毫秒）	FLOPs/query
BM25 (official) BM25（官方）	16.7	16.5	-	-
KNRM	19.8	19.8	3	592M (0.085 $\times$ )
Duet	24.3	24.5	22	159B (23 $\times$ )
fastText+ConvKNRM	29.0	27.7	28	78B (11 $\times$ )
BERT ${}_{\textnormal{base}}$ (Nogueira and Cho, 2019) BERT ${}_{\textnormal{base}}$ （Nogueira 和 Cho，2019 年）	34.7	-	10,700	97T (13,900 $\times$ )
BERT ${}_{\textnormal{base}}$ (our training) BERT ${}_{\textnormal{base}}$ （我们的训练）	36.0	-	10,700	97T (13,900 $\times$ )
BERT ${}_{\textnormal{large}}$ (Nogueira and Cho, 2019) BERT ${}_{\textnormal{large}}$ （Nogueira 和 Cho，2019 年）	36.5	35.9	32,900	340T (48,600 $\times$ ) 340T (48,600 $\times$ )
ColBERT (over BERT ${}_{\textnormal{base}}$ ) ColBERT （战胜 BERT ${}_{\textnormal{base}}$ )	34.9	34.9	61	7B (1 $\times$ )

Table 1. “Re-ranking” results on MS MARCO. Each neural model re-ranks the official top-1000 results produced by BM25. Latency is reported for re-ranking only. To obtain the end-to-end latency in Figure 1, we add the BM25 latency from Table 2.
表 1：MS MARCO 的 "重新排名 "结果。每个神经模型都对 BM25 生成的官方前 1000 条结果进行了重新排序。仅报告了重新排序的延迟。为了获得图 1 中的端到端延迟，我们加上了表 2 中的 BM25 延迟。

Method	MRR@10 (Dev)	MRR@10 (Local Eval) MRR@10 （本地评估）	Latency (ms) 延迟（毫秒）	Recall@50 回忆@50	Recall@200	Recall@1000
BM25 (official) BM25（官方）	16.7	-	-	-	-	81.4
BM25 (Anserini) BM25（安塞里尼）	18.7	19.5	62	59.2	73.8	85.7
doc2query	21.5	22.8	85	64.4	77.9	89.1
DeepCT	24.3	-	62 (est.) 62（估计值）	69 (Dai and Callan, 2019a) 69 （Dai 和 Callan，2019a）	82 (Dai and Callan, 2019a) 82 （Dai 和 Callan，2019a）	91 (Dai and Callan, 2019a) 91 （Dai 和 Callan，2019a）
docTTTTTquery	27.7	28.4	87	75.6	86.9	94.7
ColBERT ${}_{\textnormal{L2}}$ (re-rank) ColBERT ${}_{\textnormal{L2}}$ （重新排名）	34.8	36.4	-	75.3	80.5	81.4
ColBERT ${}_{\textnormal{L2}}$ (end-to-end) ColBERT ${}_{\textnormal{L2}}$ （端到端）	36.0	36.7	458	82.9	92.3	96.8

Table 2. End-to-end retrieval results on MS MARCO. Each model retrieves the top-1000 documents per query directly from the entire 8.8M document collection.
表 2.MS MARCO 的端到端检索结果。每个模型直接从整个 880 万文档集合中检索每个查询的前 1000 个文档。

4.1. Methodology 4.1.Methodology 方法

4.1.1. Datasets & Metrics 4.1.1.数据集和度量标准

Similar to related work (Nogueira et al., 2019c; Dai and Callan, 2019a; Nogueira et al., 2019b), we conduct our experiments on the MS MARCO Ranking (Nguyen et al., 2016) (henceforth, MS MARCO) and TREC Complex Answer Retrieval (TREC-CAR) (Dietz et al., 2017) datasets. Both of these recent datasets provide large training data of the scale that facilitates training and evaluating deep neural networks. We describe both in detail below.
与相关工作（Nogueira 等人，2019c；Dai 和 Callan，2019a；Nogueira 等人，2019b）类似，我们在 MS MARCO Ranking（Nguyen 等人，2016 年）（以下简称 MS MARCO）和 TREC 复杂答案检索（TREC-CAR）（Dietz 等人，2017 年）数据集上进行了实验。这两个最新数据集都提供了大规模的训练数据，有助于训练和评估深度神经网络。下面我们将详细介绍这两个数据集。

MS MARCO. MS MARCO is a dataset (and a corresponding competition) introduced by Microsoft in 2016 for reading comprehension and adapted in 2018 for retrieval. It is a collection of 8.8M passages from Web pages, which were gathered from Bing’s results to 1M real-world queries. Each query is associated with sparse relevance judgements of one (or very few) documents marked as relevant and no documents explicitly indicated as irrelevant. Per the official evaluation, we use MRR@10 to measure effectiveness.
MS MARCO。MS MARCO 是微软于 2016 年推出的一个用于阅读理解的数据集（以及相应的竞赛），并于 2018 年改编用于检索。它收集了 880 万个网页段落，这些段落来自必应对 100 万个真实世界查询的结果。每个查询都与稀疏的相关性判断相关联，其中一个（或极少数）文档被标记为相关，没有文档被明确指出为不相关。根据官方评估，我们使用 MRR@10 来衡量有效性。

We use three sets of queries for evaluation. The official development and evaluation sets contain roughly 7k queries. However, the relevance judgements of the evaluation set are held-out by Microsoft and effectiveness results can only be obtained by submitting to the competition’s organizers. We submitted our main re-ranking ColBERT model for the results in §4.2. In addition, the collection includes roughly 55k queries (with labels) that are provided as additional validation data. We re-purpose a random sample of 5k queries among those (i.e., ones not in our development or training sets) as a “local” evaluation set. Along with the official development set, we use this held-out set for testing our models as well as baselines in §4.3. We do so to avoid submitting multiple variants of the same model at once, as the organizers discourage too many submissions by the same team.
我们使用三组查询进行评估。官方开发集和评估集包含大约 7k 个查询。不过，评估集的相关性判断是由微软保留的，只有向竞赛组织者提交才能获得有效性结果。我们提交了主要的重新排序 ColBERT 模型，结果见第 4.2 节。此外，我们还提供了大约 5.5 万个查询（带标签）作为额外的验证数据。我们在其中随机抽取了 5k 个查询（即不在开发集或训练集中的查询）作为 "本地 "评估集。除了官方开发集之外，我们还在第 4.3 节中使用这个保留集来测试我们的模型和基线。我们这样做是为了避免同时提交同一模型的多个变体，因为主办方不鼓励同一团队提交过多变体。

TREC CAR. Introduced by Dietz (Dietz et al., 2017) et al. in 2017, TREC CAR is a synthetic dataset based on Wikipedia that consists of about 29M passages. Similar to related work (Nogueira and Cho, 2019), we use the first four of five pre-defined folds for training and the fifth for validation. This amounts to roughly 3M queries generated by concatenating the title of a Wikipedia page with the heading of one of its sections. That section’s passages are marked as relevant to the corresponding query. Our evaluation is conducted on the test set used in TREC 2017 CAR, which contains 2,254 queries.
TREC CAR。TREC CAR 由 Dietz（Dietz 等人，2017）等人于 2017 年提出，是一个基于维基百科的合成数据集，包含约 2900 万个段落。与相关工作（Nogueira and Cho，2019）类似，我们使用五个预定义折叠中的前四个进行训练，第五个进行验证。这相当于通过连接维基百科页面的标题和其中一个章节的标题生成了大约 3M 条查询。该部分的段落被标记为与相应查询相关。我们的评估是在 TREC 2017 CAR 使用的测试集上进行的，该测试集包含 2,254 个查询。

4.1.2. Implementation

Our ColBERT models are implemented using Python 3 and PyTorch 1. We use the popular transformers⁵⁵5https://github.com/huggingface/transformers
我们的 ColBERT 模型是用 Python 3 和 PyTorch 1 实现的。我们使用流行的转换器 ⁵ library for the pre-trained BERT model. Similar to (Nogueira and Cho, 2019), we fine-tune all ColBERT models with learning rate $3\times 10^{-6}$ with a batch size 32. We fix the number of embeddings per query at $N_{q}=32$ . We set our ColBERT embedding dimension $m$ to be 128; §4.5 demonstrates ColBERT’s robustness to a wide range of embedding dimensions.
库的预训练 BERT 模型。与（Nogueira 和 Cho，2019 年）类似，我们对所有 ColBERT 模型进行了微调，学习率为 $3\times 10^{-6}$ ，批量大小为 32。我们将每次查询的嵌入数固定为 $N_{q}=32$ 。我们将 ColBERT 的嵌入维度 $m$ 设置为 128；第 4.5 节展示了 ColBERT 对各种嵌入维度的鲁棒性。

For MS MARCO, we initialize the BERT components of the ColBERT query and document encoders using Google’s official pre-trained BERT ${}_{\textnormal{base}}$ model. Further, we train all models for 200k iterations. For TREC CAR, we follow related work (Nogueira and Cho, 2019; Dai and Callan, 2019a) and use a different pre-trained model to the official ones. To explain, the official BERT models were pre-trained on Wikipedia, which is the source of TREC CAR’s training and test sets. To avoid leaking test data into train, Nogueira and Cho’s (Nogueira and Cho, 2019) pre-train a randomly-initialized BERT model on the Wiki pages corresponding to training subset of TREC CAR. They release their BERT ${}_{\textnormal{large}}$ pre-trained model, which we fine-tune for ColBERT’s experiments on TREC CAR. Since fine-tuning this model is significantly slower than BERT ${}_{\textnormal{base}}$ , we train on TREC CAR for only 125k iterations.
对于 MS MARCO，我们使用 Google 官方预训练的 BERT ${}_{\textnormal{base}}$ 模型初始化 ColBERT 查询和文档编码器的 BERT 组件。此外，我们对所有模型进行了 200k 次迭代训练。对于 TREC CAR，我们遵循相关工作（Nogueira and Cho, 2019; Dai and Callan, 2019a），使用与官方模型不同的预训练模型。要解释的是，官方 BERT 模型是在维基百科上预先训练的，而维基百科正是 TREC CAR 训练集和测试集的来源。为了避免测试数据泄漏到训练数据中，Nogueira 和 Cho（Nogueira and Cho, 2019）在与 TREC CAR 训练子集相对应的维基页面上预训练了一个随机初始化的 BERT 模型。他们发布了 BERT ${}_{\textnormal{large}}$ 预训练模型，我们针对 ColBERT 在 TREC CAR 上的实验对该模型进行了微调。由于微调该模型的速度明显慢于 BERT ${}_{\textnormal{base}}$ ，因此我们在 TREC CAR 上仅进行了 125k 次迭代训练。

In our re-ranking results, unless stated otherwise, we use 4 bytes per dimension in our embeddings and employ cosine as our vector-similarity function. For end-to-end ranking, we use (squared) L2 distance, as we found our faiss index was faster at L2-based retrieval. For our faiss index, we set the number of partitions to $P=$ 2,000, and search the nearest $p=10$ to each query embedding to retrieve $k^{\prime}=k=1000$ document vectors per query embedding. We divide each embedding into $s=16$ sub-vectors, each encoded using one byte. To represent the index used for the second stage of our end-to-end retrieval procedure, we use 16-bit values per dimension.
在我们的重新排序结果中，除非另有说明，我们在嵌入时每个维度使用 4 个字节，并使用余弦作为向量相似度函数。对于端到端排名，我们使用（平方）L2 距离，因为我们发现我们的 faiss 索引在基于 L2 的检索中速度更快。对于我们的 faiss 索引，我们将分区数设置为 $P=$ 2,000，并搜索与每个查询嵌入最近的 $p=10$ 以检索每个查询嵌入的 $k^{\prime}=k=1000$ 文档向量。我们将每个嵌入分为 $s=16$ 个子向量，每个子向量使用一个字节编码。为了表示端到端检索程序第二阶段使用的索引，我们在每个维度上使用 16 位值。

4.1.3. Hardware & Time Measurements
4.1.3 硬件和时间测量

To evaluate the latency of neural re-ranking models in §4.2, we use a single Tesla V100 GPU that has 32 GiBs of memory on a server with two Intel Xeon Gold 6132 CPUs, each with 14 physical cores (24 hyperthreads), and 469 GiBs of RAM. For the mostly CPU-based retrieval experiments in §4.3 and the indexing experiments in §4.5, we use another server with the same CPU and system memory specifications but which has four Titan V GPUs attached, each with 12 GiBs of memory. Across all experiments, only one GPU is dedicated per query for retrieval (i.e., for methods with neural computations) but we use up to all four GPUs during indexing.
为了评估第4.2节中神经重新排序模型的延迟，我们在一台配有两个英特尔至强Gold 6132 CPU（每个CPU有14个物理内核（24个超线程）和469GB内存）的服务器上使用了一个拥有32GB内存的Tesla V100 GPU。在第 4.3 节主要基于 CPU 的检索实验和第 4.5 节的索引实验中，我们使用了另一台服务器，其 CPU 和系统内存规格相同，但附加了 4 个 Titan V GPU，每个 GPU 有 12 GiBs 内存。在所有实验中，每个查询只使用一个GPU进行检索（即用于神经计算方法），但在索引编制过程中，我们最多使用了全部四个GPU。

4.2. Quality–Cost Tradeoff: Top- $k$ Re-ranking
4.2.质量-成本权衡： $k$ 重排首位

In this section, we examine ColBERT’s efficiency and effectiveness at re-ranking the top- $k$ results extracted by a bag-of-words retrieval model, which is the most typical setting for testing and deploying neural ranking models. We begin with the MS MARCO dataset. We compare against KNRM, Duet, and fastText+ConvKNRM, a representative set of neural matching models that have been previously tested on MS MARCO. In addition, we compare against the natural adaptation of BERT for ranking by Nogueira and Cho (Nogueira and Cho, 2019), in particular, BERT ${}_{\textnormal{base}}$ and its deeper counterpart BERT ${}_{\textnormal{large}}$ . We also report results for “BERT ${}_{\textnormal{base}}$ (our training)”, which is based on Nogueira and Cho’s base model (including hyperparameters) but is trained with the same loss function as ColBERT (§3.3) for 200k iterations, allowing for a more direct comparison of the results.
在本节中，我们将检验 ColBERT 在对词袋检索模型提取的顶部 $k$ 结果进行重新排序时的效率和有效性，这也是测试和部署神经排序模型的最典型环境。我们从 MS MARCO 数据集开始。我们将其与 KNRM、Duet 和 fastText+ConvKNRM 进行比较，这是一组具有代表性的神经匹配模型，之前已在 MS MARCO 上进行过测试。此外，我们还与 Nogueira 和 Cho（Nogueira 和 Cho，2019 年）对 BERT 进行的自然适应排序进行了比较，特别是 BERT ${}_{\textnormal{base}}$ 及其更深入的对应模型 BERT ${}_{\textnormal{large}}$ 。我们还报告了 "BERT ${}_{\textnormal{base}}$ （我们的训练）"的结果，它基于 Nogueira 和 Cho 的基础模型（包括超参数），但使用与 ColBERT 相同的损失函数（第 3.3 节）进行了 200k 次迭代训练，从而可以对结果进行更直接的比较。

We report the competition’s official metric, namely MRR@10, on the validation set (Dev) and the evaluation set (Eval). We also report the re-ranking latency, which we measure using a single Tesla V100 GPU, and the FLOPs per query for each neural ranking model. For ColBERT, our reported latency subsumes the entire computation from gathering the document representations, moving them to the GPU, tokenizing then encoding the query, and applying late interaction to compute document scores. For the baselines, we measure the scoring computations on the GPU and exclude the CPU-based text preprocessing (similar to (Hofstätter and Hanbury, 2019)). In principle, the baselines can pre-compute the majority of this preprocessing (e.g., document tokenization) offline and parallelize the rest across documents online, leaving only a negligible cost. We estimate the FLOPs per query of each model using the torchprofile⁶⁶6https://github.com/mit-han-lab/torchprofile
我们报告了比赛的官方指标，即验证集（Dev）和评估集（Eval）上的 MRR@10。我们还报告了重新排序的延迟（我们使用单个 Tesla V100 GPU 进行测量），以及每个神经排序模型每次查询的 FLOPs。对于 ColBERT，我们报告的延迟包含了从收集文档表示、将其移动到 GPU、标记化然后编码查询，以及应用后期交互计算文档分数的整个计算过程。对于基线，我们测量的是 GPU 上的评分计算，不包括基于 CPU 的文本预处理（类似于（Hofstätter and Hanbury, 2019））。原则上，基线可以离线预处理大部分预处理（如文档标记化），并在线并行处理文档中的其余部分，因此成本可以忽略不计。我们使用 torchprofile ⁶ 来估算每个模型每次查询的 FLOPs。 library. 图书馆

We now proceed to study the results, which are reported in Table 1. To begin with, we notice the fast progress from KNRM in 2017 to the BERT-based models in 2019, manifesting itself in over 16% increase in MRR@10. As described in §1, the simultaneous increase in computational cost is difficult to miss. Judging by their rather monotonic pattern of increasingly larger cost and higher effectiveness, these results appear to paint a picture where expensive models are necessary for high-quality ranking.
现在我们开始研究表 1 中报告的结果。首先，我们注意到从 2017 年的 KNRM 到 2019 年基于 BERT 的模型的快速进步，表现为 MRR@10 增加了 16% 以上。如第 1 节所述，计算成本的同时增加是难以忽视的。从成本越来越高、效率越来越高的单调模式来看，这些结果似乎描绘了一幅图景：要想获得高质量的排名，就必须使用昂贵的模型。

In contrast with this trend, ColBERT (which employs late interaction over BERT ${}_{\textnormal{base}}$ ) performs no worse than the original adaptation of BERT ${}_{\textnormal{base}}$ for ranking by Nogueira and Cho (Nogueira and Cho, 2019; Nogueira et al., 2019b) and is only marginally less effective than BERT ${}_{\textnormal{large}}$ and our training of BERT ${}_{\textnormal{base}}$ (described above). While highly competitive in effectiveness, ColBERT is orders of magnitude cheaper than BERT ${}_{\textnormal{base}}$ , in particular, by over 170 $\times$ in latency and 13,900 $\times$ in FLOPs. This highlights the expressiveness of our proposed late interaction mechanism, particularly when coupled with a powerful pre-trained LM like BERT. While ColBERT’s re-ranking latency is slightly higher than the non-BERT re-ranking models shown (i.e., by 10s of milliseconds), this difference is explained by the time it takes to gather, stack, and transfer the document embeddings to the GPU. In particular, the query encoding and interaction in ColBERT consume only 13 milliseconds of its total execution time. We note that ColBERT’s latency and FLOPs can be considerably reduced by padding queries to a shorter length, using smaller vector dimensions (the MRR@10 of which is tested in §4.5), employing quantization of the document vectors, and storing the embeddings on GPU if sufficient memory exists. We leave these directions for future work.
与这一趋势相反，ColBERT（它在 BERT ${}_{\textnormal{base}}$ 上采用了后期交互）的性能并不比 Nogueira 和 Cho（Nogueira 和 Cho，2019 年；Nogueira 等人，2019b）对 BERT ${}_{\textnormal{base}}$ 进行排名的原始调整差，而且仅略低于 BERT ${}_{\textnormal{large}}$ 和我们对 BERT ${}_{\textnormal{base}}$ 的训练（如上所述）。虽然 ColBERT 在有效性方面极具竞争力，但其成本却比 BERT ${}_{\textnormal{base}}$ 低几个数量级，尤其是在延迟和 FLOPs 方面分别低 170 $\times$ 和 13,900 $\times$ 个数量级。这凸显了我们提出的后期交互机制的表现力，尤其是在与 BERT 这样强大的预训练 LM 相结合时。虽然 ColBERT 的重新排序延迟略高于所示的非 BERT 重新排序模型（即 10 毫秒），但这一差异是由于收集、堆叠文档嵌入并将其传输到 GPU 所花费的时间造成的。特别是，ColBERT 的查询编码和交互仅耗费其总执行时间的 13 毫秒。我们注意到，ColBERT的延迟和FLOPs可以通过将查询填充到更短的长度、使用更小的矢量维度（第4.5节测试了其MRR@10）、对文档矢量进行量化，以及在有足够内存的情况下在GPU上存储嵌入来大大减少。我们将这些方向留给未来的工作。

Diving deeper into the quality–cost tradeoff between BERT and ColBERT, Figure 4 demonstrates the relationships between FLOPs and effectiveness (MRR@10) as a function of the re-ranking depth $k$ when re-ranking the top- $k$ results by BM25, comparing ColBERT and BERT ${}_{\textnormal{base}}$ (our training). We conduct this experiment on MS MARCO (Dev). We note here that as the official top-1000 ranking does not provide the BM25 order (and also lacks documents beyond the top-1000 per query), the models in this experiment re-rank the Anserini (Yang et al., 2018) toolkit’s BM25 output. Consequently, both MRR@10 values at $k=1000$ are slightly higher from those reported in Table 1.
深入探讨 BERT 和 ColBERT 之间的质量成本权衡，图 4 展示了在通过 BM25 对排名最高的 $k$ 结果进行重新排序时，FLOP 与效率（MRR@10）之间的关系，即重新排序深度 $k$ 与 ColBERT 和 BERT ${}_{\textnormal{base}}$ （我们的训练）之间的函数关系。我们在 MS MARCO (Dev) 上进行了这一实验。我们在此注意到，由于官方的前 1000 名排名不提供 BM25 排序（而且每次查询也缺少前 1000 名以外的文档），因此本实验中的模型对 Anserini（Yang 等人，2018 年）工具包的 BM25 输出进行了重新排序。因此， $k=1000$ 处的两个 MRR@10 值都比表 1 中报告的值略高。

Studying the results in Figure 4, we notice that not only is ColBERT much cheaper than BERT for the same model size (i.e., 12-layer “base” transformer encoder), it also scales better with the number of ranked documents. In part, this is because ColBERT only needs to process the query once, irrespective of the number of documents evaluated. For instance, at $k=10$ , BERT requires nearly 180 $\times$ more FLOPs than ColBERT; at $k=1000$ , BERT’s overhead jumps to 13,900 $\times$ . It then reaches 23,000 $\times$ at $k=2000$ . In fact, our informal experimentation shows that this orders-of-magnitude gap in FLOPs makes it practical to run ColBERT entirely on the CPU, although CPU-based re-ranking lies outside our scope.
通过研究图 4 中的结果，我们注意到在相同模型大小（即 12 层 "基础"transformer编码器）的情况下，ColBERT 不仅比 BERT 便宜得多，而且随着排序文档数量的增加，其扩展性也更好。部分原因是 ColBERT 只需要处理一次查询，而与评估的文档数量无关。例如，在 $k=10$ 时，BERT 需要比 ColBERT 多近 180 个 $\times$ FLOP；在 $k=1000$ 时，BERT 的开销跃升至 13,900 个 $\times$ 。然后在 $k=2000$ 时达到 23,000 $\times$ 。事实上，我们的非正式实验表明，尽管基于 CPU 的重新排序不在我们的研究范围之内，但 FLOPs 上的这种数量级差距使得完全在 CPU 上运行 ColBERT 变得切实可行。

Method	MAP	MRR@10
BM25 (Anserini) BM25（安塞里尼）	15.3	-
doc2query	18.1	-
DeepCT	24.6	33.2
BM25 + BERT ${}_{\textnormal{base}}$	31.0	-
BM25 + BERT ${}_{\textnormal{large}}$	33.5	-
BM25 + ColBERT	31.3	44.3

Table 3. Results on TREC CAR.
表 3.TREC CAR 的结果

Having studied our results on MS MARCO, we now consider TREC CAR, whose official metric is MAP. Results are summarized in Table 3, which includes a number of important baselines (BM25, doc2query, and DeepCT) in addition to re-ranking baselines that have been tested on this dataset. These results directly mirror those with MS MARCO.
在研究了我们在 MS MARCO 数据集上的结果后，我们现在来研究 TREC CAR 数据集，其官方指标是 MAP。表 3 总结了结果，其中包括一些重要的基线（BM25、doc2query 和 DeepCT），以及在该数据集上测试过的重新排序基线。这些结果直接反映了 MS MARCO 的结果。

4.3. End-to-end Top- $k$ Retrieval
4.3.End-to-end Top- $k$ 检索

Beyond cheap re-ranking, ColBERT is amenable to top- $k$ retrieval directly from a full collection. Table 2 considers full retrieval, wherein each model retrieves the top-1000 documents directly from MS MARCO’s 8.8M documents per query. In addition to MRR@10 and latency in milliseconds, the table reports Recall@50, Recall@200, and Recall@1000, important metrics for a full-retrieval model that essentially filters down a large collection on a per-query basis.
除了廉价的重新排序之外，ColBERT 还可以直接从完整的文档库中检索出排名前 $k$ 的文档。表 2 列出了全部检索结果，其中每个模型每次查询都能直接从 MS MARCO 的 880 万个文档中检索出前 1000 个文档。除了以毫秒为单位的 MRR@10 和延迟外，该表还报告了 Recall@50、Recall@200 和 Recall@1000，这些都是全检索模型的重要指标，因为全检索模型基本上是按每次查询过滤大量文档。

We compare against BM25, in particular MS MARCO’s official BM25 ranking as well as a well-tuned baseline based on the Anserini toolkit.⁷⁷7http://anserini.io/
我们与 BM25 进行了比较，特别是 MS MARCO 的官方 BM25 排名以及基于 Anserini 工具包的经过良好调整的基线。 ⁷ While many other traditional models exist, we are not aware of any that substantially outperform Anserini’s BM25 implementation (e.g., see RM3 in (Nogueira et al., 2019c), LMDir in (Dai and Callan, 2019a), or Microsoft’s proprietary feature-based RankSVM on the leaderboard).
虽然存在许多其他传统模型，但我们还没有发现任何模型的性能大大优于 Anserini 的 BM25 实现（例如，请参见 (Nogueira et al., 2019c) 中的 RM3、(Dai and Callan, 2019a) 中的 LMDir 或排行榜上的微软专有的基于特征的 RankSVM）。

We also compare against doc2query, DeepCT, and docTTTTTquery. All three rely on a traditional bag-of-words model (primarily BM25) for retrieval. Crucially, however, they re-weigh the frequency of terms per document and/or expand the set of terms in each document before building the BM25 index. In particular, doc2query expands each document with a pre-defined number of synthetic queries generated by a seq2seq transformer model (which docTTTTquery replaced with a pre-trained language model, T5 (Raffel et al., 2019)). In contrast, DeepCT uses BERT to produce the term frequency component of BM25 in a context-aware manner.
我们还与 doc2query、DeepCT 和 docTTTTTquery 进行了比较。所有这三种方法都依赖于传统的词袋模型（主要是 BM25）进行检索。但重要的是，它们在建立 BM25 索引之前，会重新权衡每篇文档的术语频率和/或扩展每篇文档中的术语集。特别是，doc2query 使用由 seq2seq transformer 模型生成的预定义数量的合成查询（docTTTTquery 使用预先训练的语言模型 T5（Raffel et al.）相比之下，DeepCT 使用 BERT 以上下文感知的方式生成 BM25 的词频成分。

For the latency of Anserini’s BM25, doc2query, and docTTTTquery, we use the authors’ (Nogueira et al., 2019c, a) Anserini-based implementation. While this implementation supports multi-threading, it only utilizes parallelism across different queries. We thus report single-threaded latency for these models, noting that simply parallelizing their computation over shards of the index can substantially decrease their already-low latency. For DeepCT, we only estimate its latency using that of BM25 (as denoted by (est.) in the table), since DeepCT re-weighs BM25’s term frequency without modifying the index otherwise.⁸⁸8In practice, a myriad of reasons could still cause DeepCT’s latency to differ slightly from BM25’s. For instance, the top- $k$ pruning strategy employed, if any, could interact differently with a changed distribution of scores.
在实践中，各种原因仍可能导致 DeepCT 的延迟与 BM25 的延迟略有不同。例如，所采用的顶部 $k$ 剪枝策略（如果有的话）可能会与变化的分数分布产生不同的相互作用。
关于 Anserini 的 BM25、doc2query 和 docTTTTquery 的延迟，我们使用了作者（Nogueira 等人，2019c，a）基于 Anserini 的实现。虽然该实现支持多线程，但它只利用了不同查询之间的并行性。因此，我们报告了这些模型的单线程延迟，同时指出，只需在索引分片上并行计算，就能大幅降低它们本已很低的延迟。对于 DeepCT，我们仅使用 BM25 的延迟来估算其延迟（表中用 (est.) 表示），因为 DeepCT 重新权衡了 BM25 的术语频率，而没有对索引进行其他修改。 ⁸ As discussed in §4.1, we use ColBERT ${}_{\textnormal{L2}}$ for end-to-end retrieval, which employs negative squared L2 distance as its vector-similarity function. For its latency, we measure the time for faiss-based candidate filtering and the subsequent re-ranking. In this experiment, faiss uses all available CPU cores.
如第 4.1 节所述，我们使用 ColBERT ${}_{\textnormal{L2}}$ 进行端到端检索，它采用负平方 L2 距离作为向量相似度函数。在延迟方面，我们测量的是基于 faiss 的候选过滤和后续重新排序所需的时间。在本实验中，faiss 使用了所有可用的 CPU 内核。

Looking at Table 2, we first see Anserini’s BM25 baseline at 18.7 MRR@10, noticing its very low latency as implemented in Anserini (which extends the well-known Lucene system), owing to both very cheap operations and decades of bag-of-words top- $k$ retrieval optimizations. The three subsequent baselines, namely doc2query, DeepCT, and docTTTTquery, each brings a decisive enhancement to effectiveness. These improvements come at negligible overheads in latency, since these baselines ultimately rely on BM25-based retrieval. The most effective among these three, docTTTTquery, demonstrates a massive 9% gain over vanilla BM25 by fine-tuning the recent language model T5.
查看表 2，我们首先看到 Anserini 的 BM25 基线为 18.7 MRR@10，注意到其在 Anserini（扩展了著名的 Lucene 系统）中实现的极低延迟，这得益于非常便宜的操作和数十年的词袋顶级 $k$ 检索优化。随后的三个基线，即 doc2query、DeepCT 和 docTTTTquery，每个都带来了决定性的效果提升。由于这些基线最终都依赖于基于 BM25 的检索，因此这些改进在延迟方面的开销可以忽略不计。这三个基线中最有效的是 docTTTTquery，它通过微调最新的语言模型 T5，比传统的 BM25 提高了 9%。

Shifting our attention to ColBERT’s end-to-end retrieval effectiveness, we see its major gains in MRR@10 over all of these end-to-end models. In fact, using ColBERT in the end-to-end setup is superior in terms of MRR@10 to re-ranking with the same model due to the improved recall. Moving beyond MRR@10, we also see large gains in Recall@ $k$ for $k$ equals to 50, 200, and 1000. For instance, its Recall@50 actually exceeds the official BM25’s Recall@1000 and even all but docTTTTTquery’s Recall@200, emphasizing the value of end-to-end retrieval (instead of just re-ranking) with ColBERT.
将注意力转移到ColBERT的端到端检索效果上，我们发现它的MRR@10比所有这些端到端模型都要高。事实上，由于召回率的提高，在端到端设置中使用 ColBERT 的 MRR@10 优于使用相同模型的重新排序。除了 MRR@10 之外，我们还发现在 $k$ 等于 50、200 和 1000 时，Recall@ $k$ 的收益也很大。例如，它的 Recall@50 实际上超过了官方 BM25 的 Recall@1000，甚至超过了除 docTTTTTquery 之外的所有其他 Recall@200，强调了 ColBERT 端到端检索（而不仅仅是重新排序）的价值。

4.4. Ablation Studies 4.4. 消融研究

The results from §4.2 indicate that ColBERT is highly effective despite the low cost and simplicity of its late interaction mechanism. To better understand the source of this effectiveness, we examine a number of important details in ColBERT’s interaction and encoder architecture. For this ablation, we report MRR@10 on the validation set of MS MARCO in Figure 5, which shows our main re-ranking ColBERT model [E], with MRR@10 of 34.9%.
第4.2节的结果表明，尽管ColBERT的后期交互机制成本低、操作简单，但却非常有效。为了更好地理解这种有效性的来源，我们研究了 ColBERT 交互和编码器架构中的一些重要细节。对于这种消减，我们在图 5 中报告了 MS MARCO 验证集上的 MRR@10，图 5 显示了我们的主要重排 ColBERT 模型 [E]，MRR@10 为 34.9%。

Due to the cost of training all models, we train a copy of our main model that retains only the first 5 layers of BERT out of 12 (i.e., model [D]) and similarly train all our ablation models for 200k iterations with five BERT layers. To begin with, we ask if the fine-granular interaction in late interaction is necessary. Model [A] tackles this question: it uses BERT to produce a single embedding vector for the query and another for the document, extracted from BERT’s [CLS] contextualized embedding and expanded through a linear layer to dimension 4096 (which equals $N_{q}\times 128=32\times 128$ ). Relevance is estimated as the inner product of the query’s and the document’s embeddings, which we found to perform better than cosine similarity for single-vector re-ranking. As the results show, this model is considerably less effective than ColBERT, reinforcing the importance of late interaction.
由于训练所有模型的成本较高，我们训练了主模型的一个副本，该副本只保留了 12 个 BERT 层中的前 5 层（即模型 [D]），并以 5 个 BERT 层对所有消融模型进行了类似的 20 万次迭代训练。首先，我们要问后期相互作用中的细粒相互作用是否必要。模型 [A] 解决了这个问题：它使用 BERT 为查询和文档分别生成一个嵌入向量，该向量从 BERT 的 [CLS] 上下文嵌入中提取，并通过线性层扩展到 4096 维（等于 $N_{q}\times 128=32\times 128$ ）。相关性是通过查询和文档嵌入的内积来估算的，我们发现在单矢量重新排序时，这个内积比余弦相似度更好。结果表明，该模型的效果大大低于 ColBERT，这也加强了后期交互的重要性。

Subsequently, we ask if our MaxSim-based late interaction is better than other simple alternatives. We test a model [B] that replaces ColBERT’s maximum similarity with average similarity. The results suggest the importance of individual terms in the query paying special attention to particular terms in the document. Similarly, the figure emphasizes the importance of our query augmentation mechanism: without query augmentation [C], ColBERT has a noticeably lower MRR@10. Lastly, we see the impact of end-to-end retrieval not only on recall but also on MRR@10. By retrieving directly from the full collection, ColBERT is able to retrieve to the top-10 documents missed entirely from BM25’s top-1000.
随后，我们询问基于 MaxSim 的后期交互是否优于其他简单的替代方案。我们测试了用平均相似度取代 ColBERT 最大相似度的模型 [B]。结果表明，查询中的单个术语非常重要，要特别注意文档中的特定术语。同样，该图强调了我们的查询增强机制的重要性：如果没有查询增强[C]，ColBERT 的 MRR@10 会明显降低。最后，我们看到端到端检索不仅对召回率有影响，而且对 MRR@10 也有影响。通过直接从全集中检索，ColBERT 能够检索到 BM25 的前 1000 名中完全遗漏的前 10 名文档。

4.5. Indexing Throughput & Footprint
4.5.Indexing 吞吐量和占地面积

Lastly, we examine the indexing throughput and space footprint of ColBERT. Figure 6 reports indexing throughput on MS MARCO documents with ColBERT and four other ablation settings, which individually enable optimizations described in §3.4 on top of basic batched indexing. Based on these throughputs, ColBERT can index MS MARCO in about three hours. Note that any BERT-based model must incur the computational cost of processing each document at least once. While ColBERT encodes each document with BERT exactly once, existing BERT-based rankers would repeat similar computations on possibly hundreds of documents for each query.
最后，我们检查了 ColBERT 的索引吞吐量和空间占用。图 6 报告了使用 ColBERT 和其他四种消融设置对 MS MARCO 文档进行索引的吞吐量，这四种设置在基本的分批索引基础上分别实现了第 3.4 节所述的优化。根据这些吞吐量，ColBERT 可以在大约三小时内完成 MS MARCO 索引。需要注意的是，任何基于 BERT 的模型都必须承担对每个文档至少处理一次的计算成本。ColBERT 对每份文档都进行了一次精确的 BERT 编码，而现有的基于 BERT 的排序器可能会对每次查询的数百份文档重复类似的计算。

Setting	Dimension( $m$ ) 尺寸( $m$ )	Bytes/Dim 字节/直径	Space(GiBs) 空间（GB）	MRR@10
Re-rank Cosine 重排余弦	128	4	286	34.9
End-to-end L2 端对端 L2	128	2	154	36.0
Re-rank L2 重新排名 L2	128	2	143	34.8
Re-rank Cosine 重排余弦	48	4	54	34.4
Re-rank Cosine 重排余弦	24	2	27	33.9

Table 4. Space Footprint vs MRR@10 (Dev) on MS MARCO.
表 4.MS MARCO 上的空间占用与 MRR@10 (Dev)。

Table 4 reports the space footprint of ColBERT under various settings as we reduce the embeddings dimension and/or the bytes per dimension. Interestingly, the most space-efficient setting, that is, re-ranking with cosine similarity with 24-dimensional vectors stored as 2-byte floats, is only 1% worse in MRR@10 than the most space-consuming one, while the former requires only 27 GiBs to represent the MS MARCO collection.
表 4 报告了 ColBERT 在不同设置下的空间占用情况，我们减少了嵌入维度和/或每个维度的字节数。有趣的是，最节省空间的设置，即用余弦相似度重新排序，24 维向量存储为 2 字节浮点数，在 MRR@10 中只比最耗费空间的设置差 1%，而前者只需要 27 GiB 就能表示 MS MARCO 数据集。

5. Conclusions 5 结论

In this paper, we introduced ColBERT, a novel ranking model that employs contextualized late interaction over deep LMs (in particular, BERT) for efficient retrieval. By independently encoding queries and documents into fine-grained representations that interact via cheap and pruning-friendly computations, ColBERT can leverage the expressiveness of deep LMs while greatly speeding up query processing. In addition, doing so allows using ColBERT for end-to-end neural retrieval directly from a large document collection. Our results show that ColBERT is more than 170 $\times$ faster and requires 14,000 $\times$ fewer FLOPs/query than existing BERT-based models, all while only minimally impacting quality and while outperforming every non-BERT baseline.
在本文中，我们介绍了 ColBERT，这是一种新颖的排序模型，它利用深度 LM（特别是 BERT）上的上下文化后期交互来实现高效检索。通过将查询和文档独立编码为细粒度表示，并通过廉价和剪枝友好的计算进行交互，ColBERT 可以利用深度 LM 的表达能力，同时大大加快查询处理速度。此外，ColBERT还可以直接从大型文档库中进行端到端的神经检索。我们的结果表明，与现有的基于 BERT 的模型相比，ColBERT 的速度提高了 170 多 $\times$ ，所需的 FLOPs/query 减少了 14,000 多 $\times$ ，同时对质量的影响微乎其微，并优于所有非 BERT 基线。

Acknowledgments. OK was supported by the Eltoukhy Family Graduate Fellowship at the Stanford School of Engineering. This research was supported in part by affiliate members and other supporters of the Stanford DAWN project—Ant Financial, Facebook, Google, Infosys, NEC, and VMware—as well as Cisco, SAP, and the NSF under CAREER grant CNS-1651570. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
致谢。OK 得到了斯坦福大学工程学院 Eltoukhy 家族研究生奖学金的支持。本研究部分得到了斯坦福 DAWN 项目的附属成员和其他支持者--蚂蚁金服、Facebook、谷歌、Infosys、NEC 和 VMware，以及思科、SAP 和美国国家科学基金会 CAREER 基金 CNS-1651570 的支持。本资料中表述的任何观点、发现、结论或建议均为作者个人观点，不代表美国国家科学基金会的观点。

References 参考资料

(1)
Abuzaid et al. (2019) 阿布扎伊德等人（2019） Firas Abuzaid, Geet Sethi, Peter Bailis, and Matei Zaharia. 2019. To Index or Not to Index: Optimizing Exact Maximum Inner Product Search. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1250–1261.
Firas Abuzaid、Geet Sethi、Peter Bailis 和 Matei Zaharia。2019.索引与否：优化精确最大内积搜索。In 2019 IEEE 35th International Conference on Data Engineering (ICDE).IEEE, 1250-1261.
Dai and Callan (2019a)
戴和卡伦（2019a） Zhuyun Dai and Jamie Callan. 2019a. Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval. arXiv preprint arXiv:1910.10687 (2019).
Zhuyun Dai and Jamie Callan.2019a.用于第一阶段检索的上下文感知句子/段落术语重要性估计。
Dai and Callan (2019b)
戴和卡伦（2019b） Zhuyun Dai and Jamie Callan. 2019b. Deeper Text Understanding for IR with Contextual Neural Language Modeling. arXiv preprint arXiv:1905.09217 (2019).
Zhuyun Dai and Jamie Callan.2019b.利用上下文神经语言建模加深对 IR 文本的理解》，arXiv preprint arXiv:1905.09217 (2019)。
Dai et al. (2018) 戴等人 (2018) Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. 2018. Convolutional neural networks for soft-matching n-grams in ad-hoc search. In Proceedings of the eleventh ACM international conference on web search and data mining. 126–134.
Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu.2018.用于临时搜索中软匹配 n-grams 的卷积神经网络。第十一届 ACM 网络搜索与数据挖掘国际会议论文集》。126-134.
Devlin et al. (2018) Devlin 等人（2018 年） Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Jacob Devlin、Ming-Wei Chang、Kenton Lee 和 Kristina Toutanova。2018.Bert：用于语言理解的深度双向变换器的预训练。arXiv preprint arXiv:1810.04805 (2018)。
Dietz et al. (2017) 迪茨等人（2017） Laura Dietz, Manisha Verma, Filip Radlinski, and Nick Craswell. 2017. TREC Complex Answer Retrieval Overview.. In TREC.
Laura Dietz、Manisha Verma、Filip Radlinski 和 Nick Craswell。2017.TREC 复杂答案检索概述》。在 TREC 中。
Guo et al. (2016) 郭等人（2016） Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. 2016. A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 55–64.
Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft.2016.用于临时检索的深度相关性匹配模型。第 25 届 ACM 国际信息与知识管理大会论文集》。ACM, 55-64.
Guo et al. (2019) 郭等人（2019） Jiafeng Guo, Yixing Fan, Liang Pang, Liu Yang, Qingyao Ai, Hamed Zamani, Chen Wu, W Bruce Croft, and Xueqi Cheng. 2019. A deep look into neural ranking models for information retrieval. arXiv preprint arXiv:1903.06902 (2019).
郭佳峰、范奕星、庞亮、杨柳、艾庆耀、哈默德-扎马尼、吴晨、布鲁斯-克罗夫特和程雪琪。2019.信息检索神经排序模型的深度研究》，arXiv preprint arXiv:1903.06902 (2019)。
Hofstätter and Hanbury (2019)
霍夫施泰特和汉伯里（2019 年） Sebastian Hofstätter and Allan Hanbury. 2019. Let’s measure run time! Extending the IR replicability infrastructure to include performance aspects. arXiv preprint arXiv:1907.04614 (2019).
Sebastian Hofstätter and Allan Hanbury.2019.让我们测量运行时间！扩展 IR 可复制性基础架构以包括性能方面。arXiv preprint arXiv:1907.04614 (2019)。
Hofstätter et al. (2019a) Sebastian Hofstätter, Navid Rekabsaz, Carsten Eickhoff, and Allan Hanbury. 2019a. On the effect of low-frequency terms on neural-IR models. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1137–1140.
Sebastian Hofstätter、Navid Rekabsaz、Carsten Eickhoff 和 Allan Hanbury。2019a.低频项对神经红外模型的影响。第 42 届 ACM SIGIR 信息检索研究与发展国际会议论文集》。1137-1140.
Hofstätter et al. (2019b)
霍夫施泰特等人（2019b） Sebastian Hofstätter, Markus Zlabinger, and Allan Hanbury. 2019b. TU Wien@ TREC Deep Learning’19–Simple Contextualization for Re-ranking. arXiv preprint arXiv:1912.01385 (2019).
Sebastian Hofstätter、Markus Zlabinger 和 Allan Hanbury。2019b.TU Wien@ TREC Deep Learning'19-Simple Contextualization for Re-ranking. arXiv preprint arXiv:1912.01385 (2019).
Huang et al. (2013) 黄等人（2013） Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 2333–2338.
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck.2013.利用点击数据为网络搜索学习深度结构化语义模型。第 22 届 ACM 信息与知识管理国际会议论文集》。2333-2338.
Ji et al. (2019) 吉等人（2019） Shiyu Ji, Jinjin Shao, and Tao Yang. 2019. Efficient Interaction-based Neural Ranking with Locality Sensitive Hashing. In The World Wide Web Conference. ACM, 2858–2864.
Shiyu Ji, Jinjin Shao, and Tao Yang.2019.基于位置敏感散列的高效交互式神经排名。在万维网大会上。ACM，2858-2864。
Jiao et al. (2019) Jiao 等人（2019） Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351 (2019).
焦小琪、尹一春、尚立峰、蒋昕、陈晓、李琳琳、王芳和刘群。2019.Tinybert: Distilling bert for natural language understanding. ArXiv preprint arXiv:1909.10351 (2019).
Johnson et al. (2017) 约翰逊等人（2017） Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017).
Jeff Johnson、Matthijs Douze 和 Hervé Jégou.2017.使用 GPU 的亿次级相似性搜索。arXiv preprint arXiv:1702.08734 (2017)。
Kingma and Ba (2014)
Kingma 和 Ba（2014 年） Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Diederik P Kingma 和 Jimmy Ba.2014.Adam：一种随机优化方法。arXiv preprint arXiv:1412.6980 (2014)。
Kohavi et al. (2013) 科哈维等人（2013 年） Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online controlled experiments at large scale. In SIGKDD.
Ron Kohavi、Alex Deng、Brian Frasca、Toby Walker、Ya Xu 和 Nils Pohlmann。2013.大规模在线受控实验。In SIGKDD.
MacAvaney et al. (2019) MacAvaney 等人（2019 年） Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019. Cedr: Contextualized embeddings for document ranking. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1101–1104.
Sean MacAvaney、Andrew Yates、Arman Cohan 和 Nazli Goharian。2019.Cedr：用于文档排序的语境化嵌入。第 42 届 ACM SIGIR 信息检索研究与发展国际会议论文集》。ACM，1101-1104。
Michel et al. (2019) 米歇尔等人（2019） Paul Michel, Omer Levy, and Graham Neubig. 2019. Are Sixteen Heads Really Better than One?. In Advances in Neural Information Processing Systems. 14014–14024.
Paul Michel、Omer Levy 和 Graham Neubig。2019.十六个头真的比一个头好吗？神经信息处理系统进展》。14014-14024.
Mitra and Craswell (2019)
米特拉和克拉斯韦尔（2019） Bhaskar Mitra and Nick Craswell. 2019. An Updated Duet Model for Passage Re-ranking. arXiv preprint arXiv:1903.07666 (2019).
Bhaskar Mitra 和 Nick Craswell。2019.用于通道重新排序的更新二重奏模型》，arXiv preprint arXiv:1903.07666 (2019)。
Mitra et al. (2018) 米特拉等人（2018） Bhaskar Mitra, Nick Craswell, et al. 2018. An introduction to neural information retrieval. Foundations and Trends® in Information Retrieval 13, 1 (2018), 1–126.
Bhaskar Mitra, Nick Craswell, et al. 2018.神经信息检索简介。信息检索的基础与趋势® 13, 1 (2018), 1-126.
Mitra et al. (2017) 米特拉等人（2017） Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to match using local and distributed representations of text for web search. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1291–1299.
Bhaskar Mitra、Fernando Diaz 和 Nick Craswell。2017.使用本地和分布式文本表征学习匹配网络搜索。第 26 届万维网国际会议论文集》。国际万维网会议指导委员会，1291-1299。
Mitra et al. (2019) 米特拉等人（2019） Bhaskar Mitra, Corby Rosset, David Hawking, Nick Craswell, Fernando Diaz, and Emine Yilmaz. 2019. Incorporating query term independence assumption for efficient retrieval and ranking using deep neural networks. arXiv preprint arXiv:1907.03693 (2019).
巴斯卡-米特拉、科比-罗塞特、大卫-霍金、尼克-克拉斯韦尔、费尔南多-迪亚兹和埃米内-伊尔马兹。2019.使用深度神经网络纳入查询术语独立性假设以实现高效检索和排序》，arXiv preprint arXiv:1907.03693 (2019)。
Nguyen et al. (2016) Nguyen 等人（2016 年） Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human-Generated MAchine Reading COmprehension Dataset. (2016).
Tri Nguyen、Mir Rosenberg、Xia Song、Jianfeng Gao、Saurabh Tiwary、Rangan Majumder 和 Li Deng。2016.MS MARCO: A Human-Generated MAchine Reading COmprehension Dataset.(2016).
Nogueira and Cho (2019)
诺盖拉和周（2019） Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085 (2019).
Rodrigo Nogueira 和 Kyunghyun Cho.2019.用 BERT 进行段落重新排序。arXiv preprint arXiv:1901.04085 (2019)。
Nogueira et al. (2019a) 诺盖拉等人（2019a） Rodrigo Nogueira, Jimmy Lin, and AI Epistemic. 2019a. From doc2query to docTTTTTquery. (2019).
Rodrigo Nogueira、Jimmy Lin 和 AI Epistemic。2019a.从 doc2query 到 docTTTTTquery。(2019).
Nogueira et al. (2019b) 诺盖拉等人（2019b） Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019b. Multi-Stage Document Ranking with BERT. arXiv preprint arXiv:1910.14424 (2019).
Rodrigo Nogueira、Wei Yang、Kyunghyun Cho 和 Jimmy Lin.2019b.用 BERT 进行多阶段文档排序》，arXiv preprint arXiv:1910.14424 (2019).
Nogueira et al. (2019c) 诺盖拉等人（2019c） Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019c. Document Expansion by Query Prediction. arXiv preprint arXiv:1904.08375 (2019).
Rodrigo Nogueira、Wei Yang、Jimmy Lin 和 Kyunghyun Cho.2019c.通过查询预测进行文档扩展。arXiv preprint arXiv:1904.08375 (2019)。
Peters et al. (2018) 彼得斯等人（2018） Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018).
Matthew E Peters、Mark Neumann、Mohit Iyyer、Matt Gardner、Christopher Clark、Kenton Lee 和 Luke Zettlemoyer。2018.深度语境化单词表征。arXiv preprint arXiv:1802.05365 (2018)。
Qiao et al. (2019) 乔等人（2019） Yifan Qiao, Chenyan Xiong, Zhenghao Liu, and Zhiyuan Liu. 2019. Understanding the Behaviors of BERT in Ranking. arXiv preprint arXiv:1904.07531 (2019).
Yifan Qiao, Chenyan Xiong, Zhenghao Liu, and Zhiyuan Liu.2019.理解排名中的 BERT 行为》，arXiv preprint arXiv:1904.07531 (2019).
Raffel et al. (2019) 拉斐尔等人（2019） Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
Colin Raffel、Noam Shazeer、Adam Roberts、Katherine Lee、Sharan Narang、Michael Matena、Yanqi Zhou、Wei Li 和 Peter J Liu。2019.用统一的文本到文本 transformer 探索迁移学习的极限。arXiv preprint arXiv:1910.10683 (2019).
Robertson et al. (1995) 罗伯逊等人（1995 年） Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at TREC-3. NIST Special Publication (1995).
Stephen E Robertson、Steve Walker、Susan Jones、Micheline M Hancock-Beaulieu、Mike Gatford 等，1995 年。Okapi at TREC-3.NIST 特别出版物（1995 年）。
Tang et al. (2019) 唐等人（2019） Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. 2019. Distilling task-specific knowledge from BERT into simple neural networks. arXiv preprint arXiv:1903.12136 (2019).
Raphael Tang、Yao Lu、Linqing Liu、Lili Mou、Olga Vechtomova 和 Jimmy Lin。2019.将 BERT 的特定任务知识提炼到简单神经网络中。arXiv preprint arXiv:1903.12136 (2019)。
Vaswani et al. (2017) 瓦斯瓦尼等人（2017 年） Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
Ashish Vaswani、Noam Shazeer、Niki Parmar、Jakob Uszkoreit、Llion Jones、Aidan N Gomez、Łukasz Kaiser 和 Illia Polosukhin.2017.注意力就是你所需要的一切。神经信息处理系统进展》。5998-6008.
Wu et al. (2016) 吴等人（2016） Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al.谷歌的神经机器翻译系统：ArXiv preprint arXiv:1609.08144 (2016).
Xiong et al. (2017) 熊等人（2017） Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval. 55–64.
Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power.2017.使用内核池的端到端神经特设排名。第 40 届 ACM SIGIR 国际信息检索研究与发展会议论文集》。55-64.
Yang et al. (2018) 杨等人（2018） Peilin Yang, Hui Fang, and Jimmy Lin. 2018. Anserini: Reproducible ranking baselines using Lucene. Journal of Data and Information Quality (JDIQ) 10, 4 (2018), 1–20.
Peilin Yang, Hui Fang, and Jimmy Lin.2018.Anserini：使用 Lucene 的可复制排名基线。数据与信息质量期刊》（JDIQ）10，4 (2018)，1-20。
Yang et al. (2019) 杨等人（2019） Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. 2019. Critically Examining the” Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1129–1132.
Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin.2019.批判性检验 "神经炒作 "的弱基线和神经排序模型效果收益的可加性》（Critically Examining the "Neural Hype" Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models.第 42 届 ACM SIGIR 信息检索研究与发展国际会议论文集》。1129-1132.
Yilmaz et al. (2019) 伊尔马兹等人（2019 年） Zeynep Akkalyoncu Yilmaz, Wei Yang, Haotian Zhang, and Jimmy Lin. 2019. Cross-domain modeling of sentence-level evidence for document retrieval. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3481–3487.
Zeynep Akkalyoncu Yilmaz, Wei Yang, Haotian Zhang, and Jimmy Lin.2019.用于文档检索的句子级证据跨域建模。2019年自然语言处理实证方法会议暨第九届自然语言处理国际联合会议（EMNLP-IJCNLP）论文集》。3481-3487.
Zafrir et al. (2019) 扎弗里尔等人（2019 年） Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. 2019. Q8bert: Quantized 8bit bert. arXiv preprint arXiv:1910.06188 (2019).
Ofir Zafrir、Guy Boudoukh、Peter Izsak 和 Moshe Wasserblat。2019.Q8bert：ArXiv preprint arXiv:1910.06188 (2019).
Zamani et al. (2018) 扎马尼等人（2018） Hamed Zamani, Mostafa Dehghani, W Bruce Croft, Erik Learned-Miller, and Jaap Kamps. 2018. From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 497–506.
Hamed Zamani、Mostafa Dehghani、W Bruce Croft、Erik Learned-Miller 和 Jaap Kamps。2018.从神经重排序到神经排序：为倒排索引学习稀疏表示。第 27 届 ACM 国际信息与知识管理大会论文集》。ACM，497-506。
Zhao (2012) 赵（2012） Le Zhao. 2012. Modeling and solving term mismatch for full-text retrieval. Ph.D. Dissertation. Carnegie Mellon University.
Le Zhao.2012.全文检索中术语不匹配的建模与解决。博士论文。卡内基梅隆大学。

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERTColBERT：通过 BERT 上的语境化后期交互实现高效和有效的通道搜索