Introducing Contextual Retrieval
情境檢索簡介
For an AI model to be useful in specific contexts, it often needs access to background knowledge. For example, customer support chatbots need knowledge about the specific business they're being used for, and legal analyst bots need to know about a vast array of past cases.
為了讓人工智慧模型在特定環境中發揮作用,它通常需要獲得背景知識。例如,客戶支援聊天機器人需要了解它們所用於的特定業務,而法律分析機器人需要了解大量過去的案例。
Developers typically enhance an AI model's knowledge using Retrieval-Augmented Generation (RAG). RAG is a method that retrieves relevant information from a knowledge base and appends it to the user's prompt, significantly enhancing the model's response. The problem is that traditional RAG solutions remove context when encoding information, which often results in the system failing to retrieve the relevant information from the knowledge base.
開發人員通常使用檢索增強生成 (RAG) 來增強 AI 模型的知識。 RAG 是一種從知識庫中檢索相關資訊並將其附加到使用者提示中的方法,從而顯著增強模型的回應。問題在於,傳統的 RAG 解決方案在編碼資訊時會刪除上下文,這通常會導致系統無法從知識庫中檢索相關資訊。
In this post, we outline a method that dramatically improves the retrieval step in RAG. The method is called “Contextual Retrieval” and uses two sub-techniques: Contextual Embeddings and Contextual BM25. This method can reduce the number of failed retrievals by 49% and, when combined with reranking, by 67%. These represent significant improvements in retrieval accuracy, which directly translates to better performance in downstream tasks.
在這篇文章中,我們概述了一種可以顯著改進 RAG 檢索步驟的方法。該方法稱為“上下文檢索”,並使用兩種子技術:上下文嵌入和上下文 BM25。此方法可將失敗檢索的數量減少 49%,與重新排名結合使用時可減少 67%。這些代表了檢索準確性的顯著提高,這直接轉化為下游任務的更好性能。
You can easily deploy your own Contextual Retrieval solution with Claude with our cookbook.
您可以透過 Claude 和我們的食譜輕鬆部署您自己的上下文檢索解決方案。
A note on simply using a longer prompt
關於僅使用較長提示的注意事項
Sometimes the simplest solution is the best. If your knowledge base is smaller than 200,000 tokens (about 500 pages of material), you can just include the entire knowledge base in the prompt that you give the model, with no need for RAG or similar methods.
有時最簡單的解決方案是最好的。如果您的知識庫小於 200,000 個標記(大約 500 頁資料),您可以將整個知識庫包含在為模型提供的提示中,而不需要 RAG 或類似的方法。
A few weeks ago, we released prompt caching for Claude, which makes this approach significantly faster and more cost-effective. Developers can now cache frequently used prompts between API calls, reducing latency by > 2x and costs by up to 90% (you can see how it works by reading our prompt caching cookbook).
幾週前,我們為 Claude 發布了即時緩存,這使得這種方法明顯更快且更具成本效益。開發人員現在可以在 API 呼叫之間快取常用的提示,從而將延遲減少 2 倍以上,並將成本降低高達 90%(您可以透過閱讀我們的提示快取手冊來了解它的工作原理)。
However, as your knowledge base grows, you'll need a more scalable solution. That’s where Contextual Retrieval comes in.
然而,隨著您的知識庫的成長,您將需要更具可擴展性的解決方案。這就是上下文檢索的用武之地。
A primer on RAG: scaling to larger knowledge bases
RAG 入門:擴展到更大的知識庫
For larger knowledge bases that don't fit within the context window, RAG is the typical solution. RAG works by preprocessing a knowledge base using the following steps:
對於不適合上下文視窗的較大知識庫,RAG 是典型的解決方案。 RAG 的工作原理是使用以下步驟預處理知識庫:
- Break down the knowledge base (the “corpus” of documents) into smaller chunks of text, usually no more than a few hundred tokens;
將知識庫(文檔的「語料庫」)分解為更小的文字區塊,通常不超過幾百個標記; - Use an embedding model to convert these chunks into vector embeddings that encode meaning;
使用嵌入模型將這些區塊轉換為編碼意義的向量嵌入; - Store these embeddings in a vector database that allows for searching by semantic similarity.
將這些嵌入儲存在向量資料庫中,以便透過語義相似性進行搜尋。
At runtime, when a user inputs a query to the model, the vector database is used to find the most relevant chunks based on semantic similarity to the query. Then, the most relevant chunks are added to the prompt sent to the generative model.
在運行時,當使用者向模型輸入查詢時,向量資料庫用於根據與查詢的語義相似性找到最相關的區塊。然後,最相關的區塊被添加到發送到生成模型的提示中。
While embedding models excel at capturing semantic relationships, they can miss crucial exact matches. Fortunately, there’s an older technique that can assist in these situations. BM25 (Best Matching 25) is a ranking function that uses lexical matching to find precise word or phrase matches. It's particularly effective for queries that include unique identifiers or technical terms.
雖然嵌入模型擅長捕獲語義關係,但它們可能會錯過關鍵的精確匹配。幸運的是,有一種較舊的技術可以在這些情況下提供幫助。 BM25(最佳匹配 25)是一種排名函數,使用詞彙匹配來尋找精確的單字或短語匹配。它對於包含唯一識別碼或技術術語的查詢特別有效。
BM25 works by building upon the TF-IDF (Term Frequency-Inverse Document Frequency) concept. TF-IDF measures how important a word is to a document in a collection. BM25 refines this by considering document length and applying a saturation function to term frequency, which helps prevent common words from dominating the results.
BM25 的工作原理是基於 TF-IDF(術語頻率-逆文檔頻率)概念。 TF-IDF 衡量一個單字對於集合中的文件的重要性。 BM25 透過考慮文件長度並對術語頻率應用飽和函數來對此進行細化,這有助於防止常見單字主導結果。
Here’s how BM25 can succeed where semantic embeddings fail: Suppose a user queries "Error code TS-999" in a technical support database. An embedding model might find content about error codes in general, but could miss the exact "TS-999" match. BM25 looks for this specific text string to identify the relevant documentation.
以下是 BM25 如何在語義嵌入失敗的情況下取得成功:假設使用者在技術支援資料庫中查詢「錯誤代碼 TS-999」。嵌入模型通常可能會找到有關錯誤代碼的內容,但可能會錯過確切的“TS-999”匹配。 BM25 尋找此特定文字字串以識別相關文件。
RAG solutions can more accurately retrieve the most applicable chunks by combining the embeddings and BM25 techniques using the following steps:
RAG 解決方案可以透過使用以下步驟結合嵌入和 BM25 技術來更準確地檢索最適用的區塊:
- Break down the knowledge base (the "corpus" of documents) into smaller chunks of text, usually no more than a few hundred tokens;
將知識庫(文檔的「語料庫」)分解為更小的文字區塊,通常不超過幾百個標記; - Create TF-IDF encodings and semantic embeddings for these chunks;
為這些區塊建立 TF-IDF 編碼和語義嵌入; - Use BM25 to find top chunks based on exact matches;
使用 BM25 根據精確匹配查找頂級塊; - Use embeddings to find top chunks based on semantic similarity;
使用嵌入根據語義相似性找到頂級塊; - Combine and deduplicate results from (3) and (4) using rank fusion techniques;
使用等級融合技術組合(3)和(4)的結果並消除重複; - Add the top-K chunks to the prompt to generate the response.
將前 K 個區塊新增至提示中以產生回應。
By leveraging both BM25 and embedding models, traditional RAG systems can provide more comprehensive and accurate results, balancing precise term matching with broader semantic understanding.
透過利用 BM25 和嵌入模型,傳統 RAG 系統可以提供更全面、更準確的結果,在精確的術語匹配與更廣泛的語義理解之間取得平衡。
This approach allows you to cost-effectively scale to enormous knowledge bases, far beyond what could fit in a single prompt. But these traditional RAG systems have a significant limitation: they often destroy context.
這種方法使您能夠經濟有效地擴展到龐大的知識庫,遠遠超出單一提示所能容納的範圍。但這些傳統的 RAG 系統有一個顯著的限制:它們經常破壞上下文。
The context conundrum in traditional RAG
In traditional RAG, documents are typically split into smaller chunks for efficient retrieval. While this approach works well for many applications, it can lead to problems when individual chunks lack sufficient context.
For example, imagine you had a collection of financial information (say, U.S. SEC filings) embedded in your knowledge base, and you received the following question: "What was the revenue growth for ACME Corp in Q2 2023?"
A relevant chunk might contain the text: "The company's revenue grew by 3% over the previous quarter." However, this chunk on its own doesn't specify which company it's referring to or the relevant time period, making it difficult to retrieve the right information or use the information effectively.
相關區塊可能包含文字: “公司收入比上一季增長了 3%。”然而,該區塊本身並沒有指定它所指的是哪家公司或相關時間段,因此很難檢索正確的資訊或有效地使用該資訊。
Introducing Contextual Retrieval
情境檢索簡介
Contextual Retrieval solves this problem by prepending chunk-specific explanatory context to each chunk before embedding (“Contextual Embeddings”) and creating the BM25 index (“Contextual BM25”).
上下文檢索透過在嵌入之前向每個區塊添加特定於區塊的解釋性上下文(「上下文嵌入」)並建立 BM25 索引(「上下文 BM25」)來解決此問題。
Let’s return to our SEC filings collection example. Here's an example of how a chunk might be transformed:
讓我們回到 SEC 文件收集範例。以下是如何轉換區塊的範例:
original_chunk = "The company's revenue grew by 3% over the previous quarter."
contextualized_chunk = "This chunk is from an SEC filing on ACME corp's performance in Q2 2023; the previous quarter's revenue was $314 million. The company's revenue grew by 3% over the previous quarter."
It is worth noting that other approaches to using context to improve retrieval have been proposed in the past. Other proposals include: adding generic document summaries to chunks (we experimented and saw very limited gains), hypothetical document embedding, and summary-based indexing (we evaluated and saw low performance). These methods differ from what is proposed in this post.
值得注意的是,過去已經提出了使用上下文來改進檢索的其他方法。其他建議包括:為區塊添加通用文件摘要(我們進行了實驗,發現收益非常有限)、假設文件嵌入和基於摘要的索引(我們評估並發現效能較低)。這些方法與本文所提出的方法不同。
Implementing Contextual Retrieval
實作上下文檢索
Of course, it would be far too much work to manually annotate the thousands or even millions of chunks in a knowledge base. To implement Contextual Retrieval, we turn to Claude. We’ve written a prompt that instructs the model to provide concise, chunk-specific context that explains the chunk using the context of the overall document. We used the following Claude 3 Haiku prompt to generate context for each chunk:
當然,手動註釋知識庫中數千甚至數百萬個區塊的工作量太大。為了實現上下文檢索,我們求助於 Claude。我們編寫了一個提示,指示模型提供簡潔的、特定於區塊的上下文,使用整個文件的上下文來解釋區塊。我們使用以下 Claude 3 Haiku 提示來為每個區塊產生上下文:
<document>
{{WHOLE_DOCUMENT}}
</document>
Here is the chunk we want to situate within the whole document
<chunk>
{{CHUNK_CONTENT}}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.
The resulting contextual text, usually 50-100 tokens, is prepended to the chunk before embedding it and before creating the BM25 index.
產生的上下文文字(通常為 50-100 個標記)在嵌入區塊之前和建立 BM25 索引之前被加入到區塊中。
Here’s what the preprocessing flow looks like in practice:
以下是實務中的預處理流程:
If you’re interested in using Contextual Retrieval, you can get started with our cookbook.
如果您對使用上下文檢索感興趣,您可以開始使用我們的食譜。
Using Prompt Caching to reduce the costs of Contextual Retrieval
使用提示快取來降低上下文檢索的成本
Contextual Retrieval is uniquely possible at low cost with Claude, thanks to the special prompt caching feature we mentioned above. With prompt caching, you don’t need to pass in the reference document for every chunk. You simply load the document into the cache once and then reference the previously cached content. Assuming 800 token chunks, 8k token documents, 50 token context instructions, and 100 tokens of context per chunk, the one-time cost to generate contextualized chunks is $1.02 per million document tokens.
由於我們上面提到的特殊提示快取功能,Claude 可以以較低的成本實現上下文檢索。使用提示緩存,您不需要為每個區塊傳遞參考文件。您只需將文件載入到快取中一次,然後引用先前快取的內容即可。假設 800 個令牌區塊、8k 個令牌文檔、50 個令牌上下文指令以及每個區塊 100 個上下文令牌,則產生上下文化區塊的一次性成本為每百萬個文檔令牌 1.02 美元。
Methodology 方法論
We experimented across various knowledge domains (codebases, fiction, ArXiv papers, Science Papers), embedding models, retrieval strategies, and evaluation metrics. We’ve included a few examples of the questions and answers we used for each domain in Appendix II.
我們嘗試了各種知識領域(程式碼庫、小說、ArXiv 論文、科學論文)、嵌入模型、檢索策略和評估指標。我們在附錄 II中提供了針對每個領域使用的問題和答案的一些範例。
The graphs below show the average performance across all knowledge domains with the top-performing embedding configuration (Gemini Text 004) and retrieving the top-20-chunks. We use 1 minus recall@20 as our evaluation metric, which measures the percentage of relevant documents that fail to be retrieved within the top 20 chunks. You can see the full results in the appendix - contextualizing improves performance in every embedding-source combination we evaluated.
下圖顯示了使用最佳效能的嵌入配置 (Gemini Text 004) 並檢索前 20 個區塊的所有知識領域的平均效能。我們使用1減去recall@20作為我們的評估指標,它衡量前20個區塊中未能檢索到的相關文件的百分比。您可以在附錄中看到完整的結果 - 上下文化提高了我們評估的每個嵌入來源組合的效能。
Performance improvements 性能改進
Our experiments showed that:
我們的實驗顯示:
- Contextual Embeddings reduced the top-20-chunk retrieval failure rate by 35% (5.7% → 3.7%).
上下文嵌入將前 20 個區塊的檢索失敗率降低了 35% (5.7% → 3.7%)。 - Combining Contextual Embeddings and Contextual BM25 reduced the top-20-chunk retrieval failure rate by 49% (5.7% → 2.9%).
結合上下文嵌入和上下文 BM25 將前 20 個區塊檢索失敗率降低了 49% (5.7% → 2.9%)。
Implementation considerations
實施注意事項
When implementing Contextual Retrieval, there are a few considerations to keep in mind:
在實作上下文檢索時,需要記住以下幾點:
- Chunk boundaries: Consider how you split your documents into chunks. The choice of chunk size, chunk boundary, and chunk overlap can affect retrieval performance1.
區塊邊界:考慮如何將文件分割成區塊。區塊大小、區塊邊界和區塊重疊的選擇會影響檢索效能1 。 - Embedding model: Whereas Contextual Retrieval improves performance across all embedding models we tested, some models may benefit more than others. We found Gemini and Voyage embeddings to be particularly effective.
嵌入模型:雖然上下文檢索提高了我們測試的所有嵌入模型的效能,但某些模型可能比其他模型受益更多。我們發現Gemini和Voyage嵌入特別有效。 - Custom contextualizer prompts: While the generic prompt we provided works well, you may be able to achieve even better results with prompts tailored to your specific domain or use case (for example, including a glossary of key terms that might only be defined in other documents in the knowledge base).
自訂上下文提示:雖然我們提供的通用提示效果很好,但您可以透過根據您的特定領域或用例自訂的提示(例如,包括可能僅在其他文件中定義的關鍵術語詞彙表)獲得更好的結果在知識庫中)。 - Number of chunks: Adding more chunks into the context window increases the chances that you include the relevant information. However, more information can be distracting for models so there's a limit to this. We tried delivering 5, 10, and 20 chunks, and found using 20 to be the most performant of these options (see appendix for comparisons) but it’s worth experimenting on your use case.
區塊數:將更多區塊加入上下文視窗會增加包含相關資訊的機會。然而,更多的資訊可能會分散模型的注意力,因此這是有限制的。我們嘗試交付 5、10 和 20 個區塊,發現使用 20 是這些選項中效能最高的(請參閱附錄進行比較),但值得在您的用例上進行試驗。
Always run evals: Response generation may be improved by passing it the contextualized chunk and distinguishing between what is context and what is the chunk.
始終運行評估:可以透過向其傳遞上下文化區塊並區分什麼是上下文和什麼是區塊來改進回應生成。
Further boosting performance with Reranking
透過重新排名進一步提高效能
In a final step, we can combine Contextual Retrieval with another technique to give even more performance improvements. In traditional RAG, the AI system searches its knowledge base to find the potentially relevant information chunks. With large knowledge bases, this initial retrieval often returns a lot of chunks—sometimes hundreds—of varying relevance and importance.
最後一步,我們可以將上下文檢索與另一種技術結合起來,以提供更多的效能改進。在傳統的 RAG 中,人工智慧系統會搜尋其知識庫以查找潛在相關的資訊區塊。對於大型知識庫,這種初始檢索通常會傳回大量(有時是數百個)具有不同相關性和重要性的區塊。
Reranking is a commonly used filtering technique to ensure that only the most relevant chunks are passed to the model. Reranking provides better responses and reduces cost and latency because the model is processing less information. The key steps are:
重新排名是一種常用的過濾技術,可確保僅將最相關的區塊傳遞給模型。重新排名可提供更好的回應並降低成本和延遲,因為模型處理的資訊較少。關鍵步驟是:
- Perform initial retrieval to get the top potentially relevant chunks (we used the top 150);
執行初始檢索以取得最重要的潛在相關區塊(我們使用前 150 個); - Pass the top-N chunks, along with the user's query, through the reranking model;
將前 N 個區塊與使用者的查詢一起透過重新排名模型; - Using a reranking model, give each chunk a score based on its relevance and importance to the prompt, then select the top-K chunks (we used the top 20);
使用重新排序模型,根據每個區塊與提示的相關性和重要性給每個區塊評分,然後選擇前 K 個區塊(我們使用前 20 個); - Pass the top-K chunks into the model as context to generate the final result.
將前 K 個區塊作為上下文傳遞到模型中以產生最終結果。
Performance improvements 性能改進
There are several reranking models on the market. We ran our tests with the Cohere reranker. Voyage also offers a reranker, though we did not have time to test it. Our experiments showed that, across various domains, adding a reranking step further optimizes retrieval.
市場上有多種重新排名模型。我們使用Cohere reranker進行了測試。 Voyage也提供了重新排名器,儘管我們沒有時間測試它。我們的實驗表明,在各個領域中,添加重新排名步驟可以進一步優化檢索。
Specifically, we found that Reranked Contextual Embedding and Contextual BM25 reduced the top-20-chunk retrieval failure rate by 67% (5.7% → 1.9%).
具體來說,我們發現 Reranked Contextual Embedding 和 Contextual BM25 將前 20 個區塊的檢索失敗率降低了 67% (5.7% → 1.9%)。
Cost and latency considerations
成本和延遲考慮
One important consideration with reranking is the impact on latency and cost, especially when reranking a large number of chunks. Because reranking adds an extra step at runtime, it inevitably adds a small amount of latency, even though the reranker scores all the chunks in parallel. There is an inherent trade-off between reranking more chunks for better performance vs. reranking fewer for lower latency and cost. We recommend experimenting with different settings on your specific use case to find the right balance.
重新排序的一個重要考慮因素是對延遲和成本的影響,尤其是在對大量區塊進行重新排序時。由於重新排序在運行時增加了一個額外的步驟,因此它不可避免地會增加少量延遲,即使重新排序器並行對所有區塊進行評分也是如此。重新排序更多區塊以獲得更好的效能與重新排序較少區塊以獲得更低的延遲和成本之間存在固有的權衡。我們建議根據您的特定用例嘗試不同的設置,以找到合適的平衡點。
Conclusion 結論
We ran a large number of tests, comparing different combinations of all the techniques described above (embedding model, use of BM25, use of contextual retrieval, use of a reranker, and total # of top-K results retrieved), all across a variety of different dataset types. Here’s a summary of what we found:
我們進行了大量測試,比較了上述所有技術的不同組合(嵌入模型、BM25 的使用、上下文檢索的使用、重新排序器的使用以及檢索到的前K 個結果的總數),所有這些都涵蓋了各種技術不同資料集類型的。以下是我們發現的摘要:
- Embeddings+BM25 is better than embeddings on their own;
Embeddings+BM25 比單獨的 embeddings 更好; - Voyage and Gemini have the best embeddings of the ones we tested;
Voyage 和 Gemini 擁有我們測試的最佳嵌入; - Passing the top-20 chunks to the model is more effective than just the top-10 or top-5;
將前 20 個區塊傳遞給模型比僅傳遞前 10 個或前 5 個區塊更有效; - Adding context to chunks improves retrieval accuracy a lot;
在區塊中加入上下文可以大大提高檢索準確性; - Reranking is better than no reranking;
重新排名比不重新排名好; - All these benefits stack: to maximize performance improvements, we can combine contextual embeddings (from Voyage or Gemini) with contextual BM25, plus a reranking step, and adding the 20 chunks to the prompt.
所有這些好處疊加起來:為了最大限度地提高效能,我們可以將上下文嵌入(來自 Voyage 或 Gemini)與上下文 BM25 結合起來,再加上重新排名步驟,並將 20 個區塊添加到提示中。
We encourage all developers working with knowledge bases to use our cookbook to experiment with these approaches to unlock new levels of performance.
我們鼓勵所有使用知識庫的開發人員使用我們的食譜來嘗試這些方法,以解鎖新的效能等級。
Appendix I 附錄一
Below is a breakdown of results across datasets, embedding providers, use of BM25 in addition to embeddings, use of contextual retrieval, and use of reranking for Retrievals @ 20.
以下是跨資料集、嵌入提供者、嵌入之外使用 BM25、上下文檢索的使用以及 Retrievals @ 20 重新排名的使用的結果細分。
See Appendix II for the breakdowns for Retrievals @ 10 and @ 5 as well as example questions and answers for each dataset.
請參閱附錄 II,以了解檢索 @ 10 和 @ 5 的細目分類以及每個資料集的範例問題和答案。