Boosting Long-Context Management via Query-Guided Activation Refilling 通过查询引导的激活填充提高长上下文管理
Hongjin Qian ^(1){ }^{1}, Zheng Liu ^(1**){ }^{1 *}, Peitian Zhang ^(2){ }^{2}, Zhicheng Dou ^(2){ }^{2}, Defu Lian ^(3){ }^{3} 钱红禾、刘征、张培田、窦志成、连德富^(1){ }^{1} Beijing Academy of Artificial Intelligence 北京人工智能学院^(2){ }^{2} Gaoling School of Artificial Intelligence, Renmin University of China 中国人民大学高岭人工智能学院^(3){ }^{3} University of Science and Technology of China 中国科学技术大学{chienqhj,zhengliu1026}@gmail.com
Abstract 摘要
Processing long contexts poses a significant challenge for large language models (LLMs) due to their inherent context-window limitations and the computational burden of extensive key-value (KV) activations, which severely impact efficiency. For information-seeking tasks, full context perception is often unnecessary, as a query’s information needs can dynamically range from localized details to a global perspective, depending on its complexity. However, existing methods struggle to adapt effectively to these dynamic information needs. 处理长上下文为大型语言模型 (LLMs) 带来了显著挑战,这是由于其固有的上下文窗口限制以及大量键值 (KV) 激活的计算负担,这严重影响了效率。对于信息查找任务来说,完整的上下文感知通常是不必要的,因为查询的信息需求可能会根据其复杂程度动态地从局部细节到全局视角不等。然而,现有方法难以有效地适应这些动态的信息需求。
In the paper, we propose a method for processing long-context information-seeking tasks via query-guided ACtivation REfilling (ACRE). ACRE constructs a Bi-layer KV Cache for long contexts, where the layer-1 (L1) cache compactly captures global information, and the layer-2 (L2) cache provides detailed and localized information. ACRE establishes a proxying relationship between the two caches, allowing the input query to attend to the L1 cache and dynamically refill it with relevant entries from the L2 cache. This mechanism integrates global understanding with query-specific local details, thus improving answer decoding. Experiments on a variety of long-context informationseeking datasets demonstrate ACRE’s effectiveness, achieving improvements in both performance and efficiency. We will release our source codes in this repository. 在这篇论文中,我们提出了一种基于查询引导的激活填充(ACRE)方法来处理长上下文信息检索任务。ACRE 构建了一个双层键值缓存来处理长上下文,其中第一层缓存(L1)简洁地捕捉全局信息,第二层缓存(L2)提供详细和本地化的信息。ACRE 在两个缓存之间建立了代理关系,允许输入查询关注 L1 缓存并动态从 L2 缓存中填充相关条目。这种机制将全局理解与查询特定的局部细节相结合,从而改善了答案解码。在各种长上下文信息检索数据集上的实验证明,ACRE 的有效性在性能和效率方面都有所提高。 我们将在这个仓库中发布我们的源代码。
1 Introduction 1 简介
Recently, large language models (LLMs) have become widely used for daily information-seeking tasks, such as ChatGPT (OpenAI, 2023). However, their capabilities are inherently limited by the difficulty of updating parametric knowledge. To address this, incorporating external knowledge as a context has become a common approach (Zhao et al., 2024). In practice, this external knowledge 最近,大型语言模型(code1001)已广泛用于日常信息搜索任务,例如 ChatGPT(OpenAI,2023)。然而,它们的能力受制于更新参数化知识的困难。为了解决这个问题,将外部知识作为上下文纳入已成为一种常见方法(Zhao 等,2024)。在实践中,这种外部知识
Figure 1: Comparison of ACRE, standard RAG, and efficient long LLMs for information-seeking tasks. Standard RAG retrieves evidence without full-context perception, and long LLMs struggle with contexts exceeding their native window. ACRE overcomes these limitations with a resource-efficient bi-layer KV cache and query-guided refilling, capturing both global and local information while enhancing performance. 图 1:ACRE、标准 RAG 和高效长期LLMs对信息检索任务的比较。标准 RAG 在检索证据时没有全面的上下文感知,而长期LLMs在超出其原生窗口的上下文中表现不佳。ACRE 通过一个资源高效的双层 KV 缓存和查询引导的重填机制克服了这些局限性,捕捉全局和局部信息,并提高了性能。
often involves long contexts, such as long documents or novels, which pose significant challenges due to the large KV activations accumulated during inference, demanding substantial computational resources and reducing efficiency (Xu et al., 2023; Bai et al., 2024b; Zhang et al., 2024c). 通常涉及长文本环境,例如长篇文档或小说,这在推理过程中会产生大量的 KV 激活,需要大量的计算资源,从而降低了效率(Xu et al., 2023; Bai et al., 2024b; Zhang et al., 2024c)。
To address the challenges posed by excessive KV activations, previous works have proposed various strategies: reducing the precision of activation tensors (Liu et al., 2024; Xu et al., 2024), dividing long contexts into smaller chunks for independent processing (Lee et al., 2024; Yoon et al., 2024), or compressing KV activations into shorter representations through selection or sparse attention (Zhang et al., 2023; Li et al., 2024; Xiao et al., 2024; Jiang et al., 2024). Retrieval-Augmented Generation (RAG) has also emerged as a promising approach, retrieving precise evidence from long contexts to support answer generation (Gao et al., 2024). 为了解决过度 KV 激活带来的挑战,之前的工作提出了各种策略:降低激活张量的精度(Liu 等人,2024;Xu 等人,2024)、将长上下文划分为较小的块进行独立处理(Lee 等人,2024;Yoon 等人,2024),或通过选择或稀疏注意力压缩 KV 激活为更短的表示(Zhang 等人,2023;Li 等人,2024;Xiao 等人,2024;Jiang 等人,2024)。检索增强型生成(RAG)也已经成为一种很有前景的方法,从长上下文中检索精确证据以支持答案生成(Gao 等人,2024)。
However, most existing methods follow a unilateral strategy: either compromising the semantic richness of KV activations to create compact global representations, such as with quantized activations (Liu et al., 2024), or concentrating solely 然而,大多数现有方法遵循单边策略:要么压缩 KV 激活的语义丰富性以创建紧凑的全局表示,例如使用量化激活(Liu et al., 2024),要么仅集中于
on detailed local information, such as RAG methods (Gao et al., 2024). Moreover, most lightweight KV methods remain constrained by the native context length limit, leading to significant performance degradation when processing contexts that exceed this limit (Zhang et al., 2024b). 基于详细的本地信息,例如 RAG 方法(Gao 等,2024 年)。此外,大多数轻量级 KV 方法仍然受到本地上下文长度限制的约束,当处理超出此限制的上下文时,会导致性能显著下降(Zhang 等,2024b)。
In information-seeking tasks, we argue that the information needs of a user query can dynamically range from localized details to a global perspective, depending on the query’s complexity. For instance, given a novel, the query “What are the main characters’ names?” involves localized information needs and can be answered using specific local evidence. In contrast, the query “How do the main characters drive the story’s development?” requires a global understanding of the entire book. 在信息寻求任务中,我们认为用户查询的信息需求可以根据查询的复杂性动态地从局部细节到全局视角变化。例如,给定一部小说,查询"主要人物的名字是什么?"涉及局部信息需求,可以使用特定的局部证据来回答。相比之下,查询"主要人物如何推动故事发展?"需要全面理解整本书。
To address dynamic information needs in information-seeking tasks, we propose ACRE, a method that employs a bilateral strategy to capture a global perspective across the full context and enhance local details using query-guided activation refilling. Figure 1 presents an overview of ACRE’s framework along with a comparison against efficient long LLMs and RAG methods. 为了解决信息检索任务中动态信息需求,我们提出了 ACRE,这是一种采用双边策略的方法,它可以捕获全局视角并利用基于查询的激活补充增强局部细节。 图 1 展示了 ACRE 框架的概述以及与高效的长文本方法和 RAG 方法的比较。
Specifically, ACRE constructs a bi-layer KV activation cache for long contexts, comprising an L1 cache and an L2 cache. The L1 cache captures compact yet global information from the full context, while the L2 cache retains localized, detailed information. Notably, the L1 cache is significantly smaller than the L2 cache. During the forward pass of the LLM, the L1 and L2 caches are interleaved into a nested structure, with each L1 tensor optimized to proxy the semantics of its corresponding L2 cache. To enhance efficiency, we replace the original full attention mechanism-where each token attends to all preceding tokens-with a tailored selective attention mechanism. In this approach, tokens perform full attention on recent L1 and L2 tokens but only attend to distant L1 tokens. This selective attention mechanism significantly reduces computational costs, enabling ACRE to process long contexts more efficiently. 具体而言,ACRE 构建了一个双层 KV 激活缓存,用于长上下文,由 L1 缓存和 L2 缓存组成。L1 缓存从整个上下文中捕获简洁但全局的信息,而 L2 缓存保留局部化的详细信息。值得注意的是,L1 缓存明显小于 L2 缓存。在LLM的正向传递过程中,L1 和 L2 缓存被交错组合成一个嵌套结构,每个 L1 张量都被优化以代理其相应 L2 缓存的语义。为了提高效率,我们用专门的选择性注意力机制取代了原有的完全注意力机制,即每个 token 都关注所有前面的 token。在这种方法中,token 对最近的 L1 和 L2 token 进行完全注意力,但只关注远程的 L1 token。 这种选择性注意机制大大降低了计算成本,使 ACRE 能够更有效地处理长上下文。
After the forward pass, the nested KV cache is decomposed back into separate L1 and L2 caches. For an input query, ACRE first uses the query to attend to the compact L1 cache. Based on the resulting attention score distribution, ACRE selectively refills key entries of the L1 cache with the corresponding L2 cache entries, thereby enriching local details. This process is referred to as query-guided activation refilling. 在前向传播之后,嵌套的 KV 缓存被分解回独立的 L1 和 L2 缓存。对于输入查询,ACRE 首先使用查询来关注紧凑的 L1 缓存。基于结果注意力分数分布,ACRE 选择性地用相应的 L2 缓存条目重新填充 L1 缓存的关键条目,从而丰富局部细节。这个过程被称为查询引导的激活重填。
ACRE is trained through an efficient two-stage process. The first stage focuses on constructing the bi-layer KV cache, while the second stage targets query-guided activation refilling. Throughout both stages, ACRE updates only a small subset of model parameters, ensuring training efficiency. 这是一个经过高效两阶段过程培训的 ACRE。第一阶段专注于构建双层 KV 缓存,而第二阶段则针对查询引导的激活补充。在这两个阶段中,ACRE 只更新了一小部分模型参数,确保了训练效率。
We evaluate ACRE across a wide range of long-context information-seeking tasks (Bai et al., 2024b; Zhang et al., 2024c; Qian et al., 2024b). The experimental results confirm the effectiveness of ACRE. Our key contributions are summarized as follows: (1) We design a flexible and efficient bi-layer KV activation cache mechanism for long contexts, which captures compact global information while preserving local details. (2) We introduce ACRE, a method that leverages the bi-layer KV activation cache with a query-guided activation refilling mechanism to efficiently handle longcontext information-seeking tasks. (3) We demonstrate that ACRE achieves superior performance on long-context information-seeking tasks, effectively handling contexts much longer than LLMs’ typical context limits, while substantially reducing computational resources and latency. 我们在广泛的长上下文信息检索任务中评估 ACRE(Bai et al., 2024b; Zhang et al., 2024c; Qian et al., 2024b)。实验结果证实了 ACRE 的有效性。我们的主要贡献总结如下:(1)我们设计了一种灵活高效的双层 KV 激活缓存机制,用于处理长上下文,它能捕获紧凑的全局信息,同时保留局部细节。(2)我们提出了 ACRE,这是一种利用双层 KV 激活缓存以及查询驱动的激活补充机制,有效处理长上下文信息检索任务的方法。 我们展示 ACRE 在长上下文信息寻找任务上取得了优秀的性能,能够有效处理远长于LLMs典型上下文限制的上下文,同时大幅减少了计算资源和延迟。
2 Method 2 方法
2.1 Preliminary 2.1 初步
The process of solving information-seeking tasks using LLMs can be succinctly described as Y=\mathcal{Y}=M(X)\mathcal{M}(\mathcal{X}), where M(*)\mathcal{M}(\cdot) denotes the LLM, Y\mathcal{Y} represents the output answer and X\mathcal{X} represents the input sequence. X\mathcal{X} can take various forms, ranging from a standalone query to a complex instruction prompt. In this paper, we focus on informationseeking tasks with long contexts. Therefore, we define the input sequence X\mathcal{X} as comprising a query qq and a long context C\mathcal{C}, denoted by X=(C,q)\mathcal{X}=(\mathcal{C}, q). 使用LLMs来解决信息搜索任务可以简洁地描述为 Y=\mathcal{Y}=M(X)\mathcal{M}(\mathcal{X}) ,其中 M(*)\mathcal{M}(\cdot) 表示LLM, Y\mathcal{Y} 表示输出答案, X\mathcal{X} 表示输入序列。 X\mathcal{X} 可以采取各种形式,从独立查询到复杂的指令提示。在本文中,我们专注于具有长上下文的信息搜索任务。因此,我们将输入序列 X\mathcal{X} 定义为由查询 qq 和长上下文 C\mathcal{C} 组成,用 X=(C,q)\mathcal{X}=(\mathcal{C}, q) 表示。
For the input X\mathcal{X}, a Transformer-based LLM computes multi-head attention (MHA) as follows: 对于输入 X\mathcal{X} ,基于变换器的 LLM 计算多头注意力(MHA)如下:
where X\boldsymbol{X} represents the hidden states of the input sequence X\mathcal{X}, and W_(Q),W_(K)\boldsymbol{W}_{Q}, \boldsymbol{W}_{K}, and W_(V)\boldsymbol{W}_{V} are the projection weight matrices for the query Q\boldsymbol{Q}, key K\boldsymbol{K}, and value V\boldsymbol{V}, respectively (Vaswani et al., 2023). The attention function A(*)\mathcal{A}(\cdot) is applied iteratively 其中 X\boldsymbol{X} 表示输入序列 X\mathcal{X} 的隐藏状态, W_(Q),W_(K)\boldsymbol{W}_{Q}, \boldsymbol{W}_{K} 、 W_(V)\boldsymbol{W}_{V} 分别是查询 Q\boldsymbol{Q} 、键 K\boldsymbol{K} 和值 V\boldsymbol{V} 的投影权重矩阵(Vaswani et al., 2023)。注意力函数 A(*)\mathcal{A}(\cdot) 会被迭代应用。
(a) Bi-layer KV Cache 双层 KV 缓存
The inference process of LLMs can be divided into two stages: (1) prefilling and (2) decoding (Liu et al., 2024). During the prefilling stage, the input sequence X\mathcal{X} is processed through each layer using MHA, and the layer-wise key-value activations [K,V][\boldsymbol{K}, \boldsymbol{V}] are cached. These cached activations are reused in the decoding stage to avoid redundant computations, enabling efficient processing. However, as MHA computation has quadratic complexity with respect to the sequence length nn, handling long contexts becomes computationally expensive. This often results in slow processing speeds and out-of-memory issues, particularly when dealing with long input contexts (Dong et al., 2023). 推断过程可以分为两个阶段:(1)预填充和(2)解码(刘等人,2024 年)。在预填充阶段,输入序列 X\mathcal{X} 通过每一层使用多头注意力处理,并缓存每层的键值激活 [K,V][\boldsymbol{K}, \boldsymbol{V}] 。这些缓存的激活在解码阶段被重复使用,以避免冗余计算,从而实现高效处理。然而,由于多头注意力计算的复杂度与序列长度 nn 呈二次关系,处理长上下文会变得计算昂贵。这通常会导致处理速度缓慢和内存溢出问题,特别是在处理长输入上下文时(董等人,2023 年)。
To address the challenges posed by oversized KV caches for long contexts, we propose ACRE, a framework that constructs a Bi-layer KV Cache and employs a Query-Guided Refilling mechanism to enable a flexible KV cache that captures both global context and query-specific local details, ensuring efficient and high-quality answer decoding. 为了解决长上下文中过大的 KV 缓存带来的挑战,我们提出了 ACRE 框架。ACRE 构建了一个双层 KV 缓存,并采用了查询引导的重填机制,以实现一个灵活的 KV 缓存,既能捕捉全局上下文,又能捕捉查询特定的细节,从而确保高效和高质量的答案解码。
2.2 Overview of ACRE 2.2 ACRE 概述
Figure 2 provides an overview of ACRE. Specifically, for a information-seeking task with a long context C\mathcal{C}, ACRE organizes the long context into a bi-layer KV activation cache during the pre-filling stage, as shown in Figure 2 (a). 图 2 提供了 ACRE 的概述。具体而言,对于具有较长文本的信息检索任务 C\mathcal{C} ,ACRE 在预填充阶段将长文本上下文组织成双层 KV 激活缓存,如图 2 (a) 所示。
The construction of the Bi-layer KV Cache begins by interleaving newly introduced L1 tokens into the input context. Through model forwarding, a nested KV cache [ tilde(K), tilde(V)][\tilde{\boldsymbol{K}}, \tilde{\boldsymbol{V}}] is obtained. This nested KV cache is then decomposed into a Bi-layer KV cache: the layer-1 (L1) cache, which is compact and stores global information from the full long context, and the layer-2 (L2) cache, which holds detailed and localized information. Each tensor in the L1 cache serves as a semantic proxy for a corresponding sequence of tensors in the L2 cache. 双层 KV 缓存的构建首先通过将新引入的 L1 令牌插入输入上下文来实现。通过模型转发获得嵌套的 KV 缓存 [ tilde(K), tilde(V)][\tilde{\boldsymbol{K}}, \tilde{\boldsymbol{V}}] 。然后将这个嵌套的 KV 缓存分解为双层 KV 缓存:第 1 层 (L1) 缓存紧凑且存储来自完整长上下文的全局信息,第 2 层 (L2) 缓存保存详细和局部化的信息。L1 缓存中的每个张量都充当 L2 缓存中相应张量序列的语义代理。
We denote the L1 KV cache as [K^(L1),V^(L1)]in\left[\boldsymbol{K}^{L 1}, \boldsymbol{V}^{L 1}\right] \inR^(m xx d)\mathbb{R}^{m \times d} and the L2 KV cache as [K^(L2),V^(L2)]in\left[\boldsymbol{K}^{L 2}, \boldsymbol{V}^{L 2}\right] \inR^(n xx d)\mathbb{R}^{n \times d}. Here, the length of the L1 KV cache, mm, is significantly smaller than nn, the length of the L2 KV cache. To optimize memory usage, the L2 cache can be offloaded to CPU memory, while the L1 cache is retained in GPU memory as a constant cache after constructing the bi-layer KV cache. This design significantly improves memory efficiency in practical applications. 我们将 L1 KV 缓存表示为 [K^(L1),V^(L1)]in\left[\boldsymbol{K}^{L 1}, \boldsymbol{V}^{L 1}\right] \inR^(m xx d)\mathbb{R}^{m \times d} ,将 L2 KV 缓存表示为 [K^(L2),V^(L2)]in\left[\boldsymbol{K}^{L 2}, \boldsymbol{V}^{L 2}\right] \inR^(n xx d)\mathbb{R}^{n \times d} 。这里, L1 KV 缓存的长度 mm 明显小于 L2 KV 缓存的长度 nn 。为了优化内存使用,L2 缓存可以卸载到 CPU 内存,而 L1 缓存则保留在 GPU 内存中作为恒定缓存,在构建双层 KV 缓存后。这种设计在实际应用中大大提高了内存效率。
The Bi-layer KV Cache is constructed exclusively for input contexts, enabling it to be reused across different information-seeking tasks that share the same context. Given an input query qq, ACRE utilizes qq to attend to the L1 cache, computing attention scores. Based on these scores, ACRE selectively refills the L1 cache by retrieving the most informative entries from the L2 cache, which are proxied by the corresponding most attentive 双层键值缓存专门为输入上下文构建,使其能够在共享相同上下文的不同信息检索任务之间重复使用。给定输入查询 qq ,ACRE 利用 qq 关注 L1 缓存,计算注意力分数。基于这些分数,ACRE 通过从 L2 缓存检索最有信息量的条目来有选择地填充 L1 缓存,这些条目由相应的最注意力代理。
L1 cache tensors. This process recovers a partial nested cache to support answer decoding and is referred to as query-guided activation refilling, which is shown in Figure 2 (b). 一级缓存张量。这个过程恢复了部分嵌套缓存来支持答案解码,称为查询引导激活重填,如图 2(b)所示。
By leveraging both the L1 KV cache and the query-specific L2 KV cache, the final KV cache captures global information from the full long context while preserving local details. This design significantly enhances the performance of longcontext information-seeking tasks. In the following sections, we provide the technical details of ACRE. 通过利用 L1 KV 缓存和特定查询的 L2 KV 缓存,最终的 KV 缓存能捕获来自完整长上下文的全局信息,同时保留局部细节。这种设计大大提高了长上下文信息搜索任务的性能。在以下部分中,我们提供了 ACRE 的技术细节。
2.3 Bi-Layer KV Cache 双层 KV 缓存
To construct the bi-layer KV cache, we introduce a new type of token, called L1 tokens, denoted as X^(L1)=(x_(1)^(L1),cdots,x_(m)^(L1))\mathcal{X}^{L 1}=\left(x_{1}^{L 1}, \cdots, x_{m}^{L 1}\right). The original tokens of the input sequence are referred to as L 2 tokens, denoted as X^(L2)=(x_(1),cdots,x_(n))\mathcal{X}^{L 2}=\left(x_{1}, \cdots, x_{n}\right). By interleaving the L1 and L2 tokens, the input sequence X\mathcal{X} is transformed into a nested sequence tilde(X)\tilde{\mathcal{X}} : 为构建双层 KV 缓存,我们引入了一种新的令牌类型,称为 L1 令牌,记为 X^(L1)=(x_(1)^(L1),cdots,x_(m)^(L1))\mathcal{X}^{L 1}=\left(x_{1}^{L 1}, \cdots, x_{m}^{L 1}\right) 。输入序列的原始令牌称为 L2 令牌,记为 X^(L2)=(x_(1),cdots,x_(n))\mathcal{X}^{L 2}=\left(x_{1}, \cdots, x_{n}\right) 。通过交错 L1 和 L2 令牌,输入序列 X\mathcal{X} 被转化为嵌套序列 tilde(X)\tilde{\mathcal{X}} 。
where each L1 token is inserted after every ll L2 tokens, acting as a semantic proxy for the preceding lL2l \mathrm{~L} 2 tokens. We refer to ll as the L1/L2 interval. For the L1 tokens, we initialize an additional set of trainable weight matrices W_(Q)^(L1),W_(K)^(L1)\boldsymbol{W}_{Q}^{L 1}, \boldsymbol{W}_{K}^{L 1}, and W_(V)^(L1)\boldsymbol{W}_{V}^{L 1}, while keeping the original weight matrices for L2 tokens frozen. 其中每个 L1 标记都插入在每隔 ll L2 标记之后,充当前面 lL2l \mathrm{~L} 2 标记的语义代理。我们将 ll 称为 L1/L2 间隔。对于 L1 标记,我们初始化了一组额外的可训练权重矩阵 W_(Q)^(L1),W_(K)^(L1)\boldsymbol{W}_{Q}^{L 1}, \boldsymbol{W}_{K}^{L 1} 和 W_(V)^(L1)\boldsymbol{W}_{V}^{L 1} ,同时保持 L2 标记的原始权重矩阵冻结不变。
After constructing the nested sequence tilde(X)\tilde{\mathcal{X}}, we adapt the attention computation defined in Eq. (4). Specifically, for the key K\boldsymbol{K}, the original projection K=X*W_(K)\boldsymbol{K}=\boldsymbol{X} \cdot \boldsymbol{W}_{K} is replaced with: 在构建嵌套序列 tilde(X)\tilde{\mathcal{X}} 之后,我们调整了公式(4)中定义的注意力计算。具体而言,对于关键字 K\boldsymbol{K} ,原始投影 K=X*W_(K)\boldsymbol{K}=\boldsymbol{X} \cdot \boldsymbol{W}_{K} 被替换为:
K={[x*W_(K)^(L1)","," if "x" is an L1 token "],[x*W_(K)","," if "x" is an L2 token "]:}\boldsymbol{K}= \begin{cases}\boldsymbol{x} \cdot \boldsymbol{W}_{K}^{L 1}, & \text { if } x \text { is an L1 token } \\ \boldsymbol{x} \cdot \boldsymbol{W}_{K}, & \text { if } x \text { is an L2 token }\end{cases}
where x in X\boldsymbol{x} \in \boldsymbol{X}. Through multi-head attention, this modification yields the nested key activations: 通过多头注意力机制,这一修改产生了嵌套的键激活。
where ubrace(k_(1),cdots,k_(l)ubrace)_(k_(1)^(L1))\underbrace{\boldsymbol{k}_{1}, \cdots, \boldsymbol{k}_{l}}_{\boldsymbol{k}_{1}^{L 1}} represents the proxying relationship between the L1 cache and the L2 cache. 其中 ubrace(k_(1),cdots,k_(l)ubrace)_(k_(1)^(L1))\underbrace{\boldsymbol{k}_{1}, \cdots, \boldsymbol{k}_{l}}_{\boldsymbol{k}_{1}^{L 1}} 表示 L1 缓存和 L2 缓存之间的代理关系。
As previously mentioned, directly computing full attention over the long sequence X\mathcal{X} is both computationally expensive and resource-intensive. To efficiently construct the bi-layer KV cache, we propose a selective attention mechanism. This mechanism maintains a relatively small working context window W\mathcal{W}, enabling current tokens to perform full attention on recent L1 and L2 tokens while only attending to distant L1 tokens. For instance, when computing KV activations at step nn, we prune the previous KV cache [ tilde(K), tilde(V)][\tilde{\boldsymbol{K}}, \tilde{\boldsymbol{V}}] as follows: 正如之前提到的,直接计算长序列的全注意力既计算密集又资源密集。为了有效地构建双层 KV 缓存,我们提出了一种选择性注意力机制。这种机制维护一个相对较小的工作上下文窗口,使当前令牌能够对最近的 L1 和 L2 令牌执行完全注意力,同时只关注远程 L1 令牌。例如,在步骤 nn 计算 KV 激活时,我们如下修剪前一个 KV 缓存 [ tilde(K), tilde(V)][\tilde{\boldsymbol{K}}, \tilde{\boldsymbol{V}}] :
subject to the constraints | tilde(K)| <= W|\tilde{\boldsymbol{K}}| \leq \mathcal{W} and | tilde(V)| <=|\tilde{\boldsymbol{V}}| \leqW\mathcal{W}. Through this mechanism, we sequentially process the full sequence tilde(X)\tilde{\mathcal{X}} into KV activations using a short working context window, achieving both high computational efficiency and economical memory usage. 受制于约束条件 | tilde(K)| <= W|\tilde{\boldsymbol{K}}| \leq \mathcal{W} 和 | tilde(V)| <=|\tilde{\boldsymbol{V}}| \leqW\mathcal{W} 。通过这种机制,我们使用短工作上下文窗口,连续处理整个序列 tilde(X)\tilde{\mathcal{X}} 以生成 KV 激活,实现了高计算效率和经济高效的内存使用。
2.4 Query-Guided Activation Refilling 查询引导激活补充
After constructing the bi-layer KV cache for the context, we obtain the L1 KV cache [K^(L1),V^(L1)]\left[\boldsymbol{K}^{L 1}, \boldsymbol{V}^{L 1}\right], which serves as a global yet compact representation of the full long context, and the L2 KV cache [K^(L2),V^(L2)]\left[\boldsymbol{K}^{L 2}, \boldsymbol{V}^{L 2}\right], which provides detailed but memoryintensive representations. To optimize memory usage, the L1 KV cache is retained as a constant cache in GPU memory, while the L2 KV cache is offloaded to CPU memory. 在为上下文构建双层 KV 缓存后,我们获得了 L1 KV 缓存 [K^(L1),V^(L1)]\left[\boldsymbol{K}^{L 1}, \boldsymbol{V}^{L 1}\right] ,它为完整的长上下文提供了全局而紧凑的表示,以及 L2 KV 缓存