Boosting Long-Context Management via Query-Guided Activation Refilling
通过查询引导的激活填充提高长上下文管理

Hongjin Qian $^{1}$ , Zheng Liu $^{1 *}$ , Peitian Zhang $^{2}$ , Zhicheng Dou $^{2}$ , Defu Lian $^{3}$
钱红禾、刘征、张培田、窦志成、连德富 $^{1}$ Beijing Academy of Artificial Intelligence
北京人工智能学院 $^{2}$ Gaoling School of Artificial Intelligence, Renmin University of China
中国人民大学高岭人工智能学院 $^{3}$ University of Science and Technology of China
中国科学技术大学{chienqhj,zhengliu1026}@gmail.com

Abstract 摘要

Processing long contexts poses a significant challenge for large language models (LLMs) due to their inherent context-window limitations and the computational burden of extensive key-value (KV) activations, which severely impact efficiency. For information-seeking tasks, full context perception is often unnecessary, as a query’s information needs can dynamically range from localized details to a global perspective, depending on its complexity. However, existing methods struggle to adapt effectively to these dynamic information needs.
处理长上下文为大型语言模型 (LLMs) 带来了显著挑战,这是由于其固有的上下文窗口限制以及大量键值 (KV) 激活的计算负担,这严重影响了效率。对于信息查找任务来说,完整的上下文感知通常是不必要的,因为查询的信息需求可能会根据其复杂程度动态地从局部细节到全局视角不等。然而,现有方法难以有效地适应这些动态的信息需求。

In the paper, we propose a method for processing long-context information-seeking tasks via query-guided ACtivation REfilling (ACRE). ACRE constructs a Bi-layer KV Cache for long contexts, where the layer-1 (L1) cache compactly captures global information, and the layer-2 (L2) cache provides detailed and localized information. ACRE establishes a proxying relationship between the two caches, allowing the input query to attend to the L1 cache and dynamically refill it with relevant entries from the L2 cache. This mechanism integrates global understanding with query-specific local details, thus improving answer decoding. Experiments on a variety of long-context informationseeking datasets demonstrate ACRE’s effectiveness, achieving improvements in both performance and efficiency. We will release our source codes in this repository.
在这篇论文中,我们提出了一种基于查询引导的激活填充(ACRE)方法来处理长上下文信息检索任务。ACRE 构建了一个双层键值缓存来处理长上下文,其中第一层缓存(L1)简洁地捕捉全局信息,第二层缓存(L2)提供详细和本地化的信息。ACRE 在两个缓存之间建立了代理关系,允许输入查询关注 L1 缓存并动态从 L2 缓存中填充相关条目。这种机制将全局理解与查询特定的局部细节相结合,从而改善了答案解码。在各种长上下文信息检索数据集上的实验证明,ACRE 的有效性在性能和效率方面都有所提高。我们将在这个仓库中发布我们的源代码。

1 Introduction 1 简介

Recently, large language models (LLMs) have become widely used for daily information-seeking tasks, such as ChatGPT (OpenAI, 2023). However, their capabilities are inherently limited by the difficulty of updating parametric knowledge. To address this, incorporating external knowledge as a context has become a common approach (Zhao et al., 2024). In practice, this external knowledge
最近,大型语言模型(code1001)已广泛用于日常信息搜索任务,例如 ChatGPT(OpenAI,2023)。然而,它们的能力受制于更新参数化知识的困难。为了解决这个问题,将外部知识作为上下文纳入已成为一种常见方法(Zhao 等,2024)。在实践中,这种外部知识

Figure 1: Comparison of ACRE, standard RAG, and efficient long LLMs for information-seeking tasks. Standard RAG retrieves evidence without full-context perception, and long LLMs struggle with contexts exceeding their native window. ACRE overcomes these limitations with a resource-efficient bi-layer KV cache and query-guided refilling, capturing both global and local information while enhancing performance.
图 1:ACRE、标准 RAG 和高效长期LLMs对信息检索任务的比较。标准 RAG 在检索证据时没有全面的上下文感知,而长期LLMs在超出其原生窗口的上下文中表现不佳。ACRE 通过一个资源高效的双层 KV 缓存和查询引导的重填机制克服了这些局限性,捕捉全局和局部信息,并提高了性能。
often involves long contexts, such as long documents or novels, which pose significant challenges due to the large KV activations accumulated during inference, demanding substantial computational resources and reducing efficiency (Xu et al., 2023; Bai et al., 2024b; Zhang et al., 2024c).
通常涉及长文本环境,例如长篇文档或小说,这在推理过程中会产生大量的 KV 激活,需要大量的计算资源,从而降低了效率(Xu et al., 2023; Bai et al., 2024b; Zhang et al., 2024c)。

To address the challenges posed by excessive KV activations, previous works have proposed various strategies: reducing the precision of activation tensors (Liu et al., 2024; Xu et al., 2024), dividing long contexts into smaller chunks for independent processing (Lee et al., 2024; Yoon et al., 2024), or compressing KV activations into shorter representations through selection or sparse attention (Zhang et al., 2023; Li et al., 2024; Xiao et al., 2024; Jiang et al., 2024). Retrieval-Augmented Generation (RAG) has also emerged as a promising approach, retrieving precise evidence from long contexts to support answer generation (Gao et al., 2024).
为了解决过度 KV 激活带来的挑战,之前的工作提出了各种策略:降低激活张量的精度(Liu 等人,2024;Xu 等人,2024)、将长上下文划分为较小的块进行独立处理(Lee 等人,2024;Yoon 等人,2024),或通过选择或稀疏注意力压缩 KV 激活为更短的表示(Zhang 等人,2023;Li 等人,2024;Xiao 等人,2024;Jiang 等人,2024)。检索增强型生成(RAG)也已经成为一种很有前景的方法,从长上下文中检索精确证据以支持答案生成(Gao 等人,2024)。

However, most existing methods follow a unilateral strategy: either compromising the semantic richness of KV activations to create compact global representations, such as with quantized activations (Liu et al., 2024), or concentrating solely
然而,大多数现有方法遵循单边策略:要么压缩 KV 激活的语义丰富性以创建紧凑的全局表示,例如使用量化激活(Liu et al., 2024),要么仅集中于
on detailed local information, such as RAG methods (Gao et al., 2024). Moreover, most lightweight KV methods remain constrained by the native context length limit, leading to significant performance degradation when processing contexts that exceed this limit (Zhang et al., 2024b).
基于详细的本地信息,例如 RAG 方法(Gao 等,2024 年)。此外,大多数轻量级 KV 方法仍然受到本地上下文长度限制的约束,当处理超出此限制的上下文时,会导致性能显著下降(Zhang 等,2024b)。

In information-seeking tasks, we argue that the information needs of a user query can dynamically range from localized details to a global perspective, depending on the query’s complexity. For instance, given a novel, the query “What are the main characters’ names?” involves localized information needs and can be answered using specific local evidence. In contrast, the query “How do the main characters drive the story’s development?” requires a global understanding of the entire book.
在信息寻求任务中,我们认为用户查询的信息需求可以根据查询的复杂性动态地从局部细节到全局视角变化。例如,给定一部小说,查询"主要人物的名字是什么?"涉及局部信息需求,可以使用特定的局部证据来回答。相比之下,查询"主要人物如何推动故事发展?"需要全面理解整本书。

To address dynamic information needs in information-seeking tasks, we propose ACRE, a method that employs a bilateral strategy to capture a global perspective across the full context and enhance local details using query-guided activation refilling. Figure 1 presents an overview of ACRE’s framework along with a comparison against efficient long LLMs and RAG methods.
为了解决信息检索任务中动态信息需求,我们提出了 ACRE,这是一种采用双边策略的方法,它可以捕获全局视角并利用基于查询的激活补充增强局部细节。图 1 展示了 ACRE 框架的概述以及与高效的长文本方法和 RAG 方法的比较。

Specifically, ACRE constructs a bi-layer KV activation cache for long contexts, comprising an L1 cache and an L2 cache. The L1 cache captures compact yet global information from the full context, while the L2 cache retains localized, detailed information. Notably, the L1 cache is significantly smaller than the L2 cache. During the forward pass of the LLM, the L1 and L2 caches are interleaved into a nested structure, with each L1 tensor optimized to proxy the semantics of its corresponding L2 cache. To enhance efficiency, we replace the original full attention mechanism-where each token attends to all preceding tokens-with a tailored selective attention mechanism. In this approach, tokens perform full attention on recent L1 and L2 tokens but only attend to distant L1 tokens. This selective attention mechanism significantly reduces computational costs, enabling ACRE to process long contexts more efficiently.
具体而言,ACRE 构建了一个双层 KV 激活缓存,用于长上下文,由 L1 缓存和 L2 缓存组成。L1 缓存从整个上下文中捕获简洁但全局的信息,而 L2 缓存保留局部化的详细信息。值得注意的是,L1 缓存明显小于 L2 缓存。在LLM的正向传递过程中,L1 和 L2 缓存被交错组合成一个嵌套结构,每个 L1 张量都被优化以代理其相应 L2 缓存的语义。为了提高效率,我们用专门的选择性注意力机制取代了原有的完全注意力机制,即每个 token 都关注所有前面的 token。在这种方法中,token 对最近的 L1 和 L2 token 进行完全注意力,但只关注远程的 L1 token。这种选择性注意机制大大降低了计算成本,使 ACRE 能够更有效地处理长上下文。

After the forward pass, the nested KV cache is decomposed back into separate L1 and L2 caches. For an input query, ACRE first uses the query to attend to the compact L1 cache. Based on the resulting attention score distribution, ACRE selectively refills key entries of the L1 cache with the corresponding L2 cache entries, thereby enriching local details. This process is referred to as query-guided activation refilling.
在前向传播之后,嵌套的 KV 缓存被分解回独立的 L1 和 L2 缓存。对于输入查询,ACRE 首先使用查询来关注紧凑的 L1 缓存。基于结果注意力分数分布,ACRE 选择性地用相应的 L2 缓存条目重新填充 L1 缓存的关键条目,从而丰富局部细节。这个过程被称为查询引导的激活重填。

ACRE is trained through an efficient two-stage process. The first stage focuses on constructing the bi-layer KV cache, while the second stage targets query-guided activation refilling. Throughout both stages, ACRE updates only a small subset of model parameters, ensuring training efficiency.
这是一个经过高效两阶段过程培训的 ACRE。第一阶段专注于构建双层 KV 缓存,而第二阶段则针对查询引导的激活补充。在这两个阶段中,ACRE 只更新了一小部分模型参数,确保了训练效率。

We evaluate ACRE across a wide range of long-context information-seeking tasks (Bai et al., 2024b; Zhang et al., 2024c; Qian et al., 2024b). The experimental results confirm the effectiveness of ACRE. Our key contributions are summarized as follows: (1) We design a flexible and efficient bi-layer KV activation cache mechanism for long contexts, which captures compact global information while preserving local details. (2) We introduce ACRE, a method that leverages the bi-layer KV activation cache with a query-guided activation refilling mechanism to efficiently handle longcontext information-seeking tasks. (3) We demonstrate that ACRE achieves superior performance on long-context information-seeking tasks, effectively handling contexts much longer than LLMs’ typical context limits, while substantially reducing computational resources and latency.
我们在广泛的长上下文信息检索任务中评估 ACRE(Bai et al., 2024b; Zhang et al., 2024c; Qian et al., 2024b)。实验结果证实了 ACRE 的有效性。我们的主要贡献总结如下:(1)我们设计了一种灵活高效的双层 KV 激活缓存机制,用于处理长上下文,它能捕获紧凑的全局信息,同时保留局部细节。(2)我们提出了 ACRE,这是一种利用双层 KV 激活缓存以及查询驱动的激活补充机制,有效处理长上下文信息检索任务的方法。我们展示 ACRE 在长上下文信息寻找任务上取得了优秀的性能,能够有效处理远长于LLMs典型上下文限制的上下文,同时大幅减少了计算资源和延迟。

2 Method 2 方法

2.1 Preliminary 2.1 初步

The process of solving information-seeking tasks using LLMs can be succinctly described as

Y =

M (X)

, where

M (\cdot)

denotes the LLM,

Y

represents the output answer and

X

represents the input sequence.

X

can take various forms, ranging from a standalone query to a complex instruction prompt. In this paper, we focus on informationseeking tasks with long contexts. Therefore, we define the input sequence

X

as comprising a query

q

and a long context

C

, denoted by

X = (C, q)

.
使用LLMs来解决信息搜索任务可以简洁地描述为

Y =

M (X)

,其中

M (\cdot)

表示LLM,

Y

表示输出答案,

X

表示输入序列。

X

可以采取各种形式,从独立查询到复杂的指令提示。在本文中,我们专注于具有长上下文的信息搜索任务。因此,我们将输入序列

X

定义为由查询

q

和长上下文

C

组成,用

X = (C, q)

表示。

For the input

X

, a Transformer-based LLM computes multi-head attention (MHA) as follows:
对于输入

X

，基于变换器的 LLM 计算多头注意力(MHA)如下:

\begin{aligned} Q & = X \cdot W_{Q} \\ K & = X \cdot W_{K} \\ V & = X \cdot W_{V} \\ A (Q, K, V) & = softmax (\frac{Q \cdot K^{⊤}}{\sqrt{d}}) \cdot V \end{aligned}

where

X

represents the hidden states of the input sequence

X

, and

W_{Q}, W_{K}

, and

W_{V}

are the projection weight matrices for the query

Q

, key

K

, and value

V

, respectively (Vaswani et al., 2023). The attention function

A (\cdot)

is applied iteratively
其中

X

表示输入序列

X

的隐藏状态,

W_{Q}, W_{K}

、

W_{V}

分别是查询

Q

、键

K

和值

V

的投影权重矩阵(Vaswani et al., 2023)。注意力函数

A (\cdot)

会被迭代应用。
(a) Bi-layer KV Cache
双层 KV 缓存

(b) Query-Guided Refilling
查询引导补充

Figure 2: Overview of ACRE. (a) ACRE constructs the Bi-layer KV cache from a long context. (b) For an input query, ACRE refills the L1 KV cache with query-relevant entries from the L2 KV cache and decodes the final answer based on the refilled cache. © The two-stage optimization process used to train ACRE is illustrated.
图 2：ACRE 概述。(a) ACRE 从长上下文构建双层 KV 缓存。(b) 对于输入查询,ACRE 从 L2 KV 缓存中重填 L1 KV 缓存,并根据重填的缓存解码最终答案。(c) 用于训练 ACRE 的两阶段优化过程的说明。
across multiple layers and attention heads. For simplicity, we omit the layer and head indices.
跨多个层和注意力头。为简单起见,我们省略了层和头的索引。

The inference process of LLMs can be divided into two stages: (1) prefilling and (2) decoding (Liu et al., 2024). During the prefilling stage, the input sequence

X

is processed through each layer using MHA, and the layer-wise key-value activations

[K, V]

are cached. These cached activations are reused in the decoding stage to avoid redundant computations, enabling efficient processing. However, as MHA computation has quadratic complexity with respect to the sequence length

n

, handling long contexts becomes computationally expensive. This often results in slow processing speeds and out-of-memory issues, particularly when dealing with long input contexts (Dong et al., 2023).
推断过程可以分为两个阶段:(1)预填充和(2)解码(刘等人,2024 年)。在预填充阶段,输入序列

X

通过每一层使用多头注意力处理,并缓存每层的键值激活

[K, V]

。这些缓存的激活在解码阶段被重复使用,以避免冗余计算,从而实现高效处理。然而,由于多头注意力计算的复杂度与序列长度

n

呈二次关系,处理长上下文会变得计算昂贵。这通常会导致处理速度缓慢和内存溢出问题,特别是在处理长输入上下文时(董等人,2023 年)。

To address the challenges posed by oversized KV caches for long contexts, we propose ACRE, a framework that constructs a Bi-layer KV Cache and employs a Query-Guided Refilling mechanism to enable a flexible KV cache that captures both global context and query-specific local details, ensuring efficient and high-quality answer decoding.
为了解决长上下文中过大的 KV 缓存带来的挑战,我们提出了 ACRE 框架。ACRE 构建了一个双层 KV 缓存,并采用了查询引导的重填机制,以实现一个灵活的 KV 缓存,既能捕捉全局上下文,又能捕捉查询特定的细节,从而确保高效和高质量的答案解码。

2.2 Overview of ACRE
2.2 ACRE 概述

Figure 2 provides an overview of ACRE. Specifically, for a information-seeking task with a long context

C

, ACRE organizes the long context into a bi-layer KV activation cache during the pre-filling stage, as shown in Figure 2 (a).
图 2 提供了 ACRE 的概述。具体而言，对于具有较长文本的信息检索任务

C

，ACRE 在预填充阶段将长文本上下文组织成双层 KV 激活缓存,如图 2 (a) 所示。

The construction of the Bi-layer KV Cache begins by interleaving newly introduced L1 tokens into the input context. Through model forwarding, a nested KV cache

[\tilde{K}, \tilde{V}]

is obtained. This nested KV cache is then decomposed into a Bi-layer KV cache: the layer-1 (L1) cache, which is compact and stores global information from the full long context, and the layer-2 (L2) cache, which holds detailed and localized information. Each tensor in the L1 cache serves as a semantic proxy for a corresponding sequence of tensors in the L2 cache.
双层 KV 缓存的构建首先通过将新引入的 L1 令牌插入输入上下文来实现。通过模型转发获得嵌套的 KV 缓存

[\tilde{K}, \tilde{V}]

。然后将这个嵌套的 KV 缓存分解为双层 KV 缓存:第 1 层 (L1) 缓存紧凑且存储来自完整长上下文的全局信息,第 2 层 (L2) 缓存保存详细和局部化的信息。L1 缓存中的每个张量都充当 L2 缓存中相应张量序列的语义代理。

We denote the L1 KV cache as

[K^{L 1}, V^{L 1}] \in

R^{m \times d}

and the L2 KV cache as

[K^{L 2}, V^{L 2}] \in

R^{n \times d}

. Here, the length of the L1 KV cache,

m

, is significantly smaller than

n

, the length of the L2 KV cache. To optimize memory usage, the L2 cache can be offloaded to CPU memory, while the L1 cache is retained in GPU memory as a constant cache after constructing the bi-layer KV cache. This design significantly improves memory efficiency in practical applications.
我们将 L1 KV 缓存表示为

[K^{L 1}, V^{L 1}] \in

R^{m \times d}

,将 L2 KV 缓存表示为

[K^{L 2}, V^{L 2}] \in

R^{n \times d}

。这里, L1 KV 缓存的长度

m

明显小于 L2 KV 缓存的长度

n

。为了优化内存使用,L2 缓存可以卸载到 CPU 内存,而 L1 缓存则保留在 GPU 内存中作为恒定缓存,在构建双层 KV 缓存后。这种设计在实际应用中大大提高了内存效率。

The Bi-layer KV Cache is constructed exclusively for input contexts, enabling it to be reused across different information-seeking tasks that share the same context. Given an input query

q

, ACRE utilizes

q

to attend to the L1 cache, computing attention scores. Based on these scores, ACRE selectively refills the L1 cache by retrieving the most informative entries from the L2 cache, which are proxied by the corresponding most attentive
双层键值缓存专门为输入上下文构建,使其能够在共享相同上下文的不同信息检索任务之间重复使用。给定输入查询

q

，ACRE 利用

q

关注 L1 缓存,计算注意力分数。基于这些分数,ACRE 通过从 L2 缓存检索最有信息量的条目来有选择地填充 L1 缓存,这些条目由相应的最注意力代理。

L1 cache tensors. This process recovers a partial nested cache to support answer decoding and is referred to as query-guided activation refilling, which is shown in Figure 2 (b).
一级缓存张量。这个过程恢复了部分嵌套缓存来支持答案解码,称为查询引导激活重填,如图 2(b)所示。

By leveraging both the L1 KV cache and the query-specific L2 KV cache, the final KV cache captures global information from the full long context while preserving local details. This design significantly enhances the performance of longcontext information-seeking tasks. In the following sections, we provide the technical details of ACRE.
通过利用 L1 KV 缓存和特定查询的 L2 KV 缓存,最终的 KV 缓存能捕获来自完整长上下文的全局信息,同时保留局部细节。这种设计大大提高了长上下文信息搜索任务的性能。在以下部分中,我们提供了 ACRE 的技术细节。

2.3 Bi-Layer KV Cache
双层 KV 缓存

To construct the bi-layer KV cache, we introduce a new type of token, called L1 tokens, denoted as

X^{L 1} = (x_{1}^{L 1}, \dots, x_{m}^{L 1})

. The original tokens of the input sequence are referred to as L 2 tokens, denoted as

X^{L 2} = (x_{1}, \dots, x_{n})

. By interleaving the L1 and L2 tokens, the input sequence

X

is transformed into a nested sequence

\tilde{X}

:
为构建双层 KV 缓存,我们引入了一种新的令牌类型,称为 L1 令牌,记为

X^{L 1} = (x_{1}^{L 1}, \dots, x_{m}^{L 1})

。输入序列的原始令牌称为 L2 令牌,记为

X^{L 2} = (x_{1}, \dots, x_{n})

。通过交错 L1 和 L2 令牌,输入序列

X

被转化为嵌套序列

\tilde{X}

。

\tilde{X} = (x_{1}, \dots, x_{l}, x_{1}^{L 1}, x_{l + 1}, \dots, x_{n}, x_{m}^{L 1})

where each L1 token is inserted after every

l

L2 tokens, acting as a semantic proxy for the preceding

l L 2

tokens. We refer to

l

as the L1/L2 interval. For the L1 tokens, we initialize an additional set of trainable weight matrices

W_{Q}^{L 1}, W_{K}^{L 1}

, and

W_{V}^{L 1}

, while keeping the original weight matrices for L2 tokens frozen.
其中每个 L1 标记都插入在每隔

l

L2 标记之后,充当前面

l L 2

标记的语义代理。我们将

l

称为 L1/L2 间隔。对于 L1 标记,我们初始化了一组额外的可训练权重矩阵

W_{Q}^{L 1}, W_{K}^{L 1}

和

W_{V}^{L 1}

,同时保持 L2 标记的原始权重矩阵冻结不变。

After constructing the nested sequence

\tilde{X}

, we adapt the attention computation defined in Eq. (4). Specifically, for the key

K

, the original projection

K = X \cdot W_{K}

is replaced with:
在构建嵌套序列

\tilde{X}

之后,我们调整了公式(4)中定义的注意力计算。具体而言,对于关键字

K

,原始投影

K = X \cdot W_{K}

被替换为:

K = {\begin{cases} x \cdot W_{K}^{L 1}, & if x is an L1 token \\ x \cdot W_{K}, & if x is an L2 token \end{cases}

where

x \in X

. Through multi-head attention, this modification yields the nested key activations:
通过多头注意力机制,这一修改产生了嵌套的键激活。

\tilde{K} = [k_{1}, \dots, k_{l}, k_{1}^{L 1}, \dots, k_{n}, k_{m}^{L 1}]

Similarly, the nested value activations

\tilde{V}

are computed as:
类似地,嵌套的值激活

\tilde{V}

计算如下:

\tilde{V} = [v_{1}, \dots, v_{l}, v_{1}^{L 1}, \dots, v_{n}, v_{m}^{L 1}]

By decomposing the nested KV cache, we obtain
通过分解嵌套的 KV 缓存,我们得到
the bi-layer KV cache as follows:
双层 KV 缓存如下:

\begin{matrix} K^{L 1} = [k_{1}^{L 1}, \dots, k_{m}^{L 1}] \\ V^{L 1} = [v_{1}^{L 1}, \dots, v_{m}^{L 1}] \\ K^{L 2} = [\underset{k_{1}^{L 1}}{\underset{⏟}{k_{1}, \dots, k_{l}}}, \dots, \underset{k_{m}^{L 1}}{\underset{⏟}{k_{n - l}, \dots, k_{n}}}] \\ V^{L 2} = [\underset{v_{1}^{L 1}}{\underset{⏟}{v_{1}, \dots, v_{l}}}, \dots, \underset{v_{m}^{L 1}}{\underset{⏟}{v_{n - l}, \dots, v_{n}}}] \end{matrix}

where

\underset{k_{1}^{L 1}}{\underset{⏟}{k_{1}, \dots, k_{l}}}

represents the proxying relationship between the L1 cache and the L2 cache.
其中

\underset{k_{1}^{L 1}}{\underset{⏟}{k_{1}, \dots, k_{l}}}

表示 L1 缓存和 L2 缓存之间的代理关系。

As previously mentioned, directly computing full attention over the long sequence

X

is both computationally expensive and resource-intensive. To efficiently construct the bi-layer KV cache, we propose a selective attention mechanism. This mechanism maintains a relatively small working context window

W

, enabling current tokens to perform full attention on recent L1 and L2 tokens while only attending to distant L1 tokens. For instance, when computing KV activations at step

n

, we prune the previous KV cache

[\tilde{K}, \tilde{V}]

as follows:
正如之前提到的,直接计算长序列的全注意力既计算密集又资源密集。为了有效地构建双层 KV 缓存,我们提出了一种选择性注意力机制。这种机制维护一个相对较小的工作上下文窗口,使当前令牌能够对最近的 L1 和 L2 令牌执行完全注意力,同时只关注远程 L1 令牌。例如,在步骤

n

计算 KV 激活时,我们如下修剪前一个 KV 缓存

[\tilde{K}, \tilde{V}]

\begin{aligned} \tilde{K} & = [k_{1}^{L 1}, \dots, k_{i}^{L 1}, k_{j}, \dots, k_{n}, k_{m}^{L 1}] \\ \tilde{V} & = [\underset{distant L1 tokens}{\underset{⏟}{[v_{1}^{L 1}, \dots, v_{i}^{L 1}}}, \underset{recent L1/L2 tokens}{\underset{⏟}{v_{j}, \dots, v_{n}, v_{m}^{L 1}}}], \end{aligned}

subject to the constraints

| \tilde{K} | \leq W

and

| \tilde{V} | \leq

W

. Through this mechanism, we sequentially process the full sequence

\tilde{X}

into KV activations using a short working context window, achieving both high computational efficiency and economical memory usage.
受制于约束条件

| \tilde{K} | \leq W

和

| \tilde{V} | \leq

W

。通过这种机制,我们使用短工作上下文窗口,连续处理整个序列

\tilde{X}

以生成 KV 激活,实现了高计算效率和经济高效的内存使用。

2.4 Query-Guided Activation Refilling
查询引导激活补充

After constructing the bi-layer KV cache for the context, we obtain the L1 KV cache

[K^{L 1}, V^{L 1}]

, which serves as a global yet compact representation of the full long context, and the L2 KV cache

[K^{L 2}, V^{L 2}]

, which provides detailed but memoryintensive representations. To optimize memory usage, the L1 KV cache is retained as a constant cache in GPU memory, while the L2 KV cache is offloaded to CPU memory.
在为上下文构建双层 KV 缓存后,我们获得了 L1 KV 缓存

[K^{L 1}, V^{L 1}]

，它为完整的长上下文提供了全局而紧凑的表示,以及 L2 KV 缓存

[K^{L 2}, V^{L 2}]

,它提供了详细但占用内存的表示。为了优化内存使用,L1 KV 缓存被保留为 GPU 内存中的常量缓存,而 L2 KV 缓存则被卸载到 CPU 内存中。

For an input query

q

, relying solely on the L1 KV cache is feasible but lacks query-specific detailed information. To address this limitation, ACRE proposes refilling the compact L1 KV cache with selected entries from the L2 KV cache that are
对于输入查询

q

，仅依赖 L1 KV 缓存是可行的但缺乏特定于查询的详细信息。为了解决这一限制，ACRE 建议用从 L2 KV 缓存中选择的条目重新填充紧凑的 L1 KV 缓存。
most relevant for answering the query. Specifically, the query state

Q_{q}

for the input query

q

is computed as

Q_{q} = q \cdot W_{Q}

. Using this query state, the attention distribution is calculated as:

A = softmax (\frac{Q_{q} \cdot K^{L 1^{⊤}}}{\sqrt{d}})

, where

A \in R^{h \times m \times t}

h

is the number of attention heads,

m

is the length of L1 cache, and

t

is the length of the query

q

. The attention scores

S

are then obtained by applying mean pooling:
最相关于回答查询的。具体来说,输入查询

q

的查询状态

Q_{q}

被计算为

Q_{q} = q \cdot W_{Q}

。使用这个查询状态,注意力分布被计算为:

A = softmax (\frac{Q_{q} \cdot K^{L 1^{⊤}}}{\sqrt{d}})

,其中

A \in R^{h \times m \times t}

h

是注意力头的数量,

m

是 L1 缓存的长度,

t

是查询

q

的长度。注意力得分

S

然后通过应用均值池化获得。

S = {Pool}_{\dim = 0, 2} (A), S \in R^{m},

where

S

serves as a guiding signal to select relevant entries from the L2 KV cache. The selection process is defined as:
其中

S

作为选择从 L2 KV 缓存中获取相关条目的指引信号。该选择过程的定义如下：

\begin{matrix} I = \arg {top}_{k} (S), \\ k = ⌊ \frac{min (W - m, η)}{l} ⌋, \end{matrix}

where

k

is dynamically determined based on the maximum length of the predefined working context window

W

or the maximum refilling length

η

, and

I

represents the set of selected indices.
其中

k

动态地确定为根据预定义的工作上下文窗口

W

或最大补充长度

η

的最大长度,

I

表示所选索引集。

After selection, the L1 KV cache is refilled with the chosen entries from the L2 KV cache. For example, if

I = {2}

, the refilled KV cache becomes:
在选择之后,L1 KV 缓存使用从 L2 KV 缓存中选择的条目进行重新填充。例如,如果

I = {2}

,则重新填充的 KV 缓存变为:

\begin{aligned} K & = [k_{1}^{L 1}, k_{l + 1}, \dots, k_{2 l}, k_{2}^{L 1}, \dots, k_{m}^{L 1}], \\ V & = [v_{1}^{L 1}, \underset{refilled L 2 KV cache}{\underset{⏟}{v_{l + 1}, \dots, v_{2 l}}}, v_{2}^{L 1}, \dots, v_{m}^{L 1}] . \end{aligned}

This refilling process is performed independently for each layer. With the refilled KV cache, ACRE decodes the final answer

Y

in a standard autoregressive manner.
这一重填过程是独立地对每个层进行的。通过重填的 KV 缓存,ACRE 以标准的自回归方式解码最终答案

Y

。

2.5 Model Optimization 模型优化

ACRE is characterized by its Bi-layer KV Cache structure and Query-Guided Activation Refilling mechanism. Its effectiveness relies on two key abilities: (1) the L1 KV activations must faithfully represent the L2 KV activations, and (2) given an input query

q

, the most relevant L 2 KV activations must be efficiently retrieved. To optimize these abilities, we employ a two-stage optimization strategy.
ACRE 以其双层 KV 缓存结构和查询引导激活补充机制为特征。它的有效性依赖于两个关键能力：(1) L1 KV 激活必须忠实地表示 L2 KV 激活,以及(2)给定输入查询

q

，必须有效地检索最相关的 L2 KV 激活。为优化这些能力,我们采用了两阶段优化策略。

In stage 1 , the objective is to maximize the semantic volume of the L1 KV activations to effectively represent the corresponding L2 KV activations. This is achieved by predicting the next token using the previously accumulated L1 tokens and
在第 1 阶段，目标是最大化 L1 KV 激活的语义体积，以有效地表示相应的 L2 KV 激活。这通过使用先前累积的 L1 令牌来预测下一个令牌来实现。
the recent L2 tokens. The optimization can be expressed through a cross-entropy loss:
最近的 L2 令牌。优化可以通过交叉熵损失来表示:

L_{stage- 1} = - \sum_{t = 1}^{T} \log P (x_{t} ∣ x_{[1 ::]}^{L 1}, x_{[j : t - 1]}),

where

x_{[1 : i]}^{L 1}

denotes the accumulated L1 tokens, and

x_{[j; t - 1]}

denotes the recent L 2 tokens.
其中

x_{[1 : i]}^{L 1}

表示累积的 L1 令牌,而

x_{[j; t - 1]}

表示最近的 L2 令牌。

In stage 2, the objective is to enable ACRE to retrieve the most relevant L2 KV activations for refilling the L1 KV cache based on an input query

q

. Since the L2 KV cache is proxied by the L1 KV cache, accurately attending to the most useful L1 KV activations allows retrieval of the corresponding L2 KV activations via the proxying relationship. To achieve this, we optimize ACRE using task-specific data comprising long contexts and input queries. The optimization employs the following loss function:
在第 2 阶段,目标是使 ACRE 能够根据输入查询

q

检索最相关的 L2 KV 激活以填充 L1 KV 缓存。由于 L2 KV 缓存由 L1 KV 缓存代理,准确关注最有用的 L1 KV 激活可以通过代理关系检索相应的 L2 KV 激活。为了实现这一目标,我们使用由长上下文和输入查询组成的特定任务数据对 ACRE 进行优化。优化使用以下损失函数:

L_{stage- 2} = - \sum_{t = 1}^{T} \log P (y_{t} ∣ X^{L 2}, q),

where

y

represents the ground-truth answer, and

q

is the input query. This loss ensures that ACRE learns to produce accurate answers solely based on the L1 KV cache while maintaining its ability to retrieve the most relevant L 2 KV activations.
其中

y

表示真实答案，

q

是输入查询。这种损失确保了 ACRE 能够仅基于 L1 KV 缓存生成准确的答案,同时保持其提取最相关 L2 KV 激活的能力。

3 Experiments 3 个实验

3.1 Dataset 数据集

We evaluate ACRE and all baseline models across 12 information-seeking tasks from three public long-context benchmarks: LongBench (Bai et al., 2024b), InfiniteBench (Zhang et al., 2024c), and UltraDomain (Qian et al., 2024b). These 12 datasets are categorized as follows: (1) Complex QA (Qian et al., 2024b): Financial, Legal, Physics, Biology, Math, and CS. These tasks involve practical, high-level queries with extra-long contexts spanning specialized domains. Many queries demand a global and in-depth understanding of the full context, making them especially challenging. (2) Single-Document QA: NarrativeQA (Kociský et al., 2018), Qasper (Dasigi et al., 2021), MultiFieldQA (Bai et al., 2024b), and En.QA (Zhang et al., 2024c). (3) MultiDocument QA: 2WikiMQA (Ho et al., 2020), and MuSiQue (Trivedi et al., 2022).
我们评估了 ACRE 和所有基线模型在三种公共长上下文基准测试中的 12 个信息检索任务:LongBench(Bai et al., 2024b)、InfiniteBench(Zhang et al., 2024c)和 UltraDomain(Qian et al., 2024b)。这 12 个数据集分为以下类别:(1)复杂问答(Qian et al., 2024b):金融、法律、物理、生物、数学和计算机科学。这些任务涉及实用的高级查询,上下文长度很长,跨专业领域。许多查询需要对完整的上下文有全局和深入的理解,这使它们特别具有挑战性。(2)单文档问答:NarrativeQA(Kociský et al., 2018)、Qasper(Dasigi et al., 2021)、MultiFieldQA(Bai et al., 2024b)和 En.QA(Zhang et al., 2024c)。 (3) MultiDocument QA: 2WikiMQA (Ho 等人, 2020)和 MuSiQue (Trivedi 等人, 2022)。

Table 1: Main experimental results. The best results are in bold, and the second-best are underlined. All methods use Qwen2.5-3B-Instruct as the underlying LLM. Baselines in the second block directly process the full context, while those in the third block divide the context into chunks and find evidence using a retriever. In the second row, ave

(| C |) (k)

means the average context length.
表 1：主要实验结果。最佳结果以粗体表示，次佳结果下划线。所有方法使用 Qwen2.5-3B-Instruct 作为底层模型。第二个部分的基准方法直接处理完整上下文,而第三部分的基准方法将上下文划分为块并使用检索器寻找证据。第二行中,ave

(| C |) (k)

表示平均上下文长度。

Dataset ave $(\| C \|) (k)$ 数据集文件	$\begin{matrix} nar \\ 18.4 \end{matrix}$	$\begin{matrix} fin \\ 40.6 \end{matrix}$	$\begin{matrix} legal \\ 51.4 \end{matrix}$	$\begin{matrix} phy \\ 105.8 \end{matrix}$	$\begin{matrix} bio \\ 125.3 \end{matrix}$	$\begin{aligned} en.qa \\ 192.6 \end{aligned}$	$\begin{aligned} math \\ 197.9 \end{aligned}$	$\begin{matrix} CS \\ 215.9 \end{matrix}$	$\begin{aligned} qas \\ 3.6 \end{aligned}$	$\begin{matrix} mul \\ 4.6 \end{matrix}$	$\begin{matrix} 2wiki \\ 4.9 \end{matrix}$	$\begin{aligned} mus \\ 11.2 \end{aligned}$
	Ave. Context Length $> 16 K$ 大道上下文长度								AvE. LENGTH < 16K 长度< 16K
Original 原文: Original	22.0	36.8	42.6	38.2	35.8	$\underset{―}{20.1}$	36.3	35.6	37.4	48.5	36.3	22.1
KIVI	21.1	27.0	39.5	35.3	33.2	15.6	32.1	33.4	37.1	46.1	35.0	22.1
Beacon 信标	20.2	37.8	43.9	37.1	33.7	18.3	31.8	32.3	30.4	35.6	24.7	24.7
SelfExtent 自我扩展	20.8	37.5	40.0	29.1	29.9	11.4	31.6	30.4	36.0	49.6	37.1	25.1
StreamingLLM 流式 LLM	18.8	27.3	26.2	31.4	27.4	8.3	30.0	26.9	33.4	38.6	32.1	12.2
MInference 推理	22.2	35.6	37.2	32.9	28.5	8.9	30.3	27.1	36.2	48.6	36.0	23.5
RAG	18.9	36.9	38.6	22.1	18.4	11.3	19.2	19.3	38.6	46.6	37.8	20.8
RQRAG	19.0	37.0	39.0	28.0	23.0	12.0	26.1	24.1	37.6	47.3	37.4	21.8
MemoRAG 备忘录	$\underset{―}{24.0}$	41.5	44.8	36.9	33.2	13.2	33.1	33.4	34.1	49.1	38.0	$\underset{―}{26.0}$
ACRE	27.8	46.4	47.7	41.6	38.3	23.6	41.9	45.9	39.6	50.0	36.4	26.2

3.2 Baseline Models 基线模型

We compare ACRE with the following baselines: Original: Directly fits the maximum context length of the underlying LLMs. KIVI (Liu et al., 2024): Quantizes KV activations into 4-bit precision. Beacon (Zhang et al., 2024a): Compresses the full KV activations into beacon activations. SelfExtend (Jin et al., 2024): Applies hierarchical positional encoding to extend the model’s context window. MInference (Jiang et al., 2024): Dynamically applies different sparse attention mechanisms across all attention heads. StreamingLLM (Xiao et al., 2024): Attends only to recent tokens and sink tokens. RAG: Uses standard RAG pipelines to retrieve relevant evidence from the full context. RQRAG (Chan et al., 2024): Rewrites the input query into sub-queries and retrieves evidence for each sub-query. MemoRAG (Qian et al., 2024b): Applies a memory model to form a compact global memory over the full context, providing answer clues that assist the retrieval process for better evidence retrieval.
我们将 ACRE 与以下基线模型进行比较: 原始：直接适应基础 LLMs 的最大上下文长度。 KIVI（刘等人，2024）：将 KV 激活量化为 4 位精度。 Beacon（张等人，2024a）：将完整的 KV 激活压缩为 beacon 激活。 SelfExtend（金等人，2024）：应用分层位置编码来扩展模型的上下文窗口。 MInference（姜等人，2024）：在所有注意力头中动态应用不同的稀疏注意力机制。 StreamingLLM（肖等人，2024）：仅关注最近的标记和 sink 标记。 RAG：使用标准的 RAG 管道从完整的上下文中检索相关证据。 RQRAG（陈等人，2024）：将输入查询重写为子查询，并为每个子查询检索证据。记忆型 RAG（Qian et al.，2024b）：将内存模型应用于形成全局上下文的紧凑内存，提供有助于检索过程的答案线索,以便更好地检索证据。

In the main experiments (Section 3.3), we use Qwen2.5-3B-Instruct as the underlying model. To analyze the impact of using different underlying models, we also experiment with Llama3.2-3BInstruct and Qwen2.5-7B-Instruct in Section 3.4. All three LLMs have a native context window of 128K (Yang et al., 2024; MetaAI, 2024). The implementation details of ACRE and all baselines are in Appendix A.
在主要实验(第 3.3 节)中,我们使用 Qwen2.5-3B-Instruct 作为底层模型。为了分析使用不同底层模型的影响,我们还在第 3.4 节中试验了 Llama3.2-3BInstruct 和 Qwen2.5-7B-Instruct。这三个LLMs都有 128K 的原生上下文窗口(Yang et al., 2024; MetaAI, 2024)。ACRE 和所有基准线的实现细节在附录 A 中。

3.3 Main Results 3.3 主要结果

In Table 3.3, we present the results of the main experiments, demonstrating that ACRE outperforms all baselines across most datasets. These results highlight the effectiveness of ACRE’s design. Specifically, we derive the following findings: (1) ACRE consistently outperforms the baseline approach of feeding the full context directly into LLMs. This improvement stems not only from ACRE’s ability to process contexts exceeding the native LLM’s context window but also from its precise focus on query-relevant local information, effectively filtering out irrelevant details through query-guided activation refilling. (2) Baselines in the second block generally perform worse than directly feeding the full context into LLMs. This is attributed to semantic loss caused by compressing full KV activations. In contrast, ACRE leverages its bi-layer KV cache and query-guided activation refilling to recover local detailed semantics from the L2 cache that are absent in the L1 cache, resulting in superior performance. (3) Baselines in the third block use retrieval tools to extract precise evidence from long contexts. While effective for queries with clear information needs, these methods struggle with complex queries that require a higher-level understanding of the full context. ACRE overcomes this limitation by utilizing the global information in the L1 cache and dynamically refilling it with query-relevant local details from the L2 cache, thereby adapting to the varying information needs of different queries.
在表 3.3 中，我们展示了主要实验的结果,证明 ACRE 在大多数数据集上都优于所有基准。这些结果突出了 ACRE 设计的有效性。具体而言,我们得出以下发现:(1) ACRE 始终优于将完整上下文直接馈送到LLMs的基准方法。这种改进不仅来自 ACRE 处理超出原生LLM上下文窗口的能力,也来自它对查询相关的局部信息的精确关注,有效地过滤掉无关细节通过查询引导的激活补充。(2) 第二个块中的基准通常比将完整上下文直接馈送到LLMs的方法表现更差。这归因于压缩完整的 KV 激活所导致的语义损失。相比之下,ACRE 利用其双层 KV 缓存和查询引导的激活重填来从 L2 缓存中恢复 L1 缓存中缺失的局部细节语义,从而获得卓越的性能。第三个模块中的基线使用检索工具从长上下文中提取精确证据。虽然对于具有明确信息需求的查询来说很有效,但这些方法在处理需要更高层理解整个上下文的复杂查询时仍然存在困难。ACRE 通过利用 L1 缓存中的全局信息并从 L2 缓存中动态补充与查询相关的局部细节来克服这一限制,从而适应不同查询的信息需求。

Figure 3: Ablation Study on Model Design Variations Across Different LLMs.
图 3:不同LLMs上的模型设计变体消融研究

Figure 4: Analysis of the maximum refilling length

η

(left) and the impact of the

L 1 / L 2

interval

l

(right).
图 4：最大补充长度

η

的分析(左)以及

L 1 / L 2

间隔

l

的影响(右)。

3.4 Ablation Study 消融研究

To thoroughly validate the effectiveness of our method design, we perform detailed ablation studies as follows:
为彻底验证我们方法设计的有效性,我们进行了如下详细消融研究:
(1) Method Design and Model Selection: Figure 3 presents ablation results across different LLMs and variations in model design. First, we evaluate the role of training stages in model performance. Without the two-stage training process, ACRE reverts to a vanilla LLM, which performs significantly worse than ACRE. Stage-1 training enables ACRE to construct the bi-layer KV activation cache, thereby improving its long-context processing capabilities. When both stages are applied, ACRE achieves the best performance, demonstrating the effectiveness of its optimization design.
方法设计和模型选择：图 3 展示了不同LLMs和模型设计变体的消融结果。首先,我们评估训练阶段对模型性能的作用。没有两阶段训练过程,ACRE 退化为一个普通的LLM,性能显著降低。第 1 阶段训练使 ACRE 能够构建双层 KV 激活缓存,从而提高了其长上下文处理能力。当应用两个阶段时,ACRE 取得了最佳性能,证明了其优化设计的有效性。

Second, to determine if ACRE’s effectiveness stems from its training data, we fine-tune a vanilla model using ACRE’s training data via SFT, producing SFT Vanilla. While SFT improves the vanilla model by enhancing its QA capabilities, it still un-
第二,为了确定 ACRE 的有效性是否源于其训练数据,我们使用 ACRE 的训练数据通过 SFT 微调一个原始模型,产生 SFT Vanilla。虽然 SFT 通过增强其问答能力改善了原始模型,但它仍然无法
derperforms compared to ACRE. This highlights the unique advantages of ACRE’sdesign.
与 ACRE 相比,它的性能较出色。这突出了 ACRE 设计的独特优势。

Lastly, we replace ACRE’s underlying LLM with Qwen2.5-7B (a scaled-up version of the same model) and Llama3.2-3B (a model of similar scale but different architecture). As shown in Figure 3, ACRE’s design consistently proves effective across models of varying scales and architectures, confirming its generalizability.
最后，我们将 ACRE 的基础LLM替换为 Qwen2.5-7B（同一模型的扩展版本）和 Llama3.2-3B（规模相似但架构不同的模型）。如图 3 所示，ACRE 的设计在不同规模和架构的模型上一直证明有效,证实了其通用性。
(2) Impact of Parameter Choice: As described in Section 2, ACRE’s performance may be influenced by two hyperparameters: the maximum refilling length of KV activations

η

and the L1/L2 interval

l

. To investigate their impact, we conduct experiments with different values of

η

and

l

. Figure 4 presents the results of this analysis.
参数选择的影响:正如第 2 节所述,ACRE 的性能可能受到两个超参数的影响:KV 激活的最大补充长度

η

和 L1/L2 间隔

l

。为了研究它们的影响,我们进行了不同值的

η

和

l

的实验。图 4 展示了这一分析的结果。

Specifically, in the left figure, we observe that the impact of the refilled activation length varies by task. For tasks with queries requiring explicit information (e.g., nar and en.qa), answer decoding relies on precise local information. Here, ACRE’s performance peaks at a reasonable refilled length but declines as excessive refilling introduces noise, which biases the decoding process. Conversely, for tasks with queries requiring the integration of global information, ACRE’s performance consistently improves with longer refilled lengths. This is because the L1 cache already provides global information, and additional refilled activations enhance local context.
具体地说,在左图中,我们观察到重填激活长度的影响因任务而异。对于需要明确信息的查询任务(例如 nar 和 en.qa),答案解码依赖于精确的局部信息。在这里,ACRE 的性能在合理的重填长度下达到峰值,但随着过度重填引入噪音而下降,这会偏斜解码过程。相反,对于需要整合全局信息的查询任务,ACRE 的性能随着更长的重填长度而持续提高。这是因为 L1 缓存已经提供了全局信息,而额外的重填激活增强了局部上下文。

The right figure shows the impact of the L1/L2 interval. We find that ACRE’s performance generally decreases as the L1/L2 interval increases. Larger intervals require L1 tokens to summarize more semantics from subsequent L2 tokens, potentially overloading the L1 cache. However, larger intervals result in a compact L1 KV cache, offering efficiency. In practical applications, users can adjust parameters to balance efficiency and effectiveness based on available resources.
图中右图显示了 L1/L2 间隔的影响。我们发现随着 L1/L2 间隔的增大,ACRE 的性能通常会下降。较大的间隔要求 L1 标记概括更多后续 L2 标记的语义,可能会导致 L1 缓存超负荷。然而,较大的间隔会产生更紧凑的 L1 KV 缓存,提供了更高的效率。在实际应用中,用户可以根据可用资源调整参数以平衡效率和有效性。

In summary, ACRE outperforms directly using vanilla LLMs in most parameter settings, requiring significantly fewer computational resources while achieving higher efficiency.
总之,在大多数参数设置下,ACRE 的性能优于直接使用LLMs,需要大幅少的计算资源,同时实现更高的效率。

3.5 Efficiency Analysis 效率分析

To evaluate ACRE’s efficiency compared to baselines in processing long contexts at different scales, we conduct comparative experiments using the vanilla LLM, the efficient attention method MInference, and ACRE.
为了评估 ACRE 在处理不同尺度的长上下文方面与基线的效率,我们使用原始 LLM、高效注意力方法 MInference 和 ACRE 进行了比较实验。

The results, presented in Table 2, lead to the
结果如表 2 所示,这导致了

Table 2: Efficiency comparison of Vanilla LLM, MInference, and ACRE. Peak GPU memory (mem, GiB), time latency (lat, seconds/query), and answer readability (rdbl) are evaluated using 20 samples with contexts over 1024 K , truncated to target lengths, and a max generation length of 100 tokens. Tests are conducted on a single NVIDIA A800 80G GPU. Average scores are reported, with the best in each block highlighted in bold.
表 2: Vanilla LLM、MInference 和 ACRE 的效率对比。使用 20 个包含超过 1024 K 上下文的样本、截断为目标长度且最大生成长度为 100 个标记进行评估。在单个 NVIDIA A800 80G GPU 上进行测试。报告平均分数,每个区块中最佳的以粗体突出显示。

Length 长度	64K			128 K			256K			512K			1024K
	mem 内存	lat 纬度	rdbl rdbl 的中文翻译为: rdbl	mem 内存	lat 纬度	rdbl 读不了	mem 内存	lat 纬度	rdbl 如果某些内容无需翻译（如专有名词、代码等），则保持原文不变	mem 内存	lat 纬度	rdbl 人民日报	mem 内存	lat 纬度	rdbl 如果某些内容无需翻译（如专有名词、代码等），则保持原文不变
Qwen2.5-3B-InSTRUCT-128K 一种新型语言模型
Vanilla 香草	18.5	12.1	$✓$	27.9	36.3	$✓$	49.1	103.2	$x$	OOM	-	$x$	OOM	-	$x$
MInfer. 推理。	15.5	29.2	$✓$	22.0	33.6	$✓$	28.0	57.1	$x$	39.1	58.9	$x$	47.2	79.6	$x$
ACRE	20.8	8.4	$✓$	23.0	14.3	$✓$	27.6	28.1	$✓$	44.3	48.2	$✓$	46.8	53.6	$✓$
Qwen2.5-7B-INSTRUCT-128K 以下是翻译结果: Qwen2.5-7B-INSTRUCT-128K
Vanilla 香草	31.9	21.2	$✓$	46.1	45.3	$✓$	78.3	129.6	$x$	OOM	-	$x$	OOM	-	$x$
MInfer. 推断。	27.9	29.1	$✓$	34.3	35.6	$✓$	48.1	81.2	$x$	74.2	132.7	$x$	OOM	-	$x$
ACRE	31.3	10.5	$✓$	35.1	18.0	$✓$	43.0	37.1	$✓$	72.1	85.6	$✓$	75.6	90.4	$✓$

following conclusions: (1) ACRE consistently processes long contexts at different scale with comparable or lower GPU resource usage. This efficiency is attributed to the bi-layer KV activation design, which avoids directly processing the full KV activations. (2) ACRE’s efficiency advantage becomes more pronounced with extremely long contexts (e.g., over 512K), where the vanilla LLM runs out of memory, and MInference faces a high risk of out of memory while require longer latency than ACRE. (3) Thanks to its query-guided activation refilling mechanism, ACRE utilizes only the compact L1 KV activations and query-relevant L2 KV activations for answer decoding. This enables ACRE to process contexts longer than the native window of the LLM while maintaining answer quality. In contrast, baseline models generate nonsensical answers when exceeding LLM’s native context length.
以下结论: (1) ACRE 在不同规模的长上下文中都能保持较低的 GPU 资源用量。这种高效性归因于双层 KV 激活设计,避免了对完整的 KV 激活进行直接处理。 (2) 对于特别长的上下文(如超过 512K),原生LLM会内存溢出,而 MInference 则面临高内存溢出风险并且耗时比 ACRE 长。 (3) 得益于其查询导向的激活补充机制,ACRE 仅使用紧凑的 L1 KV 激活和相关的 L2 KV 激活进行回答解码。这使 ACRE 能处理超出LLM本地窗口的上下文,同时保持回答质量。相比之下，基线模型在超出LLM的原生上下文长度时会生成无意义的答案。

In summary, ACRE demonstrates significant advantages in handling long contexts efficiently and reliably compared to baseline methods.
总的来说,与基线方法相比,ACRE 在有效且可靠地处理长上下文方面展现出明显优势。

Long-context processing is a critical capability of LLMs (Zhao et al., 2024). The most fundamental approach to enhancing this ability is training LLMs on long texts, either sampled from raw corpora or synthesized (Xiong et al., 2024; Mohtashami and Jaggi, 2024; Fu et al., 2024; Bai et al., 2024a). Consequently, the native context window of popular LLMs has increased significantly, from the earlier 4 K to the current 128K (Peng et al., 2023; Touvron et al., 2023; Yang et al., 2024).
长上下文处理是LLMs（赵等，2024）的关键能力。增强这一能力的最基本方法是训练LLMs在长文本上,这些长文本可以从原始语料库中采样或合成（熊等，2024;Mohtashami 和 Jaggi,2024;傅等，2024;白等，2024a）。因此,流行的LLMs的原生上下文窗口已显著增加,从早期的 4K 增加到目前的 128K（彭等,2023;Touvron 等,2023;杨等,2024）。

In addition to directly increasing the context win-
除了直接增加上下文覆盖范围之外,
dow, some methods employ strategic positional encoding to enable LLMs to process contexts longer than their native window, as demonstrated by (Chen et al., 2023b; Song et al., 2023; Liu et al., 2023; Jin et al., 2024). However, when processing long contexts, LLMs generate large key-value (KV) activations, which consume substantial resources and reduce efficiency. To address this, many works aim to make KV activations more compact and lightweight (Liu et al., 2024; Xu et al., 2024). For example, KIVI focuses on reducing the precision of KV activations to 2-bit, resulting in significantly lighter KV representations (Liu et al., 2024). Other methods selectively attend to a small portion of KV activations through compression or sparse attention mechanisms. For instance, StreamingLLM proposes attending only to recent tokens and sink tokens to maintain compact KV activations (Xiao et al., 2024), similar idea also adopted by (Li et al., 2024; Zhang et al., 2023; Jiang et al., 2024; Zhang et al., 2024a). Beyond optimizing KV activations, alternative methods such as agent-based approaches (Qian et al., 2024a; Lee et al., 2024) and retrieval-augmented generation (Xu et al., 2023; Zhu et al., 2024) have been applied to facilitate long-context processing. These methods split the long context into chunks and retrieve evidence using retrievers or agents. They work well for explicit queries but struggle with implicit ones requiring full-context aggregation.
有些方法使用战略位置编码来使LLMs处理超出其本地窗口的上下文,这在(Chen et al., 2023b; Song et al., 2023; Liu et al., 2023; Jin et al., 2024)中有所展示。然而,在处理长上下文时,LLMs会产生大量的键值(KV)激活,这会消耗大量资源并降低效率。为解决这个问题,许多研究旨在使 KV 激活更加紧凑和轻量化(Liu et al., 2024; Xu et al., 2024)。例如,KIVI 专注于将 KV 激活的精度降低到 2 位,从而大幅减轻 KV 表示(Liu et al., 2024)。其他方法通过压缩或稀疏注意力机制选择性地关注一小部分 KV 激活。例如,StreamingLLM 提出仅关注最近的令牌和汇聚令牌,以维持紧凑的键值激活(Xiao 等人,2024 年),类似的想法也被(Li 等人,2024 年;张等人,2023 年;姜等人,2024 年;张等人,2024a)采用。除了优化键值激活之外,基于代理的方法(钱等人,2024a;李等人,2024 年)和检索增强生成(徐等人,2023 年;朱等人,2024 年)等替代方法也被应用于促进长上下文处理。这些方法将长上下文拆分成块,并使用检索器或代理检索证据。它们对于显式查询效果很好,但在需要全上下文聚合的隐式查询方面存在困难。

Most existing methods either compact global KV activations into a lightweight form or prune them into shorter forms, often failing to balance global perspective with local informativeness. This limitation can compromise performance in information-
大多数现有的方法要么将全局 KV 激活压缩成轻量级形式,要么将其修剪成更短的形式,经常无法平衡全局视角与局部信息丰富性。这一限制可能会影响信息
seeking scenarios, where information needs may dynamically range from global to local.
寻求信息需求可能从全球到本地动态变化的场景。

5 Conclusion 结论

In this paper, we propose a method, ACRE, designed to adapt to the dynamic information needs of long-context information-seeking tasks. ACRE constructs a bi-layer KV activation cache structure for long contexts, where the L1 KV cache stores compact, global information, and the L2 KV cache captures detailed, local information. Using query-guided activation refilling, ACRE identifies query-specific evidence from the L2 KV cache and refills this local information into the L1 KV cache, resulting in nested KV activations that effectively combine a global perspective with local details. Through experiments on a wide range of information-seeking datasets, we demonstrate the effectiveness of ACRE in simultaneously improving the performance and efficiency of long-context processing for information-seeking tasks.
在本文中,我们提出了一种名为 ACRE 的方法,旨在适应长文本信息检索任务的动态信息需求。ACRE 为长文本构建了双层 KV 激活缓存结构,其中 L1 KV 缓存存储简洁的全局信息,L2 KV 缓存捕捉详细的局部信息。通过基于查询的激活重填,ACRE 从 L2 KV 缓存中识别特定于查询的证据,并将这些局部信息重填到 L1 KV 缓存中,从而有效地结合了全局视角和局部细节。通过对各种信息寻求数据集进行实验,我们证明了 ACRE 在同时提高信息寻求任务的长上下文处理性能和效率方面的有效性。

Limitation 局限性

In this paper, we propose ACRE, a method designed to adapt to the dynamic information needs of long-context information-seeking tasks. ACRE constructs a bi-layer KV activation cache to balance global context perception and local detail preservation, leveraging query-guided activation refilling to enhance performance and efficiency. While ACRE demonstrates significant advancements, several limitations are worth noting:
在本文中，我们提出了 ACRE，一种针对长上下文信息检索任务的动态信息需求的方法。ACRE 构建了一个双层 KV 激活缓存，以平衡全局上下文感知和局部细节保留，利用基于查询的激活重填来提高性能和效率。虽然 ACRE 展现出了显著的进步，但仍有几个值得注意的局限性:
(1) Our method is primarily designed for information-seeking tasks, a major subset of longcontext processing. This focus is largely driven by the availability of training data, as informationseeking tasks benefit from abundant QA datasets. While ACRE has the potential to adapt to general long-context tasks, further exploration with diverse task-specific data would be necessary to validate its broader applicability.
我们的方法主要针对信息搜索任务设计，这是长文本处理的一个主要子集。这一重点主要受训练数据的可获得性所驱动，因为信息搜索任务可从丰富的问答数据集中获益。虽然 ACRE 有潜力适应一般的长文本任务，但需要进一步探索不同任务特定数据以验证其更广泛的适用性。
(2) ACRE introduces additional parameters for constructing the bi-layer KV cache, increasing the model size. For example, using Qwen2.5-3BInstruct, ACRE adds approximately

17.2 %

more parameters, requiring additional GPU memory to load the model. However, in long-context tasks, the majority of GPU memory is consumed by KV activations rather than model parameters. Our efficiency analysis confirms that ACRE reduces overall GPU memory consumption when processing long
除了增加构建双层 KV 缓存的参数以外,ACRE 还增加了模型的大小。例如,使用 Qwen2.5-3BInstruct 时,ACRE 增加了大约

17.2 %

个参数,需要额外的 GPU 内存来加载模型。然而,在长上下文任务中,GPU 内存的大部分被 KV 激活所消耗,而不是模型参数。我们的效率分析证实,在处理长上下文时,ACRE 可以减少整体 GPU 内存消耗。
contexts, mitigating this limitation to some extent.
上下文，在一定程度上缓解了这一限制。
(3) A portion of our training data is synthetically generated by commercial LLMs (e.g. GPT-4), which may introduce biases inherited from the original corpus or the LLMs used. While such biases could impact performance, many current commercial LLMs incorporate robust safeguards that help mitigate these issues. Nonetheless, addressing potential biases in synthetic data remains an area for future improvement.
我们的部分训练数据是由商业LLMs（例如 GPT-4）合成生成的,可能会继承原始语料库或所使用的LLMs中的偏差。虽然这些偏差可能会影响性能,但许多当前的商业LLMs都采用了有效的防护措施来缓解这些问题。尽管如此,如何解决合成数据中的潜在偏差仍然是一个需要未来改进的领域。

References 参考文献

Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, and Juanzi Li. 2024a. Longalign: A recipe for long context alignment of large language models. arXiv preprint arXiv:2401.18058.
白玉石、吕昕、张佳杰、何宇泽、祁季、侯雷、唐杰、董宇晓和李娟姿。 2024 年。Longalign：大型语言模型长语境对齐的一种方法。 arXiv 预印本 arXiv:2401.18058。

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024b. Longbench: A bilingual, multitask benchmark for long context understanding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 3119-3137. Association for Computational Linguistics.
白雨师、吕新、张家洁、吕宏畅、唐建凯、黄至旦、杜政晓、刘晓、曾鳌寒、侯磊、董宇晓、唐杰、李娟子。 2024 b. Longbench:一种双语多任务基准,用于长文本理解。在第 62 届计算语言学学会年会论文集(第 1 卷:长论文)ACL 2024 中,2024 年 8 月 11 日至 16 日,曼谷,泰国, 第 3119-3137 页。计算语言学学会。

Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, and Jie Fu. 2024. RQ-RAG: learning to refine queries for retrieval augmented generation. CoRR, abs/2404.00610.
陈启敏、徐春浦、袁瑞斌、罗宏荫、薛维、郭艺科、傅杰。 2024。 RQ-RAG: 学习完善查询以进行检索增强生成。 CoRR, abs/2404.00610。

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2023a. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. Preprint, arXiv:2309.07597.
陈建吕、肖诗涛、张培添、罗昆、连德富和刘正。 2023a。Bge m3-embedding:通过自我知识蒸馏的多语言、多功能、多粒度文本嵌入。预印本,arXiv:2309.07597。

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. 2023b. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
陈守元、王向、陈良坚和田远东。2023b。通过位置插值扩展大型语言模型的上下文窗口。arXiv 预印本 arXiv:2306.15595。

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. 2024. Longlora: Efficient fine-tuning of long-context large language models. Preprint, arXiv:2309.12307.
陈雨康、钱圣聚、唐皓天、赖鑫、刘志坚、韩松和贾行家。2024。Longlora:高效微调长上下文大型语言模型。预印本,arXiv:2309.12307。

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. 2021. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599-4610.
普拉迪普·达西吉、凯尔·罗、伊兹·贝尔塔吉、阿尔曼·科汉、诺亚·A·史密斯和马特·加德纳。2021 年。基于研究论文的信息查询问题和答案数据集。在 2021 年北美计算语言学协会人类语言技术年会论文集中,第 4599-4610 页。

Zican Dong, Tianyi Tang, Lunyi Li, and Wayne Xin Zhao. 2023. A survey on long text modeling with transformers. arXiv preprint arXiv:2302.14502.
董子璨、唐天逸、李伦怡和赵薪。 2023。使用变形金刚进行长文本建模的调查。 arXiv 预印本 arXiv:2302.14502。

Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. 2024. Data engineering for scaling language models to 128 k context. Preprint, arXiv:2402.10171.
姚付、拉梅斯瓦尔·潘达、牛鑫垚、薛祥、哈纳奈·哈吉西尔齐、尹金和彭昊。2024。将语言模型扩展到 128k 上下文的数据工程。预印本，arXiv:2402.10171。

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. 2024. Retrievalaugmented generation for large language models: A survey. Preprint, arXiv:2312.10997.
高云帆、熊云、高新宇、贾康祥、潘金柳、毕宇曦、戴怡、孙家伟、郭倩宇、王萌和王浩芬。2024。大型语言模型的检索增强生成：综述。预印本，arXiv:2312.10997。

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multihop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609-6625, Barcelona, Spain (Online). International Committee on Computational Linguistics.
薛河,阮安科,菅原咲,和相泽明子。 2020 年。为全面评估推理步骤构建多跳问答数据集。在巴塞罗那(线上)举行的第 28 届国际计算语言学大会论文集中,第 6609-6625 页。国际计算语言学委员会。

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention. arXiv preprint arXiv:2407.02490.
姜惠强、李雨成、张成瑞东、吴倩慧、罗序芳、安世麟、韩振华、Amir H Abdi、李东升、林晋瑜、杨玉庆和邱丽丽。 2024 年。 Minference 1.0：通过动态稀疏注意力加速长上下文llms的预填充。 arXiv 预印本 arXiv:2407.02490。

Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, and Xia Hu. 2024. Llm maybe longlm: Selfextend 1 lm context window without tuning. Preprint, arXiv:2401.01325.
金洪烨、韩晓天、杨景峰、蒋志猛、刘子睿、张嘉渊、陈慧苑和胡夏。2024 年。Llm maybe longlm: 在不调整的情况下自动扩展 1 个语言模型上下文窗口。预印本,arXiv:2401.01325.

Tomás Kociský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. 2018. The narrativeqa reading comprehension challenge. Trans. Assoc. Comput. Linguistics, 6:317-328.
托马斯·科奇斯基、乔纳森·施瓦兹、菲尔·布洛斯姆、克里斯·戴尔、卡尔·莫里茨·赫尔曼、加布尔·梅利斯和爱德华·格雷芬斯通。 2018 年。narrativeqa 阅读理解挑战。ACL 交易。6:317-328。

Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John F. Canny, and Ian Fischer. 2024. A human-inspired reading agent with gist memory of very long contexts. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net.
李广辉、陈新赟、古田弘树、约翰·F·坎尼和伊恩·菲希尔。2024 年。一种具有很长上下文概要记忆的人类启发式阅读代理。在 2024 年 7 月 21 日至 27 日于奥地利维也纳举行的第四十一届国际机器学习会议(ICML 2024)上。OpenReview.net。

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469.
李玉红、黄迎兵、杨博文、Bharat Venkitesh、Acyr Locatelli、叶涵宸、蔡天乐、帕特里克·刘易斯和陈德明。2024 年。Snapkv: Llm 在生成之前就知道你要找什么。arXiv 预印本 arXiv:2404.14469。

Xiaoran Liu, Hang Yan, Chenxin An, Xipeng Qiu, and Dahua Lin. 2023. Scaling laws of rope-based extrapolation. In The Twelfth International Conference on Learning Representations.
刘小然、颜航、安晨昕、邱西朋、林大华。2023。基于绳索的外推缩放定律。在第十二届国际表示学习会议上。

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. In Forty-first International Conference on Machine Learning.
刘梓瑞、袁佳怡、金宏叶、钟韶辰、徐昭卓、Vladimir Braverman、陈蓓迪和胡夏。2024。Kivi：k-v 缓存的无需调优的非对称 2 位量化。在第四十一届国际机器学习大会上。

MetaAI. 2024. The llama 3 herd of models. Preprint, arXiv:2407.21783.
梅塔人工智能。2024 年。llama 3 模型群。预印本,arXiv:2407.21783。

Amirkeivan Mohtashami and Martin Jaggi. 2024. Random-access infinite context length for transformers. Advances in Neural Information Processing Systems, 36.
阿米尔·克维安·莫塔沙米和马丁·亚吉。2024 年。变换器的随机访问无限上下文长度。神经信息处理系统进展,第 36 期。

OpenAI. 2023. Gpt-4 technical report. https://cdn. openai.com/papers/gpt-4.pdf.
开放人工智能。2023。Gpt-4 技术报告。https://cdn.openai.com/papers/gpt-4.pdf。

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. 2023. Yarn: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations.
鲍文·彭、杰弗里·奎斯内尔、洪鲁·潘和恩里科·施波尔。 2023 年。Yarn：大型语言模型的高效上下文窗口扩展。在第十二届国际学习表征会议上。

Hongjin Qian, Zheng Liu, Peitian Zhang, Kelong Mao, Yujia Zhou, Xu Chen, and Zhicheng Dou. 2024a. Are long-llms a necessity for long-context tasks? Preprint, arXiv:2405.15318.
钱宏进、刘峥、张沛田、毛克龙、周宇佳、陈旭和邹志成。 2024a. 长长的llms是长文本任务的必需品吗?预印本,arXiv:2405.15318.

Hongjin Qian, Peitian Zhang, Zheng Liu, Kelong Mao, and Zhicheng Dou. 2024b. Memorag: Moving towards next-gen rag via memory-inspired knowledge discovery. Preprint, arXiv:2409.05591.
钱宏进、张佩田、刘征、毛克龙和邓志诚。2024b。Memorag:通过受记忆启发的知识发现迈向下一代 rag。预印本,arXiv:2409.05591。

Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. 2023. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.
索菲娅·索波列娃、费萨尔·阿尔-哈季卜、罗伯特·迈尔斯、雅各布·斯蒂夫斯、乔尔·希斯特内斯和诺兰·戴伊。2023 年。SlimPajama:RedPajama 的一个 627B token 清洁并去重版本。

Woomin Song, Seunghyuk Oh, Sangwoo Mo, Jaehyung Kim, Sukmin Yun, Jung-Woo Ha, and Jinwoo Shin. 2023. Hierarchical context merging: Better long context understanding for pre-trained llms. In The Twelfth International Conference on Learning Representations.
宋武敏、吴承赫、牟相宇、金在兴、尹成民、河正宇和申镇宇。2023。分层上下文合并:更好的长上下文理解用于预训练llms。在第十二届国际学习表示大会上。

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
休戈·托弗隆、路易斯·马丁、凯文·斯通、彼得·艾伯特、阿姆贾德·阿玛赫尔、亚斯敏·巴巴艾、尼古拉·巴什利科夫、苏米亚·巴特拉、普拉季瓦尔·巴加瓦、什鲁蒂·博萨尔等. 2023. Llama 2: 开放基础和微调聊天模型. arXiv 预印本 arXiv:2307.09288.

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539-554.
哈尔什·特里维迪、尼兰金·巴拉苏布拉马尼亚、塔沙尔·科特和阿希什·萨巴尔瓦尔。 2022 年。Musique:通过单跳问题组合实现多跳问题。计算语言学协会论文集, 10:539-554.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention is all you need. Preprint, arXiv:1706.03762.
阿什哥·瓦斯瓦尼、诺阿姆·沙泽尔、妮基·帕玛、雅各布·乌斯科雷特、利昂·琼斯、艾登·N·戈麦斯、鲁卡斯·凯撒和伊利亚·波洛苏金。2023。注意力机制就是你所需要的。预印本,arXiv:1706.03762。

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient streaming language models with attention sinks. Preprint, arXiv:2309.17453.
肖光轩、田远东、陈倍迪、韩松和 Mike Lewis。2024。带有注意力汇集的高效流式语言模型。预印本,arXiv:2309.17453。

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao
熊文涵、刘静渝、伊戈尔· 莫里博格、张和佳、普拉吉瓦尔· 巴尔加瓦、侯锐、路易斯· 马丁、拉希· 龙格达、卡尔蒂克· 阿比纳夫· 桑卡拉拉曼、巴拉斯· 奥格兹、马甸· 卡布萨、方寒、亚沙尔· 梅达德、沙兰· 纳朗、克西蒂兹· 马利克、安吉拉· 范、舒鲁蒂· 博沙尔、谢尔盖· 艾杜诺夫、迈克· 刘易斯、斯诺·王和郝

Ma. 2024. Effective long-context scaling of foundation models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pages 46434663. Association for Computational Linguistics.
马克斯.2024. 基础模型长上下文扩展的有效性. 在 2024 年北美计算语言学协会人类语言技术年会(第 1 卷: 长论文)论文集中, NAACL 2024, 2024 年 6 月 16-21 日,墨西哥城,墨西哥,页码 46434663. 计算语言学协会.

Peng Xu, Wei Ping, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina, Mohammad Shoeybi, and Bryan Catanzaro. 2023. Retrieval meets Long Context Large Language Models. arXiv. Experimental.
彭徐、魏平、冼朝武、劳伦斯·麦克阿菲、陈竹、刘子涵、桑迪普·苏布拉马尼安、埃维琳娜·巴赫图里纳、穆罕默德·舒伊比和布赖恩·卡坦扎罗。2023。检索与长文本大型语言模型。arXiv。实验性。

Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, and Doyen Sahoo. 2024. Think: Thinner key cache by query-driven pruning. arXiv preprint arXiv:2407.21018.
许宇慧、解占明、董汉泽、王雷、陆旭东、周凹军、Amrita Saha、熊才明和 Doyen Sahoo. 2024. Think: Thinner key cache by query-driven pruning. arXiv 预印本 arXiv:2407.21018.

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. 2024. Qwen2 technical report. arXiv preprint arXiv:2407.10671.
杨安,杨宝松,惠笔缘,郑波,余博文,周畅,李成鹏,李诚源,刘大益恒,黄飞等。 2024。Qwen2 技术报告。 arXiv 论文集 arXiv:2407.10671。

Chanwoong Yoon, Taewhoo Lee, Hyeon Hwang, Minbyul Jeong, and Jaewoo Kang. 2024. Compact: Compressing retrieved documents actively for question answering. Preprint, arXiv:2407.09014.
尹灿雄、李泰优、黄炫、郑珉彬和姜在宇。2024。Compact: 为问答主动压缩检索文档。预印本,arXiv:2407.09014。

Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, and Zhicheng Dou. 2024a. Soaring from 4 k to 400 k : Extending llm’s context with activation beacon. arXiv preprint arXiv:2401.03462.
张培田、刘峥、肖柿桃、邵宁璐、叶齐巍和窦智成。2024 年 a。从 4 千到 40 万的腾飞:利用激活信标扩展llm的语境。arXiv 预印本 arXiv:2401.03462。

Peitian Zhang, Ninglu Shao, Zheng Liu, Shitao Xiao, Hongjin Qian, Qiwei Ye, and Zhicheng Dou. 2024b. Extending llama-3’s context ten-fold overnight. Preprint, arXiv:2404.19553.
张沛天、邵宁禄、刘征、肖世涛、钱宏进、叶启为和窦智城。2024b。一夜之间将 llama-3 的上下文扩展十倍。预印本,arXiv:2404.19553。

Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, and Maosong Sun. 2024c. inftybench: Extending long context evaluation beyond 100k tokens. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 15262-15277. Association for Computational Linguistics.
张新荣、陈英发、胡盛丁、徐子航、陈俊豪、Moo Khai Hao、韩旭、泰珍楠、王硕、刘知远和孙茂松。2024 年。inftybench:将长上下文评估扩展到超过 10 万个标记。在第 62 届计算语言学会年会(第 1 卷:长文章)论文集中,ACL 2024,2024 年 8 月 11-16 日,曼谷,泰国,第 15262-15277 页。计算语言学会。

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. 2023.

H_{2} O

: Heavy-hitter oracle for efficient generative inference of large language models. Preprint, arXiv:2306.14048.
张振宇、盛莹、周天逸、陈天龙、郑连民、蔡瑞斯、宋兆、田远东、Christopher Ré、Clark Barrett、王章阳和陈碧迪。2023。

H_{2} O

:大型语言模型高效生成推断的重要信息源。预印本,arXiv:2306.14048。

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2024. A survey of large language models. Preprint, arXiv:2303.18223.
赵鑫,周昆,李俊毅,唐天逸,王小雷,侯玉鹏,闵迎倩,张贝晨,张俊杰,董子灿,杜逸凡,杨晨,陈玉硕,陈志鹏,蒋金豪,任瑞阳,李奕凡,唐新宇,刘子康,刘沛煜,聂建云,文继荣。 2024 年。大型语言模型综述。预印本, arXiv:2303.18223。

Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zhicheng Dou, and Ji-Rong Wen. 2024. Large language models for information retrieval: A survey. Preprint, arXiv:2308.07107.
朱宇韬、袁华英、王舒婷、刘炯男、刘文涵、邓宸龙、陈浩南、窦智城和文继荣。2024。信息检索的大型语言模型:一份综述。预印本,arXiv:2308.07107。

A Implementation details
实现细节

For ACRE training, in stage 1, we sample long text spans from the RedPajama (Soboleva et al., 2023) dataset to create a training set of 2 billion tokens. The sampled text lengths are limited to a minimum of 4 K and a maximum of 64 K tokens. We randomly choose L1/L2 interval from

l \in {8, 16, 32, 64, 128}

. The model is trained for one epoch with a batch size of 8 and a learning rate of

5 \times 10^{- 5}

. In stage 2 , we collect 28,400 QA SFT data points from LongAlpaca (Chen et al., 2024) and synthetic data from (Zhang et al., 2024a; Qian et al., 2024b). We apply the same L1 token insertion strategy during training. The model is trained for three epochs with a batch size of 8 and a learning rate of

1 \times 10^{- 5}

for two epochs. Stage-1 training takes around 7 hours while stage- 2 training takes around 13 hours.
在第一阶段的 ACRE 训练中，我们从 RedPajama（Soboleva 等人，2023）数据集中采样长文本段落，创建了包含 20 亿个标记的训练集。采样的文本长度限制在 4K 到 64K 个标记之间。我们随机选择

l \in {8, 16, 32, 64, 128}

范围内的 L1/L2 间隔。模型训练一个 epoch,批量大小为 8,学习率为

5 \times 10^{- 5}

。在第二阶段,我们从 LongAlpaca（Chen 等人，2024）和合成数据（Zhang 等人，2024a; Qian 等人，2024b）收集了 28,400 个问答 SFT 数据点。我们在训练期间应用了相同的 L1 标记插入策略。模型训练三个 epoch,批量大小为 8,前两个 epoch 的学习率为

1 \times 10^{- 5}

。第一阶段训练大约需要 7 个小时,而第二阶段训练大约需要 13 个小时。

During the two-stage training process, we optimize only the newly initialized parameters, keeping the original parameters frozen. The number of trainable parameters varies depending on the model. For instance: (1) When using Qwen2.5-3Binstruct, ACRE has around 503 M trainable parameters, accounting for

17.2 %

of the original parameters. (2) When using Llama3.2-3B-instruct, ACRE has around 780 M trainable parameters, accounting for

25.6 %

of the original parameters. This difference arises from variations in the implementation of multi-head attention.
在两阶段训练过程中,我们只优化新初始化的参数,保持原始参数冻结。可训练参数的数量取决于模型。例如:(1)使用 Qwen2.5-3Binstruct 时,ACRE 有大约 503 M 个可训练参数,占原始参数的

17.2 %

。(2)使用 Llama3.2-3B-instruct 时,ACRE 有大约 780 M 个可训练参数,占原始参数的

25.6 %

。这种差异源于多头注意力机制的实现差异。

Prompt for Bi-Layer KV Cache Construction
双层密钥值缓存的提示

You are provided with a long article. Read the article carefully.
您已获得一篇长文章。仔细阅读该文章。
After reading, you will be asked to perform specific tasks based on the content of the article.
读过之后,您将被要求执行基于文章内容的具体任务。
Now, the article begins:
现在,文章开始:
Article Content: [context]
文章内容:[context]
The article ends here.
文章到此结束。
Next, follow the instructions provided to complete the tasks.
下一步,按照提供的说明完成任务。

For the main experiments, we configure ACRE
对于主要实验,我们配置了 ACRE
with an L1/L2 interval

l

of 16 , a maximum refilling length

η

of 4,096, and the maximum working context window

W

of 32 K tokens. For the Bi-Layer KV Cache construction, we utilize the following prompt. During the Query-Guided Activation Refilling process, we adopt task-specific prompts from the official benchmark repositories, without inserting the context into the task prompt.
使用 L1/L2 间隔

l

为 16,最大补充长度

η

为 4,096,最大工作上下文窗口

W

为 32 K 个标记。对于双层 KV 缓存构建,我们使用以下提示。在查询引导激活补充过程中,我们采用来自官方基准库的任务特定提示,不将上下文插入任务提示中。

For RAG, RQ-RAG, and MemoRAG, we employ BGE-M3 (Chen et al., 2023a) as the retriever and set the hit number to 5 . For methods that divide the long context into chunks, we use the semantic-text-splitter tool, chunking the context to a maximum length of 512 tokens.
对于 RAG、RQ-RAG 和 MemoRAG,我们使用 BGE-M3(Chen 等人,2023a)作为检索器,并将命中数设置为 5。对于将长上下文分割成块的方法,我们使用语义文本分割器工具,将上下文划分为最多 512 个标记的块。

For KIVI, we quantize the KV activations to 4bit precision. For Beacon, we use the official training code to fine-tune Qwen2.5-3B-Instruct, setting the compression ratio to 8 during inference. For SelfExtend, we set the group size to 32 and the window size to 2048, which is approximate by the official recommended strategy. For StreamingLLM, we use the SinkCache implementation from Transformers, configuring the window size to 4096 and the number of sink tokens to 8 . Lastly, for MemoRAG, we utilize the officially released memorag-qwen2-7b-inst as the memory model.
对于 KIVI,我们将 KV 激活量化为 4 位精度。对于 Beacon,我们使用官方训练代码对 Qwen2.5-3B-Instruct 进行微调,推理阶段将压缩比设置为 8。对于 SelfExtend,我们将组大小设置为 32,窗口大小设置为 2048,这与官方推荐的策略相似。对于 StreamingLLM,我们使用 Transformers 中的 SinkCache 实现,将窗口大小配置为 4096,sink tokens 数量设置为 8。最后,对于 MemoRAG,我们使用官方发布的 memorag-qwen2-7b-inst 作为记忆模型。

All methods are evaluated using the task prompts provided in the official repositories of their corresponding benchmarks

^{1}

. Additionally, we use the same generation hyper-parameters (taskdependent) for ACRE and all baseline models.
所有方法都使用相应基准测试官方存储库中提供的任务提示进行评估

^{1}

。此外,我们对 ACRE 和所有基线模型使用相同的生成超参数(任务相关)。

All training and evaluation experiments were conducted using 8 NVIDIA A800-80G GPUs.
所有的训练和评估实验都使用了 8 台 NVIDIA A800-80G GPU 设备进行。

$^{*}$ Corresponding author.
通信作者。
$^{1}$ LongBench: https://github.com/THUDM/LongBench, InfiniteBench: https://github.com/OpenBMB/
长期评测基准: https://github.com/THUDM/LongBench, 无限基准: https://github.com/OpenBMB/

Boosting Long-Context Management via Query-Guided Activation Refilling 通过查询引导的激活填充提高长上下文管理

Abstract 摘要

1 Introduction 1 简介

2 Method 2 方法

2.1 Preliminary 2.1 初步

2.2 Overview of ACRE2.2 ACRE 概述

2.3 Bi-Layer KV Cache双层 KV 缓存

2.4 Query-Guided Activation Refilling查询引导激活补充

2.5 Model Optimization 模型优化

3 Experiments 3 个实验

3.1 Dataset 数据集

3.2 Baseline Models 基线模型

3.3 Main Results 3.3 主要结果

3.4 Ablation Study 消融研究

3.5 Efficiency Analysis 效率分析

4 Related Work 相关工作

5 Conclusion 结论

Limitation 局限性

References 参考文献

A Implementation details实现细节

Prompt for Bi-Layer KV Cache Construction双层密钥值缓存的提示

Boosting Long-Context Management via Query-Guided Activation Refilling
通过查询引导的激活填充提高长上下文管理

2.2 Overview of ACRE
2.2 ACRE 概述

2.3 Bi-Layer KV Cache
双层 KV 缓存

2.4 Query-Guided Activation Refilling
查询引导激活补充

A Implementation details
实现细节

Prompt for Bi-Layer KV Cache Construction
双层密钥值缓存的提示