Core Context Aware Transformers for Long Context Language Modeling
核心上下文感知 Transformer 用于长上下文语言建模
Abstract 摘要
Transformer-based Large Language Models (LLMs) have exhibited remarkable success in extensive tasks primarily attributed to the self-attention mechanism, which requires a token to consider all preceding tokens as its context to compute attention.
However, when the context length becomes very large (e.g., 128K), the amount of potentially redundant information in the context tends to increase.
The redundant context not only hampers the modeling representation performance but also incurs unnecessary computational and storage overhead.
基于 Transformer 的大型语言模型(LLMs)在广泛的任务中表现出显著的成功,这主要归功于自注意力机制,该机制要求一个标记考虑所有先前的标记作为其上下文来计算注意力。然而,当上下文长度 变得非常大(例如,128K)时,上下文中潜在冗余信息的数量往往会增加。冗余的上下文不仅阻碍了建模表示性能,还造成了不必要的计算和存储开销。
In this paper, we propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling, comprising two complementary modules:
1) Globality-aware pooling module groups input tokens and dynamically compresses each group into one core token based on their significance.
In this way, our method automatically focuses and strengthens core context while diminishing redundancy during the learning process, leading to effective long-term dependency modeling.
2) Locality-preserving module incorporates neighboring tokens to preserve local context for detailed representation.
在本文中,我们提出了一种即插即用的核心上下文感知(CCA)注意力机制,用于高效的长上下文建模,包含两个互补模块:1)全局感知池化模块将输入标记分组,并根据其重要性动态地将每个组压缩成一个核心标记。通过这种方式,我们的方法在学习过程中自动聚焦和强化核心上下文,同时减少冗余,从而实现有效的长期依赖建模。2)局部保持模块结合邻近标记以保留局部上下文,进行详细表示。
Notably, our CCA-Attention is able to replace the self-attention module in existing LLMs with minimal fine-tuning cost.
Extensive experimental results show the superiority of our method in both long-context modeling and computational efficiency over state-of-the-art methods.
值得注意的是,我们的 CCA-Attention 能够在最小化微调成本的情况下替换现有 LLMs 中的自注意力模块。大量的实验结果表明,我们的方法在长上下文建模和计算效率方面均优于现有最佳方法。
.tocmtchapter
\etocsettagdepthmtchaptersubsubsection
\etocsettagdepthmtappendixnone
.tocmt 章节 \etocsettagdepthmt 子章节 \etocsettagdepthmt 附录无
1 Introduction 1 引言
Large language models (LLMs) (Brown et al., 2020; OpenAI, 2023; Touvron et al., 2023a; Liu et al., 2024a) have demonstrated exceptional proficiency across various applications by effectively modeling extended contexts, particularly in tasks involving natural language understanding and generation (Ouyang et al., 2022; Chang et al., 2024).
The remarkable success of Transformer-based LLMs is predominantly credited to the self-attention mechanism (Vaswani et al., 2017), which requires each token to incorporate all preceding tokens as its context for attention calculation. In this mechanism, the context of a token refers to the sequence of tokens that precede it. By leveraging self-attention, LLMs are able to capture long-range dependencies and generate coherent and contextually relevant outputs.
大型语言模型(LLMs)(Brown et al., 2020; OpenAI, 2023; Touvron et al., 2023a; Liu et al., 2024a)通过有效地模拟扩展上下文,在各种应用中展现出卓越的技能,尤其是在涉及自然语言理解和生成(Ouyang et al., 2022; Chang et al., 2024)的任务中。基于 Transformer 的 LLMs 的显著成功主要归功于自注意力机制(Vaswani et al., 2017),该机制要求每个标记在注意力计算中将其所有前面的标记作为其上下文。在这个机制中,一个标记的上下文是指在其之前的标记序列。通过利用自注意力,LLMs 能够捕捉长距离依赖关系,并生成连贯且上下文相关的输出。
The ability to process longer contexts has been a key factor in improving the performance of LLMs across a wide range of tasks, particularly beneficial for tasks requiring document-level understanding, such as summarization of extended texts, and multi-turn dialogue (Kitaev et al., 2019; Wei et al., 2022; Jiang et al., 2024).
More importantly, recent advancements in LLMs, such as OpenAI-o1 (OpenAI, 2023) and DeepSeek-R1 (Guo et al., 2025), have demonstrated that extended contexts significantly enhance reasoning capabilities, enabling models to solve intricate problems that require multi-step inference and contextual understanding.
处理较长上下文的能力一直是提高 LLMs 在各种任务中性能的关键因素,尤其对需要文档级理解的任务有益,例如扩展文本的摘要和多轮对话(Kitaev 等人,2019;Wei 等人,2022;Jiang 等人,2024)。更重要的是,LLMs 的最近进展,如 OpenAI-o1(OpenAI,2023)和 DeepSeek-R1(Guo 等人,2025),已经证明扩展上下文显著增强了推理能力,使模型能够解决需要多步推理和上下文理解的复杂问题。
This trend underscores the importance of extending the context length in LLMs to enhance modeling capabilities.
这一趋势突显了在 LLMs 中扩展上下文长度以增强建模能力的重要性。
However, as the context length scales to extremely large magnitudes (e.g., 128K), it becomes impractical for a token to maintain significant semantic connections with all tokens within such an extensive context.
From this perspective, it is natural to consider that not all parts of the context contribute equally to a token’s representation.
然而,当上下文长度 扩展到极其大的数值(例如,128K)时,一个标记与如此广泛上下文中的所有标记保持显著的语义联系变得不切实际。从这个角度来看,考虑不是上下文的每一部分都对标记的表示贡献同等是自然的。
Instead, the context can be viewed as comprising two primary aspects: core context, which captures essential semantic connections, and redundant context, which contains less critical or repetitive information.
相反,可以将上下文视为包含两个主要方面:核心上下文,它捕捉基本语义联系,以及冗余上下文,它包含不那么关键或重复的信息。
The redundant context may hamper LLMs from capturing dependencies among crucial tokens, degrading representation performance.
冗余上下文可能会妨碍 LLMs 捕捉关键标记之间的依赖关系,降低表示性能。
In self-attention, this redundancy manifests as a highly sparse distribution of attention scores, with a substantial proportion disproportionately assigned to a limited number of tokens.
在自注意力中,这种冗余表现为注意力分数的高度稀疏分布,大量比例不当地分配给有限的几个标记。
Such sparsity in attention score distribution has been observed across different attention heads in most layers of LLMs, as shown in recent studies (Beltagy et al., 2020; Zaheer et al., 2020; Xiao et al., 2024b).
This sparsity in attention scores introduces unnecessary computational and storage overhead, especially for extremely long contexts.
在 LLMs 的多数层中,已经观察到注意力分数分布的这种稀疏性,如最近的研究所示(Beltagy 等,2020;Zaheer 等,2020;Xiao 等,2024b)。这种注意力分数的稀疏性引入了不必要的计算和存储开销,尤其是在极长上下文中。
To address the above issues, numerous studies have been advanced to eliminate the redundancy and enhance attention computational efficiency.
StreamingLLM (Xiao et al., 2024b) and LM-Infinite (Han et al., 2023) simply maintain the attention over only the initial and last several tokens, ignoring the attention connection among remaining tokens.
Besides, MInference (Jiang et al., 2024) introduces an efficient mixed attention mechanism comprising A-shape, vertical-slash, and block-sparse attentions, with the mixed attention patterns determined offline based on some samples.
为了解决上述问题,许多研究被提出以消除冗余并提高注意力计算效率。StreamingLLM(Xiao 等,2024b)和 LM-Infinite(Han 等,2023)简单地只维护初始和最后几个标记的注意力,忽略了剩余标记之间的注意力连接。此外,MInference(Jiang 等,2024)引入了一种高效的混合注意力机制,包括 A 形、垂直斜杠和块稀疏注意力,混合注意力模式基于一些样本离线确定。
These methodologies typically involve computing only a portion of the attention to approximate full attention, thus compromising the connections among different tokens.
In question-answering tasks, the crucial information can be located across any position in the input tokens.
这些方法通常只计算部分注意力来近似完整注意力,从而损害了不同标记之间的连接。在问答任务中,关键信息可能位于输入标记的任何位置。
Consequently, it is crucial for the model to have the capability to leverage information from any position within the input text (Liu et al., 2024b).
In this sense, these methods with fixed sparsity patterns may lead to incomplete comprehension of the long context.
Therefore, how to ensure the information exchange among tokens in the attention while reducing the context redundancy is still an open question.
因此,模型具备从输入文本的任何位置利用信息的能力至关重要(刘等,2024b)。从这个意义上讲,这些具有固定稀疏模式的方法可能会导致对长上下文的理解不完整。因此,如何在注意力中确保标记之间的信息交换同时减少上下文冗余仍然是一个未解决的问题。
In this paper, we propose an efficient Core Context Aware (CCA) Attention mechanism, which is designed to efficiently capture both global and local dependencies within long contexts.
Specifically, our CCA-Attention consists of two complementary components:
1) Globality-aware pooling module first partitions the input tokens into groups and derives core tokens by compressing the input tokens in each group based on their significance.
We perform attention on these core tokens instead of original input tokens to efficiently extract long-term contextual information.
在本文中,我们提出了一种高效的核上下文感知(CCA)注意力机制,该机制旨在有效地捕捉长上下文中的全局和局部依赖关系。具体来说,我们的 CCA-Attention 由两个互补的组件组成:1)全局感知池化模块首先将输入标记分成组,并通过根据其重要性压缩每组中的输入标记来推导核心标记。我们对这些核心标记而不是原始输入标记进行注意力操作,以有效地提取长期上下文信息。
These number-reduced core tokens are more compact representations than the original ones, which enables our attention method to automatically focus on the core context.
这些数量减少的核心标记比原始标记更紧凑,这使得我们的注意力方法能够自动聚焦于核心上下文。
In this way, our method is able to
eliminate the context redundancy and reduce unnecessary computational overhead.
However, the globality-aware pooling module is only able to capture long-range and coarse-grained information.
2) To address this issue, we propose a Locality-preserving module that captures the local and fine-grained context by focusing on neighboring tokens, serving as a complement for the globality-aware pooling module.
这样,我们的方法能够消除上下文冗余并减少不必要的计算开销。然而,全局感知池化模块只能捕捉长距离和粗粒度信息。2) 为了解决这个问题,我们提出一个局部保持模块,通过关注邻近标记来捕捉局部和细粒度上下文,作为全局感知池化模块的补充。
By fusing the insights from both these two modules, our method not only excels in long context modeling but also achieves this with a significant reduction in computational costs and storage demands.
Our contributions are as follows:
通过融合这两个模块的见解,我们的方法不仅在长上下文建模方面表现出色,而且实现了这一点的同时显著降低了计算成本和存储需求。我们的贡献如下:
-
•
We propose a plug-and-play Core Context Aware Attention for efficient long-context modeling. Our CCA-Attention reduces computational complexity to linear complexity by taking a set of core tokens as efficient proxies for attention.
• 我们提出了一种即插即用的核心上下文感知注意力机制,用于高效的长期上下文建模。我们的 CCA-Attention 通过将一组核心标记作为高效的注意力代理,将计算复杂度降低到线性复杂度。
Unlike traditional efficient attention methods that require extensive retraining, our CCA-Attention can be easily integrated into pretrained LLMs with minimal fine-tuning effort.
与传统需要大量重新训练的高效注意力方法不同,我们的 CCA-Attention 可以轻松集成到预训练的 LLMs 中,只需最小的微调努力。 -
•
We develop a dynamic globality-aware pooling module that adaptively derives core tokens based on their importance. By compressing input tokens into core tokens, our method captures essential information more effectively than static or random token selection approaches.
• 我们开发了一个动态的全局感知池化模块,该模块根据其重要性自适应地提取核心标记。通过将输入标记压缩为核心标记,我们的方法比静态或随机标记选择方法更有效地捕捉关键信息。
Our strategy focuses on the most relevant global context, leading to more accurate and effective long-term dependency modeling.
我们的策略专注于最相关的全局上下文,从而实现更准确和有效的长期依赖建模。 -
•
We achieve significant improvements compared with other baseline methods in both long-context modeling performance and computational efficiency.
• 与其他基线方法相比,我们在长上下文建模性能和计算效率方面都取得了显著的改进。
Our experimental results show that CCA-Attention not only outperforms existing efficient attention mechanisms in long-context modeling but also achieves a 7.9 speedup compared to full self-attention when processing 128K token contexts, demonstrating substantial efficiency gains with compatible accuracy.
我们的实验结果表明,CCA-Attention 不仅在长上下文建模中优于现有的高效注意力机制,而且在处理 128K 标记上下文时,与全自注意力相比实现了 7.9 的速度提升,展示了在兼容的准确度下显著的效率提升。

图 1:核心上下文和冗余上下文的示意图。我们展示了 LLaMA2-7B 中最后一个标记相对于其他标记的注意力分数(较深的阴影表示更高的注意力分数)。最后一个标记对核心上下文表现出较高的注意力分数。
The remains are considered as redundant contexts, introducing unnecessary computational overhead for attention.
这些残留被视为冗余上下文,引入了不必要的注意力计算开销。
2 Related Work 2 相关工作
Efficient Attention. Self-attention is a fundamental module in Transformer-based Large Language Models (LLMs) (Brown et al., 2020; OpenAI, 2023; Touvron et al., 2023a). It captures the global relationship between each token throughout the input sequence. However, the computational complexity of self-attention increases quadratically with the sequence length, thereby limiting the application of LLMs to long documents.
高效注意力。自注意力是基于 Transformer 的大语言模型(LLMs)中的基本模块(Brown 等,2020;OpenAI,2023;Touvron 等,2023a)。它捕捉了输入序列中每个标记之间的全局关系。然而,自注意力的计算复杂度随着序列长度的增加而呈二次方增长,从而限制了 LLMs 在长文档中的应用。
Various works have sought to mitigate this complexity through approaches such as sparse attention (Beltagy et al., 2020; Zaheer et al., 2020; Ding et al., 2023) and linear attention approximations (Choromanski et al., 2020; Katharopoulos et al., 2020; Sun et al., 2023).
Specifically, Longformer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020) employ sparse attention mechanisms to handle long sequences by utilizing strided attention patterns, where attention is only paid at fixed intervals. Linear Transformer (Katharopoulos et al., 2020) and RetNet (Sun et al., 2023) reformulate self-attention as a linear dot-product of kernel feature maps and leverages the associativity property of matrix products to achieve linear complexity.
各种研究试图通过稀疏注意力(Beltagy 等,2020;Zaheer 等,2020;Ding 等,2023)和线性注意力近似(Choromanski 等,2020;Katharopoulos 等,2020;Sun 等,2023)等方法来减轻这种复杂性。具体来说,Longformer(Beltagy 等,2020)和 BigBird(Zaheer 等,2020)采用稀疏注意力机制,通过利用步进注意力模式来处理长序列,其中注意力只在固定间隔处进行。线性 Transformer(Katharopoulos 等,2020)和 RetNet(Sun 等,2023)将自注意力重新表述为核特征图的线性点积,并利用矩阵乘积的结合性质来实现线性复杂度。
Recently, LongLoRA (Chen et al., 2024) designs a shifted sparse attention mechanism that computes attention among grouped input tokens. To facilitate communication between groups, this approach shifts the group partition by half the group size.
StreamingLLM (Xiao et al., 2024b) and LM-Infinite (Han et al., 2023) prioritize attention on the initial and final tokens, effectively disregarding the intermediate tokens.
InfLLM (Xiao et al., 2024a) employs a sliding window attention mechanism and a block-level context memory to selectively attend to relevant context information, avoiding noise and reducing computational costs.
MInference (Jiang et al., 2024) accelerates long-context LLM inference by applying three distinct sparse attention patterns
with optimized GPU kernels.
最近,LongLoRA(陈等,2024)设计了一种偏移稀疏注意力机制,该机制计算分组输入标记之间的注意力。为了促进组之间的通信,这种方法将组分区偏移了组大小的一半。StreamingLLM(肖等,2024b)和 LM-Infinite(韩等,2023)优先考虑对初始和最终标记的注意力,实际上忽略了中间标记。InfLLM(肖等,2024a)采用滑动窗口注意力机制和块级上下文记忆来选择性地关注相关上下文信息,避免噪声并降低计算成本。MInference(江等,2024)通过应用三种不同的稀疏注意力模式和优化的 GPU 内核来加速长上下文 LLM 推理。
However, these methods can not ensure that each token has access to all preceding tokens, leading to inferior performance in tasks requiring comprehensive long-context understanding.
然而,这些方法不能保证每个标记都能访问到所有前面的标记,导致在需要全面长上下文理解的任务中表现不佳。
Instead, we propose a globality-aware pooling module that each token can communicate with previous tokens via number-reduced core tokens.
我们提出了一种全局感知池化模块,每个标记可以通过数量减少的核心标记与之前的标记进行通信。
Long-context Large Language Models (LLMs). LLMs are often pretrained with a relatively small and predefined context length due to computational cost constraints, such as 4K for LLaMA2 (Touvron et al., 2023b).
This limitation restricts the applicability of LLMs to tasks with long documents.
Recently, several attempts have been made to extend the context length of LLMs through continuous training.
Position Interpolation (Chen et al., 2023) addresses this by linearly down-scaling the input position indices to fit within the original context window size, thereby extending the context length of RoPE-based LLMs.
Furthermore, YaRN (Peng et al., 2024) enhances performance by combining interpolation techniques with dynamic scaling. Beyond modifications to position embeddings, other efforts focus on designing more efficient attention mechanisms (Chen et al., 2024; Dao et al., 2022; Dao, 2024) for context window extension.
Our method is orthogonal to position embedding methods.
During inference, our approach accelerates the forward propagation process, which cannot be achieved through position embedding modifications alone.
Some context compression works attempt to achieve long context modeling by either compressing features via auxiliary networks (Rae et al., 2020) or compressing context with extra specific tokens (Mu et al., 2023; Qin & Van Durme, 2023; Mohtashami & Jaggi, 2023; Zhang et al., 2024; Qin et al., 2024). In contrast, our method dynamically identifies and enhances task-relevant core-context while suppressing redundant information.
长上下文大型语言模型(LLMs)。由于计算成本限制,LLMs 通常使用相对较小且预定义的上下文长度进行预训练,例如 LLaMA2 的 4K(Touvron 等人,2023b)。这种限制限制了 LLMs 在处理长文档任务中的应用。最近,通过持续训练来扩展 LLMs 的上下文长度的尝试已经出现。位置插值(Chen 等人,2023)通过线性降尺度输入位置索引以适应原始上下文窗口大小来解决这个问题,从而扩展了基于 RoPE 的 LLMs 的上下文长度。此外,YaRN(Peng 等人,2024)通过结合插值技术与动态缩放来提高性能。除了修改位置嵌入之外,其他努力还集中在设计更有效的注意力机制(Chen 等人,2024;Dao 等人,2022;Dao,2024)以扩展上下文窗口。我们的方法与位置嵌入方法正交。在推理过程中,我们的方法加速了前向传播过程,这仅通过修改位置嵌入是无法实现的。 一些上下文压缩工作试图通过辅助网络(Rae 等人,2020 年)或使用额外特定标记压缩上下文(Mu 等人,2023 年;Qin 和 Van Durme,2023 年;Mohtashami 和 Jaggi,2023 年;张等人,2024 年;Qin 等人,2024 年)来实现长上下文建模。相比之下,我们的方法动态识别并增强与任务相关的核心上下文,同时抑制冗余信息。
Unlike sequence-length-oriented compression techniques, our core-context-aware mechanism optimizes redundancy directly within self-attention computation, enabling more effective long-context modeling.
与面向序列长度的压缩技术不同,我们的核心上下文感知机制直接在自注意力计算中优化冗余,从而实现更有效的长上下文建模。
3 Core Context Aware Attention
3 核心上下文感知注意力

图 2:CCA-Attention 的示意图,包括两个组件:1)全局感知池化模块根据重要性(公式(2))将输入标记 封装为核心标记 。核心标记 作为 的代表性代理用于注意力,从而降低计算成本。2)局部保持模块结合邻近标记的局部上下文,作为全局感知池化模块的补充。我们根据公式(5)融合这两个模块生成最终输出 。
3.1 Motivation and Method Overview
3.1 动机和方法概述
Most existing attention-based large language models (LLMs), such as GPT (Brown et al., 2020; OpenAI, 2023) and LLaMA (Touvron et al., 2023a), employ the next-token prediction (Vaswani et al., 2017) paradigm to generate text. Given a sequence of tokens , where each token , the model predicts the next token by conditioning on all preceding tokens as its context . Specifically, the model generates the next token with the highest probability as:
大多数现有的基于注意力的大型语言模型(LLMs),如 GPT(Brown 等,2020;OpenAI,2023)和 LLaMA(Touvron 等,2023a),采用下一个标记预测(Vaswani 等,2017)范式来生成文本。给定一个标记序列 ,其中每个标记 ,模型 通过条件化在所有先前标记作为其上下文 上来预测下一个标记 。具体来说,模型通过以下方式生成下一个概率最高的标记:
(1) |
However, as the context length grows, the context inevitably exhibits redundant information.
This redundancy stems from the inherent nature of natural language: not all contextual information is equally important for the representation of the target token.
These redundant context (e.g., w.r.t. in Figure 1) have weak semantic relevance to the token , while introducing significant computational overhead.
In contrast, the core context (e.g., w.r.t. in Figure 1) refers to the contextual information that is highly relevant to the token . This information is crucial for the token’s representation. Therefore, in long-context modeling, the model should prioritize the core context over redundant parts.
To this end, most existing methods (Beltagy et al., 2020; Ding et al., 2023; Chen et al., 2024) employ sparse attention with predefined and fixed patterns. Unfortunately, they often overlook the importance of maintaining comprehensive information exchange among tokens, which may hinder the performance of long-context modeling tasks.
然而,随着上下文长度 的增长,上下文不可避免地会表现出冗余信息。这种冗余源于自然语言的固有性质:并非所有上下文信息对目标标记的表示同等重要。这些冗余上下文(例如, 相对于 在图 1 中)与标记 的语义相关性较弱,同时引入了显著的计算开销。相比之下,核心上下文(例如, 相对于 在图 1 中)指的是与标记 高度相关的上下文信息。这些信息对于标记的表示至关重要。因此,在长上下文建模中,模型应优先考虑核心上下文而非冗余部分。为此,大多数现有方法(Beltagy et al., 2020;Ding et al., 2023;Chen et al., 2024)采用预定义和固定的稀疏注意力模式。不幸的是,它们往往忽视了在标记之间保持全面信息交换的重要性,这可能会阻碍长上下文建模任务的性能。
In this paper, we seek to reduce the context redundancy associated with full self-attention.
在本文中,我们寻求减少与全自注意力相关的上下文冗余。
To achieve this, we propose a Core Context Aware Attention (CCA-Attention), which employs globality-aware pooling and locality-preserving modules to capture both global and local dependencies within a long context.
As shown in Figure 2, the globality-aware pooling module operates by generating representative core tokens from segmented groups of the input sequence. It then computes attention using these reduced-number core tokens, thereby reducing the context redundancy and computational cost (see Section 3.2).
However, the globality-aware pooling module mainly focuses on long-range and coarse-grained information and overlooks local context.
为实现这一目标,我们提出了一种核心上下文感知注意力(CCA-Attention),该注意力机制采用全局感知池化和局部保持模块来捕捉长上下文中的全局和局部依赖关系。如图 2 所示,全局感知池化模块通过从输入序列的分割组中生成代表性核心标记来工作。然后,它使用这些减少数量的核心标记来计算注意力,从而减少上下文冗余和计算成本(见第 3.2 节)。然而,全局感知池化模块主要关注长距离和粗粒度信息,而忽略了局部上下文。
To address this limitation, the locality-preserving module is responsible for capturing the local information of the neighborhood tokens to ensure comprehensive coverage (see Section 3.3).
Furthermore, we devise a differentiable fusion strategy to combine the insights from global and local modules (see Section 3.4).
This is crucial as it retains the comprehensive understanding ability of the full self-attention within our CCA-Attention.
The pseudo-code for our proposed CCA-Attention is presented in Algorithm 1.
为了解决这一限制,局部保持模块负责捕捉邻域标记的局部信息,以确保全面覆盖(见第 3.3 节)。此外,我们设计了一种可微的融合策略,以结合全局和局部模块的见解(见第 3.4 节)。这一点至关重要,因为它保留了我们在 CCA-Attention 中全自注意力机制的全面理解能力。我们提出的 CCA-Attention 的伪代码在算法 1 中给出。
3.2 Globality-aware Pooling Module
3.2 全局感知池化模块
The context redundancy aforementioned indicates that computational resources can be dynamically allocated to core contexts while reducing emphasis on the remaining ones.
This could approximate the full self-attention with both reduced redundancy and computational overhead.
上述上下文冗余表明,可以将计算资源动态分配给核心上下文,同时减少对其他上下文的重视。这可以近似全自注意力,同时减少冗余和计算开销。
Motivated by this, we propose a globality-aware pooling module that dynamically identifies prominent tokens and encapsulates them into a smaller set of core tokens for attention.
受此启发,我们提出了一种全局感知池化模块,该模块可以动态识别突出标记,并将它们封装成一组较小的核心标记以供注意力使用。
Given an input sequence of tokens ,
we segment the input sequence , each group containing tokens, in total groups. For simplicity, we denote the -th group by ,
where
with denoting the last token in the -th group.
To identify prominent tokens in the -th group, we devise a group-wise weighted pooling strategy that employs the last token to evaluate the importance.
This is inspired by attention map visualizations (Section C.4), which show that important tokens consistently receive high attention scores from subsequent tokens, indicating their significant influence regardless of position within the group.
Formally, we derive one core token from each group by
给定一个标记序列 ,我们将输入序列 分段,每个组包含 个标记,总共 个组。为了简便,我们用 表示第 组,其中 与 表示第 组中的最后一个标记。为了识别第 组中的突出标记,我们设计了一种组内加权池化策略,该策略使用最后一个标记 来评估重要性。这受到注意力图可视化(第 C.4 节)的启发,这些可视化显示重要标记始终从后续标记中获得高注意力分数,表明它们在组内无论位置如何都具有显著影响。正式地,我们从每个组中推导出一个核心标记 。
(2) |
where is the query vector for the last token in the -th group of and , and are learnable parameters. In this way, the core token encapsulates crucial information of the corresponding group. With groups in the input sequence , we derive core tokens in total, i.e., .
其中 是第 2#和第 3#组中第 1#组的最后一个查询向量, 和 是可学习参数。这样,核心标记 封装了相应组的关键信息。在输入序列 中有 组,我们总共导出 个核心标记,即 。
算法 1 核心上下文感知注意力流水线。
0: 输入标记 ,参数 , , ,组大小 和局部窗口大小 .
1: 计算查询 ,#groups
4: ,其中
7: , // 全局感知池化模块
8: , // 保持局部性的模块
14: 返回:标记 的表示
To reduce the redundancy, we use the sequence of core tokens instead of the original tokens for attention computation. This substitution reduces the dimensionality from to , thereby reducing the computational and storage complexity.
For each query , tokens that are distant from it typically exhibit lower relevance and are more likely to be redundant.
Formally, we adopt core tokens to
calculate the key and value matrices , for these tokens as follows
为了减少冗余,我们在注意力计算中使用核心标记序列 而不是原始标记 。这种替换将维度从 降低到 ,从而减少了计算和存储的复杂性。对于每个查询 ,距离它较远的标记通常具有较低的相关性,更可能是冗余的。正式来说,我们采用核心标记来计算这些标记的关键矩阵 和值矩阵 ,如下所示
(3) |
where and is learnable parameters.
In contrast, tokens in close proximity to the query are likely to be more relevant. We retain nearest tokens for a fine-grained attention computation (discussed in detail in Section 3.3).
Thus, the index in Eqn. (3) can be calculated as
.
When the context is short (i.e., ), the key and value , would be excluded from attention calculation since the redundancy in the context is negligible.
During inference, as tokens are generated sequentially, we derive a new core token via Eqn. (2) once the number of generated tokens reaches .
Different from the full self-attention, we cache and for inference.
其中 和 是可学习的参数。相比之下,靠近查询 的标记更有可能是相关的。我们保留 个最近的标记进行细粒度注意力计算(在 3.3 节中详细讨论)。因此,方程(3)中的索引 可以计算为 。当上下文较短(即 )时,由于上下文中的冗余可以忽略不计,因此会排除关键 和值 的注意力计算。在推理过程中,随着标记的顺序生成,一旦生成的标记数量达到 ,我们就会通过方程(2)推导出一个新的核心标记。与全自注意力不同,我们在推理时缓存 和 。
3.3 Locality-preserving Module
3.3 本地性保持模块
As mentioned above, the globality-aware pooling module effectively captures long-range dependencies by compressing input tokens into core tokens. It focuses on coarse-grained global information, potentially overlooking fine-grained local context. However, recent studies (Manakul & Gales, 2021; Yang et al., 2021) demonstrate that local context plays a critical role in many language modeling tasks. To address this, we introduce a locality-preserving module that complements the globality-aware pooling module by focusing on neighboring tokens to capture detailed local dependencies.
如上所述,全局感知池化模块通过压缩输入标记为核心标记,有效地捕捉了长距离依赖关系。它关注粗粒度的全局信息,可能忽略了细粒度的局部上下文。然而,最近的研究(Manakul & Gales,2021;Yang 等,2021)表明,局部上下文在许多语言建模任务中起着关键作用。为了解决这个问题,我们引入了一个保留局部性的模块,通过关注邻近标记来捕捉详细的局部依赖关系,以此补充全局感知池化模块。
To be specific, the locality-preserving module ensures that each query attends to the preceding at least tokens to capture local dependencies. During the generation process, it is challenging to maintain the number of tokens as a multiple of the group size . To address this, we set the local window size to . This strategy ensures that each key token participates in the attention computation with the query . Consequently, the key and value matrices for a specific query in the locality-preserving module are defined as follows:
具体来说,局部保持模块确保每个查询 至少关注前 个标记以捕获局部依赖。在生成过程中,保持标记数量为组大小的倍数 是具有挑战性的。为了解决这个问题,我们将局部窗口大小设置为 。这种策略确保每个关键标记都参与与查询 的注意力计算。因此,局部保持模块中特定查询 的关键和值矩阵定义如下:
(4) |
where .
Note that the locality-preserving module shares the projection parameters , , and with the globality-aware pooling module, thereby incurring no additional projection parameters.
其中 。请注意,局部保持模块与全局感知池化模块共享投影参数 、 和 ,因此不会产生额外的投影参数。
3.4 Differentiable Fusion of Global and Local Modules
3.4 全局和局部模块的可微分融合
Both globality-aware pooling and locality-preserving modules involve only a portion of tokens in the attention computation, leading to a limited comprehensive understanding of the context.
全局感知池化和局部保持模块仅涉及注意力计算中的一部分标记,导致对上下文的全面理解有限。
To address this limitation, we seek to combine the involved tokens of these two attentions to integrate the insights they provide.
Specifically, we concatenate the key and value matrices from both attentions, i.e., and , to leverage the combined information. Formally, the proposed CCA-Attention is computed as follows:
为了解决这一局限性,我们寻求结合这两种注意力的相关标记,以整合它们提供的见解。具体来说,我们将来自两种注意力的键和值矩阵(即 和 )连接起来,以利用组合信息。形式上,所提出的 CCA-Attention 的计算如下:
(5) |
We represent the final output of our CCA-Attention as . In practice, we implement our attention mechanism with Triton (Tillet et al., 2019) to accelerate both training and inference processes. This implementation enables parallel computation of each with high computational efficiency. After integrating the global and local attention, we can formalize in Eqn. (5) element-wise into the structure of full attention, as detailed in Proposition 1 in the supplementary material A. The Proposition 1 shows that each token accesses all preceding tokens, ensuring full information exchange among tokens and thus enhancing capturing long-range dependencies.
我们表示我们 CCA-Attention 的最终输出为 。在实践中,我们使用 Triton(Tillet 等人,2019)实现我们的注意力机制,以加速训练和推理过程。这种实现使得每个 可以以高计算效率进行并行计算。在整合全局和局部注意力之后,我们可以将 在式(5)中逐元素形式化为全注意力结构,如补充材料 A 中的命题 1 所述。命题 1 表明每个标记可以访问所有前面的标记,确保标记之间充分的信息交换,从而增强捕捉长距离依赖关系。
More importantly, our CCA-Attention demonstrates dynamic flexibility through adjustable group size and local window size during inference. This architectural flexibility allows the generation of multiple model variants tailored to varying user traffic, offering a substantial advantage over the full self-attention mechanisms in real-world deployment scenarios (see results and analysis in Section 4.3).
更重要的是,我们的 CCA-Attention 通过可调整的组大小 和局部窗口大小 在推理过程中展现出动态灵活性。这种架构灵活性允许生成多个针对不同用户流量的模型变体,在现实部署场景中相对于全自注意力机制具有显著优势(参见第 4.3 节的结果和分析)。
3.5 Training Strategies of CCA Transformers
3.5 CCA Transformers 的训练策略
Our proposed CCA-Attention is fully compatible with existing attention-based LLMs, such as the LLaMA series models (Touvron et al., 2023b; AI, 2024), serving as a plug-and-play module that can replace the full self-attention. CCA-Attention maintains alignment with full self-attention in terms of input, output, and parameter dimensions.
我们提出的 CCA-Attention 与现有的基于注意力的 LLMs(如 LLaMA 系列模型(Touvron 等,2023b;AI,2024))完全兼容,作为一个即插即用的模块,可以替换全自注意力。CCA-Attention 在输入、输出和参数维度上与全自注意力保持一致。
This ensures that only a minimal training cost is able to preserve long-context modeling capabilities while reducing computational costs. In contrast, existing linear attention approaches (Katharopoulos et al., 2020; Sun et al., 2023) introduce kernel functions for attention and necessitate training from scratch, making them less practical for real-world applications due to their inability to leverage the extensive knowledge embedded in pretrained LLMs.
这确保了只需最小的训练成本就能保留长上下文建模能力,同时降低计算成本。相比之下,现有的线性注意力方法(Katharopoulos 等人,2020 年;Sun 等人,2023 年)为注意力引入核函数,需要从头开始训练,由于它们无法利用预训练 LLMs 中嵌入的丰富知识,因此在实际应用中不太实用。
We replace the self-attention module in existing attention-based LLMs with CCA-Attention, enabling compatibility with three different training strategies:
1) Training from Scratch: This strategy involves training CCA-Attention from scratch on large-scale corpora.
我们将现有基于注意力的 LLMs 中的自注意力模块替换为 CCA-Attention,使其能够兼容三种不同的训练策略:1)从头开始训练:这种策略涉及在大型语料库上从头开始训练 CCA-Attention。
While this approach may yield superior performance by fully adapting the model to the proposed attention mechanism, it is computationally intensive and requires significant resources.
虽然这种方法可能通过完全适应所提出的注意力机制而获得更好的性能,但它计算密集,需要大量资源。
2) Full Finetuning: A more efficient alternative is to finetune all parameters of the model based on the parameters of existing pretrained LLMs. This strategy leverages the pre-trained knowledge while allowing the model to adapt fully to our attention mechanism.
2)完全微调:一种更有效的方法是基于现有预训练 LLMs 的参数对模型的所有参数进行微调。这种策略利用了预训练知识,同时允许模型完全适应我们的注意力机制。
3) Partial Finetuning: For scenarios where efficiency is critical, we can finetune only the learnable parameters of CCA-Attention, i.e., , , and . This strategy requires only a modest finetuning effort on a small number of corpora, making it computationally efficient for maintaining long-context modeling capabilities.
3) 部分微调:对于效率至关重要的场景,我们只需微调 CCA-Attention 的可学习参数,即 、 和 。这种策略只需要在小规模语料库上进行适度的微调努力,因此在维持长上下文建模能力方面具有计算效率。
We provide empirical evidence to guide the selection of the most appropriate finetuning strategy in Table 10.
我们在表 10 中提供了经验证据,以指导选择最合适的微调策略。
表 1:不同模型在 LongBench-E(白等,2023)上的比较。LongBench-E 中 95%的测试样本长度小于 31K。“FTL”表示预填充阶段生成第一个标记的延迟。“Mem.”表示内存占用。“S. QA”表示单文档问答,而“M. QA”表示多文档问答。
We report the latency and memory footprint of LLaMA2-7B-32K and LLaMA2-7B-80K within contexts of 32K and 64K on A800 GPUs, respectively.
我们报告了在 A800 GPU 上,LLaMA2-7B-32K 和 LLaMA2-7B-80K 在 32K 和 64K 的上下文中的延迟和内存占用。
Methods 方法 | S. QA | M. QA | Sum. 总结。 | FS. Learning 形式化学习 | Synthetic 合成 | Code 代码 | Avg. 平均 | FTL (s) FTL(秒) | Mem. (GB) Mem.(GB) |
---|---|---|---|---|---|---|---|---|---|
LLaMA2-7B-32K (Vanilla Self-Attention) LLaMA2-7B-32K(纯自注意力) |
2.75 | 1.85 | 12.43 | 66.28 | 0.34 | 48.99 | 22.11 | 9.15 | 35.58 |
StreamingLLM (Xiao et al., 2024b) StreamingLLM (肖等,2024b) |
4.75 | 2.94 | 2.97 | 48.20 | 0.66 | 30.16 | 14.95 | 5.75 (1.6) | 22.94 (35%) |
LM-Infinite (Han et al., 2023) LM-Infinite (韩等,2023) |
2.04 | 2.33 | 1.98 | 57.45 | 0.3 | 48.46 | 18.76 | 4.72 (1.9) | 26.35 (26%) |
MInference (Jiang et al., 2024) MInference (江等,2024) |
3.68 | 3.05 | 10.97 | 66.26 | 0.61 | 42.30 | 21.14 | 4.20 (2.2) | 33.52 (6%) |
CCA-LLM (Ours) CCA-LLM ( ours) | 3.63 | 3.98 | 7.79 | 61.79 | 2.64 | 51.36 | 21.86 | 2.59 (3.5) | 19.12 (46%) |
LLaMA2-7B-80K (Vanilla Self-Attention) LLaMA2-7B-80K (纯自注意力) |
3.22 | 2.71 | 3.90 | 64.98 | 0.56 | 59.16 | 22.42 | 32.43 | 60.03 |
StreamingLLM (Xiao et al., 2024b) StreamingLLM (Xiao 等,2024b) |
2.07 | 2.32 | 0.37 | 45.03 | 2.67 | 37.17 | 14.94 | 9.04 (3.6) | 37.45 (37%) |
LM-Infinite (Han et al., 2023) LM-Infinite (Han 等,2023) |
2.54 | 1.53 | 2.22 | 61.29 | 1.08 | 58.54 | 21.20 | 8.27 (3.9) | 41.54 (31%) |
MInference (Jiang et al., 2024) MInference(江等,2024) |
2.44 | 3.49 | 4.41 | 64.26 | 0.28 | 57.60 | 22.08 | 8.14 (4.0) | 54.09 (10%) |
CCA-LLM (Ours) | 5.62 | 4.34 | 8.99 | 59.60 | 0.48 | 54.40 | 22.24 | 6.42 (5.7) | 33.86 (44%) |
表 2:在 LongBench-E(Bai 等,2023)上使用最新模型的不同方法的比较。我们报告了 LLaMA3.1-8B-Instruct-128K(在表中简称为“LLaMA3.1-8B-128K”)和 Qwen2.5-7B-128K 在 32K 上下文中的延迟和内存占用。
Methods 方法 | S. QA | M. QA | Sum. 总结。 | FS. Learning 形式化学习 | Synthetic 合成 | Code 代码 | Avg. 平均 | FTL (s) FTL(秒) | Mem. (GB) 内存(GB) |
---|---|---|---|---|---|---|---|---|---|
LLaMA3.1-8B-128K (Vanilla Self-Attention) LLaMA3.1-8B-128K(纯自注意力) |
16.71 | 10.75 | 20.32 | 68.75 | 48.93 | 62.10 | 37.93 | 9.55 | 40.38 |
MInference (Jiang et al., 2024) MInference(江等人,2024) |
16.33 | 10.71 | 20.44 | 68.41 | 48.06 | 62.50 | 37.74 | 4.93 (1.9) | 35.95 (11%) |
CCA-LLM (Ours) CCA-LLM( ours) | 17.90 | 16.41 | 19.63 | 67.20 | 43.76 | 61.98 | 37.81 | 3.08 (3.1) | 20.63 (49%) |
Qwen2.5-7B-128K (Vanilla Self-Attention) Qwen2.5-7B-128K(纯自注意力) |
16.67 | 18.18 | 18.70 | 66.81 | 45.34 | 64.56 | 38.38 | 10.58 | 35.11 |
MInference (Jiang et al., 2024) MInference(江等人,2024) |
16.20 | 17.21 | 18.59 | 67.10 | 38.28 | 62.95 | 36.72 | 4.86 (2.2) | 32.40 (8%) |
CCA-LLM (Ours) CCA-LLM( ours) | 16.91 | 17.07 | 18.60 | 66.89 | 45.50 | 63.52 | 38.08 | 2.74 (3.9) | 19.31 (45%) |
3.6 Computational and Storage Complexity Analysis
3.6 计算和存储复杂度分析
Compared with the full self-attention, our CCA-Attention offers significant benefits in terms of computational complexity and key-value cache storage, as analyzed below.
与全自注意力相比,我们的 CCA-Attention 在计算复杂度和键值缓存存储方面提供了显著的优势,如下所述。
Acceleration via Reduced Computational Complexity.
Our CCA-Attention exhibits varying computational complexities depending on the type of task.
For tasks with fixed-length sequences (such as multi-choice question answering), our CCA-Attention exhibits a linear computational complexity of , marking a significant enhancement over the full self-attention with a complexity of .
Here, we define the number of group as a constant.
For the globality-aware pooling module, the query and key matrices encompass and tokens, respectively, resulting in a computational complexity of .
Regarding the locality-preserving module, each token only attends preceding tokens at most.
With tokens in total, the upper bound of complexity amounts to .
通过降低计算复杂度实现加速。我们的 CCA-Attention 根据任务类型表现出不同的计算复杂度。对于具有固定长度序列的任务(如多选题问答),我们的 CCA-Attention 表现出线性计算复杂度 ,这比全自注意力的复杂度 有显著提升。在这里,我们将组数 定义为常数。对于全局感知池化模块,查询和键矩阵分别包含 和 个标记,导致计算复杂度为 。至于局部保持模块,每个标记最多只关注前面的 个标记。总共有 个标记,复杂度的上限达到 。
For tasks with variable-length sequences (such as open-ended question answering), models generate subsequent tokens in an autoregressive manner.
In this case, we set the group size as a constant, ensuring that our CCA-Attention is able to leverage key-value caching during autoregressive token generation.
Once one group has certain tokens, the corresponding core token is also determined and cached.
Thus, our CCA-Attention achieves a computational complexity of .
The complexity analysis follows a similar pattern to the tasks with fixed-length sequences.
对于具有可变长度序列的任务(如开放式问答),模型以自回归的方式生成后续标记。在这种情况下,我们将组大小 设为一个常数,确保我们的 CCA-Attention 能够在自回归标记生成过程中利用键值缓存。一旦一个组有了一定的 个标记,相应的核心标记也会被确定并缓存。因此,我们的 CCA-Attention 实现了 的计算复杂度。复杂度分析遵循与固定长度序列任务相似的模式。
Acceleration through Reduced Key-Value (KV) Cache.
In attention-based LLMs, the KV cache leverages the autoregressive nature to store and reuse key-value pairs, thereby significantly boosting the efficiency.
通过减少键值(KV)缓存加速。在基于注意力的 LLMs 中,KV 缓存利用自回归特性来存储和重用键值对,从而显著提高效率。
The size of the KV cache scales linearly with the length of the input sequence, consuming a major part of the memory footprint during inference.
The expanded KV cache would consume considerable memory and significant memory IO resources.
KV 缓存的大小与输入序列的长度成线性关系,在推理过程中消耗了大部分内存占用。扩大的 KV 缓存将消耗大量内存和显著的内存 I/O 资源。
Compared with full attention’s complexity of , our CCA-Attention has a storage complexity of .
For the globality-aware pooling module, we only retain the key and value matrices for core tokens, rather than for all original tokens. This reduces the memory requirement to .
Besides, the locality-preserving module only maintains the key and value matrices for the preceding tokens. The storage complexity for this component is .
与全注意力机制的复杂度 相比,我们的 CCA-Attention 具有 的存储复杂度。对于全局感知池化模块,我们只保留核心标记的关键和值矩阵,而不是所有原始标记。这降低了内存需求至 。此外,局部性保持模块仅维护前 个标记的关键和值矩阵。该组件的存储复杂度为 。
4 Experiments 4 实验
4.1 Experimental Setup 4.1 实验设置
We apply our CCA-Attention and compared efficient attention methods to existing pretrained LLMs. We report the performance in long context modeling and computational efficiency.
We put more implementation details and ablation studies of our method in the supplementary materials.
我们应用了我们的 CCA-Attention,并将其与现有的预训练 LLMs 中的高效注意力方法进行了比较。我们报告了在长上下文建模和计算效率方面的性能。我们将更多实现细节和我们的方法的消融研究放在了补充材料中。
Dataset & Evaluation Metrics. We quantitatively evaluate our models and compare them with other considered models on the following benchmark and metric:
1) LongBench (Bai et al., 2023) is a pioneering benchmark for the bilingual, multi-task, and comprehensive assessment of large language models’ long context understanding capabilities.
数据集与评估指标。我们对我们的模型进行定量评估,并在以下基准和指标上与其他考虑的模型进行比较:1)LongBench(Bai 等,2023)是用于评估大型语言模型长上下文理解能力的双语、多任务和全面评估的开创性基准。
It covers multiple languages like Chinese and English, consisting of 6 major categories and 21 tasks involving various application areas.
2) Exact Match Score (EM Score) (Liu et al., 2024b) is a metric for measuring the model’s ability to find the key information within a long context in a multi-document question-answering task.
它涵盖了多种语言,如中文和英语,包括 6 个大类和 21 个涉及不同应用领域的任务。2) 精确匹配得分(EM 得分)(刘等,2024b)是衡量模型在多文档问答任务中寻找长文本中关键信息能力的指标。
In this task, each test sample comprises a certain number of documents to reach the specified context length, followed by a question about the key information inserted in context.
在这个任务中,每个测试样本包含一定数量的文档以达到指定的上下文长度,随后是一个关于上下文中插入的关键信息的问题。
表 3:不同模型在多文档 EM 分数(刘等,2024b)上的比较,评估长度从 4K 到 128K。“FTL”表示预填充阶段生成第一个标记的延迟。我们在两个 A800 GPU 上报告了 128K 上下文中的延迟。
Methods 方法 | 4K | 8K | 16K | 32K | 64K | 128K | Avg. 平均 | FTL (s) FTL(秒) |
---|---|---|---|---|---|---|---|---|
LLaMA2-7B-80K (Vanilla Self-Attention) LLaMA2-7B-80K(纯自注意力) |
39.4 | 37.8 | 37.6 | 36.2 | 34.6 | 30.3 | 36.0 | 124.85 |
StreamingLLM (Xiao et al., 2024b) StreamingLLM (肖等,2024b) |
33.6 | 26.0 | 32.2 | 30.6 | 27.4 | 25.1 | 29.2 | 34.74 (3.6) |
LM-Infinite (Han et al., 2023) LM-Infinite (韩等,2023) |
31.6 | 25.6 | 32.4 | 32.2 | 28.2 | 26.3 | 29.4 | 32.57 (3.8) |
MInference (Jiang et al., 2024) MInference(江等人,2024) |
39.0 | 32.4 | 37.4 | 36.0 | 32.3 | 28.9 | 34.3 | 20.18 (6.2) |
CCA-LLM (Ours) CCA-LLM( ours) | 39.3 | 33.2 | 35.4 | 31.4 | 35.3 | 32.0 | 34.4 | 15.89 (7.9) |
Implementation Details111The source code for this project is publicly available at https://github.com/chenyaofo/CCA-Attention
该项目的源代码公开可获取,链接为 https://github.com/chenyaofo/CCA-Attention.
实现细节 1 .
We apply our proposed CCA-Attention to LLaMA2-7B-32K, LLaMA2-7B-80K (Fu et al., 2024), LLaMA3.1-8B-128K (AI, 2024) and Qwen2.5-7B-128K (Yang et al., 2024) models.
We replace the full self-attention in the above LLMs with our proposed CCA-Attention.
In the continuous finetuning, we adopt the SlimPajama (Cerebras, 2024) dataset, an open-source replication of the LLaMA pretraining data mixture.
The number of groups in globality-aware Attention is shared across different model sizes. We finetune the full model on A800 GPUs using a micro-batch size of 1 and a gradient accumulation of 8, with a total of 1000 training steps. This training configuration is applicable to all model sizes and context lengths.
. 我们将提出的 CCA-Attention 应用于 LLaMA2-7B-32K、LLaMA2-7B-80K(Fu 等人,2024 年)、LLaMA3.1-8B-128K(AI,2024 年)和 Qwen2.5-7B-128K(Yang 等人,2024 年)模型。我们将上述 LLMs 中的全自注意力替换为我们提出的 CCA-Attention。在持续微调中,我们采用 SlimPajama(Cerebras,2024 年)数据集,这是 LLaMA 预训练数据混合的开源复制。全局感知注意力中的组数在不同模型大小之间共享。我们使用 1 个微批大小和 8 个梯度累积,在 A800 GPU 上对完整模型进行微调,总共 1000 个训练步骤。此训练配置适用于所有模型大小和上下文长度。
To scale the models to long contexts, we modified the “base frequency” in RoPE from 10,000 to 500,000, following (Cerebras, 2024; Xiong et al., 2024). See Section B.2 for more implementation details.
为了将模型扩展到长上下文,我们将 RoPE 中的“基本频率”从 10,000 修改为 500,000,遵循(Cerebras,2024;Xiong 等,2024)。更多实现细节请参见 B.2 节。
Compared Methods. We conduct comprehensive comparisons between our proposed CCA-Attention and several state-of-the-art methods, including LLaMA-2 with vanilla attention, StreamingLLM (Xiao et al., 2024b), LM-infinite (Han et al., 2023), and MInference (Jiang et al., 2024), across the LongBench (Bai et al., 2023) and RULER (Hsieh et al., 2024) benchmarks. Our experiments are based on the LLaMA-2 7B model fine-tuned on sequences of length 32K and 80K (Fu et al., 2024).
For StreamingLLM (Xiao et al., 2024b), we use the official implementation, adjusting the attention sink to 4 and setting the attention context size to 2000. Similarly, for LM-infinite (Han et al., 2023), we follow the official code, configuring the local branch size to 1024 and the global branch size to 16. In the case of MInference (Jiang et al., 2024), we also employ the official code implementations, configured with the official settings.
与比较方法。我们对提出的 CCA-Attention 与几种最先进的方法进行了全面比较,包括使用 vanilla attention 的 LLaMA-2、StreamingLLM(Xiao et al., 2024b)、LM-infinite(Han et al., 2023)和 MInference(Jiang et al., 2024),在 LongBench(Bai et al., 2023)和 RULER(Hsieh et al., 2024)基准测试中进行比较。我们的实验基于在长度为 32K 和 80K 的序列上微调的 LLaMA-2 7B 模型(Fu et al., 2024)。对于 StreamingLLM(Xiao et al., 2024b),我们使用官方实现,将注意力下沉设置为 4,并将注意力上下文大小设置为 2000。同样,对于 LM-infinite(Han et al., 2023),我们遵循官方代码,配置本地分支大小为 1024,全局分支大小为 16。在 MInference(Jiang et al., 2024)的情况下,我们也采用了官方代码实现,并配置了官方设置。
4.2 Comparisons on Long Context Modeling
4.2 长文本建模比较
Comparisons on Longbench-E.
We conduct experiments on Longbench-E (Bai et al., 2023) using our CCA-Attention and baseline methods, including StreamingLLM (Xiao et al., 2024b), LM-Infinite (Han et al., 2023), and MInference (Jiang et al., 2024). As shown in Table 1, our CCA-LLM attains the highest average score on Longbench-E, outperforming other efficient attention methods. For LLaMA-7B-32K, the average score of our CCA-LLM is higher than that of LM-Infinite
(21.86 vs. 18.76) and MInference (21.86 vs. 21.14).
For the LLaMA-7B-80K model, our method consistently shows superior performance compared to alternative approaches.
For instance, our CCA-LLM yields a higher EM score than LM-Infinite
(22.24 vs. 21.20) and MInference (22.24 vs. 22.08).
This performance advantage primarily stems from our global-aware pooling module, which effectively reduces context redundancy in long input sequences.
在 Longbench-E 上的比较。我们使用我们的 CCA-Attention 和基线方法在 Longbench-E(Bai 等,2023)上进行了实验,包括 StreamingLLM(Xiao 等,2024b)、LM-Infinite(Han 等,2023)和 MInference(Jiang 等,2024)。如表 1 所示,我们的 CCA-LLM 在 Longbench-E 上取得了最高的平均分数,优于其他高效注意力方法。对于 LLaMA-7B-32K,我们的 CCA-LLM 的平均分数高于 LM-Infinite(21.86 vs. 18.76)和 MInference(21.86 vs. 21.14)。对于 LLaMA-7B-80K 模型,我们的方法与替代方法相比始终表现出优越的性能。例如,我们的 CCA-LLM 的 EM 分数高于 LM-Infinite(22.24 vs. 21.20)和 MInference(22.24 vs. 22.08)。这种性能优势主要源于我们的全局感知池化模块,它有效地减少了长输入序列中的上下文冗余。
Consequently, our CCA-LLM model demonstrates enhanced capability in identifying and focusing on core contextual elements, thereby facilitating more accurate extraction of crucial information required for question-answering tasks within the Longbench-E benchmark.
因此,我们的 CCA-LLM 模型在识别和关注核心上下文元素方面表现出更强的能力,从而有助于更准确地提取 Longbench-E 基准测试中问答任务所需的必要信息。
Notably, our CCA-LLM achieves the lowest inference latency and KV cache usage among all compared methods. ur CCA-LLM is able to reduce the KV cache usage while the best counterpart MInferencce only accerlerate the prefilling stage and does not reduce KV cache storage.
值得注意的是,我们的 CCA-LLM 在所有比较方法中实现了最低的推理延迟和 KV 缓存使用率。我们的 CCA-LLM 能够在减少 KV 缓存使用的同时,而最佳对应方法 MInferencce 仅加速了预填充阶段,并没有减少 KV 缓存存储。
To further validate the effectiveness of our method across different model architectures, we conduct extensive experiments on more recent foundation models: LLaMA3.1-8B-Instruct-128K and Qwen2.5-7B-128K. As shown in Table 2. our CCA-Attention demonstrates superior performance over the state-of-the-art baseline MInference across three key metrics: computational efficiency, memory reduction, and model performance.
为了进一步验证我们的方法在不同模型架构上的有效性,我们在更近期的基座模型上进行了广泛的实验:LLaMA3.1-8B-Instruct-128K 和 Qwen2.5-7B-128K。如表 2 所示,我们的 CCA-Attention 在三个关键指标上优于最先进的基线 MInference:计算效率、内存减少和模型性能。
The consistent improvements across different model architectures suggest that our approach is not limited to specific model designs.
不同模型架构的一致性改进表明,我们的方法并不仅限于特定的模型设计。

图 3:通过调整组大小 和局部窗口大小 来展示推理灵活性的说明,以在测试时生成具有不同延迟和精度的各种 CCA-LLM 模型。
This architectural flexibility allows for precise control over the trade-off between inference latency and accuracy, particularly beneficial for real-world applications with varying user traffic patterns.
这种架构灵活性允许对推理延迟和准确率之间的权衡进行精确控制,尤其有利于具有不同用户流量模式的实际应用。

图 4:在 LLaMA2-7B-80K 上,就计算和存储开销而言,与最先进方法的比较。“FTL”(首次标记延迟)是指在预填充阶段接收到输入后生成第一个标记所需的时间。
“ITL” (inter token latency) is the time delay between generating consecutive tokens (except for the first token) during the decoding stage.
“ITL”(inter token latency)是解码阶段生成连续标记(除第一个标记外)之间的时间延迟。
Comparisons on Long-document QA.
We evaluate the performance of our CCA-Attention against other methods, including StreamingLLM (Xiao et al., 2024b), LM-Infinite (Han et al., 2023), and MInference (Jiang et al., 2024), on LLaMA2-7B-80K models using the multi-document EM Score metric. As shown in Table 3, we conduct comparisons across a range of context lengths: 4K, 8K, 16K, 32K, 64K, and 128K.
For the short context (e.g., 4K), our CCA-LLM consistently achieves the highest EM score across all methods, showcasing significant capability for short sequence modeling.
For example, our CCA-LLM outperforms StreamLLM (39.3 vs. 33.6) and MInference (39.3 vs. 39.0) in terms of EM score.
The reason is that our CCA-Attention captures context dependencies without discarding crucial tokens. In contrast, StreamingLLM (Xiao et al., 2024b) prioritizes attention on the initial and final tokens, effectively discarding intermediate tokens, which may contain essential information. Similarly, MInference (Jiang et al., 2024) employs predefined sparse attention patterns, selectively attending to tokens and potentially overlooking critical parts of the input sequence.
在长文档问答的比较中。我们评估了我们的 CCA-Attention 与其他方法的性能,包括 StreamingLLM(Xiao 等,2024b)、LM-Infinite(Han 等,2023)和 MInference(Jiang 等,2024),在 LLaMA2-7B-80K 模型上使用多文档 EM 分数指标进行评估。如表 3 所示,我们在一系列的上下文长度上进行比较:4K、8K、16K、32K、64K 和 128K。对于短上下文(例如,4K),我们的 CCA-LLM 在所有方法中始终实现最高的 EM 分数,展示了显著的短序列建模能力。例如,我们的 CCA-LLM 在 EM 分数方面优于 StreamLLM(39.3 比 33.6)和 MInference(39.3 比 39.0)。原因是我们的 CCA-Attention 能够捕捉上下文依赖关系而不丢弃关键标记。相比之下,StreamingLLM(Xiao 等,2024b)优先关注初始和最终标记,实际上丢弃了中间标记,这些标记可能包含重要信息。同样,MInference(Jiang 等,2024)采用预定义的稀疏注意力模式,选择性地关注标记,并可能忽略输入序列的关键部分。
Both approaches risk losing important contextual information, leading to suboptimal performance in tasks requiring comprehensive understanding.
两种方法都可能导致丢失重要的上下文信息,导致在需要全面理解的任务中表现不佳。
Our method, by preserving both local and global contexts, ensures that no critical information is overlooked, thereby achieving superior performance.
我们的方法通过保留局部和全局上下文,确保不会遗漏任何关键信息,从而实现卓越的性能。
For the extremely long context (e.g., 64K and 128K), Our CCA-LLM shows much better performance than vanilla self-attention in terms of EM score (35.3 vs. 34.6) and 7.9x inference speedup with a context length of 128K. The advantages of our method become more prominent as the length of the context increases, while the performance of vanilla self-attention may even decrease when the context length becomes very large.
对于极长上下文(例如,64K 和 128K),我们的 CCA-LLM 在 EM 分数(35.3 比 34.6)和 128K 上下文长度下的 7.9 倍推理速度提升方面,比传统的自注意力机制表现出更好的性能。随着上下文长度的增加,我们方法的优势变得更加明显,而传统的自注意力机制的性能甚至可能在上下文长度变得非常大时下降。
The reason is that in an extremely long context, non-core contexts (i.e., the irrelevant context) will be compressed by the proposed weighted pooling. In this way, CCA-LLM alleviates the redundant context issue and improves the long-context modeling performance.
原因是,在极其长的上下文中,非核心上下文(即无关的上下文)将被所提出的加权池化压缩。通过这种方式,CCA-LLM 缓解了冗余上下文问题,并提高了长上下文建模性能。
4.3 Demonstration of Inference Flexibility
4.3 推理灵活性的演示
In real-world applications, user traffic exhibits diurnal variations. During peak traffic periods, the full self-attention struggles with throughput limitations, necessitating additional servers to accommodate the increased demand.
在现实世界的应用中,用户流量表现出日间变化。在高峰流量期间,全自注意力机制因吞吐量限制而苦苦挣扎,需要额外的服务器来满足增加的需求。
Instead, our CCA-LLM can dynamically adjust the group size and the local window size during inference to improve throughput.
During off-peak traffic periods, our CCA-LLM is able to enhance model performance, albeit with a minor compromise in throughput. This strategy exhibits remarkable elasticity to address the variable user traffic challenges.
相反,我们的 CCA-LLM 可以在推理过程中动态调整组大小 和局部窗口大小 以提高吞吐量。在非高峰流量期间,我们的 CCA-LLM 能够提高模型性能,尽管在吞吐量上有所妥协。这种策略表现出显著的弹性,以应对可变用户流量的挑战。
To verify this, we conduct experiments with different group sizes and local window sizes during inference.
In Figure 3, our CCA-LLM demonstrates an improved Longbench-E Score with a concomitant increase in computational overhead as decreases.
With a reduction in the group size from 16 to 2 and the local window size from 4096 to 1024, the Longbench-E Score escalates from 20.54 to 22.69, while the latency rises from 2.80 seconds to 3.89 seconds.
Despite training CCA-LLM only once, we obtain a spectrum of models, each with distinct performance and computational demands.
This flexibility comes from two complementary modules: The globality-aware pooling module dynamically compress core tokens based on semantic relevance via intra-group attention, enabling adaptation to different group sizes.
为了验证这一点,我们在推理过程中对不同组大小 和局部窗口大小 进行了实验。如图 3 所示,随着 的减小,我们的 CCA-LLM 在 Longbench-E 分数上有所提高,同时计算开销也随之增加。当组大小 从 16 减少到 2,局部窗口大小 从 4096 减少到 1024 时,Longbench-E 分数从 20.54 上升到 22.69,而延迟从 2.80 秒增加到 3.89 秒。尽管 CCA-LLM 只训练了一次,我们获得了多种模型,每种模型都有不同的性能和计算需求。这种灵活性来自于两个互补的模块:全局感知池化模块通过组内注意力动态压缩核心标记,基于语义相关性,从而能够适应不同的组大小。
The locality-preserving module uses rotary embeddings to encode relative positions, achieving translation invariance and maintaining local context across scales.
局部保持模块使用旋转嵌入来编码相对位置,实现平移不变性,并保持不同尺度下的局部上下文。
4.4 Comparisons on Computational Efficiency
4.4 计算效率的比较
We compare our CCA-LLM with LLaMA2-7B-80K equipped with full self-attention, LM-Infinite (Han et al., 2023), and MInference (Jiang et al., 2024) in terms of inference latency and memory footprint during forward-propagation on a single NVIDIA A800 GPU.
The efficiency performance was assessed across a range of input sequence lengths, i.e., 4K, 8K, 16K, 32K, 64K, 128K.
In Figure 4, our CCA-Attention achieves a 7.9 inference speedup than LLaMA2-7B-80K with the full self-attention (15.89s vs. 124.85s in 128K context) MInference (15.89s vs. 44.50s in 128K context) in the pre-filling stage.
Our CCA-Attention also exhibits a reduced KV cache memory usage than LLaMA2-7B-80K (4.5GB vs. 64GB in 128K context) and MInference (4.5GB vs. 64GB in 128K context).
Note that Minference (Chen et al., 2024) only accelerates the pre-filling stage and adopts full self-attention in the decoding stage.
Moreover, it does not reduce KV cache, leading to the same memory usage as the full self-attention.
我们将我们的 CCA-LLM 与配备全自注意力机制的 LLaMA2-7B-80K、LM-Infinite(Han 等,2023)和 MInference(Jiang 等,2024)在单个 NVIDIA A800 GPU 上正向传播时的推理延迟和内存占用进行比较。效率性能在一系列输入序列长度上进行了评估,即 4K、8K、16K、32K、64K、128K 。在图 4 中,我们的 CCA-Attention 在预填充阶段比配备全自注意力的 LLaMA2-7B-80K(128K 上下文中 15.89 秒 vs. 124.85 秒)和 MInference(128K 上下文中 15.89 秒 vs. 44.50 秒)实现了 7.9 倍的推理速度提升。我们的 CCA-Attention 在 128K 上下文中 KV 缓存内存使用量也比 LLaMA2-7B-80K(4.5GB vs. 64GB)和 MInference(4.5GB vs. 64GB)低。请注意,Minference(Chen 等,2024)仅加速预填充阶段,在解码阶段采用全自注意力机制。此外,它没有减少 KV 缓存,导致与全自注意力机制相同的内存占用。
Conversely, our CCA-LLM is able to accelerate both pre-filling and decoding stages with reduced KV cache, which is more practical in real-world applications.
相反,我们的 CCA-LLM 能够通过减少 KV 缓存来加速预填充和解码阶段,这在实际应用中更为实用。
5 Conclusion 5 结论
In this paper, we proposed a Core Context Aware Attention (CCA-Attention) for long-context language modeling with reduced computational overhead compared with vanilla self-attention.
在本文中,我们提出了一种核心上下文感知注意力(CCA-Attention),用于长上下文语言建模,与传统的自注意力相比,计算开销更低。
Our CCA-Attention includes two components: 1) globality-aware pooling module exploits the importance of input tokens to encapsulates core tokens and employs them for attention, capturing global coarse-grained information;
2) The locality-preserving module focuses on neighboring tokens to capture local fined-grained context, serving as a complement for the global module.
我们的 CCA-Attention 包括两个组件:1)全局感知池化模块利用输入标记的重要性来封装核心标记,并使用它们进行注意力计算,捕捉全局粗粒度信息;2)局部保持模块专注于邻近标记以捕捉局部细粒度上下文,作为全局模块的补充。
Our proposed attention is able to replace the full self-attention in existing LLMs with a minimal finetuning efforts.
Experimental results show the effectiveness of our CCA-Attention with promising performance and decreased computational cost.
我们提出的注意力机制能够用最小的微调工作量替换现有 LLMs 中的全自注意力。实验结果表明,我们的 CCA-Attention 在性能上具有前景,并且降低了计算成本。
Acknowledgments 致谢
This work was partially supported by the Joint Funds of the National Natural Science Foundation of China (Grant No.U24A20327), Key-Area Research and Development Program Guangdong Province 2018B010107001,
the Major Key Project of Peng Cheng Laboratory (PCL) PCL2023A08, and TCL Science and Technology Innovation Fund, China.
这项工作部分得到了中国国家自然科学基金联合基金(项目编号:U24A20327)、广东省 2018B010107001 重点领域研发计划、鹏城实验室(PCL)重大关键项目 PCL2023A08 以及 TCL 科技创新基金的支持。
Impact Statement 影响声明
Our work addresses the critical challenge of computational inefficiency in long-context modeling for large language models.
我们的工作解决了大型语言模型在长上下文建模中计算效率低下的关键挑战。
By introducing a core context aware attention mechanism, we achieve a reduction in attention complexity to linear time while preserving essential semantic interactions. This breakthrough enables LLMs to process contexts up to 128K tokens with a 7.9 speedup over full self-attention, without compromising accuracy. This promotes a more environmentally friendly and sustainable approach to AI development, aligning with the growing demand for greener and more energy-efficient technologies in the field of AI.
通过引入核心上下文感知注意力机制,我们实现了将注意力复杂度降低到线性时间,同时保留必要的语义交互。这一突破使得 LLMs 能够以比全自注意力快 7.9 倍的速度处理多达 128K 个 token 的上下文,而不影响准确性。这促进了更环保、可持续的 AI 开发方式,符合 AI 领域对更绿色、更节能技术的日益增长需求。
References
- AI (2024) AI, M. Introducing meta llama 3: The most capable ppenly available llm to date. https://ai.meta.com/blog/meta-llama-3/, 2024. Accessed: 2025-01-20.
- Azerbayev et al. (2022) Azerbayev, Z., Ayers, E., and Piotrowski, B. Proof-pile. In Available online: https://github.com/zhangir-azerbayev/proof-pile., 2022.
- Bai et al. (2023) Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., Dong, Y., Tang, J., and Li, J. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
- Beltagy et al. (2020) Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Brown et al. (2020) Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.
- Cerebras (2024) Cerebras. Slimpajama: A 627b token cleaned and deduplicated version of redpajama. https://www.cerebras.net, 2024. Accessed: 2025-01-20.
- Chang et al. (2024) Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., Ye, W., Zhang, Y., Chang, Y., Yu, P. S., Yang, Q., and Xie, X. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):39:1–39:45, 2024.
- Chen et al. (2023) Chen, S., Wong, S., Chen, L., and Tian, Y. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
- Chen et al. (2024) Chen, Y., Qian, S., Tang, H., Lai, X., Liu, Z., Han, S., and Jia, J. LongLoRA: Efficient fine-tuning of long-context large language models. International Conference on Learning Representations, 2024.
- Choromanski et al. (2020) Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Belanger, D., Colwell, L., et al. Masked language modeling for proteins via linearly scalable long-context transformers. arXiv preprint arXiv:2006.03555, 2020.
- Dao (2024) Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations, 2024.
- Dao et al. (2022) Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems, 2022.
- Ding et al. (2023) Ding, J., Ma, S., Dong, L., Zhang, X., Huang, S., Wang, W., Zheng, N., and Wei, F. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
- Fu et al. (2024) Fu, Y., Panda, R., Niu, X., Yue, X., Hajishirzi, H., Kim, Y., and Peng, H. Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171, 2024.
- Guo et al. (2025) Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
- Han et al. (2023) Han, C., Wang, Q., Xiong, W., Chen, Y., Ji, H., and Wang, S. LM-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023.
- Hendrycks et al. (2021) Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021.
- Hsieh et al. (2024) Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., Zhang, Y., and Ginsburg, B. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024.
- Jiang et al. (2024) Jiang, H., LI, Y., Zhang, C., Wu, Q., Luo, X., Ahn, S., Han, Z., Abdi, A. H., Li, D., Lin, C.-Y., Yang, Y., and Qiu, L. MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention. In Advances in Neural Information Processing Systems, 2024.
- Katharopoulos et al. (2020) Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pp. 5156–5165. PMLR, 2020.
- Kitaev et al. (2019) Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. In International Conference on Learning Representations, 2019.
- Liu et al. (2024a) Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024a.
- Liu et al. (2024b) Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024b.
- Manakul & Gales (2021) Manakul, P. and Gales, M. J. F. Long-span summarization via local attention and content selection. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Association for Computational Linguistics, pp. 6026–6041, 2021.
- Mohtashami & Jaggi (2023) Mohtashami, A. and Jaggi, M. Random-access infinite context length for transformers. Advances in Neural Information Processing Systems, 36:54567–54585, 2023.
- Mu et al. (2023) Mu, J., Li, X., and Goodman, N. D. Learning to compress prompts with gist tokens. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- OpenAI (2023) OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Ouyang et al. (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, 2022.
- Peng et al. (2024) Peng, B., Quesnelle, J., Fan, H., and Shippole, E. YaRN: Efficient context window extension of large language models. In International Conference on Learning Representations, 2024.
- Qin & Van Durme (2023) Qin, G. and Van Durme, B. Nugget: Neural agglomerative embeddings of text. In International Conference on Machine Learning, pp. 28337–28350. PMLR, 2023.
- Qin et al. (2024) Qin, G., Rosset, C., Chau, E., Rao, N., and Van Durme, B. Dodo: Dynamic contextual compression for decoder-only lms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9961–9975, 2024.
- Rae et al. (2020) Rae, J. W., Potapenko, A., Jayakumar, S. M., Hillier, C., and Lillicrap, T. P. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations, 2020.
- Sun et al. (2023) Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., and Wei, F. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
- Tillet et al. (2019) Tillet, P., Kung, H., and Cox, D. D. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the International Workshop on Machine Learning and Programming Languages, pp. 10–19, 2019.
- Touvron et al. (2023a) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Touvron et al. (2023b) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Canton-Ferrer, C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
- Wei et al. (2022) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, pp. 24824–24837, 2022.
- Xiao et al. (2024a) Xiao, C., Zhang, P., Han, X., Xiao, G., Lin, Y., Zhang, Z., Liu, Z., and Sun, M. InfLLM: Training-free long-context extrapolation for llms with an efficient context memory. In Advances in Neural Information Processing Systems, 2024a.
- Xiao et al. (2024b) Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. In International Conference on Learning Representations, 2024b.
- Xiong et al. (2024) Xiong, W., Liu, J., Molybog, I., Zhang, H., Bhargava, P., Hou, R., Martin, L., Rungta, R., Sankararaman, K. A., Oguz, B., Khabsa, M., Fang, H., Mehdad, Y., Narang, S., Malik, K., Fan, A., Bhosale, S., Edunov, S., Lewis, M., Wang, S., and Ma, H. Effective long-context scaling of foundation models. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4643–4663, 2024.
- Yang et al. (2024) Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024.
- Yang et al. (2021) Yang, B., Wang, L., Wong, D. F., Shi, S., and Tu, Z. Context-aware self-attention networks for natural language processing. Neurocomputing, 458:157–169, 2021.
- Zaheer et al. (2020) Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33:17283–17297, 2020.
- Zhang et al. (2024) Zhang, P., Liu, Z., Xiao, S., Shao, N., Ye, Q., and Dou, Z. Long context compression with activation beacon. In International Conference on Learning Representations, 2024.
Supplementary Materials 补充材料
.tocmtappendix \etocsettagdepthmtchapternone \etocsettagdepthmtappendixsubsection
Appendix A Theoretical Analysis on Reachability for CCA-Attention
附录 A 关于 CCA-Attention 可达性的理论分析
In the self-attention, the calculation can be formulated as , where , , and , , , are learable parameters.
For convenience in analyzing the attention mechanism, we denote the attention weight as , where the element in is represented as .
We give the definition of reachability, and show that our method is to satisfy the information among all tokens is reachability, and then give the concrete expression of attention output.
在自注意力中,计算可以表示为 ,其中 , ,和 , , , 是可学习的参数。为了方便分析注意力机制,我们定义注意力权重为 ,其中 中的元素表示为 。我们给出可达性的定义,并展示我们的方法是满足所有标记之间的信息可达性,然后给出注意力输出的具体表达式。
Definition 1. 定义 1.
(Reachability)
We say the token is reachable from the token in the attention map if and only if the attention weight from the token to is positive, i.e., .
(可达性)我们说标记 在注意力图中可达标记 ,当且仅当从标记 到 的注意力权重为正,即 。
Proposition 1. 命题 1.
The attention score with causal masking in the CCA-Attention mechanism fully satisfies the reachability condition from the earlier tokens to the later tokens in the sequence at each transformer layer. Moreover, the final output representation for -th token in Eqn. (5) can be given by
注意力得分在 CCA-Attention 机制中的因果掩码完全满足在每个变换器层中从早期标记到后续标记的可达性条件。此外,方程(5)中第 1 个标记的最终输出表示 可以表示为
(6) |
where is the -th element of the output , is the weight of all core tokens in Eqn. (2), denote the attention score of CCA-Attention in Eqn. (5).
其中 是输出 的第 1 个元素, 是方程(2)中所有核心标记的权重, 表示方程(5)中 CCA-Attention 的注意力得分。
Proof. 证明。
We decompose the -th row of the attention scores into two parts:
我们将注意力得分的第 行分解为两部分:
(7) |
where and with . We aim to use these two terms to formalize element by element into the structure of full attention. For simplicity, we expand each element in these two terms with the weight although the is not directly weighted on the attention:
其中 和 与 。我们旨在使用这两个术语将 逐个元素形式化到全注意力结构中。为了简化,我们用权重 扩展这两个术语中的每个元素,尽管 并没有直接在注意力上加权:
(8) |
Based on the property of the attention scores of , for , we have , satisfying the condition of reachability. Next, we drive the output representation of a token.
基于 的注意力分数属性,对于 ,我们有 ,满足可达性条件。接下来,我们推导出标记的输出表示 。
When , for each in , we can use both the attention score of the locality-preserving module and the globality-aware pooling module to obtain
当 时,对于 中的每个 ,我们可以使用局部性保持模块的注意力分数和全局性感知池化模块来获得
(9) |
Appendix B More Implementation Details
附录 B 更多实现细节
B.1 More Details on Dataset and Evaluation Metrics
B.1 关于数据集和评估指标更多细节
SlimPajama (Cerebras, 2024) dataset is an open-source reproduction of the data mixture used to pretrain the LLaMA models.
SlimPajama(Cerebras,2024)数据集是对用于预训练 LLaMA 模型的数据混合的开放源代码再现。
It consists of 82% web data, 4.5% code from Github, 4.5% Wikipedia, 4.5% books, 2.5% Arxiv, and 2.0% StackExchange, used for extending the context lengths of LLMs to 128K tokens through careful data engineering techniques like per-source length upsampling.
它由 82%的网页数据、4.5%来自 GitHub 的代码、4.5%的维基百科、4.5%的书籍、2.5%的 Arxiv 和 2.0%的 StackExchange 组成,通过精心数据工程技术,如按来源长度上采样,用于扩展 LLMs 的上下文长度至 128K 个标记。
We use the SlimPajama dataset (Cerebras, 2024) as our training dataset.
我们使用 SlimPajama 数据集(Cerebras,2024 年)作为我们的训练数据集。
LongBench (Bai et al., 2023) is a pioneering benchmark designed for the bilingual, multitask, and comprehensive assessment of the long context understanding capabilities within LLMs. It encompasses diverse languages, specifically Chinese and English, thereby facilitating a more exhaustive evaluation of the multilingual proficiencies of large models in long context scenarios.
LongBench(Bai 等人,2023 年)是一个开创性的基准,旨在对 LLMs 中的双语、多任务和全面的长上下文理解能力进行评估。它涵盖了多种语言,特别是中文和英语,从而有助于更全面地评估大型模型在长上下文场景中的多语言能力。
Moreover, LongBench is structured with 6 major categories and 21 distinct tasks, spanning crucial long-text application areas such as single-document QA, multi-document QA, summarization, few-shot learning, synthetic tasks, and code completion.
此外,LongBench 由 6 个主要类别和 21 个不同的任务组成,涵盖了关键的长文本应用领域,如单文档问答、多文档问答、摘要、少样本学习、合成任务和代码补全。
LongBench has 14 English tasks, 5 Chinese tasks, and 2 code tasks. The average length of the majority of tasks falls within the range of 5k to 15k, and it comprises a total of 4,750 test data.
LongBench 有 14 个英语任务,5 个中文任务和 2 个代码任务。大多数任务的平均长度在 5k 到 15k 之间,总共有 4,750 个测试数据。
For detailed statistical information and construction methodologies of LongBench tasks, reference can be made to the designated source. Additionally, LongBench-E is a test set featuring a more evenly distributed length constructed through uniform sampling.
关于 LongBench 任务的详细统计信息和构建方法,可参考指定来源。此外,LongBench-E 是一个通过均匀采样构建的长度分布更均匀的测试集。
It contains comparable data quantities in the 0-4K, 4K-8K, and 8K+ length intervals, enabling an in-depth analysis of the model’s performance fluctuations across different input lengths.
它包含 0-4K、4K-8K 和 8K+长度区间的可比较数据量,使得能够深入分析模型在不同输入长度下的性能波动。
We conduct the experiments on LongBench-E to verify the long context understanding capability of models in Section 4.2.
我们在 LongBench-E 上进行了实验,以验证模型在第 4.2 节中的长上下文理解能力。
Exact Match Score (EM Score) (Liu et al., 2024b) measures the model’s ability to find the key information within a long context in a multi-document question-answering task.
精确匹配得分(EM 得分)(刘等,2024b)衡量模型在多文档问答任务中从长上下文中找到关键信息的能力。
In this task, each test sample comprises a certain number of documents to reach the specified context length (20 for 4K, 48 for 8K, 96 for 16K, 190 for 32K, 378 for 64K, 755 for 128K), followed by a question.
在这个任务中,每个测试样本包含一定数量的文档以达到指定的上下文长度(4K 为 20,8K 为 48,16K 为 96,32K 为 190,64K 为 378,128K 为 755),然后是一个问题。
We evaluate EM score metric with the multi-document question-answering dataset in Lost in the Middle (Liu et al., 2024b), which is collected from NaturalQuestions-Open and Wikipedia. We use the exact match score as the evaluation metric, judging whether any of the correct answers appear in the predicted output in Section 4.
我们使用 Lost in the Middle(刘等,2024b)中的多文档问答数据集评估 EM 得分指标,该数据集收集自 NaturalQuestions-Open 和维基百科。我们使用精确匹配得分作为评估指标,在第四章判断预测输出中是否包含任何正确答案。
Massive Multitask Language Understanding (MMLU) (Hendrycks et al., 2021) dataset is designed to assess the capabilities of language models across a wide array of subjects, delving deeper into their academic and professional understanding. The MMLU benchmark spans 57 diverse subjects, ranging from elementary mathematics to professional law.
大规模多任务语言理解(MMLU)(Hendrycks 等,2021)数据集旨在评估语言模型在广泛主题上的能力,深入探讨它们的学术和专业理解。MMLU 基准涵盖了 57 个不同的主题,从基础数学到专业法律。
The questions are designed to test both world knowledge and problem-solving abilities, challenging models with content from elementary to advanced professional levels.
这些问题旨在测试世界知识和解决问题的能力,挑战从小学到高级专业水平的内容。
We use the MMLU metric to evaluate the model’s proficiency across a diverse set of language-understanding tasks.
我们使用 MMLU 指标来评估模型在多种语言理解任务中的熟练程度。
It tests the model’s ability to apply its knowledge to a broad spectrum of topics and question types, reflecting its generalization capability in real-world scenarios.
The MMLU metric (Hendrycks et al., 2021), which tests world knowledge and problem-solving abilities in zero-shot and few-shot settings, is evaluated using the MMLU dataset (Hendrycks et al., 2021). This dataset spans 57 subjects across disciplines such as STEM, humanities, and social sciences. We test the MMLU metric in a 5-shot setting with MMLU dataset (Hendrycks et al., 2021) to verify the commonsense generalization ability of models in the supplementary materials C.3.
它测试模型将知识应用于广泛主题和问题类型的能力,反映了其在现实场景中的泛化能力。MMLU 指标(Hendrycks 等人,2021 年),在零样本和少样本设置下测试世界知识和解决问题的能力,使用 MMLU 数据集(Hendrycks 等人,2021 年)进行评估。该数据集涵盖了 57 个学科,包括 STEM、人文和社会科学。我们在 5 样本设置下使用 MMLU 数据集(Hendrycks 等人,2021 年)测试 MMLU 指标,以验证补充材料 C.3 中模型的常识泛化能力。
Perplexity (PPL) quantifies how effectively a model can predict the context. It is calculated as the exponentiated average negative log-likelihood of a sequence, offering a statistical measure of language modeling performance.
Proof-pile (Azerbayev et al., 2022) is a 13GB high-quality dataset of mathematical text and code that comprises 8.3 billion tokens (measured by the gpt-neox tokenizer).
The dataset is composed of diverse sources of both informal and formal mathematics and the raw data are downloaded from the web.
困惑度(PPL)量化了模型预测上下文的有效性。它被计算为序列的指数化平均负对数似然,提供了语言建模性能的统计度量。证明堆(Azerbayev 等,2022)是一个 13GB 的高质量数学文本和代码数据集,包含 8.3 亿个标记(由 gpt-neox 标记器测量)。该数据集由非正式和正式数学的多种来源组成,原始数据是从网络下载的。
We report PPL on the test dataset.
We use the test dataset of Proof-pile to verify long-context language modeling ability of models in the supplementary materials C.5.
我们在测试数据集上报告了 PPL。我们使用证明堆的测试数据集来验证补充材料 C.5 中模型的长期上下文语言建模能力。
B.2 More Experimental Protocols
B.2 更多实验协议
CCA-Attention (Ours).
For the continuous pretraining, we adopt the SlimPajama (Cerebras, 2024) dataset, an open-source replication of the LLaMA pretraining data mixture.
CCA-Attention( ours)。对于连续预训练,我们采用 SlimPajama(Cerebras,2024)数据集,这是 LLaMA 预训练数据混合的开源复制。
This dataset comprises 82% web data, split between 67% from CommonCrawl and 15% from C4, alongside 4.5% from GitHub code, 4.5% from Wikipedia, 4.5% from books, 2.5% from Arxiv, and 2.0% from Stack Exchange.
本数据集包含 82%的网页数据,其中 67%来自 CommonCrawl,15%来自 C4,以及 4.5%来自 GitHub 代码、4.5%来自维基百科、4.5%来自书籍、2.5%来自 Arxiv 和 2.0%来自 Stack Exchange。
We replace the full self-attention in LLaMA2 with our proposed CCA-Attention. The number of groups in globality-aware attention is shared across different model sizes. Training is conducted on 8 A800 GPUs using a micro-batch size of 1 and a gradient accumulation of 8, with a total of 1000 training steps.
This training configuration is applicable to all model sizes and context lengths.
我们将 LLaMA2 中的全自注意力替换为我们提出的 CCA-Attention。全局感知注意力中的组数在不同模型大小之间是共享的。训练在 8 块 A800 GPU 上进行,使用 1 个微批大小和 8 次梯度累积,总共 1000 个训练步骤。这种训练配置适用于所有模型大小和上下文长度。
Our method requires finetuning on a modest number of tokens to extend the long-context capabilities of LLMs, enabling efficient attention computation.
我们的方法需要在少量标记上进行微调,以扩展 LLMs 的长上下文能力,从而实现高效的注意力计算。
Specifically, we require only 2.10 billion tokens for 32K and 5 billion tokens for 80K, which is significantly lower than the token requirements for retraining a large language model.
具体来说,我们只需要 32K 的 2.10 亿个标记和 80K 的 50 亿个标记,这比重新训练大型语言模型的标记需求低得多。
To scale the models to long contexts, we modified the “base frequency” in RoPE from 10,000 to 500,000, following (Cerebras, 2024; Xiong et al., 2024).
In the globality-aware attention, we set the position embedding of to the position embedding of the token at the middle position in the corresponding group, ensuring that our attention maintains positional awareness.
Following FlashAttention (Dao, 2024), we implement our CCA-Attention by leveraging Triton (Tillet et al., 2019) to perform low-level operator fusion between our globality-aware pooling and locality-preserving modules.
This enables us to integrate our CCA-Attention as a standalone, cache-friendly operator, effectively eliminating redundant computations.
为了将模型扩展到长上下文,我们将 RoPE 中的“基本频率”从 10,000 修改为 500,000,遵循(Cerebras,2024;Xiong 等,2024)。在全局性感知注意力中,我们将 的位置嵌入设置为对应组中中间位置标记的位置嵌入,确保我们的注意力保持位置感知。遵循 FlashAttention(Dao,2024),我们通过利用 Triton(Tillet 等,2019)在我们的全局性感知池化和局部性保持模块之间执行低级算子融合来实现我们的 CCA-Attention。这使得我们能够将 CCA-Attention 集成为一个独立的、缓存友好的算子,有效地消除冗余计算。
Compared Methods. We conduct comprehensive comparisons between our proposed CCA-Attention and several state-of-the-art methods, including LLaMA-2 with vanilla attention, StreamingLLM (Xiao et al., 2024b), LM-infinite (Han et al., 2023), and MInference (Jiang et al., 2024), across the LongBench (Bai et al., 2023) and RULER (Hsieh et al., 2024) benchmarks. Our experiments are based on the LLaMA-2 7B model fine-tuned on sequences of length 32K and 80K (Fu et al., 2024).
与现有方法的比较。我们对提出的 CCA-Attention 与包括使用 vanilla attention 的 LLaMA-2、StreamingLLM(Xiao 等,2024b)、LM-infinite(Han 等,2023)和 MInference(Jiang 等,2024)在内的几种最先进方法进行了全面比较,这些比较涵盖了 LongBench(Bai 等,2023)和 RULER(Hsieh 等,2024)基准。我们的实验基于在长度为 32K 和 80K 的序列上微调的 LLaMA-2 7B 模型(Fu 等,2024)。
For StreamingLLM (Xiao et al., 2024b), we use the official implementation, adjusting the attention sink to 4 and setting the attention context size to 2000. Similarly, for LM-infinite (Han et al., 2023), we follow the official code, configuring the local branch size to 1024 and the global branch size to 16. In the case of MInference (Jiang et al., 2024), we also employ the official code implementations, configured with the official settings.
对于 StreamingLLM(Xiao 等,2024b),我们使用官方实现,将注意力汇聚设置为 4,并将注意力上下文大小设置为 2000。同样,对于 LM-infinite(Han 等,2023),我们遵循官方代码,配置本地分支大小为 1024,全局分支大小为 16。在 MInference(Jiang 等,2024)的情况下,我们也采用了官方代码实现,并使用官方设置进行配置。
Appendix C More Experimental Results
附录 C 更多实验结果
C.1 Experiments on More Long Context Benchmarks
C.1 在更多长上下文基准上的实验
We further evaluated the long-context modeling performance of our proposed method on RULER (Hsieh et al., 2024) using the LLaMA2-7B-80K model across various context lengths (ranging from 8K to 64K). As shown in Table 3, our approach consistently outperforms MInference (Jiang et al., 2024) at all context lengths, demonstrating its superiority in context modeling.
我们进一步评估了所提出方法在 RULER(Hsieh 等,2024)上的长上下文建模性能,使用 LLaMA2-7B-80K 模型在各个上下文长度(从 8K 到 64K)上进行测试。如表 3 所示,我们的方法在所有上下文长度上均优于 MInference(Jiang 等,2024),展示了其在上下文建模方面的优越性。
The performance gains primarily stem from two key innovations in our method: 1) Globality-aware Pooling Module dynamiclly identifies and pools the task-relevant context into core tokens, which effectively reduces redundancy while preserving essential information; 2) A locality-preserving module that supplements local information and ensures comprehensive information interaction across all context segments, as opposed to simply discarding tokens.
性能提升主要源于我们方法中的两个关键创新:1)全局性感知池化模块动态识别并将任务相关的上下文汇总为核心标记,有效减少冗余同时保留关键信息;2)一个保留局部信息的模块,确保所有上下文段落的全面信息交互,而不是简单地丢弃标记。
表 4:在 8-64K 上下文上的 RULER 比较。
Methods 方法 | 8K | 16K | 32K | 64K | Avg.↑ 平均↑ |
---|---|---|---|---|---|
LLaMA2-7B-80K (Vanilla Self-Attention) LLaMA2-7B-80K(纯自注意力) |
71.90 | 66.26 | 61.54 | 55.15 | 63.71 |
MInference (Jiang et al., 2024) MInference(江等人,2024) |
67.78 | 65.32 | 61.43 | 52.77 | 61.83 |
CCA-LLM (Ours) CCA-LLM( ours) | 68.15 | 66.31 | 60.89 | 54.88 | 62.56 |
C.2 Comparisons with More Efficient Attention Methods
C.2 与更高效的关注方法比较
To further evaluate our method, we compare it with LongLoRA (Chen et al., 2024), a recently proposed training-based approach with PI techniques. As shown in Table 5, on LongBench-E, our CCA-LLM achieves superior performance in terms of modeling accuracy, inference speed, and memory efficiency.
为了进一步评估我们的方法,我们将其与 LongLoRA(Chen 等人,2024 年)进行了比较,这是一种最近提出的基于训练的具有 PI 技术的方案。如表 5 所示,在 LongBench-E 上,我们的 CCA-LLM 在建模精度、推理速度和内存效率方面都取得了优异的性能。
Notably, while S2-Attention proposed by LongLoRA is only available during training, it defaults to full self-attention during inference, resulting in comparable computational and memory overhead to standard attention mechanisms. In contrast, CCA-LLM achieves a first-token latency reduction of 3.5× and a 46% decrease in memory usage, demonstrating its effectiveness for efficient long-context modeling.
值得注意的是,虽然 LongLoRA 提出的 S2-Attention 仅在训练期间可用,但在推理时默认使用全自注意力,导致与标准注意力机制相当的计算和内存开销。相比之下,CCA-LLM 实现了首个 token 延迟降低 3.5 倍和内存使用量减少 46%,证明了其在高效长上下文建模中的有效性。
表 5:在 LongBench-E 上的比较。“FTL”表示生成第一个 token 的延迟。
Model | Avg. Score 平均得分 | FTL (s) FTL (秒) | Memory. (GB) 内存. (GB) |
---|---|---|---|
LLaMA2-7B-32K | 22.11 | 9.15 | 35.58 |
LongLoRA | 21.58 | 9.15 | 35.58 (0%) |
CCA-LLM (Ours) CCA-LLM(我们的方法) | 21.86 | 2.59 | 19.12 (46%) |
C.3 More Comparisons on MMLU with Multi-choice QA
C.3 在 MMLU 数据集上的多选题比较
When applied to tasks with a fixed input length, such as multi-choice QA tasks, we set the number of groups as a constant value. This ensures that the overall computational complexity of our method is , where represents the input sequence length. In this section, we compare the performance of our method with full self-attention on the MMLU dataset specifically for multi-choice QA tasks. As shown in Table 6, our method consistently achieves superior performance than the traditional self-attention method, demonstrating the effectiveness of our approach.
当应用于具有固定输入长度的任务,例如多选题任务时,我们将组数 设为一个常数。这确保了我们的方法的整体计算复杂度是 ,其中 代表输入序列长度。在本节中,我们特别比较了我们的方法与全自注意力在 MMLU 数据集上多选题任务上的性能。如表 6 所示,我们的方法始终优于传统的自注意力方法,证明了我们方法的有效性。

图 5:LLaMA2-7B 中 32 个输入标记句子的注意力分数的可视化。注意力图揭示了一个明显的模式:大多数标记的注意力分数都很低。相反,少数标记与显著更高的注意力分数相关。
This trend is observed consistently from the shallow to the deeper layers of the model.
这种趋势在模型的浅层到深层层中是一致的。
表 6:在 MMLU 上的多选题问答题比较。
Method 方法 | LlaMA2-7B-8K | LlaMA2-7B-16K | LlaMA2-13B-16K | LlaMA2-13B-32K | ||||
---|---|---|---|---|---|---|---|---|
Full Self-attention 全自注意力 | Ours 我们的 | Full Self-attention 全自注意力 | Ours 我们的 | Full Self-attention 全自注意力 | Ours 我们的 | Full Self-attention 全自注意力 | Ours 我们的 | |
MMLU | 33.34 | 37.55 | 28.19 | 39.71 | 27.17 | 48.11 | 26.72 | 47.93 |
C.4 Statistical Results of Sparse Attention Scores
C.4 稀疏注意力分数的统计结果
We visualized LLaMA2-7B’s attention scores on a sentence of 32 tokens in Figure 5 as a supplement to Figure 1. As shown in the figure, these attention scores show consistent sparsity from shallow to deep layers. Same as demonstrated in existing methods (Beltagy et al., 2020; Xiao et al., 2024b).
我们将 LLaMA2-7B 在 32 个标记的句子上的注意力分数可视化于图 5 中,作为图 1 的补充。如图所示,这些注意力分数从浅层到深层显示出一致的稀疏性。与现有方法(Beltagy 等人,2020 年;Xiao 等人,2024b)中展示的相同。
Based on these observations, our CCA-Attention of assessing token importance within each group using the attention from the last token is both rational and effective.
基于这些观察,我们使用最后一个标记的注意力来评估每个组内标记的重要性,这种 CCA-Attention 方法既合理又有效。
The attention map visualization reveals a distinct pattern where tokens that are important to the query receive consistently high attention scores from all subsequent tokens.
注意力图的可视化揭示了一个明显的模式,即对于查询重要的标记,所有后续标记都会持续地给予其较高的注意力分数。
This indicates that important tokens, regardless of their position within a group, have a notable influence on attention distribution, suggesting that our method of importance assessment is capable of capturing these crucial tokens.
这表明,无论重要标记在组内的位置如何,它们都对注意力分布有显著影响,这表明我们的重要性评估方法能够捕捉到这些关键标记。
C.5 More Ablation Studies C.5 更多消融研究
Effect of Group-wise Pooling Strategy.
For computational efficiency, we conduct ablations with LLaMA2-7B-16K. We adopt perplexity (PPL) to evaluate our CCA-Attention models. To investigate the effect of different pooling strategies, we conduct ablations with max pooling, mean pooling and our weighted pooing in Eqn.
分组池化策略的影响。为了提高计算效率,我们对 LLaMA2-7B-16K 进行了消融实验。我们采用困惑度(PPL)来评估我们的 CCA-Attention 模型。为了研究不同池化策略的影响,我们进行了最大池化、平均池化和我们方程(2)中的加权池化的消融实验。
(2).
In Table 7, our CCA-Attention with group-wise attention pooling strategy achieves superior results in both PPL (e.g., 2.85 vs. 2.99).
This advantage arises since max pooling retains only the token with the highest response, thereby discarding the semantic importance of the remaining tokens.
在表 7 中,我们的 CCA-Attention 采用分组注意力池化策略,在 PPL(例如,2.85 vs. 2.99)方面取得了优异的结果。这种优势源于最大池化仅保留响应最高的标记,从而丢弃了剩余标记的语义重要性。
Mean pooling averages all tokens within a group, which substantially dilutes the semantic significance of critical tokens.
In contrast, our CCA-Attention dynamically assigns aggregation weights of each token, facilitating a more comprehensive and efficient fusion.
平均池化平均一个组内的所有标记,这大大削弱了关键标记的语义重要性。相比之下,我们的 CCA-Attention 动态分配每个标记的聚合权重,从而实现更全面和高效的融合。
Strategy | Mean Pooling | Max Pooling | CCA-Attention (Ours) |
---|---|---|---|
PPL | 2.99 | 2.99 | 2.85 |
Effect of Group Size . To investigate the effect of different group sizes , we implement the proposed CCA-Attention with different . In Table 8, as increases, the computational efficiency improves while the PPL increases. Upon closer examination, the smallest group size captures the most comprehensive information, which translates to the highest computational cost but also the optimal PPL. Conversely, an excessively large leads to an overemphasis on globality-aware attention, compressing information to the point where crucial semantic nuances may be overlooked, thereby curtailing performance. To strike a balance between computational efficiency and model performance, we have selected as the default training setting.
2 | 4 | 8 | 16 | 32 | 64 | |
---|---|---|---|---|---|---|
PPL | 2.75 | 2.78 | 2.81 | 2.86 | 2.90 | 2.93 |
Latency (ms) | 546.7 | 503.5 | 479.6 | 462.8 | 457.74 | 456.0 |
Effect of Local Window Size . To systematically evaluate the influence of different local window sizes , we implement the proposed CCA-Attention across a range of . In Table 9, an increase in correlates with lower PPL, but this is counterbalanced by a rise in computational cost. A larger captures more contextual information with neighborhood tokens, but also increases computational demands. Conversely, a smaller , indicative of a limited receptive field, constrains the exchange of information within the locality-preserving module, resulting in diminished performance. Striking a balance between computational efficiency and model efficacy, we opted for as the default training setting.
256 | 512 | 1024 | 2048 | 4096 | |
---|---|---|---|---|---|
PPL | 2.98 | 2.92 | 2.86 | 2.79 | 2.73 |
Latency (ms) | 457.4 | 460.1 | 461.4 | 462.8 | 473.1 |
Effect of Different Updating Strategies. As mentioned in Section 3.4, we have two updating strategies: 1) updating all the parameters during finetuning (full finetuning) and 2) only updating the parameters , , (partial finetuning). In Table 10, we compare these two variants of our methods on the Longbench-E benchmark. The variant that updates all parameters during finetuning achieves better performance because it allows the model to fully adapt to our proposed attention. In contrast, partial fine-tuning, while limiting the model’s adaptability due to fixed pre-trained features, still achieves competitive performance. This makes partial fine-tuning a practical choice in scenarios requiring rapid training or where computational resources are limited. Despite its constraints, partial fine-tuning can deliver performance close to that of full fine-tuning, offering a balanced trade-off between efficiency and accuracy.
Strategies | Single-Doc. QA | Multi-Doc. QA | Sum. | FS. Learning | Synthetic | Code | Avg. |
---|---|---|---|---|---|---|---|
Partial Finetuning | 5.39 | 3.62 | 9.21 | 60.41 | 1.34 | 51.77 | 21.96 |
Full Finetuning | 5.62 | 4.34 | 8.99 | 59.60 | 0.48 | 54.40 | 22.24 |
C.6 Training Convergence Curve
In the experiments, we finetune the LLaMA2-7B-32K and LLaMA2-7B-80K models equipped with our CCA-Attention on SlimPajama (Cerebras, 2024) dataset for 1,000 iterations. We show the training convergence curves of both models with our CCA-Attention in Figure 6. From the results, by minimizing the training loss, both LLaMA2-7B-32K and LLaMA2-7B-80K models are able to converge very fast. The perplexity rapidly converges within approximately the first 100 iterations and remains stable over 1,000 iterations. These results not only demonstrate the effectiveness and training stability of our proposed CCA-Attention, but also establish it has the potential to be a plug-and-play attention module incorporated into existing LLMs. Notably, the initial training loss of LLaMA2-7B-32K is higher than that of LLaMA2-7B-80K. This difference arises because LLaMA2-7B-32K is finetuned from the official LLaMA2-7B-4K model, which has a shorter context window and thus requires more significant adjustments to adapt to longer sequences. In contrast, LLaMA2-7B-80K is fine-tuned from a model pre-trained by Fu. et.al. (Fu et al., 2024).
