这是用户在 2024-5-6 22:18 为 https://app.immersivetranslate.com/pdf-pro/59863aba-d0f6-475b-9868-37fbb97f8c09 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI DeepSeek-人工智能research@deepseek.com

Abstract 抽象

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves of training costs, reduces the KV cache by , and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V2.
我们介绍了 DeepSeek-V2,这是一种强大的专家混合 (MoE) 语言模型,其特点是经济的训练和高效的推理。它包含 236B 个总参数,其中 21B 为每个令牌激活,并支持 128K 个令牌的上下文长度。DeepSeek-V2 采用多头潜在注意力 (MLA) 和 DeepSeekMoE 等创新架构。MLA 通过将键值 (KV) 缓存显着压缩为潜在向量来保证高效推理,而 DeepSeekMoE 则通过稀疏计算以经济的成本训练强大的模型。与DeepSeek 67B相比,DeepSeek-V2性能明显增强,同时节省 了训练成本,KV缓存减少 ,最大生成吞吐量提升至5.76倍。我们在由 8.1T 代币组成的高质量多源语料库上预训练 DeepSeek-V2,并进一步执行监督微调 (SFT) 和强化学习 (RL) 以充分释放其潜力。评估结果表明,即使仅激活了21B参数,DeepSeek-V2及其聊天版本仍能在开源模型中实现顶级性能。模型检查点可在 https://github.com/deepseek-ai/DeepSeek-V2 处使用。

(a) MMLU accuracy vs. activated parameters, among different (b) open-source models.
(a) MMLU 精度与激活参数,在不同的 (b) 开源模型中。

(b) Training costs and inference efficiency of DeepSeek 67B (Dense) and DeepSeek-V2.
(b) DeepSeek 67B (Dense) 和 DeepSeek-V2 的训练成本和推理效率。

Contents 内容

1 Introduction ..... 4 1 引言 .....4
2 Architecture ..... 6
2 建筑.....6

2.1 Multi-Head Latent Attention: Boosting Inference Efficiency, ..... 6
2.1 多头潜在注意力:提高推理效率,.....6

2.1.1 Preliminaries: Standard Multi-Head Attention ..... 6
2.1.1 预赛:标准多头注意力.....6

2.1.2 Low-Rank Key-Value Joint Compression ..... 7
2.1.2 低秩键值关节压缩.....7

2.1.3 Decoupled Rotary Position Embedding ..... 7
2.1.3 解耦旋转位置嵌入.....7

2.1.4 Comparison of Key-Value Cache ..... 8
2.1.4 键值缓存.....比较8

2.2 DeepSeekMoE: Training Strong Models at Economical Costs ..... 9
2.2 DeepSeekMoE:以经济的成本训练强大的模型 .....9

2.2.1 Basic Architecture ..... 9
2.2.1 基本架构.....9

2.2.2 Device-Limited Routing . ..... 9
2.2.2 设备受限路由。.....9

2.2.3 Auxiliary Loss for Load Balance ..... 9
2.2.3 负载均衡.....的辅助损耗9

2.2.4 Token-Dropping Strategy ..... 11
2.2.4 代币投放策略.....11

3 Pre-Training ..... 11
3 预训练.....11

3.1 Experimental Setups ..... 11
3.1 实验设置.....11

3.1.1 Data Construction ..... 11
3.1.1 数据构建.....11

3.1.2 Hyper-Parameters ..... 11
3.1.2 超参数.....11

3.1.3 Infrastructures ..... 12
3.1.3 基础设施.....12

3.1.4 Long Context Extension ..... 12
3.1.4 长上下文扩展.....12

3.2 Evaluations ..... 13
3.2 评价.....13

3.2.1 Evaluation Benchmarks ..... 13
3.2.1 评估基准 .....13

3.2.2 Evaluation Results ..... 14
3.2.2 评价结果 .....14

3.2.3 Training and Inference Efficiency ..... 15
3.2.3 训练和推理效率.....15

4 Alignment ..... 16
4 对齐.....16

4.1 Supervised Fine-Tuning ..... 16
4.1 监督微调.....16

4.2 Reinforcement Learning ..... 16
4.2 强化学习.....16

4.3 Evaluation Results ..... 17
4.3 评估结果 .....17

4.4 Discussion ..... 19
4.4 讨论.....19

5 Conclusion, Limitation, and Future Work ..... 20
5 结论、局限性和未来工作.....20

A Contributions and Acknowledgments ..... 27
a 文稿和致谢.....27

B Full Formulas of MLA ..... 29
B MLA的完整公式.....29

C Ablation of Attention Mechanisms ..... 29
C 消融注意力机制 .....29

C. 1 Ablation of MHA, GQA, and MQA ..... 29
C. 1 MHA、GQA 和 MQA .....的消融29

C. 2 Comparison Between MLA and MHA ..... 29
C. 2 MLA与MHA.....的比较29

D Discussion About Pre-Training Data Debiasing ..... 30
D 关于预训练数据去偏.....的讨论30

E Additional Evaluations on Math and Code ..... 31
E 数学和代码.....的附加评估31

F Evaluation Formats ..... 32
F 评估格式 .....32

1. Introduction 1. 引言

In the past few years, Large Language Models (LLMs) (Anthropic, 2023: Google, 2023 OpenAI, 2022, 2023) have undergone rapid development, offering a glimpse into the dawn of Artificial General Intelligence (AGI). In general, the intelligence of an LLM tends to improve as the number of parameters increases, allowing it to exhibit emergent capabilities across various tasks (Wei et al., 2022). However, the improvement comes at the cost of larger computing resources for training and a potential decrease in inference throughput. These constraints present significant challenges that impede the widespread adoption and utilization of LLMs. In order to tackle this problem, we introduce DeepSeek-V2, a strong open-source Mixture-of-Experts (MoE) language model, characterized by economical training and efficient inference through an innovative Transformer architecture. It is equipped with a total of 236B parameters, of which 21B are activated for each token, and supports a context length of tokens.
在过去的几年里,大型语言模型(LLMs)(Anthropic,2023:Google,2023 OpenAI,2022,2023)经历了快速发展,让我们得以一窥通用人工智能(AGI)的曙光。一般来说,随着参数数量的增加,一个人LLM的智力往往会提高,使其能够在各种任务中表现出紧急能力(Wei 等人,2022 年)。然而,这种改进是以更大的计算资源为代价的,并且可能会降低推理吞吐量。这些制约因素带来了重大挑战,阻碍了 LLMs.为了解决这个问题,我们引入了 DeepSeek-V2,这是一种强大的开源专家混合 (MoE) 语言模型,其特点是通过创新的 Transformer 架构进行经济的训练和高效的推理。它总共配备了 236B 参数,其中 21B 为每个令牌激活,并支持 令牌的上下文长度。
We optimize the attention modules and Feed-Forward Networks (FFNs) within the Transformer framework (Vaswani et al., 2017) with our proposed Multi-head Latent Attention (MLA) and DeepSeekMoE. (1) In the context of attention mechanisms, the Key-Value (KV) cache of the Multi-Head Attention (MHA) (Vaswani et al., 2017) poses a significant obstacle to the inference efficiency of LLMs. Various approaches have been explored to address this issue, including Grouped-Query Attention (GQA) (Ainslie et al. 2023 ) and Multi-Query Attention (MQA) (Shazeer 2019). However, these methods often compromise performance in their attempt to reduce the cache. In order to achieve the best of both worlds, we introduce MLA, an attention mechanism equipped with low-rank key-value joint compression. Empirically, MLA achieves superior performance compared with MHA, and meanwhile significantly reduces the KV cache during inference, thus boosting the inference efficiency. (2) For Feed-Forward Networks (FFNs), we follow the DeepSeekMoE architecture (Dai et al. 2024), which adopts fine-grained expert segmentation and shared expert isolation for higher potential in expert specialization. The DeepSeekMoE architecture demonstrates great advantages compared with conventional MoE architectures like GShard (Lepikhin et al., 2021), enabling us to train strong models at an economical cost. As we employ expert parallelism during training, we also devise supplementary mechanisms to control communication overheads and ensure load balance. By combining these two techniques, DeepSeek-V2 features strong performance (Figure 1(a)), economical training costs, and efficient inference throughput (Figure 1(b)), simultaneously.
我们用我们提出的多头潜在注意力(MLA)和DeepSeekMoE优化了Transformer框架(Vaswani等人,2017)中的注意力模块和前馈网络(FFN)。(1)在注意力机制的背景下,多头注意力(MHA)的键值(KV)缓存(Vaswani et al., 2017)对推理LLMs效率构成了重大障碍。已经探索了各种方法来解决这个问题,包括分组查询注意力 (GQA) (Ainslie 等人,2023 年)和多查询注意力 (MQA) (Shazeer 2019)。但是,这些方法在尝试减少 缓存时通常会损害性能。为了实现两全其美,我们引入了MLA,这是一种配备低秩键值联合压缩的注意力机制。从经验上看,MLA与MHA相比性能优越,同时显著降低了推理过程中的KV缓存,从而提高了推理效率。(2)对于前馈网络(FFN),我们遵循DeepSeekMoE架构(Dai et al. 2024),该架构采用细粒度的专家分割和共享的专家隔离,以提高专家专业化的潜力。与 GShard 等传统 MoE 架构相比,DeepSeekMoE 架构表现出巨大的优势(Lepikhin 等人,2021 年),使我们能够以经济的成本训练强大的模型。当我们在培训期间采用专家并行性时,我们还设计了补充机制来控制通信开销并确保负载平衡。通过结合这两种技术,DeepSeek-V2 同时具有强大的性能(图 1(a))、经济的训练成本和高效的推理吞吐量(图 1(b))。
We construct a high-quality and multi-source pre-training corpus consisting of tokens. Compared with the corpus used in DeepSeek 67B (our previous release) (DeepSeek-AI |2024), this corpus features an extended amount of data, especially Chinese data, and higher data quality. We first pretrain DeepSeek-V2 on the full pre-training corpus. Then, we collect conversational sessions, which encompass various domains such as math, code, writing, reasoning, safety, and more, to perform Supervised Fine-Tuning (SFT) for DeepSeek-V2 Chat (SFT). Finally, we follow DeepSeekMath (Shao et al., 2024) to employ Group Relative Policy Optimization (GRPO) to further align the model with human preference and produce DeepSeek-V2 Chat (RL).
我们构建了一个由 代币组成的高质量、多源的预训练语料库。与DeepSeek 67B(我们之前的版本)(DeepSeek-AI |2024)中使用的语料库相比,该语料库具有更大的数据量,尤其是中文数据,并且具有更高的数据质量。我们首先在完整的预训练语料库上预训练 DeepSeek-V2。然后,我们收集 对话会话,这些会话涵盖数学、代码、写作、推理、安全等各个领域,以执行 DeepSeek-V2 聊天 (SFT) 的监督微调 (SFT)。最后,我们遵循 DeepSeekMath (Shao et al., 2024) 采用组相对策略优化 (GRPO) 来进一步使模型与人类偏好保持一致并生成 DeepSeek-V2 聊天 (RL)。
We evaluate DeepSeek-V2 on a wide range of benchmarks in English and Chinese, and compare it with representative open-source models. Evaluation results show that even with only 21B activated parameters, DeepSeek-V2 still achieves top-tier performance among open-source models and becomes the strongest open-source MoE language model. Figure 1(a) highlights that, on MMLU, DeepSeek-V2 achieves top-ranking performance with only a small number of activated parameters. In addition, as shown in Figure 1(b), compared with DeepSeek 67B, DeepSeek-V2 saves of training costs, reduces the KV cache by , and boosts the maximum generation throughput to 5.76 times. We also evaluate DeepSeek-V2 Chat (SFT) and
我们在广泛的中英文基准测试中评估了 DeepSeek-V2,并将其与具有代表性的开源模型进行了比较。评估结果表明,即使仅激活了21B参数,DeepSeek-V2仍实现了开源模型中的顶级性能,成为最强的开源MoE语言模型。图 1(a) 突出显示,在 MMLU 上,DeepSeek-V2 仅使用少量激活参数即可实现顶级性能。此外,如图1(b)所示,与DeepSeek 67B相比,DeepSeek-V2节省 了训练成本,将KV缓存降低了 ,最大生成吞吐量提高到5.76倍。我们还评估了 DeepSeek-V2 聊天 (SFT) 和
Figure 1 | Illustration of the architecture of DeepSeek-V2. MLA ensures efficient inference by significantly reducing the KV cache for generation, and DeepSeekMoE enables training strong models at an economical cost through the sparse architecture.
图1 |DeepSeek-V2 架构图示。MLA 通过显著减少生成的 KV 缓存来确保高效推理,而 DeepSeekMoE 通过稀疏架构以经济的成本训练强大的模型。
DeepSeek-V2 Chat (RL) on open-ended benchmarks. Notably, DeepSeek-V2 Chat (RL) achieves 38.9 length-controlled win rate on AlpacaEval 2.0 (Dubois et al., 2024), 8.97 overall score on MT-Bench (Zheng et al. 2023), and 7.91 overall score on AlignBench (Liu et al., 2023). The English open-ended conversation evaluations demonstrate that DeepSeek-V2 Chat (RL) has toptier performance among open-source chat models. In addition, the evaluation on AlignBench indicates that in Chinese, DeepSeek-V2 Chat (RL) outperforms all of open-source models, and even beats most of closed-source models.
DeepSeek-V2 Chat (RL) 基于开放式基准测试。值得注意的是,DeepSeek-V2 Chat (RL) 在 AlpacaEval 2.0 (Dubois et al., 2024) 上实现了 38.9 的长度控制胜率,在 MT-Bench 上获得了 8.97 的总分 (Zheng et al. 2023),在 AlignBench 上获得了 7.91 的总分 (Liu et al., 2023)。英语开放式对话评估表明,DeepSeek-V2 Chat (RL) 在开源聊天模型中具有顶级性能。此外,AlignBench 的评估表明,在中文中,DeepSeek-V2 Chat (RL) 的表现优于所有开源模型,甚至优于大多数闭源模型。
In the rest of this paper, we first provide a detailed description of the model architecture of DeepSeek-V2 (Section 2). Subsequently, we introduce our pre-training endeavors, including the training data construction, hyper-parameter settings, infrastructures, long context extension, and the evaluation of model performance and efficiency (Section 3). Following this, we demonstrate our efforts in alignment, encompassing Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), the evaluation results, and other discussion (Section 4). Finally, we summarize the conclusion, deliberate on the current limitations of DeepSeek-V2, and outline our future work (Section 5).
在本文的其余部分,我们首先详细介绍了 DeepSeek-V2 的模型架构(第 2 节)。随后,我们介绍了我们的预训练工作,包括训练数据构建、超参数设置、基础设施、长上下文扩展以及模型性能和效率的评估(第 3 节)。在此之后,我们展示了我们在一致性方面的努力,包括监督微调 (SFT)、强化学习 (RL)、评估结果和其他讨论(第 4 节)。最后,我们总结了结论,讨论了 DeepSeek-V2 当前的局限性,并概述了我们未来的工作(第 5 节)。

2. Architecture 2. 建筑

By and large, DeepSeek-V2 is still in the Transformer architecture (Vaswani et al., 2017), where each Transformer block consists of an attention module and a Feed-Forward Network (FFN). However, for both the attention module and the FFN, we design and employ innovative architectures. For attention, we design MLA, which utilizes low-rank key-value joint compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference. For FFNs, we adopt the DeepSeekMoE architecture (Dai et al., 2024), a high-performance MoE architecture that enables training strong models at an economical cost. An illustration of the architecture of DeepSeek-V2 is presented in Figure 1, and we will introduce the details of MLA and DeepSeekMoE in this section. For other tiny details (e.g., layer normalization and the activation function in FFNs), unless specifically stated, DeepSeek-V2 follows the settings of DeepSeek 67B ( .
总的来说,DeepSeek-V2 仍处于 Transformer 架构中(Vaswani 等人,2017 年),其中每个 Transformer 模块由一个注意力模块和一个前馈网络 (FFN) 组成。然而,对于注意力模块和FFN,我们设计和采用了创新的架构。为了引起注意,我们设计了MLA,它利用低秩键值联合压缩来消除推理-时间键值缓存的瓶颈,从而支持高效推理。对于 FFN,我们采用 DeepSeekMoE 架构(Dai et al., 2024),这是一种高性能的 MoE 架构,能够以经济的成本训练强大的模型。图 1 展示了 DeepSeek-V2 的架构,我们将在本节中详细介绍 MLA 和 DeepSeekMoE。对于其他微小的细节(例如,层归一化和 FFN 中的激活函数),除非特别说明,否则 DeepSeek-V2 遵循 DeepSeek 67B ( .

2.1. Multi-Head Latent Attention: Boosting Inference Efficiency
2.1. 多头潜在注意力:提高推理效率

Conventional Transformer models usually adopts Multi-Head Attention (MHA) (Vaswani et al. 2017), but during generation, its heavy Key-Value (KV) cache will become the bottleneck that limit the inference efficiency. In order to reduce the KV cache, Multi-Query Attention (MQA) (Shazeer, 2019) and Grouped-Query Attention (GQA) (Ainslie et al., 2023) are proposed. They require a smaller magnitude of cache, but their performance does not match MHA (we provide the ablation of MHA, GQA and MQA in Appendix C.1).
传统的 Transformer 模型通常采用多头注意力 (MHA) (Vaswani et al. 2017),但在生成过程中,其繁重的键值 (KV) 缓存将成为限制推理效率的瓶颈。为了减少KV缓存,提出了多查询注意力(MQA)(Shazeer,2019)和分组查询注意力(GQA)(Ainslie等人,2023)。它们需要较小量级的 缓存,但它们的性能与 MHA 不匹配(我们在附录 C.1 中提供了 MHA、GQA 和 MQA 的消融)。
For DeepSeek-V2, we design an innovative attention mechanism called Multi-head Latent Attention (MLA). Equipped with low-rank key-value joint compression, MLA achieves better performance than MHA, but requires a significantly smaller amount of cache. We introduce its architecture in the following, and also provide a comparison between MLA and MHA in Appendix
对于 DeepSeek-V2,我们设计了一种创新的注意力机制,称为多头潜在注意力 (MLA)。MLA 配备低秩键值联合压缩,性能优于 MHA,但需要的 缓存量要小得多。我们将在下面介绍其架构,并在附录 中提供 MLA 和 MHA 之间的比较

2.1.1. Preliminaries: Standard Multi-Head Attention
2.1.1. 预赛:标准多头注意力

We first introduce the standard MHA mechanism as background. Let be the embedding dimension, be the number of attention heads, be the dimension per head, and be the attention input of the -th token at an attention layer. Standard MHA first produces through three matrices , respectively:
我们首先介绍标准的MHA机制作为背景。设 是嵌入维度, 是注意力头的数量, 是每个头的维度, 并且是注意力层上第 -th 标记的注意力输入。标准MHA首先通过三个矩阵 分别产生
Then, will be sliced into heads for the multi-head attention computation:
然后, 将被切成 头部进行多头注意力计算:
Figure 2 | Simplified illustration of Multi-Head Attention (MHA), Grouped-Query Attention (GQA), Multi-Query Attention (MQA), and Multi-head Latent Attention (MLA). Through jointly compressing the keys and values into a latent vector, MLA significantly reduces the KV cache during inference.
图2 |多头注意力 (MHA)、分组查询注意力 (GQA)、多查询注意力 (MQA) 和多头潜在注意力 (MLA) 的简化图示。通过将键和值联合压缩为潜在向量,MLA显著降低了推理过程中的KV缓存。
where denote the query, key, and value of the -th attention head, respectively; denotes the output projection matrix. During inference, all keys and values need to be cached to accelerate inference, so MHA needs to cache elements for each token. In model deployment, this heavy KV cache is a large bottleneck that limits the maximum batch size and sequence length.
其中 分别表示 第 -th 个注意力头的查询、键和值; 表示输出投影矩阵。在推理过程中,需要缓存所有键和值以加速推理,因此 MHA 需要缓存每个令牌 的元素。在模型部署中,这种繁重的 KV 缓存是一个很大的瓶颈,它限制了最大批处理大小和序列长度。

2.1.2. Low-Rank Key-Value Joint Compression
2.1.2. 低秩键值关节压缩

The core of MLA is the low-rank joint compression for keys and values to reduce cache:
MLA 的核心是对键和值进行低秩联合压缩,以减少 缓存:
where is the compressed latent vector for keys and values; denotes the compression dimension; is the down-projection matrix; and are the up-projection matrices for keys and values, respectively. During inference, MLA only needs to cache , so its cache has only elements, where denotes the number of layers. In addition, during inference, since can be absorbed into , and can be absorbed into , we even do not need to compute keys and values out for attention. Figure 2 intuitively illustrates how the joint compression in MLA reduces the cache.
其中 是键和值的压缩潜在向量; 表示 压缩尺寸; 是向下投影矩阵;和 分别是键和值的上投影矩阵。在推理过程中,MLA 只需要缓存 ,因此其 缓存只有 元素,其中 表示层数。此外,在推理过程中,由于 可以被吸收到 可以吸收到 ,我们甚至不需要计算出键和值来引起注意。图 2 直观地说明了 MLA 中 的联合压缩如何减少 缓存。
Moreover, in order to reduce the activation memory during training, we also perform low-rank compression for the queries, even if it cannot reduce the cache:
此外,为了减少训练过程中的激活内存,我们还对查询进行了低秩压缩,即使它不能减少 缓存:
where is the compressed latent vector for queries; denotes the query compression dimension; and are the down-projection and upprojection matrices for queries, respectively.
其中 是查询的压缩潜在向量; 表示查询压缩维度;和 分别是查询的下投影和上投影矩阵。

2.1.3. Decoupled Rotary Position Embedding
2.1.3. 解耦旋转位置嵌入

Following DeepSeek 67B ( ) we intend to use the Rotary Position Embedding (RoPE) (Su et al., 2024) for DeepSeek-V2. However, RoPE is incompatible with low-rank
在 DeepSeek 67B ( ) 之后,我们打算将旋转位置嵌入 (RoPE) (Su et al., 2024) 用于 DeepSeek-V2。但是,RoPE 与低秩不兼容
Attention Mechanism 注意力机制 KV Cache per Token (# Element)
每个令牌的 KV 缓存 (# 元素)
Multi-Head Attention (MHA)
多头注意力 (MHA)
Grouped-Query Attention (GQA)
分组查询注意 (GQA)
Multi-Query Attention (MQA)
多查询注意力 (MQA)
MLA (Ours) Stronger
Table 1 | Comparison of the KV cache per token among different attention mechanisms. denotes the number of attention heads, denotes the dimension per attention head, denotes the number of layers, denotes the number of groups in GQA, and and denote the KV compression dimension and the per-head dimension of the decoupled queries and key in MLA, respectively. The amount of cache is measured by the number of elements, regardless of the storage precision. For DeepSeek-V2, is set to and is set to . So, its KV cache is equal to GQA with only 2.25 groups, but its performance is stronger than MHA.
表1 |不同注意力机制下每个令牌的 KV 缓存比较。 分别表示注意力头数、 每个注意力头的维度、 层数、 GQA 中的组数、MLA中解耦查询和键的KV压缩维度和每头维度。 缓存量由元素数来衡量,而与存储精度无关。对于 DeepSeek-V2, 设置为 设置为 。因此,它的 KV 缓存等于 GQA,只有 2.25 个组,但其性能比 MHA 强。
KV compression. To be specific, RoPE is position-sensitive for both keys and queries. If we apply RoPE for the keys in Equation 10 will be coupled with a position-sensitive RoPE matrix. In this way, cannot be absorbed into any more during inference, since a RoPE matrix related to the currently generating token will lie between and and matrix multiplication does not obey a commutative law. As a result, we must recompute the keys for all the prefix tokens during inference, which will significantly hinder the inference efficiency.
KV压缩。具体来说,RoPE 对键和查询都是位置敏感的。如果我们对公式 10 中的键 应用 RoPE,则将与位置敏感的 RoPE 矩阵耦合。这样, 在推理过程中就不能再被吸收了 ,因为与当前生成的令牌相关的 RoPE 矩阵将位于 和 之间 ,并且矩阵乘法不遵守交换定律。因此,在推理过程中,我们必须重新计算所有前缀标记的键,这将大大阻碍推理效率。
As a solution, we propose the decoupled RoPE strategy that uses additional multi-head queries and a shared key to carry RoPE, where denotes the per-head dimension of the decoupled queries and key. Equipped with the decoupled RoPE strategy, MLA performs the following computation:
作为解决方案,我们提出了解耦的 RoPE 策略,该策略使用额外的多头查询 和共享密钥 来携带 RoPE,其中 表示解耦查询和键的每头维度。配备解耦 RoPE 策略,MLA 执行以下计算:
where and are matrices to produce the decouples queries and key, respectively; denotes the operation that applies RoPE matrices; and denotes the concatenation operation. During inference, the decoupled key should also be cached. Therefore, DeepSeek-V2 requires a total cache containing elements.