DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model DeepSeek-V2:一种强大、经济且高效的专家混合语言模型
DeepSeek-AIresearch@deepseek.com
\begin{abstract}
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128 K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5%42.5 \% of training costs, reduces the KV cache by 93.3%93.3 \%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1 T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V2. 我们推出了 DeepSeek-V2,这是一种强大的混合专家(MoE)语言模型,具有经济的训练和高效的推理特点。它总共有 2360 亿个参数,其中每个 token 激活 21 亿,并支持 128K tokens 的上下文长度。DeepSeek-V2 采用了创新的架构,包括多头潜在注意力(MLA)和 DeepSeekMoE。MLA 通过将键值(KV)缓存显著压缩为潜在向量,确保了高效的推理,而 DeepSeekMoE 则通过稀疏计算以经济的成本训练强大的模型。与 DeepSeek 67B 相比,DeepSeek-V2 的性能显著增强,同时节省了 42.5%42.5 \% 的训练成本,减少了 KV 缓存 93.3%93.3 \% ,并将最大生成吞吐量提升至 5.76 倍。我们在一个由 8.1T tokens 组成的高质量多源语料库上对 DeepSeek-V2 进行了预训练,并进一步进行了监督微调(SFT)和强化学习(RL),以充分释放其潜力。评估结果表明,即使只有 21 亿个激活参数,DeepSeek-V2 及其聊天版本在开源模型中仍然实现了顶级性能。 模型检查点可在 https://github.com/deepseek-ai/DeepSeek-V2 获取。
Figure 1 | (a) MMLU accuracy vs. activated parameters, among different open-source models. (b) Training costs and inference efficiency of DeepSeek 67B (Dense) and DeepSeek-V2. 图 1 | (a) 不同开源模型中,MMLU 准确率与激活参数的关系。(b) DeepSeek 67B(密集型)和 DeepSeek-V2 的训练成本和推理效率。
Contents 内容
1 Introduction … 4 1 引言 … 4
2 Architecture … 6 2 架构 … 6
2.1 Multi-Head Latent Attention: Boosting Inference Efficiency … 6 2.1 多头潜在注意力:提升推理效率 … 6
2.1.1 Preliminaries: Standard Multi-Head Attention … 6 2.1.1 前言:标准多头注意力 … 6
2.1.2 Low-Rank Key-Value Joint Compression … 7 2.1.2 低秩键值联合压缩 … 7
2.1.3 Decoupled Rotary Position Embedding … 8 2.1.3 解耦旋转位置嵌入 … 8
2.1.4 Comparison of Key-Value Cache … 8 2.1.4 键值缓存的比较 … 8
2.2 DeepSeekMoE: Training Strong Models at Economical Costs … 9 2.2 DeepSeekMoE:以经济成本训练强大模型 … 9
2.2.1 Basic Architecture … 9 2.2.1 基本架构 … 9
2.2.2 Device-Limited Routing … 9 2.2.2 设备限制路由 … 9
2.2.3 Auxiliary Loss for Load Balance … 10 2.2.3 辅助损失用于负载平衡 … 10
2.2.4 Token-Dropping Strategy … 11 2.2.4 令牌丢弃策略 … 11
3 Pre-Training … 11 3 预训练 … 11
3.1 Experimental Setups … 11 3.1 实验设置 … 11
3.1.1 Data Construction … 11 3.1.1 数据构建 … 11
3.1.2 Hyper-Parameters … 12 3.1.2 超参数 … 12
3.1.3 Infrastructures … 12 3.1.3 基础设施 … 12
3.1.4 Long Context Extension … 13 3.1.4 长上下文扩展 … 13
3.2 Evaluations … 13 3.2 评估 … 13
3.2.1 Evaluation Benchmarks … 13 3.2.1 评估基准 … 13
3.2.2 Evaluation Results … 14 3.2.2 评估结果 … 14
3.2.3 Training and Inference Efficiency … 16 3.2.3 训练和推理效率 … 16
4 Alignment … 16 4 对齐 … 16
4.1 Supervised Fine-Tuning … 16 4.1 监督微调 … 16
4.2 Reinforcement Learning … 17 4.2 强化学习 … 17
4.3 Evaluation Results … 18 4.3 评估结果 … 18
4.4 Discussion … 20 4.4 讨论 … 20
5 Conclusion, Limitation, and Future Work … 21 5 结论、局限性和未来工作 … 21
A Contributions and Acknowledgments … 27 A 贡献与致谢 … 27
B DeepSeek-V2-Lite: A 16B Model Equipped with MLA and DeepSeekMoE … 29 B DeepSeek-V2-Lite:一款配备 MLA 和 DeepSeekMoE 的 16B 模型 … 29
B. 1 Model Description … 29 B. 1 模型描述 … 29
B. 2 Performance Evaluation … 30 B. 2 性能评估 … 30
C Full Formulas of MLA … 31 C MLA 的完整公式 … 31
D Ablation of Attention Mechanisms … 31 D 注意机制的消融 … 31
D. 1 Ablation of MHA, GQA, and MQA … 31 D. 1 MHA、GQA 和 MQA 的消融 … 31
D. 2 Comparison Between MLA and MHA … 31 D. 2 MLA 与 MHA 的比较 … 31
E Discussion About Pre-Training Data Debiasing … 32 E 关于预训练数据去偏见的讨论 … 32
F Additional Evaluations on Math and Code … 32 F 额外的数学和代码评估 … 32
G Evaluation Formats … 33 G 评估格式 … 33
1. Introduction 1. 引言
In the past few years, Large Language Models (LLMs) (Anthropic, 2023; Google, 2023; OpenAI, 2022. 2023) have undergone rapid development, offering a glimpse into the dawn of Artificial General Intelligence (AGI). In general, the intelligence of an LLM tends to improve as the number of parameters increases, allowing it to exhibit emergent capabilities across various tasks (Wei et al., 2022). However, the improvement comes at the cost of larger computing resources for training and a potential decrease in inference throughput. These constraints present significant challenges that impede the widespread adoption and utilization of LLMs. In order to tackle this problem, we introduce DeepSeek-V2, a strong open-source Mixture-of-Experts (MoE) language model, characterized by economical training and efficient inference through an innovative Transformer architecture. It is equipped with a total of 236B parameters, of which 21B are activated for each token, and supports a context length of 128 K tokens. 在过去的几年中,大型语言模型 (LLMs) (Anthropic, 2023; Google, 2023; OpenAI, 2022. 2023) 经历了快速发展,展现了人工通用智能 (AGI) 的曙光。一般来说,LLM 的智能随着参数数量的增加而提高,使其能够在各种任务中展现出突现能力 (Wei et al., 2022)。然而,这种改进以训练所需更大的计算资源为代价,并可能导致推理吞吐量的下降。这些限制带来了显著的挑战,阻碍了 LLMs 的广泛采用和利用。为了解决这个问题,我们推出了 DeepSeek-V2,一个强大的开源专家混合 (MoE) 语言模型,其特点是通过创新的 Transformer 架构实现经济的训练和高效的推理。它配备了总计 236B 的参数,其中每个 token 激活 21B,并支持 128 K tokens 的上下文长度。
We optimize the attention modules and Feed-Forward Networks (FFNs) within the Transformer framework (Vaswani et al., 2017) with our proposed Multi-head Latent Attention (MLA) and DeepSeekMoE. (1) In the context of attention mechanisms, the Key-Value (KV) cache of the Multi-Head Attention (MHA) (Vaswani et al., 2017) poses a significant obstacle to the inference efficiency of LLMs. Various approaches have been explored to address this issue, including Grouped-Query Attention (GQA) (Ainslie et al., 2023) and Multi-Query Attention (MQA) (Shazeer, 2019). However, these methods often compromise performance in their attempt to reduce the KV cache. In order to achieve the best of both worlds, we introduce MLA, an attention mechanism equipped with low-rank key-value joint compression. Empirically, MLA achieves superior performance compared with MHA, and meanwhile significantly reduces the KV cache during inference, thus boosting the inference efficiency. (2) For Feed-Forward Networks (FFNs), we follow the DeepSeekMoE architecture (Dai et al., 2024), which adopts fine-grained expert segmentation and shared expert isolation for higher potential in expert specialization. The DeepSeekMoE architecture demonstrates great advantages compared with conventional MoE architectures like GShard (Lepikhin et al., 2021), enabling us to train strong models at an economical cost. As we employ expert parallelism during training, we also devise supplementary mechanisms to control communication overheads and ensure load balance. By combining these two techniques, DeepSeek-V2 features strong performance (Figure 1(a)), economical training costs, and efficient inference throughput (Figure 1(b)), simultaneously. 我们在 Transformer 框架内(Vaswani et al., 2017)优化了注意力模块和前馈网络(FFNs),并提出了多头潜在注意力(MLA)和 DeepSeekMoE。 (1) 在注意力机制的背景下,多头注意力(MHA)的键值(KV)缓存(Vaswani et al., 2017)对LLMs的推理效率构成了重大障碍。为了解决这个问题,已经探索了各种方法,包括分组查询注意力(GQA)(Ainslie et al., 2023)和多查询注意力(MQA)(Shazeer, 2019)。然而,这些方法在试图减少 KV 缓存时往往会妥协性能。为了实现两全其美,我们引入了 MLA,这是一种配备低秩键值联合压缩的注意力机制。实证表明,MLA 的性能优于 MHA,同时在推理过程中显著减少 KV 缓存,从而提高推理效率。 (2) 对于前馈网络(FFNs),我们遵循 DeepSeekMoE 架构(Dai et al., 2024),该架构采用细粒度专家分割和共享专家隔离,以提高专家专业化的潜力。 DeepSeekMoE 架构相比于传统的 MoE 架构如 GShard(Lepikhin 等,2021)展现出巨大的优势,使我们能够以经济的成本训练强大的模型。在训练过程中,我们采用专家并行,同时设计了补充机制来控制通信开销并确保负载平衡。通过结合这两种技术,DeepSeek-V2 同时具备强大的性能(图 1(a))、经济的训练成本和高效的推理吞吐量(图 1(b))。
We construct a high-quality and multi-source pre-training corpus consisting of 8.1T tokens. Compared with the corpus used in DeepSeek 67B (our previous release) (DeepSeek-AI, 2024), this corpus features an extended amount of data, especially Chinese data, and higher data quality. We first pretrain DeepSeek-V2 on the full pre-training corpus. Then, we collect 1.5 M conversational sessions, which encompass various domains such as math, code, writing, reasoning, safety, and more, to perform Supervised Fine-Tuning (SFT) for DeepSeek-V2 Chat (SFT). Finally, we follow DeepSeekMath (Shao et al., 2024) to employ Group Relative Policy Optimization (GRPO) to further align the model with human preference and produce DeepSeek-V2 Chat (RL). 我们构建了一个高质量的多源预训练语料库,包含 8.1T 的标记。与 DeepSeek 67B(我们之前的发布)(DeepSeek-AI,2024)使用的语料库相比,这个语料库的数据量更大,特别是中文数据,并且数据质量更高。我们首先在完整的预训练语料库上对 DeepSeek-V2 进行预训练。然后,我们收集了 150 万次对话会话,涵盖数学、代码、写作、推理、安全等多个领域,以对 DeepSeek-V2 Chat 进行监督微调(SFT)。最后,我们遵循 DeepSeekMath(Shao et al.,2024)采用群体相对策略优化(GRPO)进一步使模型与人类偏好对齐,并生成 DeepSeek-V2 Chat(RL)。
We evaluate DeepSeek-V2 on a wide range of benchmarks in English and Chinese, and compare it with representative open-source models. Evaluation results show that even with only 21B activated parameters, DeepSeek-V2 still achieves top-tier performance among open-source models and becomes the strongest open-source MoE language model. Figure 1(a) highlights that, on MMLU, DeepSeek-V2 achieves top-ranking performance with only a small number of activated parameters. In addition, as shown in Figure 1(b), compared with DeepSeek 67B, DeepSeek-V2 saves 42.5%42.5 \% of training costs, reduces the KV cache by 93.3%93.3 \%, and boosts the maximum generation throughput to 5.76 times. We also evaluate DeepSeek-V2 Chat (SFT) and 我们在英语和中文的广泛基准上评估了 DeepSeek-V2,并将其与代表性的开源模型进行了比较。评估结果显示,即使只有 21B 激活参数,DeepSeek-V2 仍在开源模型中实现了顶级性能,并成为最强的开源 MoE 语言模型。图 1(a)突显了在 MMLU 上,DeepSeek-V2 仅用少量激活参数就达到了顶尖性能。此外,如图 1(b)所示,与 DeepSeek 67B 相比,DeepSeek-V2 节省了 42.5%42.5 \% 的训练成本,减少了 KV 缓存 93.3%93.3 \% ,并将最大生成吞吐量提升至 5.76 倍。我们还评估了 DeepSeek-V2 Chat (SFT)和
Figure 2 | Illustration of the architecture of DeepSeek-V2. MLA ensures efficient inference by significantly reducing the KV cache for generation, and DeepSeekMoE enables training strong models at an economical cost through the sparse architecture. 图 2 | DeepSeek-V2 架构的示意图。MLA 通过显著减少生成的 KV 缓存来确保高效推理,而 DeepSeekMoE 通过稀疏架构以经济的成本训练强大的模型。
DeepSeek-V2 Chat (RL) on open-ended benchmarks. Notably, DeepSeek-V2 Chat (RL) achieves 38.9 length-controlled win rate on AlpacaEval 2.0 (Dubois et al., 2024), 8.97 overall score on MT-Bench (Zheng et al., 2023), and 7.91 overall score on AlignBench (Liu et al., 2023). The English open-ended conversation evaluations demonstrate that DeepSeek-V2 Chat (RL) has toptier performance among open-source chat models. In addition, the evaluation on AlignBench indicates that in Chinese, DeepSeek-V2 Chat (RL) outperforms all of open-source models, and even beats most of closed-source models. DeepSeek-V2 Chat (RL) 在开放式基准测试中的表现。值得注意的是,DeepSeek-V2 Chat (RL) 在 AlpacaEval 2.0(Dubois 等,2024)上实现了 38.9 的长度控制胜率,在 MT-Bench(Zheng 等,2023)上获得了 8.97 的总体得分,在 AlignBench(Liu 等,2023)上获得了 7.91 的总体得分。英语开放式对话评估表明,DeepSeek-V2 Chat (RL) 在开源聊天模型中表现出色。此外,在 AlignBench 上的评估表明,在中文中,DeepSeek-V2 Chat (RL) 超过了所有开源模型,甚至击败了大多数闭源模型。
In order to facilitate further research and development on MLA and DeepSeekMoE, we also release DeepSeek-V2-Lite, a smaller model equipped with MLA and DeepSeekMoE, for the open-source community. It has a total of 15.7B parameters, where 2.4B are activated for each token. Detailed descriptions about DeepSeek-V2-Lite can be found in Appendix B 为了促进对 MLA 和 DeepSeekMoE 的进一步研究和开发,我们还向开源社区发布了 DeepSeek-V2-Lite,这是一个配备了 MLA 和 DeepSeekMoE 的小型模型。它总共有 15.7 亿个参数,其中每个 token 激活了 2.4 亿个参数。关于 DeepSeek-V2-Lite 的详细描述可以在附录 B 中找到。
In the rest of this paper, we first provide a detailed description of the model architecture of DeepSeek-V2 (Section 2). Subsequently, we introduce our pre-training endeavors, including the training data construction, hyper-parameter settings, infrastructures, long context extension, and the evaluation of model performance and efficiency (Section 3 ). Following this, we demonstrate our efforts in alignment, encompassing Supervised Fine-Tuning (SFT), Reinforcement 在本文的其余部分,我们首先详细描述 DeepSeek-V2 的模型架构(第 2 节)。随后,我们介绍我们的预训练工作,包括训练数据构建、超参数设置、基础设施、长上下文扩展以及模型性能和效率的评估(第 3 节)。接下来,我们展示在对齐方面的努力,包括监督微调(SFT)、强化学习。
Learning (RL), the evaluation results, and other discussion (Section/4). Finally, we summarize the conclusion, deliberate on the current limitations of DeepSeek-V2, and outline our future work (Section 5). 学习(RL)、评估结果和其他讨论(第 4 节)。最后,我们总结结论,讨论 DeepSeek-V2 的当前局限性,并概述我们的未来工作(第 5 节)。
2. Architecture 2. 架构
By and large, DeepSeek-V2 is still in the Transformer architecture (Vaswani et al., 2017), where each Transformer block consists of an attention module and a Feed-Forward Network (FFN). However, for both the attention module and the FFN, we design and employ innovative architectures. For attention, we design MLA, which utilizes low-rank key-value joint compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference. For FFNs, we adopt the DeepSeekMoE architecture (Dai et al., 2024), a high-performance MoE architecture that enables training strong models at an economical cost. An illustration of the architecture of DeepSeek-V2 is presented in Figure 2, and we will introduce the details of MLA and DeepSeekMoE in this section. For other tiny details (e.g., layer normalization and the activation function in FFNs), unless specifically stated, DeepSeek-V2 follows the settings of DeepSeek 67B (DeepSeek-AI, 2024). 总体而言,DeepSeek-V2 仍然采用 Transformer 架构(Vaswani et al., 2017),其中每个 Transformer 块由一个注意力模块和一个前馈网络(FFN)组成。然而,对于注意力模块和 FFN,我们设计并采用了创新的架构。对于注意力,我们设计了 MLA,它利用低秩键值联合压缩来消除推理时键值缓存的瓶颈,从而支持高效推理。对于 FFN,我们采用了 DeepSeekMoE 架构(Dai et al., 2024),这是一种高性能的 MoE 架构,能够以经济的成本训练强大的模型。DeepSeek-V2 的架构示意图如图 2 所示,我们将在本节中介绍 MLA 和 DeepSeekMoE 的详细信息。对于其他细节(例如,层归一化和 FFN 中的激活函数),除非特别说明,DeepSeek-V2 遵循 DeepSeek 67B 的设置(DeepSeek-AI, 2024)。
Conventional Transformer models usually adopts Multi-Head Attention (MHA) (Vaswani et al., 2017), but during generation, its heavy Key-Value (KV) cache will become the bottleneck that limit the inference efficiency. In order to reduce the KV cache, Multi-Query Attention (MQA) (Shazeer, 2019) and Grouped-Query Attention (GQA) (Ainslie et al., 2023) are proposed. They require a smaller magnitude of KV cache, but their performance does not match MHA (we provide the ablation of MHA, GQA and MQA in Appendix D.1. 传统的 Transformer 模型通常采用多头注意力(MHA)(Vaswani 等,2017),但在生成过程中,其庞大的键值(KV)缓存将成为限制推理效率的瓶颈。为了减少 KV 缓存,提出了多查询注意力(MQA)(Shazeer,2019)和分组查询注意力(GQA)(Ainslie 等,2023)。它们需要较小规模的 KV 缓存,但其性能与 MHA 不匹配(我们在附录 D.1 中提供了 MHA、GQA 和 MQA 的消融实验)。
For DeepSeek-V2, we design an innovative attention mechanism called Multi-head Latent Attention (MLA). Equipped with low-rank key-value joint compression, MLA achieves better performance than MHA, but requires a significantly smaller amount of KV cache. We introduce its architecture in the following, and also provide a comparison between MLA and MHA in Appendix D. 2. 对于 DeepSeek-V2,我们设计了一种创新的注意力机制,称为多头潜在注意力(MLA)。配备低秩键值联合压缩,MLA 的性能优于 MHA,但所需的 KV 缓存量显著更小。我们在下面介绍其架构,并在附录 D 中提供 MLA 与 MHA 的比较。2。
2.1.1. Preliminaries: Standard Multi-Head Attention 2.1.1. 前言:标准多头注意力
We first introduce the standard MHA mechanism as background. Let dd be the embedding dimension, n_(h)n_{h} be the number of attention heads, d_(h)d_{h} be the dimension per head, and h_(t)inR^(d)\mathbf{h}_{t} \in \mathbb{R}^{d} be the attention input of the tt-th token at an attention layer. Standard MHA first produces q_(t),k_(t),v_(t)inR^(d_(h)n_(h))\mathbf{q}_{t}, \mathbf{k}_{t}, \mathbf{v}_{t} \in \mathbb{R}^{d_{h} n_{h}} through three matrices W^(Q),W^(K),W^(V)inR^(d_(h)n_(h)xx d)W^{Q}, W^{K}, W^{V} \in \mathbb{R}^{d_{h} n_{h} \times d}, respectively: 我们首先介绍标准的多头自注意力机制作为背景。设 dd 为嵌入维度, n_(h)n_{h} 为注意力头的数量, d_(h)d_{h} 为每个头的维度, h_(t)inR^(d)\mathbf{h}_{t} \in \mathbb{R}^{d} 为注意力层中第 tt 个标记的注意力输入。标准的多头自注意力机制首先通过三个矩阵 W^(Q),W^(K),W^(V)inR^(d_(h)n_(h)xx d)W^{Q}, W^{K}, W^{V} \in \mathbb{R}^{d_{h} n_{h} \times d} 生成 q_(t),k_(t),v_(t)inR^(d_(h)n_(h))\mathbf{q}_{t}, \mathbf{k}_{t}, \mathbf{v}_{t} \in \mathbb{R}^{d_{h} n_{h}} 。
Figure 3 | Simplified illustration of Multi-Head Attention (MHA), Grouped-Query Attention (GQA), Multi-Query Attention (MQA), and Multi-head Latent Attention (MLA). Through jointly compressing the keys and values into a latent vector, MLA significantly reduces the KV cache during inference. 图 3 | 多头注意力(MHA)、分组查询注意力(GQA)、多查询注意力(MQA)和多头潜在注意力(MLA)的简化示意图。通过将键和值共同压缩为潜在向量,MLA 在推理过程中显著减少了 KV 缓存。
Then, q_(t),k_(t),v_(t)\mathbf{q}_{t}, \mathbf{k}_{t}, \mathbf{v}_{t} will be sliced into n_(h)n_{h} heads for the multi-head attention computation: 然后, q_(t),k_(t),v_(t)\mathbf{q}_{t}, \mathbf{k}_{t}, \mathbf{v}_{t} 将被切分成 n_(h)n_{h} 个头用于多头注意力计算:
where q_(t,i,),k_(t,i),v_(t,i)inR^(d_(h))\mathbf{q}_{t, i,}, \mathbf{k}_{t, i}, \mathbf{v}_{t, i} \in \mathbb{R}^{d_{h}} denote the query, key, and value of the ii-th attention head, respectively; W^(0)inR^(d xxd_(h)n_(h))W^{0} \in \mathbb{R}^{d \times d_{h} n_{h}} denotes the output projection matrix. During inference, all keys and values need to be cached to accelerate inference, so MHA needs to cache 2n_(h)d_(h)l2 n_{h} d_{h} l elements for each token. In model deployment, this heavy KV cache is a large bottleneck that limits the maximum batch size and sequence length. 其中 q_(t,i,),k_(t,i),v_(t,i)inR^(d_(h))\mathbf{q}_{t, i,}, \mathbf{k}_{t, i}, \mathbf{v}_{t, i} \in \mathbb{R}^{d_{h}} 分别表示第 ii 个注意力头的查询、键和值; W^(0)inR^(d xxd_(h)n_(h))W^{0} \in \mathbb{R}^{d \times d_{h} n_{h}} 表示输出投影矩阵。在推理过程中,所有键和值需要被缓存以加速推理,因此 MHA 需要为每个标记缓存 2n_(h)d_(h)l2 n_{h} d_{h} l 个元素。在模型部署中,这个庞大的 KV 缓存是限制最大批量大小和序列长度的一个重大瓶颈。
where c_(t)^(KV)inR^(d_(c))\mathbf{c}_{t}^{K V} \in \mathbb{R}^{d_{c}} is the compressed latent vector for keys and values; d_(c)(≪d_(h)n_(h))d_{c}\left(\ll d_{h} n_{h}\right) denotes the KV compression dimension; W^(DKV)inR^(d_(c)xx d)W^{D K V} \in \mathbb{R}^{d_{c} \times d} is the down-projection matrix; and W^(UK),W^(UV)inR^(d_(h)n_(h)xxd_(c))W^{U K}, W^{U V} \in \mathbb{R}^{d_{h} n_{h} \times d_{c}} are the up-projection matrices for keys and values, respectively. During inference, MLA only needs to cache c_(t)^(KV)\mathbf{c}_{t}^{K V}, so its KVK V cache has only d_(c)ld_{c} l elements, where ll denotes the number of layers. In addition, during inference, since W^(UK)W^{U K} can be absorbed into W^(Q)W^{Q}, and W^(UV)W^{U V} can be absorbed into W^(0)W^{0}, we even do not need to compute keys and values out for attention. Figure 3 intuitively illustrates how the KV joint compression in MLA reduces the KV cache. 其中 c_(t)^(KV)inR^(d_(c))\mathbf{c}_{t}^{K V} \in \mathbb{R}^{d_{c}} 是键和值的压缩潜在向量; d_(c)(≪d_(h)n_(h))d_{c}\left(\ll d_{h} n_{h}\right) 表示 KV 压缩维度; W^(DKV)inR^(d_(c)xx d)W^{D K V} \in \mathbb{R}^{d_{c} \times d} 是下投影矩阵; W^(UK),W^(UV)inR^(d_(h)n_(h)xxd_(c))W^{U K}, W^{U V} \in \mathbb{R}^{d_{h} n_{h} \times d_{c}} 分别是键和值的上投影矩阵。在推理过程中,MLA 只需要缓存 c_(t)^(KV)\mathbf{c}_{t}^{K V} ,因此其 KVK V 缓存只有 d_(c)ld_{c} l 个元素,其中 ll 表示层数。此外,在推理过程中,由于 W^(UK)W^{U K} 可以被吸收到 W^(Q)W^{Q} 中,而 W^(UV)W^{U V} 可以被吸收到 W^(0)W^{0} 中,我们甚至不需要计算用于注意力的键和值。图 3 直观地说明了 MLA 中的 KV 联合压缩如何减少 KV 缓存。
Moreover, in order to reduce the activation memory during training, we also perform 此外,为了减少训练期间的激活内存,我们还执行
low-rank compression for the queries, even if it cannot reduce the KV cache: 查询的低秩压缩,即使它无法减少 KV 缓存:
where c_(t)^(Q)inR^(d_(c)^('))\mathbf{c}_{t}^{Q} \in \mathbb{R}^{d_{c}^{\prime}} is the compressed latent vector for queries; d_(c)^(')(≪d_(h)n_(h))d_{c}^{\prime}\left(\ll d_{h} n_{h}\right) denotes the query compression dimension; and W^(DQ)inR^(d_(c)^(')xx d),W^(UQ)inR^(d_(h)n_(h)xxd_(c)^('))W^{D Q} \in \mathbb{R}^{d_{c}^{\prime} \times d}, W^{U Q} \in \mathbb{R}^{d_{h} n_{h} \times d_{c}^{\prime}} are the down-projection and upprojection matrices for queries, respectively. 其中 c_(t)^(Q)inR^(d_(c)^('))\mathbf{c}_{t}^{Q} \in \mathbb{R}^{d_{c}^{\prime}} 是查询的压缩潜在向量; d_(c)^(')(≪d_(h)n_(h))d_{c}^{\prime}\left(\ll d_{h} n_{h}\right) 表示查询压缩维度; W^(DQ)inR^(d_(c)^(')xx d),W^(UQ)inR^(d_(h)n_(h)xxd_(c)^('))W^{D Q} \in \mathbb{R}^{d_{c}^{\prime} \times d}, W^{U Q} \in \mathbb{R}^{d_{h} n_{h} \times d_{c}^{\prime}} 分别是查询的下投影和上投影矩阵。
2.1.3. Decoupled Rotary Position Embedding 2.1.3. 解耦旋转位置嵌入
Following DeepSeek 67B (DeepSeek-AI, 2024), we intend to use the Rotary Position Embedding (RoPE) (Su et al., 2024) for DeepSeek-V2. However, RoPE is incompatible with low-rank KV compression. To be specific, RoPE is position-sensitive for both keys and queries. If we apply RoPE for the keys k_(t)^(C),W^(UK)\mathbf{k}_{t}^{C}, W^{U K} in Equation 10 will be coupled with a position-sensitive RoPE matrix. In this way, W^(UK)W^{U K} cannot be absorbed into W^(Q)W^{Q} any more during inference, since a RoPE matrix related to the currently generating token will lie between W^(Q)W^{Q} and W^(UK)W^{U K} and matrix multiplication does not obey a commutative law. As a result, we must recompute the keys for all the prefix tokens during inference, which will significantly hinder the inference efficiency. 根据 DeepSeek 67B(DeepSeek-AI,2024),我们打算在 DeepSeek-V2 中使用旋转位置嵌入(RoPE)(Su 等,2024)。然而,RoPE 与低秩 KV 压缩不兼容。具体来说,RoPE 对键和查询都是位置敏感的。如果我们对方程 10 中的键 k_(t)^(C),W^(UK)\mathbf{k}_{t}^{C}, W^{U K} 应用 RoPE,将会与一个位置敏感的 RoPE 矩阵耦合。这样,在推理过程中, W^(UK)W^{U K} 将无法再被吸收到 W^(Q)W^{Q} 中,因为与当前生成的标记相关的 RoPE 矩阵将位于 W^(Q)W^{Q} 和 W^(UK)W^{U K} 之间,而矩阵乘法不遵循交换律。因此,我们必须在推理过程中重新计算所有前缀标记的键,这将显著阻碍推理效率。
As a solution, we propose the decoupled RoPE strategy that uses additional multi-head queries q_(t,i)^(R)inR_(h)^(d_(h)^(R))\mathbf{q}_{t, i}^{R} \in \mathbb{R}_{h}^{d_{h}^{R}} and a shared key k_(t)^(R)inR_(h)^(d_(h)^(R))\mathbf{k}_{t}^{R} \in \mathbb{R}_{h}^{d_{h}^{R}} to carry RoPE, where d_(h)^(R)d_{h}^{R} denotes the per-head dimension of the decoupled queries and key. Equipped with the decoupled RoPE strategy, MLA performs the following computation: 作为解决方案,我们提出了解耦的 RoPE 策略,该策略使用额外的多头查询 q_(t,i)^(R)inR_(h)^(d_(h)^(R))\mathbf{q}_{t, i}^{R} \in \mathbb{R}_{h}^{d_{h}^{R}} 和共享键 k_(t)^(R)inR_(h)^(d_(h)^(R))\mathbf{k}_{t}^{R} \in \mathbb{R}_{h}^{d_{h}^{R}} 来承载 RoPE,其中 d_(h)^(R)d_{h}^{R} 表示解耦查询和键的每头维度。配备解耦的 RoPE 策略,MLA 执行以下计算:
where W^(QR)inR_(h)^(d_(h)^(R)n_(h)xxd_(c)^('))W^{Q R} \in \mathbb{R}_{h}^{d_{h}^{R} n_{h} \times d_{c}^{\prime}} and W^(KR)inR_(h)^(d_(h)^(R)xx d)W^{K R} \in \mathbb{R}_{h}^{d_{h}^{R} \times d} are matrices to produce the decouples queries and key, respectively; RoPE(*)\operatorname{RoPE}(\cdot) denotes the operation that applies RoPE matrices; and [:';*][\because ; \cdot] denotes the concatenation operation. During inference, the decoupled key should also be cached. Therefore, DeepSeek-V2 requires a total KV cache containing (d_(c)+d_(h)^(R))l\left(d_{c}+d_{h}^{R}\right) l elements. 其中 W^(QR)inR_(h)^(d_(h)^(R)n_(h)xxd_(c)^('))W^{Q R} \in \mathbb{R}_{h}^{d_{h}^{R} n_{h} \times d_{c}^{\prime}} 和 W^(KR)inR_(h)^(d_(h)^(R)xx d)W^{K R} \in \mathbb{R}_{h}^{d_{h}^{R} \times d} 是生成解耦查询和键的矩阵; RoPE(*)\operatorname{RoPE}(\cdot) 表示应用 RoPE 矩阵的操作; [:';*][\because ; \cdot] 表示连接操作。在推理过程中,解耦键也应被缓存。因此,DeepSeek-V2 需要一个包含 (d_(c)+d_(h)^(R))l\left(d_{c}+d_{h}^{R}\right) l 个元素的总 KV 缓存。
In order to demonstrate the complete computation process of MLA, we also organize and provide its full formulas in Appendix C. 为了展示 MLA 的完整计算过程,我们还在附录 C 中整理并提供了其完整公式。
2.1.4. Comparison of Key-Value Cache 2.1.4. 键值缓存的比较
We demonstrate a comparison of the KV cache per token among different attention mechanisms in Table 1 . MLA requires only a small amount of KV cache, equal to GQA with only 2.25 groups, but can achieve stronger performance than MHA. 我们展示了不同注意力机制中每个令牌的 KV 缓存比较,如表 1 所示。MLA 只需要少量的 KV 缓存,等于只有 2.25 组的 GQA,但可以实现比 MHA 更强的性能。