这是用户在 2024-9-30 14:36 为 https://arxiv.org/html/2405.04434?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: dblfloatfix
  • failed: dramatist
  • failed: xltabular
  • failed: CJKutf8

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
许可证:arXiv.org 永久非独占许可证
arXiv:2405.04434v5 [cs.CL] 19 Jun 2024
arXiv:2405.04434v5 [cs.CL] 2024 年 6 月 19 日
\reportnumber \报告编号

001

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2:强大、经济且高效的专家混合语言模型

DeepSeek-AI
research@deepseek.com
Abstract 抽象的

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V2.
我们提出了 DeepSeek-V2,一种强大的专家混合 (MoE) 语言模型,其特点是经济的训练和高效的推理。它总共包括236B个参数,其中每个令牌激活21B个参数,并支持128K令牌的上下文长度。 DeepSeek-V2采用多头潜在注意力(MLA)和DeepSeekMoE等创新架构。 MLA 通过将键值 (KV) 缓存显着压缩为潜在向量来保证高效推理,而 DeepSeekMoE 则可以通过稀疏计算以经济的成本训练强大的模型。与 DeepSeek 67B 相比,DeepSeek-V2 性能显着增强,同时节省了 42.5% 的训练成本,减少了 93.3% 的 KV 缓存,最大生成吞吐量提升至 5.76 倍。我们在由 8.1T 代币组成的高质量多源语料库上对 DeepSeek-V2 进行预训练,并进一步进行监督微调(SFT)和强化学习(RL)以充分释放其潜力。评估结果表明,即使只有21B个激活参数,DeepSeek-V2及其聊天版本仍然达到了开源模型中顶级的性能。模型检查点位于https://github.com/deepseek-ai/DeepSeek-V2

{CJK*} {中日韩*}

UTF8gbsn

Refer to caption
Refer to caption
Figure 1: (a) MMLU accuracy vs. activated parameters, among different open-source models. (b) Training costs and inference efficiency of DeepSeek 67B (Dense) and DeepSeek-V2.
图 1: (a) 不同开源模型中的 MMLU 准确度与激活参数的关系。 (b) DeepSeek 67B (Dense) 和 DeepSeek-V2 的训练成本和推理效率。

1 Introduction
1简介

In the past few years, Large Language Models (LLMs) (OpenAI, 2022, 2023; Anthropic, 2023; Google, 2023) have undergone rapid development, offering a glimpse into the dawn of Artificial General Intelligence (AGI). In general, the intelligence of an LLM tends to improve as the number of parameters increases, allowing it to exhibit emergent capabilities across various tasks (Wei et al., 2022). However, the improvement comes at the cost of larger computing resources for training and a potential decrease in inference throughput. These constraints present significant challenges that impede the widespread adoption and utilization of LLMs. In order to tackle this problem, we introduce DeepSeek-V2, a strong open-source Mixture-of-Experts (MoE) language model, characterized by economical training and efficient inference through an innovative Transformer architecture. It is equipped with a total of 236B parameters, of which 21B are activated for each token, and supports a context length of 128K tokens.
在过去的几年里,大型语言模型( LLMs ) OpenAI2022,2023 ;Anthropic, 2023 ;Google, 2023经历了快速发展,让人们看到了通用人工智能(AGI)的曙光。一般来说, LLM的智力往往会随着参数数量的增加而提高,从而使其能够在各种任务中展现出新兴的能力(Wei et al., 2022 。然而,这种改进是以更大的计算资源用于训练和推理吞吐量的潜在下降为代价的。这些限制提出了重大挑战,阻碍了LLMs的广泛采用和利用。为了解决这个问题,我们引入了 DeepSeek-V2,这是一种强大的开源专家混合 (MoE) 语言模型,其特点是通过创新的 Transformer 架构进行经济的训练和高效的推理。它总共配备了236B个参数,其中每个令牌激活21B个参数,并支持128K令牌的上下文长度。

We optimize the attention modules and Feed-Forward Networks (FFNs) within the Transformer framework (Vaswani et al., 2017) with our proposed Multi-head Latent Attention (MLA) and DeepSeekMoE. (1) In the context of attention mechanisms, the Key-Value (KV) cache of the Multi-Head Attention (MHA) (Vaswani et al., 2017) poses a significant obstacle to the inference efficiency of LLMs. Various approaches have been explored to address this issue, including Grouped-Query Attention (GQA) (Ainslie et al., 2023) and Multi-Query Attention (MQA) (Shazeer, 2019). However, these methods often compromise performance in their attempt to reduce the KV cache. In order to achieve the best of both worlds, we introduce MLA, an attention mechanism equipped with low-rank key-value joint compression. Empirically, MLA achieves superior performance compared with MHA, and meanwhile significantly reduces the KV cache during inference, thus boosting the inference efficiency. (2) For Feed-Forward Networks (FFNs), we follow the DeepSeekMoE architecture (Dai et al., 2024), which adopts fine-grained expert segmentation and shared expert isolation for higher potential in expert specialization. The DeepSeekMoE architecture demonstrates great advantages compared with conventional MoE architectures like GShard (Lepikhin et al., 2021), enabling us to train strong models at an economical cost. As we employ expert parallelism during training, we also devise supplementary mechanisms to control communication overheads and ensure load balance. By combining these two techniques, DeepSeek-V2 features strong performance (Figure 1), economical training costs, and efficient inference throughput (Figure 1), simultaneously.
我们使用我们提出的多头潜在注意力(MLA)DeepSeekMoE优化 Transformer 框架(Vaswani 等人, 2017内的注意力模块和前馈网络(FFN)。 (1)在注意力机制的背景下,多头注意力(MHA)的键值(KV)缓存(Vaswani et al., 2017对LLMs的推理效率造成了重大障碍。人们已经探索了各种方法来解决这个问题,包括分组查询注意力(GQA) (Ainslie et al., 2023和多查询注意力(MQA) (Shazeer, 2019 。然而,这些方法在尝试减少 KV 缓存时常常会损害性能。为了实现两全其美,我们引入了MLA,一种配备低秩键值联合压缩的注意力机制。从经验来看,MLA相比MHA具有更优异的性能,同时大幅减少了推理过程中的KV缓存,从而提高了推理效率。 (2)对于前馈网络(FFN),我们遵循 DeepSeekMoE 架构(Dai et al., 2024 ,它采用细粒度的专家分割和共享专家隔离,以提高专家专业化的潜力。与 GShard (Lepikhin 等人, 2021等传统 MoE 架构相比,DeepSeekMoE 架构展示了巨大的优势,使我们能够以经济的成本训练强大的模型。由于我们在训练期间采用专家并行性,因此我们还设计了补充机制来控制通信开销并确保负载平衡。 通过结合这两种技术,DeepSeek-V2 同时具有强大的性能(图1 )、经济的训练成本和高效的推理吞吐量(图1 )。

Refer to caption
Figure 2: Illustration of the architecture of DeepSeek-V2. MLA ensures efficient inference by significantly reducing the KV cache for generation, and DeepSeekMoE enables training strong models at an economical cost through the sparse architecture.
图 2: DeepSeek-V2 的架构图。 MLA 通过显着减少生成的 KV 缓存来确保高效推理,而 DeepSeekMoE 可以通过稀疏架构以经济的成本训练强大的模型。

We construct a high-quality and multi-source pre-training corpus consisting of 8.1T tokens. Compared with the corpus used in DeepSeek 67B (our previous release) (DeepSeek-AI, 2024), this corpus features an extended amount of data, especially Chinese data, and higher data quality. We first pretrain DeepSeek-V2 on the full pre-training corpus. Then, we collect 1.5M conversational sessions, which encompass various domains such as math, code, writing, reasoning, safety, and more, to perform Supervised Fine-Tuning (SFT) for DeepSeek-V2 Chat (SFT). Finally, we follow DeepSeekMath (Shao et al., 2024) to employ Group Relative Policy Optimization (GRPO) to further align the model with human preference and produce DeepSeek-V2 Chat (RL).
我们构建了由 8.1T 令牌组成的高质量、多源预训练语料库。与DeepSeek 67B(我们之前版本) (DeepSeek-AI, 2024使用的语料库相比,该语料库的数据量特别是中文数据量更大,数据质量更高。我们首先在完整的预训练语料库上预训练 DeepSeek-V2。然后,我们收集 150 万个对话会话,其中涵盖数学、代码、写作、推理、安全等各个领域,为 DeepSeek-V2 聊天 (SFT) 执行监督微调 (SFT)。最后,我们遵循 DeepSeekMath (Shao et al., 2024 )采用组相对策略优化 (GRPO) 进一步使模型与人类偏好保持一致,并生成 DeepSeek-V2 Chat (RL)。

We evaluate DeepSeek-V2 on a wide range of benchmarks in English and Chinese, and compare it with representative open-source models. Evaluation results show that even with only 21B activated parameters, DeepSeek-V2 still achieves top-tier performance among open-source models and becomes the strongest open-source MoE language model. Figure 1 highlights that, on MMLU, DeepSeek-V2 achieves top-ranking performance with only a small number of activated parameters. In addition, as shown in Figure 1, compared with DeepSeek 67B, DeepSeek-V2 saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We also evaluate DeepSeek-V2 Chat (SFT) and DeepSeek-V2 Chat (RL) on open-ended benchmarks. Notably, DeepSeek-V2 Chat (RL) achieves 38.9 length-controlled win rate on AlpacaEval 2.0 (Dubois et al., 2024), 8.97 overall score on MT-Bench (Zheng et al., 2023), and 7.91 overall score on AlignBench (Liu et al., 2023). The English open-ended conversation evaluations demonstrate that DeepSeek-V2 Chat (RL) has top-tier performance among open-source chat models. In addition, the evaluation on AlignBench indicates that in Chinese, DeepSeek-V2 Chat (RL) outperforms all of open-source models, and even beats most of closed-source models.
我们在多种英文和中文基准上对 DeepSeek-V2 进行了评估,并将其与代表性的开源模型进行了比较。评估结果表明,即使只有21B个激活参数,DeepSeek-V2仍然达到了开源模型中顶级的性能,成为最强的开源MoE语言模型。图1突出显示,在 MMLU 上,DeepSeek-V2 只需少量激活参数即可实现顶级性能。此外,如图1所示,与DeepSeek 67B相比,DeepSeek-V2节省了42.5%的训练成本,减少了93.3%的KV缓存,并将最大生成吞吐量提升至5.76倍。我们还在开放式基准上评估 DeepSeek-V2 Chat (SFT) 和 DeepSeek-V2 Chat (RL)。值得注意的是,DeepSeek-V2 Chat (RL) 在 AlpacaEval 2.0 上获得了 38.9 的长度控制胜率(Dubois 等人, 2024 ,在 MT-Bench 上获得了 8.97 的总分(Zheng 等人, 2023 ,在 AlignBench 上获得了 7.91 的总分(刘等人, 2023 。英语开放式对话评估表明,DeepSeek-V2 Chat(RL)在开源聊天模型中具有顶级性能。此外,AlignBench上的评测表明,在中文中,DeepSeek-V2 Chat(RL)的表现优于所有开源模型,甚至击败了大多数闭源模型。

In order to facilitate further research and development on MLA and DeepSeekMoE, we also release DeepSeek-V2-Lite, a smaller model equipped with MLA and DeepSeekMoE, for the open-source community. It has a total of 15.7B parameters, where 2.4B are activated for each token. Detailed descriptions about DeepSeek-V2-Lite can be found in Appendix B.
为了便于MLA和DeepSeekMoE的进一步研究和开发,我们还为开源社区发布了配备MLA和DeepSeekMoE的较小模型DeepSeek-V2-Lite。它总共有 15.7B 个参数,其中每个 token 激活 2.4B 个参数。关于 DeepSeek-V2-Lite 的详细描述可以参见附录B。

In the rest of this paper, we first provide a detailed description of the model architecture of DeepSeek-V2 (Section 2). Subsequently, we introduce our pre-training endeavors, including the training data construction, hyper-parameter settings, infrastructures, long context extension, and the evaluation of model performance and efficiency (Section 3). Following this, we demonstrate our efforts in alignment, encompassing Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), the evaluation results, and other discussion (Section 4). Finally, we summarize the conclusion, deliberate on the current limitations of DeepSeek-V2, and outline our future work (Section 5).
在本文的其余部分,我们首先详细描述 DeepSeek-V2 的模型架构(第2节)。随后,我们介绍了我们的预训练工作,包括训练数据构建、超参数设置、基础设施、长上下文扩展以及模型性能和效率的评估(第3节)。接下来,我们展示了我们的协调努力,包括监督微调 (SFT)、强化学习 (RL)、评估结果和其他讨论(第4节)。最后,我们总结结论,探讨 DeepSeek-V2 当前的局限性,并概述我们未来的工作(第5节)。

2 Architecture
2架构

By and large, DeepSeek-V2 is still in the Transformer architecture (Vaswani et al., 2017), where each Transformer block consists of an attention module and a Feed-Forward Network (FFN). However, for both the attention module and the FFN, we design and employ innovative architectures. For attention, we design MLA, which utilizes low-rank key-value joint compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference. For FFNs, we adopt the DeepSeekMoE architecture (Dai et al., 2024), a high-performance MoE architecture that enables training strong models at an economical cost. An illustration of the architecture of DeepSeek-V2 is presented in Figure 2, and we will introduce the details of MLA and DeepSeekMoE in this section. For other tiny details (e.g., layer normalization and the activation function in FFNs), unless specifically stated, DeepSeek-V2 follows the settings of DeepSeek 67B (DeepSeek-AI, 2024).
总的来说,DeepSeek-V2 仍然采用 Transformer 架构(Vaswani 等人, 2017 ,其中每个 Transformer 块由一个注意力模块和一个前馈网络(FFN)组成。然而,对于注意力模块和 FFN,我们设计并采用了创新的架构。为了引起注意,我们设计了MLA,它利用低秩键值联合压缩来消除推理时键值缓存的瓶颈,从而支持高效的推理。对于 FFN,我们采用 DeepSeekMoE 架构(Dai et al., 2024 ,这是一种高性能的 MoE 架构,能够以经济的成本训练强大的模型。 DeepSeek-V2的架构如图2所示,本节我们将详细介绍MLA和DeepSeekMoE。对于其他微小细节(例如,FFN 中的层归一化和激活函数),除非特别说明,DeepSeek-V2 遵循 DeepSeek 67B (DeepSeek-AI, 2024的设置。

2.1 Multi-Head Latent Attention: Boosting Inference Efficiency
2.1多头潜在注意力:提高推理效率

Conventional Transformer models usually adopts Multi-Head Attention (MHA) (Vaswani et al., 2017), but during generation, its heavy Key-Value (KV) cache will become the bottleneck that limit the inference efficiency. In order to reduce the KV cache, Multi-Query Attention (MQA) (Shazeer, 2019) and Grouped-Query Attention (GQA) (Ainslie et al., 2023) are proposed. They require a smaller magnitude of KV cache, but their performance does not match MHA (we provide the ablation of MHA, GQA and MQA in Appendix D.1).
传统的 Transformer 模型通常采用多头注意力(MHA) (Vaswani et al., 2017 ,但在生成过程中,其繁重的键值(KV)缓存将成为限制推理效率的瓶颈。为了减少KV缓存,提出了多查询注意(MQA) (Shazeer, 2019和分组查询注意(GQA) (Ainslie等人, 2023 。它们需要较小量级的 KV 缓存,但它们的性能与 MHA 不匹配(我们在附录D.1中提供了 MHA、GQA 和 MQA 的消融)。

For DeepSeek-V2, we design an innovative attention mechanism called Multi-head Latent Attention (MLA). Equipped with low-rank key-value joint compression, MLA achieves better performance than MHA, but requires a significantly smaller amount of KV cache. We introduce its architecture in the following, and also provide a comparison between MLA and MHA in Appendix D.2.
对于 DeepSeek-V2,我们设计了一种创新的注意力机制,称为多头潜在注意力(MLA)。 MLA 配备低秩键值联合压缩,比 MHA 具有更好的性能,但需要的 KV 缓存量要少得多。我们在下面介绍它的架构,并在附录D.2中提供MLA和MHA之间的比较。

2.1.1 Preliminaries: Standard Multi-Head Attention
2.1.1预备知识:标准多头注意力

We first introduce the standard MHA mechanism as background. Let d𝑑ditalic_d be the embedding dimension, nhsubscript𝑛n_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT be the number of attention heads, dhsubscript𝑑d_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT be the dimension per head, and 𝐡tdsubscript𝐡𝑡superscript𝑑\mathbf{h}_{t}\in\mathbb{R}^{d}bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be the attention input of the t𝑡titalic_t-th token at an attention layer. Standard MHA first produces 𝐪t,𝐤t,𝐯tdhnhsubscript𝐪𝑡subscript𝐤𝑡subscript𝐯𝑡superscriptsubscript𝑑subscript𝑛\mathbf{q}_{t},\mathbf{k}_{t},\mathbf{v}_{t}\in\mathbb{R}^{d_{h}n_{h}}bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT through three matrices WQ,WK,WVdhnh×dsuperscript𝑊𝑄superscript𝑊𝐾superscript𝑊𝑉superscriptsubscript𝑑subscript𝑛𝑑W^{Q},W^{K},W^{V}\in\mathbb{R}^{d_{h}n_{h}\times d}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT, respectively:
我们首先介绍标准的MHA机制作为背景。 让 dditalic_d 是嵌入维度, nhsubscriptn_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 是注意力头的数量, dhsubscriptd_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 是每个头的尺寸,并且 𝐡tdsubscriptsuperscript\mathbf{h}_{t}\in\mathbb{R}^{d}bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT 是注意力的输入 ttitalic_t 注意力层的第一个标记。标准MHA首先生产 𝐪t,𝐤t,𝐯tdhnhsubscriptsubscriptsubscriptsuperscriptsubscriptsubscript\mathbf{q}_{t},\mathbf{k}_{t},\mathbf{v}_{t}\in\mathbb{R}^{d_{h}n_{h}}bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 通过三个矩阵 WQ,WK,WVdhnh×dsuperscriptsuperscriptsuperscriptsuperscriptsubscriptsubscriptW^{Q},W^{K},W^{V}\in\mathbb{R}^{d_{h}n_{h}\times d}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT , 分别:

𝐪tsubscript𝐪𝑡\displaystyle\mathbf{q}_{t}bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =WQ𝐡t,absentsuperscript𝑊𝑄subscript𝐡𝑡\displaystyle=W^{Q}\mathbf{h}_{t},= italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (1)
𝐤tsubscript𝐤𝑡\displaystyle\mathbf{k}_{t}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =WK𝐡t,absentsuperscript𝑊𝐾subscript𝐡𝑡\displaystyle=W^{K}\mathbf{h}_{t},= italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (2)
𝐯tsubscript𝐯𝑡\displaystyle\mathbf{v}_{t}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =WV𝐡t,absentsuperscript𝑊𝑉subscript𝐡𝑡\displaystyle=W^{V}\mathbf{h}_{t},= italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (3)

Then, 𝐪t,𝐤t,𝐯tsubscript𝐪𝑡subscript𝐤𝑡subscript𝐯𝑡\mathbf{q}_{t},\mathbf{k}_{t},\mathbf{v}_{t}bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT will be sliced into nhsubscript𝑛n_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT heads for the multi-head attention computation:
然后, 𝐪t,𝐤t,𝐯tsubscriptsubscriptsubscript\mathbf{q}_{t},\mathbf{k}_{t},\mathbf{v}_{t}bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 将被切成 nhsubscriptn_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 用于多头注意力计算的头:

[𝐪t,1;\displaystyle[\mathbf{q}_{t,1};[ bold_q start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ; 𝐪t,2;;𝐪t,nh]=𝐪t,\displaystyle\mathbf{q}_{t,2};...;\mathbf{q}_{t,n_{h}}]=\mathbf{q}_{t},bold_q start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ; … ; bold_q start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] = bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (4)
[𝐤t,1;\displaystyle[\mathbf{k}_{t,1};[ bold_k start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ; 𝐤t,2;;𝐤t,nh]=𝐤t,\displaystyle\mathbf{k}_{t,2};...;\mathbf{k}_{t,n_{h}}]=\mathbf{k}_{t},bold_k start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ; … ; bold_k start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] = bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (5)
[𝐯t,1;\displaystyle[\mathbf{v}_{t,1};[ bold_v start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ; 𝐯t,2;;𝐯t,nh]=𝐯t,\displaystyle\mathbf{v}_{t,2};...;\mathbf{v}_{t,n_{h}}]=\mathbf{v}_{t},bold_v start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ; … ; bold_v start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] = bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (6)
𝐨t,isubscript𝐨𝑡𝑖\displaystyle\mathbf{o}_{t,i}bold_o start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT =j=1tSoftmaxj(𝐪t,iT𝐤j,idh)𝐯j,i,absentsuperscriptsubscript𝑗1𝑡subscriptSoftmax𝑗superscriptsubscript𝐪𝑡𝑖𝑇subscript𝐤𝑗𝑖subscript𝑑subscript𝐯𝑗𝑖\displaystyle=\sum_{j=1}^{t}\operatorname{Softmax}_{j}(\frac{\mathbf{q}_{t,i}^% {T}\mathbf{k}_{j,i}}{\sqrt{d_{h}}})\mathbf{v}_{j,i},= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Softmax start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( divide start_ARG bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_k start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG end_ARG ) bold_v start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT , (7)
𝐮tsubscript𝐮𝑡\displaystyle\mathbf{u}_{t}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =WO[𝐨t,1;𝐨t,2;;𝐨t,nh],absentsuperscript𝑊𝑂subscript𝐨𝑡1subscript𝐨𝑡2subscript𝐨𝑡subscript𝑛\displaystyle=W^{O}[\mathbf{o}_{t,1};\mathbf{o}_{t,2};...;\mathbf{o}_{t,n_{h}}],= italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT [ bold_o start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ; bold_o start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ; … ; bold_o start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] , (8)

where 𝐪t,i,𝐤t,i,𝐯t,idhsubscript𝐪𝑡𝑖subscript𝐤𝑡𝑖subscript𝐯𝑡𝑖superscriptsubscript𝑑\mathbf{q}_{t,i},\mathbf{k}_{t,i},\mathbf{v}_{t,i}\in\mathbb{R}^{d_{h}}bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , bold_k start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denote the query, key, and value of the i𝑖iitalic_i-th attention head, respectively; WOd×dhnhsuperscript𝑊𝑂superscript𝑑subscript𝑑subscript𝑛W^{O}\in\mathbb{R}^{d\times d_{h}n_{h}}italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the output projection matrix. During inference, all keys and values need to be cached to accelerate inference, so MHA needs to cache 2nhdhl2subscript𝑛subscript𝑑𝑙2n_{h}d_{h}l2 italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_l elements for each token. In model deployment, this heavy KV cache is a large bottleneck that limits the maximum batch size and sequence length.
在哪里 𝐪t,i,𝐤t,i,𝐯t,idhsubscriptsubscriptsubscriptsuperscriptsubscript\mathbf{q}_{t,i},\mathbf{k}_{t,i},\mathbf{v}_{t,i}\in\mathbb{R}^{d_{h}}bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , bold_k start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 表示查询、键和值 iiitalic_i 分别为第-个注意力头; WOd×dhnhsuperscriptsuperscriptsubscriptsubscriptW^{O}\in\mathbb{R}^{d\times d_{h}n_{h}}italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 表示输出投影矩阵。在推理过程中,所有的key和value都需要缓存以加速推理,所以MHA需要缓存 2nhdhl2subscriptsubscript2n_{h}d_{h}l2 italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_l 每个标记的元素。在模型部署中,这种沉重的KV缓存是一个很大的瓶颈,限制了最大批量大小和序列长度。

Refer to caption
Figure 3: Simplified illustration of Multi-Head Attention (MHA), Grouped-Query Attention (GQA), Multi-Query Attention (MQA), and Multi-head Latent Attention (MLA). Through jointly compressing the keys and values into a latent vector, MLA significantly reduces the KV cache during inference.
图 3:多头注意力 (MHA)、分组查询注意力 (GQA)、多查询注意力 (MQA) 和多头潜在注意力 (MLA) 的简化说明。通过将键和值联合压缩为潜在向量,MLA 显着减少了推理过程中的 KV 缓存。

2.1.2 Low-Rank Key-Value Joint Compression
2.1.2低秩键值联合压缩

The core of MLA is the low-rank joint compression for keys and values to reduce KV cache:
MLA的核心是对key和value进行低秩联合压缩,以减少KV缓存:

𝐜tKVsuperscriptsubscript𝐜𝑡𝐾𝑉\displaystyle\mathbf{c}_{t}^{KV}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT =WDKV𝐡t,absentsuperscript𝑊𝐷𝐾𝑉subscript𝐡𝑡\displaystyle=W^{DKV}\mathbf{h}_{t},= italic_W start_POSTSUPERSCRIPT italic_D italic_K italic_V end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (9)
𝐤tCsuperscriptsubscript𝐤𝑡𝐶\displaystyle\mathbf{k}_{t}^{C}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT =WUK𝐜tKV,absentsuperscript𝑊𝑈𝐾superscriptsubscript𝐜𝑡𝐾𝑉\displaystyle=W^{UK}\mathbf{c}_{t}^{KV},= italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT , (10)
𝐯tCsuperscriptsubscript𝐯𝑡𝐶\displaystyle\mathbf{v}_{t}^{C}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT =WUV𝐜tKV,absentsuperscript𝑊𝑈𝑉superscriptsubscript𝐜𝑡𝐾𝑉\displaystyle=W^{UV}\mathbf{c}_{t}^{KV},= italic_W start_POSTSUPERSCRIPT italic_U italic_V end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT , (11)

where 𝐜tKVdcsuperscriptsubscript𝐜𝑡𝐾𝑉superscriptsubscript𝑑𝑐\mathbf{c}_{t}^{KV}\in\mathbb{R}^{d_{c}}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the compressed latent vector for keys and values; dc(dhnh)annotatedsubscript𝑑𝑐much-less-thanabsentsubscript𝑑subscript𝑛d_{c}(\ll d_{h}n_{h})italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( ≪ italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) denotes the KV compression dimension; WDKVdc×dsuperscript𝑊𝐷𝐾𝑉superscriptsubscript𝑑𝑐𝑑W^{DKV}\in\mathbb{R}^{d_{c}\times d}italic_W start_POSTSUPERSCRIPT italic_D italic_K italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT is the down-projection matrix; and WUK,WUVdhnh×dcsuperscript𝑊𝑈𝐾superscript𝑊𝑈𝑉superscriptsubscript𝑑subscript𝑛subscript𝑑𝑐W^{UK},W^{UV}\in\mathbb{R}^{d_{h}n_{h}\times d_{c}}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_U italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are the up-projection matrices for keys and values, respectively. During inference, MLA only needs to cache 𝐜tKVsuperscriptsubscript𝐜𝑡𝐾𝑉\mathbf{c}_{t}^{KV}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT, so its KV cache has only dclsubscript𝑑𝑐𝑙d_{c}litalic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_l elements, where l𝑙litalic_l denotes the number of layers. In addition, during inference, since WUKsuperscript𝑊𝑈𝐾W^{UK}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT can be absorbed into WQsuperscript𝑊𝑄W^{Q}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT, and WUVsuperscript𝑊𝑈𝑉W^{UV}italic_W start_POSTSUPERSCRIPT italic_U italic_V end_POSTSUPERSCRIPT can be absorbed into WOsuperscript𝑊𝑂W^{O}italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT, we even do not need to compute keys and values out for attention. Figure 3 intuitively illustrates how the KV joint compression in MLA reduces the KV cache.
在哪里 𝐜tKVdcsuperscriptsubscriptsuperscriptsubscript\mathbf{c}_{t}^{KV}\in\mathbb{R}^{d_{c}}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 是键和值的压缩潜在向量; dc(dhnh)annotatedsubscriptmuch-less-thanabsentsubscriptsubscriptd_{c}(\ll d_{h}n_{h})italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( ≪ italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) 表示KV压缩维数; WDKVdc×dsuperscriptsuperscriptsubscriptW^{DKV}\in\mathbb{R}^{d_{c}\times d}italic_W start_POSTSUPERSCRIPT italic_D italic_K italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT 是下投影矩阵;和 WUK,WUVdhnh×dcsuperscriptsuperscriptsuperscriptsubscriptsubscriptsubscriptW^{UK},W^{UV}\in\mathbb{R}^{d_{h}n_{h}\times d_{c}}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_U italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 分别是键和值的上投影矩阵。在推理过程中,MLA只需要缓存 𝐜tKVsuperscriptsubscript\mathbf{c}_{t}^{KV}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT ,所以它的KV缓存只有 dclsubscriptd_{c}litalic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_l 元素,其中 llitalic_l 表示层数。另外,在推理过程中,由于 WUKsuperscriptW^{UK}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT 可以被吸收到 WQsuperscriptW^{Q}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , 和 WUVsuperscriptW^{UV}italic_W start_POSTSUPERSCRIPT italic_U italic_V end_POSTSUPERSCRIPT 可以被吸收到 WOsuperscriptW^{O}italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT ,我们甚至不需要计算键和值来引起注意。图3直观地说明了MLA中的KV联合压缩如何减少KV缓存。

Moreover, in order to reduce the activation memory during training, we also perform low-rank compression for the queries, even if it cannot reduce the KV cache:
此外,为了减少训练期间的激活内存,我们还对查询进行低秩压缩,即使它不能减少 KV 缓存:

𝐜tQsuperscriptsubscript𝐜𝑡𝑄\displaystyle\mathbf{c}_{t}^{Q}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT =WDQ𝐡t,absentsuperscript𝑊𝐷𝑄subscript𝐡𝑡\displaystyle=W^{DQ}\mathbf{h}_{t},= italic_W start_POSTSUPERSCRIPT italic_D italic_Q end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (12)
𝐪tCsuperscriptsubscript𝐪𝑡𝐶\displaystyle\mathbf{q}_{t}^{C}bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT =WUQ𝐜tQ,absentsuperscript𝑊𝑈𝑄superscriptsubscript𝐜𝑡𝑄\displaystyle=W^{UQ}\mathbf{c}_{t}^{Q},= italic_W start_POSTSUPERSCRIPT italic_U italic_Q end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , (13)

where 𝐜tQdcsuperscriptsubscript𝐜𝑡𝑄superscriptsuperscriptsubscript𝑑𝑐\mathbf{c}_{t}^{Q}\in\mathbb{R}^{d_{c}^{\prime}}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is the compressed latent vector for queries; dc(dhnh)annotatedsuperscriptsubscript𝑑𝑐much-less-thanabsentsubscript𝑑subscript𝑛d_{c}^{\prime}(\ll d_{h}n_{h})italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ≪ italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) denotes the query compression dimension; and WDQdc×d,WUQdhnh×dcformulae-sequencesuperscript𝑊𝐷𝑄superscriptsuperscriptsubscript𝑑𝑐𝑑superscript𝑊𝑈𝑄superscriptsubscript𝑑subscript𝑛superscriptsubscript𝑑𝑐W^{DQ}\in\mathbb{R}^{d_{c}^{\prime}\times d},W^{UQ}\in\mathbb{R}^{d_{h}n_{h}% \times d_{c}^{\prime}}italic_W start_POSTSUPERSCRIPT italic_D italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_d end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_U italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT are the down-projection and up-projection matrices for queries, respectively.
在哪里 𝐜tQdcsuperscriptsubscriptsuperscriptsuperscriptsubscript\mathbf{c}_{t}^{Q}\in\mathbb{R}^{d_{c}^{\prime}}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT 是查询的压缩潜在向量; dc(dhnh)annotatedsuperscriptsubscriptmuch-less-thanabsentsubscriptsubscriptd_{c}^{\prime}(\ll d_{h}n_{h})italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ≪ italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) 表示查询压缩维度;和 WDQdc×d,WUQdhnh×dcformulae-sequencesuperscriptsuperscriptsuperscriptsubscriptsuperscriptsuperscriptsubscriptsubscriptsuperscriptsubscriptW^{DQ}\in\mathbb{R}^{d_{c}^{\prime}\times d},W^{UQ}\in\mathbb{R}^{d_{h}n_{h}% \times d_{c}^{\prime}}italic_W start_POSTSUPERSCRIPT italic_D italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_d end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_U italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT 分别是查询的下投影和上投影矩阵。

2.1.3 Decoupled Rotary Position Embedding
2.1.3解耦旋转位置嵌入

Following DeepSeek 67B (DeepSeek-AI, 2024), we intend to use the Rotary Position Embedding (RoPE) (Su et al., 2024) for DeepSeek-V2. However, RoPE is incompatible with low-rank KV compression. To be specific, RoPE is position-sensitive for both keys and queries. If we apply RoPE for the keys 𝐤tCsuperscriptsubscript𝐤𝑡𝐶\mathbf{k}_{t}^{C}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, WUKsuperscript𝑊𝑈𝐾W^{UK}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT in Equation 10 will be coupled with a position-sensitive RoPE matrix. In this way, WUKsuperscript𝑊𝑈𝐾W^{UK}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT cannot be absorbed into WQsuperscript𝑊𝑄W^{Q}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT any more during inference, since a RoPE matrix related to the currently generating token will lie between WQsuperscript𝑊𝑄W^{Q}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and WUKsuperscript𝑊𝑈𝐾W^{UK}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT and matrix multiplication does not obey a commutative law. As a result, we must recompute the keys for all the prefix tokens during inference, which will significantly hinder the inference efficiency.
继 DeepSeek 67B (DeepSeek-AI, 2024之后,我们打算将旋转位置嵌入(RoPE) (Su 等人, 2024用于 DeepSeek-V2。然而,RoPE 与低秩 KV 压缩不兼容。具体来说,RoPE 对于键和查询都是位置敏感的。如果我们对钥匙应用 RoPE 𝐤tCsuperscriptsubscript\mathbf{k}_{t}^{C}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , WUKsuperscriptW^{UK}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT 等式10中将与位置敏感的 RoPE 矩阵相结合。这样, WUKsuperscriptW^{UK}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT 无法被吸收 WQsuperscriptW^{Q}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT 在推理过程中不再需要,因为与当前生成的令牌相关的 RoPE 矩阵将位于 WQsuperscriptW^{Q}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPTWUKsuperscriptW^{UK}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT 并且矩阵乘法不遵守交换律。因此,我们必须在推理过程中重新计算所有前缀标记的键,这将极大地影响推理效率。

As a solution, we propose the decoupled RoPE strategy that uses additional multi-head queries 𝐪t,iRdhRsuperscriptsubscript𝐪𝑡𝑖𝑅superscriptsuperscriptsubscript𝑑𝑅\mathbf{q}_{t,i}^{R}\in\mathbb{R}^{d_{h}^{R}}bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and a shared key 𝐤tRdhRsuperscriptsubscript𝐤𝑡𝑅superscriptsuperscriptsubscript𝑑𝑅\mathbf{k}_{t}^{R}\in\mathbb{R}^{d_{h}^{R}}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT to carry RoPE, where dhRsuperscriptsubscript𝑑𝑅d_{h}^{R}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT denotes the per-head dimension of the decoupled queries and key. Equipped with the decoupled RoPE strategy, MLA performs the following computation:
作为解决方案,我们提出了使用额外多头查询的解耦 RoPE 策略 𝐪t,iRdhRsuperscriptsubscriptsuperscriptsuperscriptsubscript\mathbf{q}_{t,i}^{R}\in\mathbb{R}^{d_{h}^{R}}bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT 和共享密钥 𝐤tRdhRsuperscriptsubscriptsuperscriptsuperscriptsubscript\mathbf{k}_{t}^{R}\in\mathbb{R}^{d_{h}^{R}}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT 携带 RoPE,其中 dhRsuperscriptsubscriptd_{h}^{R}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT 表示解耦查询和密钥的每头维度。配备解耦 RoPE 策略,MLA 执行以下计算:

[𝐪t,1R;𝐪t,2R;;𝐪t,nhR]=𝐪tRsuperscriptsubscript𝐪𝑡1𝑅superscriptsubscript𝐪𝑡2𝑅superscriptsubscript𝐪𝑡subscript𝑛𝑅superscriptsubscript𝐪𝑡𝑅\displaystyle[\mathbf{q}_{t,1}^{R};\mathbf{q}_{t,2}^{R};...;\mathbf{q}_{t,n_{h% }}^{R}]=\mathbf{q}_{t}^{R}[ bold_q start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ; bold_q start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ; … ; bold_q start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ] = bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT =RoPE(WQR𝐜tQ),absentRoPEsuperscript𝑊𝑄𝑅superscriptsubscript𝐜𝑡𝑄\displaystyle=\operatorname{RoPE}({W^{QR}}\mathbf{c}_{t}^{Q}),= roman_RoPE ( italic_W start_POSTSUPERSCRIPT italic_Q italic_R end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ) , (14)
𝐤tRsuperscriptsubscript𝐤𝑡𝑅\displaystyle\mathbf{k}_{t}^{R}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT =RoPE(WKR𝐡t),absentRoPEsuperscript𝑊𝐾𝑅subscript𝐡𝑡\displaystyle=\operatorname{RoPE}({W^{KR}}\mathbf{h}_{t}),= roman_RoPE ( italic_W start_POSTSUPERSCRIPT italic_K italic_R end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (15)
𝐪t,isubscript𝐪𝑡𝑖\displaystyle\mathbf{q}_{t,i}bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT =[𝐪t,iC;𝐪t,iR],absentsuperscriptsubscript𝐪𝑡𝑖𝐶superscriptsubscript𝐪𝑡𝑖𝑅\displaystyle=[\mathbf{q}_{t,i}^{C};\mathbf{q}_{t,i}^{R}],= [ bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ] , (16)
𝐤t,isubscript𝐤𝑡𝑖\displaystyle\mathbf{k}_{t,i}bold_k start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT =[𝐤t,iC;𝐤tR],absentsuperscriptsubscript𝐤𝑡𝑖𝐶superscriptsubscript𝐤𝑡𝑅\displaystyle=[\mathbf{k}_{t,i}^{C};\mathbf{k}_{t}^{R}],= [ bold_k start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ] , (17)
𝐨t,isubscript𝐨𝑡𝑖\displaystyle\mathbf{o}_{t,i}bold_o start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT =j=1tSoftmaxj(𝐪t,iT𝐤j,idh+dhR)𝐯j,iC,absentsuperscriptsubscript𝑗1𝑡subscriptSoftmax𝑗superscriptsubscript𝐪𝑡𝑖𝑇subscript𝐤𝑗𝑖subscript𝑑superscriptsubscript𝑑𝑅superscriptsubscript𝐯𝑗𝑖𝐶\displaystyle=\sum_{j=1}^{t}\operatorname{Softmax}_{j}(\frac{\mathbf{q}_{t,i}^% {T}\mathbf{k}_{j,i}}{\sqrt{d_{h}+d_{h}^{R}}})\mathbf{v}_{j,i}^{C},= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Softmax start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( divide start_ARG bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_k start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_ARG end_ARG ) bold_v start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , (18)
𝐮tsubscript𝐮𝑡\displaystyle\mathbf{u}_{t}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =WO[𝐨t,1;𝐨t,2;;𝐨t,nh],absentsuperscript𝑊𝑂subscript𝐨𝑡1subscript𝐨𝑡2subscript𝐨𝑡subscript𝑛\displaystyle=W^{O}[\mathbf{o}_{t,1};\mathbf{o}_{t,2};...;\mathbf{o}_{t,n_{h}}],= italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT [ bold_o start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ; bold_o start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ; … ; bold_o start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] , (19)

where WQRdhRnh×dcsuperscript𝑊𝑄𝑅superscriptsuperscriptsubscript𝑑𝑅subscript𝑛superscriptsubscript𝑑𝑐W^{QR}\in\mathbb{R}^{d_{h}^{R}n_{h}\times d_{c}^{\prime}}italic_W start_POSTSUPERSCRIPT italic_Q italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and WKRdhR×dsuperscript𝑊𝐾𝑅superscriptsuperscriptsubscript𝑑𝑅𝑑W^{KR}\in\mathbb{R}^{d_{h}^{R}\times d}italic_W start_POSTSUPERSCRIPT italic_K italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT × italic_d end_POSTSUPERSCRIPT are matrices to produce the decouples queries and key, respectively; RoPE()RoPE\operatorname{RoPE}(\cdot)roman_RoPE ( ⋅ ) denotes the operation that applies RoPE matrices; and [;][\cdot;\cdot][ ⋅ ; ⋅ ] denotes the concatenation operation. During inference, the decoupled key should also be cached. Therefore, DeepSeek-V2 requires a total KV cache containing (dc+dhR)lsubscript𝑑𝑐superscriptsubscript𝑑𝑅𝑙(d_{c}+d_{h}^{R})l( italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ) italic_l elements.
在哪里 WQRdhRnh×dcsuperscriptsuperscriptsuperscriptsubscriptsubscriptsuperscriptsubscriptW^{QR}\in\mathbb{R}^{d_{h}^{R}n_{h}\times d_{c}^{\prime}}italic_W start_POSTSUPERSCRIPT italic_Q italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPTWKRdhR×dsuperscriptsuperscriptsuperscriptsubscriptW^{KR}\in\mathbb{R}^{d_{h}^{R}\times d}italic_W start_POSTSUPERSCRIPT italic_K italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT × italic_d end_POSTSUPERSCRIPT 是分别产生解耦查询和键的矩阵; RoPE()\operatorname{RoPE}(\cdot)roman_RoPE ( ⋅ ) 表示应用RoPE矩阵的操作;和 [;][\cdot;\cdot][ ⋅ ; ⋅ ] 表示串联操作。在推理过程中,解耦的密钥也应该被缓存。因此,DeepSeek-V2需要一个总的KV缓存,其中包含 (dc+dhR)lsubscriptsuperscriptsubscript(d_{c}+d_{h}^{R})l( italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ) italic_l 元素。

In order to demonstrate the complete computation process of MLA, we also organize and provide its full formulas in Appendix C.
为了展示MLA的完整计算过程,我们还在附录C中整理并提供了其完整的公式。

Attention Mechanism 注意力机制 KV Cache per Token (# Element)
每个令牌的 KV 缓存(# 元素)
Capability 能力
Multi-Head Attention (MHA)
多头注意力(MHA)
2nhdhl2subscript𝑛subscript𝑑𝑙2n_{h}d_{h}l2 italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_l Strong 强的
Grouped-Query Attention (GQA)
分组查询注意力(GQA)
2ngdhl2subscript𝑛𝑔subscript𝑑𝑙2n_{g}d_{h}l2 italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_l Moderate 缓和
Multi-Query Attention (MQA)
多查询注意力(MQA)
    2dhl2subscript𝑑𝑙2d_{h}l2 italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_l Weak 虚弱的
MLA (Ours) MLA(我们的) (dc+dhR)l92dhlsubscript𝑑𝑐superscriptsubscript𝑑𝑅𝑙92subscript𝑑𝑙\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ (d_{c}+d_{h}^{R})l\approx\frac{9}{2}d_{h}l( italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ) italic_l ≈ divide start_ARG 9 end_ARG start_ARG 2 end_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_l Stronger 更强
Table 1: Comparison of the KV cache per token among different attention mechanisms. nhsubscript𝑛n_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT denotes the number of attention heads, dhsubscript𝑑d_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT denotes the dimension per attention head, l𝑙litalic_l denotes the number of layers, ngsubscript𝑛𝑔n_{g}italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT denotes the number of groups in GQA, and dcsubscript𝑑𝑐d_{c}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and dhRsuperscriptsubscript𝑑𝑅d_{h}^{R}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT denote the KV compression dimension and the per-head dimension of the decoupled queries and key in MLA, respectively. The amount of KV cache is measured by the number of elements, regardless of the storage precision. For DeepSeek-V2, dcsubscript𝑑𝑐d_{c}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is set to 4dh4subscript𝑑4d_{h}4 italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and dhRsuperscriptsubscript𝑑𝑅d_{h}^{R}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT is set to dh2subscript𝑑2\frac{d_{h}}{2}divide start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG. So, its KV cache is equal to GQA with only 2.25 groups, but its performance is stronger than MHA.
表 1:不同注意力机制中每个 token 的 KV 缓存比较。 nhsubscriptn_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 表示注意力头的数量, dhsubscriptd_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 表示每个注意力头的维度, llitalic_l 表示层数, ngsubscriptn_{g}italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT 表示 GQA 中的组数,并且 dcsubscriptd_{c}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPTdhRsuperscriptsubscriptd_{h}^{R}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT 分别表示 MLA 中解耦查询和键的 KV 压缩维度和每头维度。 KV缓存的数量以元素数量来衡量,与存储精度无关。对于 DeepSeek-V2, dcsubscriptd_{c}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT 设置为 4dh4subscript4d_{h}4 italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPTdhRsuperscriptsubscriptd_{h}^{R}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT 设置为 dh2subscript2\frac{d_{h}}{2}divide start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG 。 所以,它的KV缓存相当于GQA,只有2.25组,但性能却比MHA强。

2.1.4 Comparison of Key-Value Cache
2.1.4 Key-Value缓存对比

We demonstrate a comparison of the KV cache per token among different attention mechanisms in Table 1. MLA requires only a small amount of KV cache, equal to GQA with only 2.25 groups, but can achieve stronger performance than MHA.
我们在表1中展示了不同注意机制之间每个令牌的 KV 缓存的比较。 MLA只需要少量的KV缓存,相当于GQA只有2.25组,但可以实现比MHA更强的性能。

2.2 DeepSeekMoE: Training Strong Models at Economical Costs
2.2 DeepSeekMoE:以经济的成本训练强大的模型

2.2.1 Basic Architecture
2.2.1基本架构

For FFNs, we employ the DeepSeekMoE architecture (Dai et al., 2024). DeepSeekMoE has two key ideas: segmenting experts into finer granularity for higher expert specialization and more accurate knowledge acquisition, and isolating some shared experts for mitigating knowledge redundancy among routed experts. With the same number of activated and total expert parameters, DeepSeekMoE can outperform conventional MoE architectures like GShard (Lepikhin et al., 2021) by a large margin.
对于 FFN,我们采用 DeepSeekMoE 架构(Dai 等人, 2024 。 DeepSeekMoE有两个关键思想:将专家细分为更细的粒度,以实现更高的专家专业化和更准确的知识获取;以及隔离一些共享专家,以减轻路由专家之间的知识冗余。在激活和总专家参数数量相同的情况下,DeepSeekMoE 可以大幅优于 GShard (Lepikhin 等人, 2021等传统 MoE 架构。

Let 𝐮tsubscript𝐮𝑡\mathbf{u}_{t}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be the FFN input of the t𝑡titalic_t-th token, we compute the FFN output 𝐡tsuperscriptsubscript𝐡𝑡\mathbf{h}_{t}^{\prime}bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as follows:
𝐮tsubscript\mathbf{u}_{t}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 是 FFN 输入 ttitalic_t -th token,我们计算 FFN 输出 𝐡tsuperscriptsubscript\mathbf{h}_{t}^{\prime}bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 如下:

𝐡tsuperscriptsubscript𝐡𝑡\displaystyle\mathbf{h}_{t}^{\prime}bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =𝐮t+i=1NsFFNi(s)(𝐮t)+i=1Nrgi,tFFNi(r)(𝐮t),absentsubscript𝐮𝑡superscriptsubscript𝑖1subscript𝑁𝑠subscriptsuperscriptFFN𝑠𝑖subscript𝐮𝑡superscriptsubscript𝑖1subscript𝑁𝑟subscript𝑔𝑖𝑡subscriptsuperscriptFFN𝑟𝑖subscript𝐮𝑡\displaystyle=\mathbf{u}_{t}+\sum_{i=1}^{N_{s}}{\operatorname{FFN}^{(s)}_{i}% \left(\mathbf{u}_{t}\right)}+\sum_{i=1}^{N_{r}}{g_{i,t}\operatorname{FFN}^{(r)% }_{i}\left(\mathbf{u}_{t}\right)},= bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_FFN start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT roman_FFN start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (20)
gi,tsubscript𝑔𝑖𝑡\displaystyle g_{i,t}italic_g start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ={si,t,si,tTopk({sj,t|1jNr},Kr),0,otherwise,absentcasessubscript𝑠𝑖𝑡subscript𝑠𝑖𝑡Topkconditional-setsubscript𝑠𝑗𝑡1𝑗subscript𝑁𝑟subscript𝐾𝑟0otherwise\displaystyle=\begin{cases}s_{i,t},&s_{i,t}\in\operatorname{Topk}(\{s_{j,t}|1% \leqslant j\leqslant N_{r}\},K_{r}),\\ 0,&\text{otherwise},\end{cases}= { start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , end_CELL start_CELL italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∈ roman_Topk ( { italic_s start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT | 1 ⩽ italic_j ⩽ italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } , italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise , end_CELL end_ROW (21)
si,tsubscript𝑠𝑖𝑡\displaystyle s_{i,t}italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT =Softmaxi(𝐮tT𝐞i),absentsubscriptSoftmax𝑖superscriptsubscript𝐮𝑡𝑇subscript𝐞𝑖\displaystyle=\operatorname{Softmax}_{i}\left({\mathbf{u}_{t}}^{T}\mathbf{e}_{% i}\right),= roman_Softmax start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (22)

where Nssubscript𝑁𝑠N_{s}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and Nrsubscript𝑁𝑟N_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT denote the numbers of shared experts and routed experts, respectively; FFNi(s)()subscriptsuperscriptFFN𝑠𝑖\operatorname{FFN}^{(s)}_{i}(\cdot)roman_FFN start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) and FFNi(r)()subscriptsuperscriptFFN𝑟𝑖\operatorname{FFN}^{(r)}_{i}(\cdot)roman_FFN start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) denote the i𝑖iitalic_i-th shared expert and the i𝑖iitalic_i-th routed expert, respectively; Krsubscript𝐾𝑟K_{r}italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT denotes the number of activated routed experts; gi,tsubscript𝑔𝑖𝑡g_{i,t}italic_g start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is the gate value for the i𝑖iitalic_i-th expert; si,tsubscript𝑠𝑖𝑡s_{i,t}italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is the token-to-expert affinity; 𝐞isubscript𝐞𝑖\mathbf{e}_{i}bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the centroid of the i𝑖iitalic_i-th routed expert in this layer; and Topk(,K)Topk𝐾\operatorname{Topk}(\cdot,K)roman_Topk ( ⋅ , italic_K ) denotes the set comprising K𝐾Kitalic_K highest scores among the affinity scores calculated for the t𝑡titalic_t-th token and all routed experts.
在哪里 NssubscriptN_{s}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPTNrsubscriptN_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT 分别表示共享专家和路由专家的数量; FFNi(s)()subscriptsuperscript\operatorname{FFN}^{(s)}_{i}(\cdot)roman_FFN start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ )FFNi(r)()subscriptsuperscript\operatorname{FFN}^{(r)}_{i}(\cdot)roman_FFN start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) 表示 iiitalic_i 第 共享专家和 iiitalic_i -th 分别被路由的专家; KrsubscriptK_{r}italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT 表示激活的路由专家的数量; gi,tsubscriptg_{i,t}italic_g start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT 是门值 iiitalic_i -th 专家; si,tsubscripts_{i,t}italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT 是代币与专家的亲和力; 𝐞isubscript\mathbf{e}_{i}bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 是 的质心 iiitalic_i - 本层第 3 个路由专家;和 Topk(,K)\operatorname{Topk}(\cdot,K)roman_Topk ( ⋅ , italic_K ) 表示集合包括 KKitalic_K 计算的亲和力分数中的最高分数 ttitalic_t -th 令牌和所有路由专家。

2.2.2 Device-Limited Routing
2.2.2设备限制路由

We design a device-limited routing mechanism to bound MoE-related communication costs. When expert parallelism is employed, the routed experts will be distributed across multiple devices. For each token, its MoE-related communication frequency is proportional to the number of devices covered by its target experts. Due to the fine-grained expert segmentation in DeepSeekMoE, the number of activated experts can be large, so the MoE-related communication will be more costly if we apply expert parallelism.
我们设计了一种设备限制的路由机制来限制 MoE 相关的通信成本。当采用专家并行时,路由专家将分布在多个设备上。对于每个代币,其与 MoE 相关的通信频率与其目标专家覆盖的设备数量成正比。由于DeepSeekMoE中的细粒度专家分割,激活的专家数量可能很大,因此如果我们应用专家并行性,MoE相关的通信成本将会更高。

For DeepSeek-V2, beyond the naive top-K selection of routed experts, we additionally ensure that the target experts of each token will be distributed on at most M𝑀Mitalic_M devices. To be specific, for each token, we first select M𝑀Mitalic_M devices that have experts with the highest affinity scores in them. Then, we perform top-K selection among experts on these M𝑀Mitalic_M devices. In practice, we find that when M3𝑀3M\geqslant 3italic_M ⩾ 3, the device-limited routing can achieve a good performance roughly aligned with the unrestricted top-K routing.
对于 DeepSeek-V2,除了简单地选择路由专家的 top-K 之外,我们还确保每个代币的目标专家最多分布在 MMitalic_M 设备。具体来说,对于每个令牌,我们首先选择 MMitalic_M 拥有亲和力得分最高的专家的设备。然后,我们在这些专家中进行top-K选择 MMitalic_M 设备。在实践中,我们发现当 M33M\geqslant 3italic_M ⩾ 3 ,设备限制路由可以实现与无限制top-K路由大致一致的良好性能。

2.2.3 Auxiliary Loss for Load Balance
2.2.3负载平衡的辅助损耗

We take the load balance into consideration for automatically learned routing strategies. Firstly, unbalanced load will raise the risk of routing collapse (Shazeer et al., 2017), preventing some experts being fully trained and utilized. Secondly, when expert parallelism is employed, unbalanced load will diminish computation efficiency. During the training of DeepSeek-V2, we design three kinds of auxiliary losses, for controlling expert-level load balance (ExpBalsubscriptExpBal\mathcal{L}_{\mathrm{ExpBal}}caligraphic_L start_POSTSUBSCRIPT roman_ExpBal end_POSTSUBSCRIPT), device-level load balance (DevBalsubscriptDevBal\mathcal{L}_{\mathrm{DevBal}}caligraphic_L start_POSTSUBSCRIPT roman_DevBal end_POSTSUBSCRIPT), and communication balance (CommBalsubscriptCommBal\mathcal{L}_{\mathrm{CommBal}}caligraphic_L start_POSTSUBSCRIPT roman_CommBal end_POSTSUBSCRIPT), respectively.
我们在自动学习的路由策略中考虑了负载平衡。 首先,负载不平衡会增加路由崩溃的风险(Shazeer et al., 2017 ,导致部分专家无法得到充分的培训和利用。其次,当采用专家并行时,不平衡的负载会降低计算效率。在DeepSeek-V2的训练过程中,我们设计了三种辅助损失,用于控制专家级负载平衡( ExpBalsubscript\mathcal{L}_{\mathrm{ExpBal}}caligraphic_L start_POSTSUBSCRIPT roman_ExpBal end_POSTSUBSCRIPT )、设备级负载均衡( DevBalsubscript\mathcal{L}_{\mathrm{DevBal}}caligraphic_L start_POSTSUBSCRIPT roman_DevBal end_POSTSUBSCRIPT )和通信平衡( CommBalsubscript\mathcal{L}_{\mathrm{CommBal}}caligraphic_L start_POSTSUBSCRIPT roman_CommBal end_POSTSUBSCRIPT ), 分别。

Expert-Level Balance Loss.
专家级平衡损失。

We use an expert-level balance loss (Fedus et al., 2021; Lepikhin et al., 2021) to mitigate the risk of routing collapse:
我们使用专家级余额损失(Fedus et al., 2021 ;Lepikhin et al., 2021来减轻路由崩溃的风险:

ExpBalsubscriptExpBal\displaystyle\mathcal{L}_{\mathrm{ExpBal}}caligraphic_L start_POSTSUBSCRIPT roman_ExpBal end_POSTSUBSCRIPT =α1i=1NrfiPi,absentsubscript𝛼1superscriptsubscript𝑖1subscript𝑁𝑟subscript𝑓𝑖subscript𝑃𝑖\displaystyle=\alpha_{1}\sum_{i=1}^{N_{r}}{f_{i}P_{i}},= italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (23)
fisubscript𝑓𝑖\displaystyle f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =NrKrTt=1T𝟙(Token t selects Expert i),absentsubscript𝑁𝑟subscript𝐾𝑟𝑇superscriptsubscript𝑡1𝑇1Token t selects Expert i\displaystyle=\frac{N_{r}}{K_{r}T}\sum_{t=1}^{T}{\mathds{1}(\text{Token $t$ % selects Expert $i$})},= divide start_ARG italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG start_ARG italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_1 ( Token italic_t selects Expert italic_i ) , (24)
Pisubscript𝑃𝑖\displaystyle P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =1Tt=1Tsi,t,absent1𝑇superscriptsubscript𝑡1𝑇subscript𝑠𝑖𝑡\displaystyle=\frac{1}{T}\sum_{t=1}^{T}{s_{i,t}},= divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , (25)

where α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a hyper-parameter called expert-level balance factor; 𝟙()1\mathds{1}(\cdot)blackboard_1 ( ⋅ ) denotes the indicator function; and T𝑇Titalic_T denotes the number of tokens in a sequence.
在哪里 α1subscript1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 是一个超参数,称为专家级平衡因子; 𝟙()1\mathds{1}(\cdot)blackboard_1 ( ⋅ ) 表示指标函数;和 TTitalic_T 表示序列中标记的数量。

Device-Level Balance Loss.
设备级平衡损失。

In addition to the expert-level balance loss, we additionally design a device-level balance loss to ensure balanced computation across different devices. In the training process of DeepSeek-V2, we partition all routed experts into D𝐷Ditalic_D groups {1,2,,D}subscript1subscript2subscript𝐷\{\mathcal{E}_{1},\mathcal{E}_{2},...,\mathcal{E}_{D}\}{ caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT }, and deploy each group on a single device. The device-level balance loss is computed as follows:
除了专家级的平衡损失之外,我们还设计了设备级的平衡损失,以确保不同设备之间的平衡计算。 在DeepSeek-V2的训练过程中,我们将所有路由专家分为 DDitalic_D 团体 {1,2,,D}subscript1subscript2subscript\{\mathcal{E}_{1},\mathcal{E}_{2},...,\mathcal{E}_{D}\}{ caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT } ,并将每个组部署在单个设备上。设备级平衡损失计算如下:

DevBalsubscriptDevBal\displaystyle\mathcal{L}_{\mathrm{DevBal}}caligraphic_L start_POSTSUBSCRIPT roman_DevBal end_POSTSUBSCRIPT =α2i=1DfiPi,absentsubscript𝛼2superscriptsubscript𝑖1𝐷superscriptsubscript𝑓𝑖superscriptsubscript𝑃𝑖\displaystyle=\alpha_{2}\sum_{i=1}^{D}{f_{i}^{\prime}P_{i}^{\prime}},= italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , (26)
fisuperscriptsubscript𝑓𝑖\displaystyle f_{i}^{\prime}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =1|i|jifj,absent1subscript𝑖subscript𝑗subscript𝑖subscript𝑓𝑗\displaystyle=\frac{1}{|\mathcal{E}_{i}|}\sum_{j\in\mathcal{E}_{i}}{f_{j}},= divide start_ARG 1 end_ARG start_ARG | caligraphic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (27)
Pisuperscriptsubscript𝑃𝑖\displaystyle P_{i}^{\prime}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =jiPj,absentsubscript𝑗subscript𝑖subscript𝑃𝑗\displaystyle=\sum_{j\in\mathcal{E}_{i}}{P_{j}},= ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (28)

where α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a hyper-parameter called device-level balance factor.
在哪里 α2subscript2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 是一个称为设备级平衡因子的超参数。

Communication Balance Loss.
沟通平衡丧失。

Finally, we introduce a communication balance loss to ensure that the communication of each device is balanced. Although the device-limited routing mechanism guarantees that the sending communication of each device is bounded, if a certain device receives more tokens than other devices, the practical communication efficiency will also be affected. In order to mitigate this issue, we design a communication balance loss as follows:
最后,我们引入一个通信平衡损耗,以保证各个设备的通信是平衡的。虽然设备限制路由机制保证了每个设备的发送通信是有界的,但如果某个设备比其他设备接收到更多的令牌,那么实际的通信效率也会受到影响。为了缓解这个问题,我们设计了如下的通信平衡损失:

CommBalsubscriptCommBal\displaystyle\mathcal{L}_{\mathrm{CommBal}}caligraphic_L start_POSTSUBSCRIPT roman_CommBal end_POSTSUBSCRIPT =α3i=1Dfi′′Pi′′,absentsubscript𝛼3superscriptsubscript𝑖1𝐷superscriptsubscript𝑓𝑖′′superscriptsubscript𝑃𝑖′′\displaystyle=\alpha_{3}\sum_{i=1}^{D}{f_{i}^{\prime\prime}P_{i}^{\prime\prime% }},= italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT , (29)
fi′′superscriptsubscript𝑓𝑖′′\displaystyle f_{i}^{\prime\prime}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT =DMTt=1T𝟙(Token t is sent to Device i),absent𝐷𝑀𝑇superscriptsubscript𝑡1𝑇1Token t is sent to Device i\displaystyle=\frac{D}{MT}\sum_{t=1}^{T}{\mathds{1}(\text{Token $t$ is sent to% Device $i$})},= divide start_ARG italic_D end_ARG start_ARG italic_M italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_1 ( Token italic_t is sent to Device italic_i ) , (30)
Pi′′superscriptsubscript𝑃𝑖′′\displaystyle P_{i}^{\prime\prime}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT =jiPj,absentsubscript𝑗subscript𝑖subscript𝑃𝑗\displaystyle=\sum_{j\in\mathcal{E}_{i}}{P_{j}},= ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (31)

where α3subscript𝛼3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is a hyper-parameter called communication balance factor. The device-limited routing mechanism operates on the principle of ensuring that each device transmits at most MT𝑀𝑇MTitalic_M italic_T hidden states to other devices. Simultaneously, the communication balance loss is employed to encourage each device to receive around MT𝑀𝑇MTitalic_M italic_T hidden states from other devices. The communication balance loss guarantees a balanced exchange of information among devices, promoting efficient communications.
在哪里 α3subscript3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 是一个称为通信平衡因子的超参数。设备限制路由机制的运行原理是确保每个设备最多传输 MTMTitalic_M italic_T 对其他设备的隐藏状态。同时,利用通信平衡损失来鼓励每个设备同时接收周围的信息。 MTMTitalic_M italic_T 来自其他设备的隐藏状态。通信平衡损失保证了设备之间信息的平衡交换,促进高效通信。

2.2.4 Token-Dropping Strategy
2.2.4掉币策略

While balance losses aim to encourage a balanced load, it is important to acknowledge that they cannot guarantee a strict load balance. In order to further mitigate the computation wastage caused by unbalanced load, we introduce a device-level token-dropping strategy during training. This approach first computes the average computational budget for each device, which means that the capacity factor for each device is equivalent to 1.0. Then, inspired by Riquelme et al. (2021), we drop tokens with the lowest affinity scores on each device until reaching the computational budget. In addition, we ensure that the tokens belonging to approximately 10% of the training sequences will never be dropped. In this way, we can flexibly decide whether to drop tokens during inference according to the efficiency requirements, and always ensure consistency between training and inference.
虽然平衡损失旨在鼓励平衡负载,但重要的是要承认它们不能保证严格的负载平衡。为了进一步减轻负载不平衡造成的计算浪费,我们在训练过程中引入了设备级令牌丢弃策略。该方法首先计算每个设备的平均计算预算,这意味着每个设备的容量因子相当于 1.0。然后,受到里克尔梅等人的启发。 ( 2021 ,我们在每台设备上丢弃亲和力分数最低的令牌,直到达到计算预算。此外,我们确保属于大约 10% 训练序列的 token 永远不会被丢弃。这样,我们就可以根据效率要求灵活决定是否在推理过程中丢弃token,始终保证训练和推理的一致性。

3 Pre-Training
3预训练

3.1 Experimental Setups
3.1实验设置

3.1.1 Data Construction
3.1.1数据构建

While maintaining the same data processing stages as for DeepSeek 67B (DeepSeek-AI, 2024), we extend the amount of data and elevate the data quality. In order to enlarge our pre-training corpus, we explore the potential of the internet data and optimize our cleaning processes, thus recovering a large amount of mistakenly deleted data. Moreover, we incorporate more Chinese data, aiming to better leverage the corpus available on the Chinese internet. In addition to the amount of data, we also focus on the data quality. We enrich our pre-training corpus with high-quality data from various sources, and meanwhile improve the quality-based filtering algorithm. The improved algorithm ensures that a large amount of non-beneficial data will be removed, while the valuable data will be mostly retained. In addition, we filter out the contentious content from our pre-training corpus to mitigate the data bias introduced from specific regional cultures. A detailed discussion about the influence of this filtering strategy is presented in Appendix E.
在保持与 DeepSeek 67B (DeepSeek-AI, 2024 )相同的数据处理阶段的同时,我们扩展了数据量并提高了数据质量。为了扩大我们的预训练语料库,我们探索互联网数据的潜力并优化我们的清理流程,从而恢复大量误删除的数据。此外,我们纳入了更多的中文数据,旨在更好地利用中文互联网上可用的语料库。除了数据量,我们还关注数据质量。我们利用各种来源的高质量数据丰富了我们的预训练语料库,同时改进了基于质量的过滤算法。改进后的算法保证了大量无益数据被剔除,而有价值的数据大部分被保留。此外,我们从预训练语料库中过滤掉有争议的内容,以减轻特定区域文化引入的数据偏差。附录E详细讨论了该过滤策略的影响。

We adopt the same tokenizer as used in DeepSeek 67B, which is built based on the Byte-level Byte-Pair Encoding (BBPE) algorithm and has a vocabulary size of 100K. Our tokenized pre-training corpus contains 8.1T tokens, where Chinese tokens are approximately 12% more than English ones.
我们采用与 DeepSeek 67B 中使用的相同的分词器,它是基于字节级字节对编码(BBPE)算法构建的,词汇量为 100K。我们的标记化预训练语料库包含 8.1T 个标记,其中中文标记比英文标记多约 12%。

3.1.2 Hyper-Parameters
3.1.2超参数

Model Hyper-Parameters. 模型超参数。

We set the number of Transformer layers to 60 and the hidden dimension to 5120. All learnable parameters are randomly initialized with a standard deviation of 0.006. In MLA, we set the number of attention heads nhsubscript𝑛n_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT to 128 and the per-head dimension dhsubscript𝑑d_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT to 128. The KV compression dimension dcsubscript𝑑𝑐d_{c}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is set to 512, and the query compression dimension dcsuperscriptsubscript𝑑𝑐d_{c}^{\prime}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is set to 1536. For the decoupled queries and key, we set the per-head dimension dhRsuperscriptsubscript𝑑𝑅d_{h}^{R}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT to 64. Following Dai et al. (2024), we substitute all FFNs except for the first layer with MoE layers. Each MoE layer consists of 2 shared experts and 160 routed experts, where the intermediate hidden dimension of each expert is 1536. Among the routed experts, 6 experts will be activated for each token. In addition, the low-rank compression and fine-grained expert segmentation will impact the output scale of a layer. Therefore, in practice, we employ additional RMS Norm layers after the compressed latent vectors, and multiply additional scaling factors at the width bottlenecks (i.e., the compressed latent vectors and the intermediate hidden states of routed experts) to ensure stable training. Under this configuration, DeepSeek-V2 comprises 236B total parameters, of which 21B are activated for each token.
我们将 Transformer 层数设置为 60,隐藏维度设置为 5120。 所有可学习参数均随机初始化,标准差为 0.006。 在MLA中,我们设置注意力头的数量 nhsubscriptn_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 至 128 和每头尺寸 dhsubscriptd_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 至 128. KV 压缩维度 dcsubscriptd_{c}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT 设置为512,查询压缩维度 dcsuperscriptsubscriptd_{c}^{\prime}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 设置为 1536。对于解耦的查询和键,我们设置每头维度 dhRsuperscriptsubscriptd_{h}^{R}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT 至 64. 继Dai 等人之后。 ( 2024 ) ,我们用 MoE 层替换除第一层之外的所有 FFN。每个MoE层由2个共享专家和160个路由专家组成,其中每个专家的中间隐藏维度为1536。在路由专家中,每个代币将激活6个专家。此外,低秩压缩和细粒度专家分割也会影响层的输出规模。因此,在实践中,我们在压缩潜在向量之后采用额外的RMS Norm层,并在宽度瓶颈处乘以额外的缩放因子(即压缩的潜在向量和路由专家的中间隐藏状态)以确保稳定的训练。在此配置下,DeepSeek-V2总共包含236B个参数,其中每个令牌激活21B个参数。

Training Hyper-Parameters.
训练超参数。

We employ the AdamW optimizer (Loshchilov and Hutter, 2017) with hyper-parameters set to β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.95subscript𝛽20.95\beta_{2}=0.95italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95, and weight_decay=0.1weight_decay0.1\mathrm{weight\_decay}=0.1roman_weight _ roman_decay = 0.1. The learning rate is scheduled using a warmup-and-step-decay strategy (DeepSeek-AI, 2024). Initially, the learning rate linearly increases from 0 to the maximum value during the first 2K steps. Subsequently, the learning rate is multiplied by 0.316 after training about 60% of tokens, and again by 0.316 after training about 90% of tokens. The maximum learning rate is set to 2.4×1042.4superscript1042.4\times 10^{-4}2.4 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and the gradient clipping norm is set to 1.0. We also use a batch size scheduling strategy, where the batch size is gradually increased from 2304 to 9216 in the training of the first 225B tokens, and then keeps 9216 in the remaining training. We set the maximum sequence length to 4K, and train DeepSeek-V2 on 8.1T tokens. We leverage pipeline parallelism to deploy different layers of a model on different devices, and for each layer, the routed experts will be uniformly deployed on 8 devices (D=8𝐷8D=8italic_D = 8). As for the device-limited routing, each token will be sent to at most 3 devices (M=3𝑀3M=3italic_M = 3). As for balance losses, we set α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to 0.003, α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to 0.05, and α3subscript𝛼3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT to 0.02. We employ the token-dropping strategy during training for acceleration, but do not drop any tokens for evaluation.
我们采用 AdamW 优化器(Loshchilov 和 Hutter, 2017 ,超参数设置为 β1=0.9subscript10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , β2=0.95subscript20.95\beta_{2}=0.95italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95 , 和 weight_decay=0.10.1\mathrm{weight\_decay}=0.1roman_weight _ roman_decay = 0.1 。使用预热和逐步衰减策略来安排学习速率(DeepSeek-AI, 2024 。最初,学习率在前 2K 步中从 0 线性增加到最大值。随后,在训练大约 6​​0% 的标记后,将学习率乘以 0.316,在训练大约 90% 的标记后,再次将学习率乘以 0.316。最大学习率设置为 2.4×1042.4superscript1042.4\times 10^{-4}2.4 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT ,梯度裁剪范数设置为 1.0。我们还使用批量大小调度策略,在前 225B 个令牌的训练中,批量大小逐渐从 2304 增加到 9216,然后在剩余的训练中保持 9216。我们将最大序列长度设置为 4K,并在​​ 8.1T 令牌上训练 DeepSeek-V2。我们利用管道并行性将模型的不同层部署在不同的设备上,对于每一层,路由专家将统一部署在8台设备上( D=88D=8italic_D = 8 )。对于设备限制路由,每个token最多会发送到3个设备( M=33M=3italic_M = 3 )。对于余额损失,我们设置 α1subscript1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 至 0.003, α2subscript2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 至 0.05,并且 α3subscript3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 至 0.02。我们在训练期间采用令牌丢弃策略来加速,但不丢弃任何令牌进行评估。

3.1.3 Infrastructures
3.1.3基础设施

DeepSeek-V2 is trained based on the HAI-LLM framework (High-flyer, 2023), an efficient and light-weight training framework developed internally by our engineers. It employs a 16-way zero-bubble pipeline parallelism (Qi et al., 2023), an 8-way expert parallelism (Lepikhin et al., 2021), and ZeRO-1 data parallelism (Rajbhandari et al., 2020). Given that DeepSeek-V2 has relatively few activated parameters, and a portion of the operators are recomputed to save activation memory, it can be trained without the necessity of tensor parallelism, thereby decreasing the communication overhead. Moreover, in order to further improve the training efficiency, we overlap the computation of shared experts with the expert parallel all-to-all communication. We also customize faster CUDA kernels for communications, routing algorithms, and fused linear computations across different experts. In addition, MLA is also optimized based on an improved version of FlashAttention-2 (Dao, 2023).
DeepSeek-V2基于HAI- LLM框架(High-flyer, 2023进行训练,这是我们工程师内部开发的高效、轻量级的训练框架。它采用 16 路零气泡管道并行性(Qi 等人, 2023 年 、8 路专家并行性(Lepikhin 等人, 2021 年和 ZeRO-1 数据并行性(Rajbhandari 等人, 2020 年 。鉴于DeepSeek-V2的激活参数相对较少,并且重新计算了部分算子以节省激活内存,因此无需张量并行即可进行训练,从而减少了通信开销。此外,为了进一步提高训练效率,我们将共享专家的计算与专家并行的全对所有通信重叠。我们还为不同专家之间的通信、路由算法和融合线性计算定制了更快的 CUDA 内核。此外,MLA还基于FlashAttention-2的改进版本(Dao, 2023进行了优化。

We conduct all experiments on a cluster equipped with NVIDIA H800 GPUs. Each node in the H800 cluster contains 8 GPUs connected using NVLink and NVSwitch within nodes. Across nodes, InfiniBand interconnects are utilized to facilitate communications.
我们在配备 NVIDIA H800 GPU 的集群上进行所有实验。 H800 集群中的每个节点包含 8 个 GPU,通过节点内的 NVLink 和 NVSwitch 连接。在节点之间,InfiniBand 互连用于促进通信。

Refer to caption
Figure 4: Evaluation results on the “Needle In A Haystack” (NIAH) tests. DeepSeek-V2 performs well across all context window lengths up to 128K.
图 4: “大海捞针”(NIAH) 测试的评估结果。 DeepSeek-V2 在高达 128K 的所有上下文窗口长度上都表现良好。

3.1.4 Long Context Extension
3.1.4长上下文扩展

After the initial pre-training of DeepSeek-V2, we employ YaRN (Peng et al., 2023) to extend the default context window length from 4K to 128K. YaRN was specifically applied to the decoupled shared key 𝐤tRsubscriptsuperscript𝐤𝑅𝑡\mathbf{k}^{R}_{t}bold_k start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as it is responsible for carrying RoPE (Su et al., 2024). For YaRN, we set the scale s𝑠sitalic_s to 40, α𝛼\alphaitalic_α to 1, β𝛽\betaitalic_β to 32, and the target maximum context length to 160K. Under these settings, we can expect the model to respond well for a context length of 128K. Slightly diverging from original YaRN, due to our distinct attention mechanism, we adjust the length scaling factor to modulate the attention entropy. The factor t𝑡\sqrt{t}square-root start_ARG italic_t end_ARG is computed as t=0.0707lns+1𝑡0.0707𝑠1\sqrt{t}=0.0707\ln{s}+1square-root start_ARG italic_t end_ARG = 0.0707 roman_ln italic_s + 1, aiming at minimizing the perplexity.
在 DeepSeek-V2 的初始预训练之后,我们使用 YaRN (Peng et al., 2023将默认上下文窗口长度从 4K 扩展到 128K。 YaRN专门应用于解耦共享密钥 𝐤tRsubscriptsuperscript\mathbf{k}^{R}_{t}bold_k start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 因为它负责承载 RoPE (Su et al., 2024 ) 。对于 YaRN,我们设置了比例 ssitalic_s 到 40, α\alphaitalic_α 至 1, β\betaitalic_β 为 32,目标最大上下文长度为 160K。在这些设置下,我们可以预期模型对于 128K 的上下文长度有良好的响应。与原始 YaRN 略有不同,由于我们独特的注意力机制,我们调整长度缩放因子来调节注意力熵。因素 t\sqrt{t}square-root start_ARG italic_t end_ARG 计算为 t=0.0707lns+10.07071\sqrt{t}=0.0707\ln{s}+1square-root start_ARG italic_t end_ARG = 0.0707 roman_ln italic_s + 1 ,旨在最大限度地减少困惑。

We additionally train the model for 1000 steps, with a sequence length of 32K and a batch size of 576 sequences. Although the training is conducted solely at the sequence length of 32K, the model still demonstrates robust performance when being evaluated at a context length of 128K. As shown in Figure 4, the results on the “Needle In A Haystack” (NIAH) tests indicate that DeepSeek-V2 performs well across all context window lengths up to 128K.
我们还对模型进行了 1000 个步骤的训练,序列长度为 32K,批量大小为 576 个序列。尽管训练仅在 32K 的序列长度下进行,但该模型在以 128K 的上下文长度进行评估时仍然表现出稳健的性能。如图4所示,“大海捞针”(NIAH) 测试的结果表明 DeepSeek-V2 在高达 128K 的所有上下文窗口长度上都表现良好。

3.2 Evaluations
3.2评价

3.2.1 Evaluation Benchmarks
3.2.1评估基准

DeepSeek-V2 is pretrained on a bilingual corpus, so we evaluate it on a series of benchmarks in English and Chinese. Our evaluation is based on our internal evaluation framework integrated in our HAI-LLM framework. Included benchmarks are categorized and listed as follows, where underlined benchmarks are in Chinese:
DeepSeek-V2 是在双语语料库上进行预训练的,因此我们根据一系列英语和中文基准对其进行评估。我们的评估基于集成在 HAI LLM框架中的内部评估框架。包含的基准分类和列出如下,其中划线基准为中文:

Multi-subject multiple-choice datasets include MMLU (Hendrycks et al., 2020), C-Eval (Huang et al., 2023), and CMMLU (Li et al., 2023).
多主题多项选择数据集包括 MMLU (Hendrycks et al., 2020 )C-Eval (Huang et al., 2023 )CMMLU (Li et al., 2023 )

Language understanding and reasoning datasets include HellaSwag (Zellers et al., 2019), PIQA (Bisk et al., 2020), ARC (Clark et al., 2018), and BigBench Hard (BBH) (Suzgun et al., 2022).
语言理解和推理数据集包括 HellaSwag (Zellers et al., 2019 ) 、PIQA (Bisk et al., 2020 ) 、ARC (Clark et al., 2018 )和 BigBench Hard (BBH) (Suzgun et al., 2022 ) .

Closed-book question answering datasets include TriviaQA (Joshi et al., 2017) and NaturalQuestions (Kwiatkowski et al., 2019).
闭卷问答数据集包括 TriviaQA (Joshi et al., 2017 )和 NaturalQuestions (Kwiatkowski et al., 2019 )

Reading comprehension datasets include RACE Lai et al. (2017), DROP (Dua et al., 2019), C3 (Sun et al., 2019), and CMRC (Cui et al., 2019).
阅读理解数据集包括 RACE Lai 等人。 ( 2017 ) 、DROP (Dua 等人, 2019 )C3 (Sun 等人, 2019 )CMRC (Cui 等人, 2019 )

Reference disambiguation datasets include WinoGrande Sakaguchi et al. (2019) and CLUEWSC (Xu et al., 2020).
参考消歧数据集包括 WinoGrande Sakaguchi 等人。 ( 2019 )CLUEWSC (Xu 等人, 2020 )

Language modeling datasets include Pile (Gao et al., 2020).
语言建模数据集包括 Pile (Gao et al., 2020 )

Chinese understanding and culture datasets include CHID (Zheng et al., 2019) and CCPM (Li et al., 2021).
中国理解和文化数据集包括CHID (Zheng et al., 2019 )CCPM (Li et al., 2021 )

Math datasets include GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), and CMath (Wei et al., 2023).
数学数据集包括 GSM8K (Cobbe et al., 2021 ) 、MATH (Hendrycks et al., 2021 )CMath (Wei et al., 2023 )

Code datasets include HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), and CRUXEval (Gu et al., 2024).
代码数据集包括 HumanEval (Chen et al., 2021 ) 、MBPP (Austin et al., 2021 )和 CRUXEval (Gu et al., 2024 )

Standardized exams include AGIEval (Zhong et al., 2023). Note that AGIEval includes both English and Chinese subsets.
标准化考试包括AGIEval (Zhong 等人, 2023 。请注意,AGIEval 包括英语和中文子集。

Following our previous work (DeepSeek-AI, 2024), we adopt perplexity-based evaluation for datasets including HellaSwag, PIQA, WinoGrande, RACE-Middle, RACE-High, MMLU, ARC-Easy, ARC-Challenge, CHID, C-Eval, CMMLU, C3, and CCPM, and adopt generation-based evaluation for TriviaQA, NaturalQuestions, DROP, MATH, GSM8K, HumanEval, MBPP, CRUXEval, BBH, AGIEval, CLUEWSC, CMRC, and CMath. In addition, we perform language-modeling-based evaluation for Pile-test and use Bits-Per-Byte (BPB) as the metric to guarantee fair comparison among models with different tokenizers.
继我们之前的工作(DeepSeek-AI, 2024之后,我们对包括 HellaSwag、PIQA、WinoGrande、RACE-Middle、RACE-High、MMLU、ARC-Easy、ARC-Challenge、CHID、C-Eval 在内的数据集采用基于困惑度的评估、CMMLU、C3 和 CCPM,并对 TriviaQA、NaturalQuestions、DROP、MATH、GSM8K、HumanEval、MBPP、CRUXEval、BBH、AGIEval、CLUEWSC、CMRC 和 CMath 采用基于生成的评估。此外,我们对 Pile-test 进行基于语言建模的评估,并使用每字节位数(BPB)作为指标,以保证具有不同分词器的模型之间的公平比较。

For an intuitive overview of these benchmarks, we additionally provide our evaluation formats for each benchmark in Appendix G.
为了直观地概述这些基准,我们还在附录G中提供了每个基准的评估格式。

3.2.2 Evaluation Results
3.2.2评价结果

Benchmark (Metric) 基准(公制) # Shots # 镜头 DeepSeek 深度搜索 Qwen1.5 Mixtral 混合 LLaMA 3 骆驼3 DeepSeek-V2
67B 72B 8x22B 70B
Architecture 建筑学 - Dense 稠密 Dense 稠密 MoE 教育部 Dense 稠密 MoE 教育部
# Activated Params # 激活参数 - 67B 72B 39B 70B 21B
# Total Params # 总参数 - 67B 72B 141B 70B 236B
English 英语 Pile-test (BPB) 桩测试 (BPB) - 0.642 0.637 0.623 0.602 0.606
BBH (EM) 3-shot 3发 68.7 59.9 78.9 81.0 78.9
MMLU (Acc.) MMLU(加速) 5-shot 5发 71.3 77.2 77.6 78.9 78.5
DROP (F1) 掉落(F1) 3-shot 3发 69.7 71.5 80.4 82.5 80.1
ARC-Easy (Acc.) ARC-Easy(附件) 25-shot 25发 95.3 97.1 97.3 97.9 97.6
ARC-Challenge (Acc.) ARC-挑战(附件) 25-shot 25发 86.4 92.8 91.2 93.3 92.4
HellaSwag (Acc.) HellaSwag(加速) 10-shot 10发 86.3 85.8 86.6 87.9 84.2
PIQA (Acc.) PIQA(加速器) 0-shot 0次射击 83.6 83.3 83.6 85.0 83.7
WinoGrande (Acc.) 维诺格兰德 (Ac.) 5-shot 5发 84.9 82.4 83.7 85.7 84.9
RACE-Middle (Acc.) 比赛-中(加速) 5-shot 5发 69.9 63.4 73.3 73.3 73.1
RACE-High (Acc.) RACE-高(加速) 5-shot 5发 50.7 47.0 56.7 57.9 52.7
TriviaQA (EM) 问答问答 (EM) 5-shot 5发 78.9 73.1 82.1 81.6 79.9
NaturalQuestions (EM) 自然问题 (EM) 5-shot 5发 36.6 35.6 39.6 40.2 38.7
AGIEval (Acc.) AGIEval(加速) 0-shot 0次射击 41.3 64.4 43.4 49.8 51.2
Code 代码 HumanEval (Pass@1) 人类评估(通过@1) 0-shot 0次射击 45.1 43.9 53.1 48.2 48.8
MBPP (Pass@1) MBPP(通过@1) 3-shot 3发 57.4 53.6 64.2 68.6 66.6
CRUXEval-I (Acc.) CRUXEval-I(加速) 2-shot 2发 42.5 44.3 52.4 49.4 52.8
CRUXEval-O (Acc.) CRUXEval-O(加速) 2-shot 2发 41.0 42.3 52.8 54.3 49.8
Math 数学 GSM8K (EM) GSM8K(EM) 8-shot 8发 63.4 77.9 80.3 83.0 79.2
MATH (EM) 数学(EM) 4-shot 4发 18.7 41.4 42.5 42.2 43.6
CMath (EM) 数学数学(EM) 3-shot 3发 63.0 77.8 72.3 73.9 78.7
Chinese 中国人 CLUEWSC (EM) 5-shot 5发 81.0 80.5 77.5 78.3 82.2
C-Eval (Acc.) C-评估(加速) 5-shot 5发 66.1 83.7 59.6 67.5 81.7
CMMLU (Acc.) CMMLU(加速) 5-shot 5发 70.8 84.3 60.0 69.3 84.0
CMRC (EM) 1-shot 1 次射击 73.4 66.6 73.1 73.3 77.5
C3 (Acc.) C3(加速) 0-shot 0次射击 75.3 78.2 71.4 74.0 77.4
CHID (Acc.) 儿童(加速) 0-shot 0次射击 92.1 - 57.0 83.2 92.7
CCPM (Acc.) CCPM(加速) 0-shot 0次射击 88.5 88.1 61.0 68.1 93.1
Table 2: Comparison among DeepSeek-V2 and other representative open-source models. All models are evaluated in our internal framework and share the same evaluation setting. Bold denotes the best and underline denotes the second-best. Scores with a gap smaller than 0.3 are regarded as at the same level. With only 21B activated parameters, DeepSeek-V2 achieves top-tier performance among open-source models.
表2: DeepSeek-V2与其他代表性开源模型的比较。所有模型都在我们的内部框架中进行评估,并共享相同的评估设置。粗体表示最好,下划线表示次佳。差距小于0.3的分数视为同一水平。 DeepSeek-V2仅需要21B个激活参数,就实现了开源模型中顶级的性能。

In Table 2, we compare DeepSeek-V2 with several representative open-source models, including DeepSeek 67B (DeepSeek-AI, 2024) (our previous release), Qwen1.5 72B (Bai et al., 2023), LLaMA3 70B (AI@Meta, 2024), and Mixtral 8x22B (Mistral, 2024). We evaluate all these models with our internal evaluation framework, and ensure that they share the same evaluation setting. Overall, with only 21B activated parameters, DeepSeek-V2 significantly outperforms DeepSeek 67B on almost all benchmarks, and achieves top-tier performance among open-source models.
在表2中,我们将DeepSeek-V2与几个代表性的开源模型进行了比较,包括DeepSeek 67B (DeepSeek-AI, 2024 (我们之前的版本)、Qwen1.5 72B (Bai等人, 2023 、LLaMA3 70B (AI) @Meta, 2024和 Mixtral 8x22B (米斯特拉尔, 2024 。我们使用内部评估框架评估所有这些模型,并确保它们共享相同的评估设置。总体而言,仅激活 21B 个参数,DeepSeek-V2 在几乎所有基准测试中均显着优于 DeepSeek 67B,在开源模型中实现了顶级性能。

Further, we elaborately compare DeepSeek-V2 with its open-source counterparts one by one. (1) Compared with Qwen1.5 72B, another model that supports both Chinese and English, DeepSeek-V2 demonstrates overwhelming advantages on the majority of English, code, and math benchmarks. As for Chinese benchmarks, Qwen1.5 72B shows better performance on multi-subject multiple-choice tasks while DeepSeek-V2 is comparable or better on others. Note that for the CHID benchmark, the tokenizer of Qwen1.5 72B will encounter errors in our evaluation framework, so we leave the CHID score blank for Qwen1.5 72B. (2) Compared with Mixtral 8x22B, DeepSeek-V2 achieves comparable or better English performance, except for TriviaQA, NaturalQuestions, and HellaSwag, which are closely related to English commonsense knowledge. Notably, DeepSeek-V2 outperforms Mixtral 8x22B on MMLU. On code and math benchmarks, DeepSeek-V2 demonstrates comparable performance with Mixtral 8x22B. Since Mixtral 8x22B is not specifically trained on Chinese data, its Chinese capability lags far behind DeepSeek-V2. (3) Compared with LLaMA3 70B, DeepSeek-V2 is trained on fewer than a quarter of English tokens. Therefore, we acknowledge that DeepSeek-V2 still has a slight gap in basic English capabilities with LLaMA3 70B. However, even with much fewer training tokens and activated parameters, DeepSeek-V2 still demonstrates comparable code and math capability with LLaMA3 70B. Also, as a bilingual language model, DeepSeek-V2 outperforms LLaMA3 70B overwhelmingly on Chinese benchmarks.
此外,我们还详细地将 DeepSeek-V2 与其开源版本一一进行了比较。 (1)与另一种支持中英文的模型Qwen1.5 72B相比,DeepSeek-V2在大多数英语、代码和数学基准上都表现出了压倒性的优势。至于中国基准测试,Qwen1.5 72B 在多主题多项选择任务上表现出更好的性能,而 DeepSeek-V2 与其他基准测试相当或更好。请注意,对于 CHID 基准测试,Qwen1.5 72B 的分词器将在我们的评估框架中遇到错误,因此我们将 Qwen1.5 72B 的 CHID 分数留空。 (2)与 Mixtral 8x22B 相比,除了与英语常识知识密切相关的 TriviaQA、NaturalQuestions 和 HellaSwag 之外,DeepSeek-V2 实现了相当或更好的英语性能。值得注意的是,DeepSeek-V2 在 MMLU 上的性能优于 Mixtral 8x22B。在代码和数学基准测试中,DeepSeek-V2 表现出与 Mixtral 8x22B 相当的性能。由于Mixtral 8x22B没有专门针对中文数据进行训练,其中文能力远远落后于DeepSeek-V2。 (3) 与 LLaMA3 70B 相比,DeepSeek-V2 只用不到四分之一的英文标记进行训练。因此,我们承认DeepSeek-V2在基础英语能力上与LLaMA3 70B仍有细微差距。然而,即使训练令牌和激活参数少得多,DeepSeek-V2 仍然表现出与 LLaMA3 70B 相当的代码和数学能力。此外,作为双语语言模型,DeepSeek-V2 在中文基准测试中以压倒性优势优于 LLaMA3 70B。

Finally, it is worth mentioning that certain prior studies (Hu et al., 2024) incorporate SFT data during the pre-training stage, whereas DeepSeek-V2 has never been exposed to SFT data during pre-training.
最后,值得一提的是,某些先前的研究(Hu et al., 2024在预训练阶段纳入了 SFT 数据,而 DeepSeek-V2 在预训练期间从未接触过 SFT 数据。

3.2.3 Training and Inference Efficiency
3.2.3训练和推理效率

Training Costs. 培训费用。

Since DeepSeek-V2 activates fewer parameters for each token and requires fewer FLOPs than DeepSeek 67B, training DeepSeek-V2 will be more economical than training DeepSeek 67B theoretically. Although training an MoE model will introduce additional communication overheads, through our operator and communication optimizations, the training for DeepSeek-V2 can attain a relatively high Model FLOPs Utilization (MFU). During our practical training on the H800 cluster, for training on each trillion tokens, DeepSeek 67B requires 300.6K GPU hours, while DeepSeek-V2 needs only 172.8K GPU hours, i.e., sparse DeepSeek-V2 can save 42.5% training costs compared with dense DeepSeek 67B.
由于 DeepSeek-V2 为每个令牌激活的参数更少,并且比 DeepSeek 67B 需要更少的 FLOP,因此理论上训练 DeepSeek-V2 将比训练 DeepSeek 67B 更经济。虽然训练MoE模型会引入额外的通信开销,但通过我们的算子和通信优化,DeepSeek-V2的训练可以获得相对较高的模型FLOPs利用率(MFU)。我们在H800集群上的实际训练中,每万亿个token的训练,DeepSeek 67B需要300.6K GPU小时,而DeepSeek-V2只需要172.8K GPU小时,即稀疏DeepSeek-V2相比密集可以节省42.5%的训练成本深寻 67B。

Inference Efficiency. 推理效率。

In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8. In addition, we also perform KV cache quantization (Hooper et al., 2024; Zhao et al., 2023) for DeepSeek-V2 to further compress each element in its KV cache into 6 bits on average. Benefiting from MLA and these optimizations, actually deployed DeepSeek-V2 requires significantly less KV cache than DeepSeek 67B, and thus can serve a much larger batch size. We evaluate the generation throughput of DeepSeek-V2 based on the prompt and generation length distribution from the actually deployed DeepSeek 67B service. On a single node with 8 H800 GPUs, DeepSeek-V2 achieves a generation throughput exceeding 50K tokens per second, which is 5.76 times the maximum generation throughput of DeepSeek 67B. In addition, the prompt input throughput of DeepSeek-V2 exceeds 100K tokens per second.
为了高效部署DeepSeek-V2进行服务,我们首先将其参数转换为FP8的精度。此外,我们还对DeepSeek-V2进行KV缓存量化(Hooper et al., 2024 ;Zhao et al., 2023 ),进一步将其KV缓存中的每个元素平均压缩为6位。受益于 MLA 和这些优化,实际部署的 DeepSeek-V2 所需的 KV 缓存比 DeepSeek 67B 少得多,因此可以服务更大的批量大小。我们根据实际部署的 DeepSeek 67B 服务的提示和生成长度分布来评估 DeepSeek-V2 的生成吞吐量。在具有8个H800 GPU的单节点上,DeepSeek-V2实现了每秒超过50K token的生成吞吐量,是DeepSeek 67B最大生成吞吐量的5.76倍。此外,DeepSeek-V2的即时输入吞吐量超过每秒100K令牌。

4 Alignment
4对齐

4.1 Supervised Fine-Tuning
4.1监督微调

Building upon our prior research (DeepSeek-AI, 2024), we curate our instruction tuning datasets to include 1.5M instances, comprising 1.2M instances for helpfulness and 0.3M instances for safety. In comparison to the initial version, we improve the data quality to mitigate hallucinatory responses and enhance writing proficiency. We fine-tune DeepSeek-V2 with 2 epochs, and the learning rate is set to 5×1065superscript1065\times 10^{-6}5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT. For the evaluation of DeepSeek-V2 Chat (SFT), we mainly include generation-based benchmarks, except for several representative multiple-choice tasks (MMLU and ARC). We also conduct an instruction-following evaluation (IFEval) (Zhou et al., 2023) for DeepSeek-V2 Chat (SFT), using prompt-level loose accuracy as the metric. Moreover, we employ LiveCodeBench (Jain et al., 2024) questions from September 1st, 2023 to April 1st, 2024 to evaluate chat models. In addition to the standard benchmarks, we further evaluate our model on open-ended conversation benchmarks including MT-Bench (Zheng et al., 2023), AlpacaEval 2.0 (Dubois et al., 2024), and AlignBench (Liu et al., 2023). For comparison, we also evaluate Qwen1.5 72B Chat, LLaMA-3-70B Instruct, and Mistral-8x22B Instruct in our evaluation framework and settings. As for DeepSeek 67B Chat, we directly refer to the evaluation results reported in our previous release.
基于我们之前的研究(DeepSeek-AI, 2024 ,我们整理了指令调整数据集以包含 150 万个实例,其中 120 万个用于帮助的实例和 030 万个用于安全的实例。与初始版本相比,我们提高了数据质量,以减轻幻觉反应并提高写作水平。我们对 DeepSeek-V2 进行了 2 个 epoch 的微调,学习率设置为 5×1065superscript1065\times 10^{-6}5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 。对于 DeepSeek-V2 Chat(SFT)的评估,除了几个代表性的多项选择任务(MMLU 和 ARC)外,我们主要包括基于生成的基准。我们还使用提示级松散准确性作为指标,对 DeepSeek-V2 Chat (SFT) 进行指令跟踪评估 (IFEval) (Zhou et al., 2023 。此外,我们使用2023年9月1日至2024年4月1日的LiveCodeBench (Jain et al., 2024问题来评估聊天模型。除了标准基准之外,我们还进一步在开放式对话基准上评估我们的模型,包括 MT-Bench (Zheng et al., 2023 ) 、AlpacaEval 2.0 (Dubois et al., 2024 )和 AlignBench (Liu et al., 2024)。 2023 。为了进行比较,我们还在我们的评估框架和设置中评估了 Qwen1.5 72B Chat、LLaMA-3-70B Instruct 和 Mistral-8x22B Instruct。对于 DeepSeek 67B Chat,我们直接参考之前版本中报告的评估结果。

4.2 Reinforcement Learning
4.2强化学习

In order to further unlock the potential of DeepSeek-V2 and align it with human preference, we conduct Reinforcement Learning (RL) to adjust its preference.
为了进一步释放 DeepSeek-V2 的潜力并使其符合人类偏好,我们进行强化学习(RL)来调整其偏好。

Reinforcement Learning Algorithm.
强化学习算法。

In order to save the training costs of RL, we adopt Group Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is typically with the same size as the policy model, and estimates the baseline from group scores instead. Specifically, for each question q𝑞qitalic_q, GRPO samples a group of outputs {o1,o2,,oG}subscript𝑜1subscript𝑜2subscript𝑜𝐺\{o_{1},o_{2},\cdots,o_{G}\}{ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_o start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } from the old policy πθoldsubscript𝜋subscript𝜃𝑜𝑙𝑑\pi_{\theta_{old}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT and then optimizes the policy model πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT by maximizing the following objective:
为了节省强化学习的训练成本,我们采用组相对策略优化(GRPO) (Shao et al., 2024 ,它放弃了通常与策略模型大小相同的批评家模型,并估计基线而是以小组成绩代替。具体来说,对于每个问题 qqitalic_q , GRPO 对一组输出进行采样 {o1,o2,,oG}subscript1subscript2subscript\{o_{1},o_{2},\cdots,o_{G}\}{ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_o start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } 从旧政策来看 πθoldsubscriptsubscript\pi_{\theta_{old}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT 然后优化政策模型 πθsubscript\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT 通过最大化以下目标:

𝒥GRPO(θ)=𝔼[qP(Q),{oi}i=1Gπθold(O|q)]1Gi=1G(min(πθ(oi|q)πθold(oi|q)Ai,clip(πθ(oi|q)πθold(oi|q),1ε,1+ε)Ai)β𝔻KL(πθ||πref)),\begin{split}\mathcal{J}_{GRPO}(\theta)&=\mathbb{E}{[q\sim P(Q),\{o_{i}\}_{i=1% }^{G}\sim\pi_{\theta_{old}}(O|q)]}\\ &\frac{1}{G}\sum_{i=1}^{G}\left(\min\left(\frac{\pi_{\theta}(o_{i}|q)}{\pi_{% \theta_{old}}(o_{i}|q)}A_{i},\text{clip}\left(\frac{\pi_{\theta}(o_{i}|q)}{\pi% _{\theta_{old}}(o_{i}|q)},1-\varepsilon,1+\varepsilon\right)A_{i}\right)-\beta% \mathbb{D}_{KL}\left(\pi_{\theta}||\pi_{ref}\right)\right),\end{split}start_ROW start_CELL caligraphic_J start_POSTSUBSCRIPT italic_G italic_R italic_P italic_O end_POSTSUBSCRIPT ( italic_θ ) end_CELL start_CELL = blackboard_E [ italic_q ∼ italic_P ( italic_Q ) , { italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_O | italic_q ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_G end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ( roman_min ( divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , clip ( divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG , 1 - italic_ε , 1 + italic_ε ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_β blackboard_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT | | italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (32)
𝔻KL(πθ||πref)=πref(oi|q)πθ(oi|q)logπref(oi|q)πθ(oi|q)1,\mathbb{D}_{KL}\left(\pi_{\theta}||\pi_{ref}\right)=\frac{\pi_{ref}(o_{i}|q)}{% \pi_{\theta}(o_{i}|q)}-\log\frac{\pi_{ref}(o_{i}|q)}{\pi_{\theta}(o_{i}|q)}-1,blackboard_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT | | italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) = divide start_ARG italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG - roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG - 1 , (33)

where ε𝜀\varepsilonitalic_ε and β𝛽\betaitalic_β are hyper-parameters; and Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the advantage, computed using a group of rewards {r1,r2,,rG}subscript𝑟1subscript𝑟2subscript𝑟𝐺\{r_{1},r_{2},\ldots,r_{G}\}{ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } corresponding to the outputs within each group:
在哪里 ε\varepsilonitalic_εβ\betaitalic_β 是超参数;和 AisubscriptA_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 是使用一组奖励计算的优势 {r1,r2,,rG}subscript1subscript2subscript\{r_{1},r_{2},\ldots,r_{G}\}{ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } 对应于每组内的输出:

Ai=rimean({r1,r2,,rG})std({r1,r2,,rG}).subscript𝐴𝑖subscript𝑟𝑖m𝑒𝑎𝑛subscript𝑟1subscript𝑟2subscript𝑟𝐺s𝑡𝑑subscript𝑟1subscript𝑟2subscript𝑟𝐺A_{i}=\frac{r_{i}-{\mathrm{m}ean(\{r_{1},r_{2},\cdots,r_{G}\})}}{{\mathrm{s}td% (\{r_{1},r_{2},\cdots,r_{G}\})}}.italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_m italic_e italic_a italic_n ( { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_r start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } ) end_ARG start_ARG roman_s italic_t italic_d ( { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_r start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } ) end_ARG . (34)
Training Strategy. 培训策略。

In our preliminary experiments, we find that the RL training on reasoning data, such as code and math prompts, exhibits unique characteristics that are distinct from the training on general data. For example, the mathematical and coding abilities of our model can keep improving over a longer period of training steps. Therefore, we employ a two-stage RL training strategy, which first performs reasoning alignment, and then performs human preference alignment. In the first reasoning alignment stage, we train a reward model RMreasoning𝑅subscript𝑀𝑟𝑒𝑎𝑠𝑜𝑛𝑖𝑛𝑔RM_{reasoning}italic_R italic_M start_POSTSUBSCRIPT italic_r italic_e italic_a italic_s italic_o italic_n italic_i italic_n italic_g end_POSTSUBSCRIPT for code and math reasoning tasks, and optimize the policy model with the feedback of RMreasoning𝑅subscript𝑀𝑟𝑒𝑎𝑠𝑜𝑛𝑖𝑛𝑔RM_{reasoning}italic_R italic_M start_POSTSUBSCRIPT italic_r italic_e italic_a italic_s italic_o italic_n italic_i italic_n italic_g end_POSTSUBSCRIPT:
在我们的初步实验中,我们发现对推理数据(例如代码和数学提示)的强化学习训练表现出与一般数据训练不同的独特特征。 例如,我们模型的数学和编码能力可以在较长时期的训练步骤中不断提高。 因此,我们采用两阶段强化学习训练策略,首先进行推理对齐,然后进行人类偏好对齐。 在第一个推理对齐阶段,我们训练奖励模型 RMreasoningsubscriptRM_{reasoning}italic_R italic_M start_POSTSUBSCRIPT italic_r italic_e italic_a italic_s italic_o italic_n italic_i italic_n italic_g end_POSTSUBSCRIPT 用于代码和数学推理任务,并根据反馈优化策略模型 RMreasoningsubscriptRM_{reasoning}italic_R italic_M start_POSTSUBSCRIPT italic_r italic_e italic_a italic_s italic_o italic_n italic_i italic_n italic_g end_POSTSUBSCRIPT :

ri=RMreasoning(oi).subscript𝑟𝑖𝑅subscript𝑀𝑟𝑒𝑎𝑠𝑜𝑛𝑖𝑛𝑔subscript𝑜𝑖r_{i}=RM_{reasoning}(o_{i}).italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_R italic_M start_POSTSUBSCRIPT italic_r italic_e italic_a italic_s italic_o italic_n italic_i italic_n italic_g end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (35)

In the second human preference alignment stage, we adopt a multi-reward framework, which acquires rewards from a helpful reward model RMhelpful𝑅subscript𝑀𝑒𝑙𝑝𝑓𝑢𝑙RM_{helpful}italic_R italic_M start_POSTSUBSCRIPT italic_h italic_e italic_l italic_p italic_f italic_u italic_l end_POSTSUBSCRIPT, a safety reward model RMsafety𝑅subscript𝑀𝑠𝑎𝑓𝑒𝑡𝑦RM_{safety}italic_R italic_M start_POSTSUBSCRIPT italic_s italic_a italic_f italic_e italic_t italic_y end_POSTSUBSCRIPT, and a rule-based reward model RMrule𝑅subscript𝑀𝑟𝑢𝑙𝑒RM_{rule}italic_R italic_M start_POSTSUBSCRIPT italic_r italic_u italic_l italic_e end_POSTSUBSCRIPT. The final reward of a response oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is
在第二个人类偏好调整阶段,我们采用多重奖励框架,该框架从有用的奖励模型中获取奖励 RMhelpfulsubscriptRM_{helpful}italic_R italic_M start_POSTSUBSCRIPT italic_h italic_e italic_l italic_p italic_f italic_u italic_l end_POSTSUBSCRIPT ,安全奖励模型 RMsafetysubscriptRM_{safety}italic_R italic_M start_POSTSUBSCRIPT italic_s italic_a italic_f italic_e italic_t italic_y end_POSTSUBSCRIPT ,以及基于规则的奖励模型 RMrulesubscriptRM_{rule}italic_R italic_M start_POSTSUBSCRIPT italic_r italic_u italic_l italic_e end_POSTSUBSCRIPT 。回复的最终奖励 oisubscripto_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

ri=c1RMhelpful(oi)+c2RMsafety(oi)+c3RMrule(oi),subscript𝑟𝑖subscript𝑐1𝑅subscript𝑀𝑒𝑙𝑝𝑓𝑢𝑙subscript𝑜𝑖subscript𝑐2𝑅subscript𝑀𝑠𝑎𝑓𝑒𝑡𝑦subscript𝑜𝑖subscript𝑐3𝑅subscript𝑀𝑟𝑢𝑙𝑒subscript𝑜𝑖r_{i}=c_{1}\cdot RM_{helpful}(o_{i})+c_{2}\cdot RM_{safety}(o_{i})+c_{3}\cdot RM% _{rule}(o_{i}),italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_R italic_M start_POSTSUBSCRIPT italic_h italic_e italic_l italic_p italic_f italic_u italic_l end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_R italic_M start_POSTSUBSCRIPT italic_s italic_a italic_f italic_e italic_t italic_y end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⋅ italic_R italic_M start_POSTSUBSCRIPT italic_r italic_u italic_l italic_e end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (36)

where c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and c3subscript𝑐3c_{3}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are corresponding coefficients.
在哪里 c1subscript1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , c2subscript2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 和 c3subscript3c_{3}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 是对应的系数。

In order to obtain reliable reward models that play crucial roles in the RL training, we carefully collect preference data, and meticulously conduct quality filtering and proportion adjustments. We obtain code preference data based on compiler-feedback, and mathematical preference data based on the ground-truth labels. For reward model training, we initialize the reward models with DeepSeek-V2 Chat (SFT) and train them with either a point-wise or a pair-wise loss. In our experiments, we observe that the RL training can fully tap into and activate the potential of our model, enabling it to select the correct and satisfactory answer from possible responses.
为了获得在强化学习训练中发挥关键作用的可靠奖励模型,我们仔细收集偏好数据,并精心进行质量过滤和比例调整。我们根据编译器反馈获得代码偏好数据,并根据真实标签获得数学偏好数据。对于奖励模型训练,我们使用 DeepSeek-V2 Chat (SFT) 初始化奖励模型,并使用逐点或成对损失来训练它们。在我们的实验中,我们观察到强化学习训练可以充分挖掘和激活我们模型的潜力,使其能够从可能的响应中选择正确且令人满意的答案。

Optimizations for Training Efficiency.
优化培训效率。

Conducting RL training on extremely large models places high demands on the training framework. It requires careful engineering optimization to manage the GPU memory and RAM pressure, and meanwhile maintain a fast training speed. For this goal, we implement the following engineering optimizations. (1) Firstly, we propose a hybrid engine that adopts different parallel strategies for training and inference respectively to achieve higher GPU utilization. (2) Secondly, we leverage vLLM (Kwon et al., 2023) with large batch sizes as our inference backend to accelerate the inference speed. (3) Thirdly, we carefully design a scheduling strategy for offloading models to CPUs and loading models back to GPUs, which achieves a near-optimal balance between the training speed and memory consumption.
在超大型模型上进行强化学习训练对训练框架提出了很高的要求。它需要仔细的工程优化来管理 GPU 内存和 RAM 压力,同时保持快速的训练速度。为了这个目标,我们实施了以下工程优化。 (1)首先,我们提出了一种混合引擎,分别采用不同的并行策略进行训练和推理,以实现更高的GPU利用率。 (2) 其次,我们利用大批量的 vLLM (Kwon et al., 2023 )作为推理后端来加快推理速度。 (3) 第三,我们精心设计了将模型卸载到 CPU 并将模型加载回 GPU 的调度策略,从而在训练速度和内存消耗之间实现了近乎最佳的平衡。

4.3 Evaluation Results
4.3评价结果

Evaluations on Standard Benchmarks. Initially, we evaluate DeepSeek-V2 Chat (SFT) and DeepSeek-V2 Chat (RL) on standard benchmarks. Notably, DeepSeek-V2 Chat (SFT) demonstrates substantial improvements in GSM8K, MATH, and HumanEval evaluations compared with its base version. This progress can be attributed to the inclusion of our SFT data, which comprises a considerable volume of math and code related content. In addition, DeepSeek-V2 Chat (RL) further boosts the performance on math and code benchmarks. We show more code and math evaluations in Appendix F.
标准基准评估。最初,我们在标准基准上评估 DeepSeek-V2 Chat (SFT) 和 DeepSeek-V2 Chat (RL)。值得注意的是,与基础版本相比,DeepSeek-V2 Chat (SFT) 在 GSM8K、MATH 和 HumanEval 评估方面表现出显着改进。这一进展可归因于我们的 SFT 数据的纳入,其中包含大量数学和代码相关内容。此外,DeepSeek-V2 Chat (RL) 进一步提升了数学和代码基准测试的性能。我们在附录F中展示了更多代码和数学评估。

As for the comparisons with other models, we first compare DeepSeek-V2 Chat (SFT) with Qwen1.5 72B Chat, and find that DeepSeek-V2 Chat (SFT) surpasses Qwen1.5 72B Chat on almost all of English, math, and code benchmarks. On Chinese benchmarks, DeepSeek-V2 Chat (SFT) demonstrates slightly lower scores than Qwen1.5 72B Chat on multi-subject multiple-choice tasks, consistent with the performance observed from their base versions. When compared with the state-of-the-art open-source MoE model, Mixtral 8x22B Instruct, DeepSeek-V2 Chat (SFT) exhibits better performance on most benchmarks, except for NaturalQuestions and IFEval. Furthermore, in comparison to the state-of-the-art open-source model LLaMA3 70B Chat, DeepSeek-V2 Chat (SFT) shows similar performance in code and math related benchmarks. LLaMA3 70B Chat exhibits better performance on MMLU and IFEval, while DeepSeek-V2 Chat (SFT) showcases stronger performance on Chinese tasks. Ultimately, DeepSeek-V2 Chat (RL) demonstrates further enhanced performance in both mathematical and coding tasks compared with DeepSeek-V2 Chat (SFT). These comparisons highlight the strengths of DeepSeek-V2 Chat in relation to other language models in various domains and languages.
至于与其他模型的比较,我们首先将 DeepSeek-V2 Chat (SFT) 与 Qwen1.5 72B Chat 进行比较,发现 DeepSeek-V2 Chat (SFT) 在英语、数学、英语等几乎所有方面都超越了 Qwen1.5 72B Chat。代码基准。在中国基准测试中,DeepSeek-V2 Chat (SFT) 在多主题多项选择任务上的得分略低于 Qwen1.5 72B Chat,这与从其基础版本观察到的性能一致。与最先进的开源 MoE 模型 Mixtral 8x22B Instruct 相比,DeepSeek-V2 Chat (SFT) 在大多数基准测试中表现出更好的性能,但 NaturalQuestions 和 IFEval 除外。此外,与最先进的开源模型 LLaMA3 70B Chat 相比,DeepSeek-V2 Chat (SFT) 在代码和数学相关基准测试中显示出相似的性能。 LLaMA3 70B Chat 在 MMLU 和 IFEval 上表现出更好的性能,而 DeepSeek-V2 Chat (SFT) 在中文任务上表现出更强的性能。最终,与 DeepSeek-V2 Chat (SFT) 相比,DeepSeek-V2 Chat (RL) 在数学和编码任务方面表现出进一步增强的性能。这些比较突出了 DeepSeek-V2 Chat 相对于各个领域和语言的其他语言模型的优势。

Benchmark 基准 # Shots # 镜头