这是用户在 2024-9-30 14:36 为 https://arxiv.org/html/2405.04434?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: dblfloatfix
  • failed: dramatist
  • failed: xltabular
  • failed: CJKutf8

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
许可证:arXiv.org 永久非独占许可证
arXiv:2405.04434v5 [cs.CL] 19 Jun 2024
arXiv:2405.04434v5 [cs.CL] 2024 年 6 月 19 日
\reportnumber \报告编号

001

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2:强大、经济且高效的专家混合语言模型

DeepSeek-AI
research@deepseek.com
Abstract 抽象的

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V2.
我们提出了 DeepSeek-V2,一种强大的专家混合 (MoE) 语言模型,其特点是经济的训练和高效的推理。它总共包括236B个参数,其中每个令牌激活21B个参数,并支持128K令牌的上下文长度。 DeepSeek-V2采用多头潜在注意力(MLA)和DeepSeekMoE等创新架构。 MLA 通过将键值 (KV) 缓存显着压缩为潜在向量来保证高效推理,而 DeepSeekMoE 则可以通过稀疏计算以经济的成本训练强大的模型。与 DeepSeek 67B 相比,DeepSeek-V2 性能显着增强,同时节省了 42.5% 的训练成本,减少了 93.3% 的 KV 缓存,最大生成吞吐量提升至 5.76 倍。我们在由 8.1T 代币组成的高质量多源语料库上对 DeepSeek-V2 进行预训练,并进一步进行监督微调(SFT)和强化学习(RL)以充分释放其潜力。评估结果表明,即使只有21B个激活参数,DeepSeek-V2及其聊天版本仍然达到了开源模型中顶级的性能。模型检查点位于https://github.com/deepseek-ai/DeepSeek-V2

{CJK*} {中日韩*}

UTF8gbsn

Refer to caption
Refer to caption
Figure 1: (a) MMLU accuracy vs. activated parameters, among different open-source models. (b) Training costs and inference efficiency of DeepSeek 67B (Dense) and DeepSeek-V2.
图 1: (a) 不同开源模型中的 MMLU 准确度与激活参数的关系。 (b) DeepSeek 67B (Dense) 和 DeepSeek-V2 的训练成本和推理效率。

1 Introduction
1简介

In the past few years, Large Language Models (LLMs) (OpenAI, 2022, 2023; Anthropic, 2023; Google, 2023) have undergone rapid development, offering a glimpse into the dawn of Artificial General Intelligence (AGI). In general, the intelligence of an LLM tends to improve as the number of parameters increases, allowing it to exhibit emergent capabilities across various tasks (Wei et al., 2022). However, the improvement comes at the cost of larger computing resources for training and a potential decrease in inference throughput. These constraints present significant challenges that impede the widespread adoption and utilization of LLMs. In order to tackle this problem, we introduce DeepSeek-V2, a strong open-source Mixture-of-Experts (MoE) language model, characterized by economical training and efficient inference through an innovative Transformer architecture. It is equipped with a total of 236B parameters, of which 21B are activated for each token, and supports a context length of 128K tokens.
在过去的几年里,大型语言模型( LLMs ) OpenAI2022,2023 ;Anthropic, 2023 ;Google, 2023经历了快速发展,让人们看到了通用人工智能(AGI)的曙光。一般来说, LLM的智力往往会随着参数数量的增加而提高,从而使其能够在各种任务中展现出新兴的能力(Wei et al., 2022 。然而,这种改进是以更大的计算资源用于训练和推理吞吐量的潜在下降为代价的。这些限制提出了重大挑战,阻碍了LLMs的广泛采用和利用。为了解决这个问题,我们引入了 DeepSeek-V2,这是一种强大的开源专家混合 (MoE) 语言模型,其特点是通过创新的 Transformer 架构进行经济的训练和高效的推理。它总共配备了236B个参数,其中每个令牌激活21B个参数,并支持128K令牌的上下文长度。

We optimize the attention modules and Feed-Forward Networks (FFNs) within the Transformer framework (Vaswani et al., 2017) with our proposed Multi-head Latent Attention (MLA) and DeepSeekMoE. (1) In the context of attention mechanisms, the Key-Value (KV) cache of the Multi-Head Attention (MHA) (Vaswani et al., 2017) poses a significant obstacle to the inference efficiency of LLMs. Various approaches have been explored to address this issue, including Grouped-Query Attention (GQA) (Ainslie et al., 2023) and Multi-Query Attention (MQA) (Shazeer, 2019). However, these methods often compromise performance in their attempt to reduce the KV cache. In order to achieve the best of both worlds, we introduce MLA, an attention mechanism equipped with low-rank key-value joint compression. Empirically, MLA achieves superior performance compared with MHA, and meanwhile significantly reduces the KV cache during inference, thus boosting the inference efficiency. (2) For Feed-Forward Networks (FFNs), we follow the DeepSeekMoE architecture (Dai et al., 2024), which adopts fine-grained expert segmentation and shared expert isolation for higher potential in expert specialization. The DeepSeekMoE architecture demonstrates great advantages compared with conventional MoE architectures like GShard (Lepikhin et al., 2021), enabling us to train strong models at an economical cost. As we employ expert parallelism during training, we also devise supplementary mechanisms to control communication overheads and ensure load balance. By combining these two techniques, DeepSeek-V2 features strong performance (Figure 1), economical training costs, and efficient inference throughput (Figure 1), simultaneously.
我们使用我们提出的多头潜在注意力(MLA)DeepSeekMoE优化 Transformer 框架(Vaswani 等人, 2017内的注意力模块和前馈网络(FFN)。 (1)在注意力机制的背景下,多头注意力(MHA)的键值(KV)缓存(Vaswani et al., 2017对LLMs的推理效率造成了重大障碍。人们已经探索了各种方法来解决这个问题,包括分组查询注意力(GQA) (Ainslie et al., 2023和多查询注意力(MQA) (Shazeer, 2019 。然而,这些方法在尝试减少 KV 缓存时常常会损害性能。为了实现两全其美,我们引入了MLA,一种配备低秩键值联合压缩的注意力机制。从经验来看,MLA相比MHA具有更优异的性能,同时大幅减少了推理过程中的KV缓存,从而提高了推理效率。 (2)对于前馈网络(FFN),我们遵循 DeepSeekMoE 架构(Dai et al., 2024 ,它采用细粒度的专家分割和共享专家隔离,以提高专家专业化的潜力。与 GShard (Lepikhin 等人, 2021等传统 MoE 架构相比,DeepSeekMoE 架构展示了巨大的优势,使我们能够以经济的成本训练强大的模型。由于我们在训练期间采用专家并行性,因此我们还设计了补充机制来控制通信开销并确保负载平衡。 通过结合这两种技术,DeepSeek-V2 同时具有强大的性能(图1 )、经济的训练成本和高效的推理吞吐量(图1 )。

Refer to caption
Figure 2: Illustration of the architecture of DeepSeek-V2. MLA ensures efficient inference by significantly reducing the KV cache for generation, and DeepSeekMoE enables training strong models at an economical cost through the sparse architecture.
图 2: DeepSeek-V2 的架构图。 MLA 通过显着减少生成的 KV 缓存来确保高效推理,而 DeepSeekMoE 可以通过稀疏架构以经济的成本训练强大的模型。

We construct a high-quality and multi-source pre-training corpus consisting of 8.1T tokens. Compared with the corpus used in DeepSeek 67B (our previous release) (DeepSeek-AI, 2024), this corpus features an extended amount of data, especially Chinese data, and higher data quality. We first pretrain DeepSeek-V2 on the full pre-training corpus. Then, we collect 1.5M conversational sessions, which encompass various domains such as math, code, writing, reasoning, safety, and more, to perform Supervised Fine-Tuning (SFT) for DeepSeek-V2 Chat (SFT). Finally, we follow DeepSeekMath (Shao et al., 2024) to employ Group Relative Policy Optimization (GRPO) to further align the model with human preference and produce DeepSeek-V2 Chat (RL).
我们构建了由 8.1T 令牌组成的高质量、多源预训练语料库。与DeepSeek 67B(我们之前版本) (DeepSeek-AI, 2024使用的语料库相比,该语料库的数据量特别是中文数据量更大,数据质量更高。我们首先在完整的预训练语料库上预训练 DeepSeek-V2。然后,我们收集 150 万个对话会话,其中涵盖数学、代码、写作、推理、安全等各个领域,为 DeepSeek-V2 聊天 (SFT) 执行监督微调 (SFT)。最后,我们遵循 DeepSeekMath (Shao et al., 2024 )采用组相对策略优化 (GRPO) 进一步使模型与人类偏好保持一致,并生成 DeepSeek-V2 Chat (RL)。

We evaluate DeepSeek-V2 on a wide range of benchmarks in English and Chinese, and compare it with representative open-source models. Evaluation results show that even with only 21B activated parameters, DeepSeek-V2 still achieves top-tier performance among open-source models and becomes the strongest open-source MoE language model. Figure 1 highlights that, on MMLU, DeepSeek-V2 achieves top-ranking performance with only a small number of activated parameters. In addition, as shown in Figure 1, compared with DeepSeek 67B, DeepSeek-V2 saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We also evaluate DeepSeek-V2 Chat (SFT) and DeepSeek-V2 Chat (RL) on open-ended benchmarks. Notably, DeepSeek-V2 Chat (RL) achieves 38.9 length-controlled win rate on AlpacaEval 2.0 (Dubois et al., 2024), 8.97 overall score on MT-Bench (Zheng et al., 2023), and 7.91 overall score on AlignBench (Liu et al., 2023). The English open-ended conversation evaluations demonstrate that DeepSeek-V2 Chat (RL) has top-tier performance among open-source chat models. In addition, the evaluation on AlignBench indicates that in Chinese, DeepSeek-V2 Chat (RL) outperforms all of open-source models, and even beats most of closed-source models.
我们在多种英文和中文基准上对 DeepSeek-V2 进行了评估,并将其与代表性的开源模型进行了比较。评估结果表明,即使只有21B个激活参数,DeepSeek-V2仍然达到了开源模型中顶级的性能,成为最强的开源MoE语言模型。图1突出显示,在 MMLU 上,DeepSeek-V2 只需少量激活参数即可实现顶级性能。此外,如图1所示,与DeepSeek 67B相比,DeepSeek-V2节省了42.5%的训练成本,减少了93.3%的KV缓存,并将最大生成吞吐量提升至5.76倍。我们还在开放式基准上评估 DeepSeek-V2 Chat (SFT) 和 DeepSeek-V2 Chat (RL)。值得注意的是,DeepSeek-V2 Chat (RL) 在 AlpacaEval 2.0 上获得了 38.9 的长度控制胜率(Dubois 等人, 2024 ,在 MT-Bench 上获得了 8.97 的总分(Zheng 等人, 2023 ,在 AlignBench 上获得了 7.91 的总分(刘等人, 2023 。英语开放式对话评估表明,DeepSeek-V2 Chat(RL)在开源聊天模型中具有顶级性能。此外,AlignBench上的评测表明,在中文中,DeepSeek-V2 Chat(RL)的表现优于所有开源模型,甚至击败了大多数闭源模型。

In order to facilitate further research and development on MLA and DeepSeekMoE, we also release DeepSeek-V2-Lite, a smaller model equipped with MLA and DeepSeekMoE, for the open-source community. It has a total of 15.7B parameters, where 2.4B are activated for each token. Detailed descriptions about DeepSeek-V2-Lite can be found in Appendix B.
为了便于MLA和DeepSeekMoE的进一步研究和开发,我们还为开源社区发布了配备MLA和DeepSeekMoE的较小模型DeepSeek-V2-Lite。它总共有 15.7B 个参数,其中每个 token 激活 2.4B 个参数。关于 DeepSeek-V2-Lite 的详细描述可以参见附录B。

In the rest of this paper, we first provide a detailed description of the model architecture of DeepSeek-V2 (Section 2). Subsequently, we introduce our pre-training endeavors, including the training data construction, hyper-parameter settings, infrastructures, long context extension, and the evaluation of model performance and efficiency (Section 3). Following this, we demonstrate our efforts in alignment, encompassing Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), the evaluation results, and other discussion (Section 4). Finally, we summarize the conclusion, deliberate on the current limitations of DeepSeek-V2, and outline our future work (Section 5).
在本文的其余部分,我们首先详细描述 DeepSeek-V2 的模型架构(第2节)。随后,我们介绍了我们的预训练工作,包括训练数据构建、超参数设置、基础设施、长上下文扩展以及模型性能和效率的评估(第3节)。接下来,我们展示了我们的协调努力,包括监督微调 (SFT)、强化学习 (RL)、评估结果和其他讨论(第4节)。最后,我们总结结论,探讨 DeepSeek-V2 当前的局限性,并概述我们未来的工作(第5节)。

2 Architecture
2架构

By and large, DeepSeek-V2 is still in the Transformer architecture (Vaswani et al., 2017), where each Transformer block consists of an attention module and a Feed-Forward Network (FFN). However, for both the attention module and the FFN, we design and employ innovative architectures. For attention, we design MLA, which utilizes low-rank key-value joint compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference. For FFNs, we adopt the DeepSeekMoE architecture (Dai et al., 2024), a high-performance MoE architecture that enables training strong models at an economical cost. An illustration of the architecture of DeepSeek-V2 is presented in Figure 2, and we will introduce the details of MLA and DeepSeekMoE in this section. For other tiny details (e.g., layer normalization and the activation function in FFNs), unless specifically stated, DeepSeek-V2 follows the settings of DeepSeek 67B (DeepSeek-AI, 2024).
总的来说,DeepSeek-V2 仍然采用 Transformer 架构(Vaswani 等人, 2017 ,其中每个 Transformer 块由一个注意力模块和一个前馈网络(FFN)组成。然而,对于注意力模块和 FFN,我们设计并采用了创新的架构。为了引起注意,我们设计了MLA,它利用低秩键值联合压缩来消除推理时键值缓存的瓶颈,从而支持高效的推理。对于 FFN,我们采用 DeepSeekMoE 架构(Dai et al., 2024 ,这是一种高性能的 MoE 架构,能够以经济的成本训练强大的模型。 DeepSeek-V2的架构如图2所示,本节我们将详细介绍MLA和DeepSeekMoE。对于其他微小细节(例如,FFN 中的层归一化和激活函数),除非特别说明,DeepSeek-V2 遵循 DeepSeek 67B (DeepSeek-AI, 2024的设置。

2.1 Multi-Head Latent Attention: Boosting Inference Efficiency
2.1多头潜在注意力:提高推理效率

Conventional Transformer models usually adopts Multi-Head Attention (MHA) (Vaswani et al., 2017), but during generation, its heavy Key-Value (KV) cache will become the bottleneck that limit the inference efficiency. In order to reduce the KV cache, Multi-Query Attention (MQA) (Shazeer, 2019) and Grouped-Query Attention (GQA) (Ainslie et al., 2023) are proposed. They require a smaller magnitude of KV cache, but their performance does not match MHA (we provide the ablation of MHA, GQA and MQA in Appendix D.1).
传统的 Transformer 模型通常采用多头注意力(MHA) (Vaswani et al., 2017 ,但在生成过程中,其繁重的键值(KV)缓存将成为限制推理效率的瓶颈。为了减少KV缓存,提出了多查询注意(MQA) (Shazeer, 2019和分组查询注意(GQA) (Ainslie等人, 2023 。它们需要较小量级的 KV 缓存,但它们的性能与 MHA 不匹配(我们在附录D.1中提供了 MHA、GQA 和 MQA 的消融)。

For DeepSeek-V2, we design an innovative attention mechanism called Multi-head Latent Attention (MLA). Equipped with low-rank key-value joint compression, MLA achieves better performance than MHA, but requires a significantly smaller amount of KV cache. We introduce its architecture in the following, and also provide a comparison between MLA and MHA in Appendix D.2.
对于 DeepSeek-V2,我们设计了一种创新的注意力机制,称为多头潜在注意力(MLA)。 MLA 配备低秩键值联合压缩,比 MHA 具有更好的性能,但需要的 KV 缓存量要少得多。我们在下面介绍它的架构,并在附录D.2中提供MLA和MHA之间的比较。

2.1.1 Preliminaries: Standard Multi-Head Attention
2.1.1预备知识:标准多头注意力

We first introduce the standard MHA mechanism as background. Let d𝑑ditalic_d be the embedding dimension, nhsubscript𝑛n_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT be the number of attention heads, dhsubscript𝑑d_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT be the dimension per head, and 𝐡tdsubscript𝐡𝑡superscript𝑑\mathbf{h}_{t}\in\mathbb{R}^{d}bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be the attention input of the t𝑡titalic_t-th token at an attention layer. Standard MHA first produces 𝐪t,𝐤t,𝐯tdhnhsubscript𝐪𝑡subscript𝐤𝑡subscript𝐯𝑡superscriptsubscript𝑑subscript𝑛\mathbf{q}_{t},\mathbf{k}_{t},\mathbf{v}_{t}\in\mathbb{R}^{d_{h}n_{h}}bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT through three matrices WQ,WK,WVdhnh×dsuperscript𝑊𝑄superscript𝑊𝐾superscript𝑊𝑉superscriptsubscript𝑑subscript𝑛𝑑W^{Q},W^{K},W^{V}\in\mathbb{R}^{d_{h}n_{h}\times d}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT, respectively:
我们首先介绍标准的MHA机制作为背景。 让 dditalic_d 是嵌入维度, nhsubscriptn_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 是注意力头的数量, dhsubscriptd_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 是每个头的尺寸,并且 𝐡tdsubscriptsuperscript\mathbf{h}_{t}\in\mathbb{R}^{d}bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT 是注意力的输入 ttitalic_t 注意力层的第一个标记。标准MHA首先生产 𝐪t,𝐤t,𝐯tdhnhsubscriptsubscriptsubscriptsuperscriptsubscriptsubscript\mathbf{q}_{t},\mathbf{k}_{t},\mathbf{v}_{t}\in\mathbb{R}^{d_{h}n_{h}}bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 通过三个矩阵 WQ,WK,WVdhnh×dsuperscriptsuperscriptsuperscriptsuperscriptsubscriptsubscriptW^{Q},W^{K},W^{V}\in\mathbb{R}^{d_{h}n_{h}\times d}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT , 分别:

𝐪tsubscript𝐪𝑡\displaystyle\mathbf{q}_{t}bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =WQ𝐡t,absentsuperscript𝑊𝑄subscript𝐡𝑡\displaystyle=W^{Q}\mathbf{h}_{t},= italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (1)
𝐤tsubscript𝐤𝑡\displaystyle\mathbf{k}_{t}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =WK𝐡t,absentsuperscript𝑊𝐾subscript𝐡𝑡\displaystyle=W^{K}\mathbf{h}_{t},= italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (2)
𝐯tsubscript𝐯𝑡\displaystyle\mathbf{v}_{t}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =WV𝐡t,absentsuperscript𝑊𝑉subscript𝐡𝑡\displaystyle=W^{V}\mathbf{h}_{t},= italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (3)

Then, 𝐪t,𝐤t,𝐯tsubscript𝐪𝑡subscript𝐤𝑡subscript𝐯𝑡\mathbf{q}_{t},\mathbf{k}_{t},\mathbf{v}_{t}bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT will be sliced into nhsubscript𝑛n_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT heads for the multi-head attention computation:
然后, 𝐪t,𝐤t,𝐯tsubscriptsubscriptsubscript\mathbf{q}_{t},\mathbf{k}_{t},\mathbf{v}_{t}bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 将被切成 nhsubscriptn_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 用于多头注意力计算的头:

[𝐪t,1;\displaystyle[\mathbf{q}_{t,1};[ bold_q start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ; 𝐪t,2;;𝐪t,nh]=𝐪t,\displaystyle\mathbf{q}_{t,2};...;\mathbf{q}_{t,n_{h}}]=\mathbf{q}_{t},bold_q start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ; … ; bold_q start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] = bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (4)
[𝐤t,1;\displaystyle[\mathbf{k}_{t,1};[ bold_k start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ; 𝐤t,2;;𝐤t,nh]=𝐤t,\displaystyle\mathbf{k}_{t,2};...;\mathbf{k}_{t,n_{h}}]=\mathbf{k}_{t},bold_k start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ; … ; bold_k start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] = bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (5)
[𝐯t,1;\displaystyle[\mathbf{v}_{t,1};[ bold_v start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ; 𝐯t,2;;𝐯t,nh]=𝐯t,\displaystyle\mathbf{v}_{t,2};...;\mathbf{v}_{t,n_{h}}]=\mathbf{v}_{t},bold_v start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ; … ; bold_v start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] = bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (6)
𝐨t,isubscript𝐨𝑡𝑖\displaystyle\mathbf{o}_{t,i}bold_o start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT =j=1tSoftmaxj(𝐪t,iT𝐤j,idh)𝐯j,i,absentsuperscriptsubscript𝑗1𝑡subscriptSoftmax𝑗superscriptsubscript𝐪𝑡𝑖𝑇subscript𝐤𝑗𝑖subscript𝑑subscript𝐯𝑗𝑖\displaystyle=\sum_{j=1}^{t}\operatorname{Softmax}_{j}(\frac{\mathbf{q}_{t,i}^% {T}\mathbf{k}_{j,i}}{\sqrt{d_{h}}})\mathbf{v}_{j,i},= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Softmax start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( divide start_ARG bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_k start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG end_ARG ) bold_v start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT , (7)
𝐮tsubscript𝐮𝑡\displaystyle\mathbf{u}_{t}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =WO[𝐨t,1;𝐨t,2;;𝐨t,nh],absentsuperscript𝑊𝑂subscript𝐨𝑡1subscript𝐨𝑡2subscript𝐨𝑡subscript𝑛\displaystyle=W^{O}[\mathbf{o}_{t,1};\mathbf{o}_{t,2};...;\mathbf{o}_{t,n_{h}}],= italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT [ bold_o start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ; bold_o start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ; … ; bold_o start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] , (8)

where 𝐪t,i,𝐤t,i,𝐯t,idhsubscript𝐪𝑡𝑖subscript𝐤𝑡𝑖subscript𝐯𝑡𝑖superscriptsubscript𝑑\mathbf{q}_{t,i},\mathbf{k}_{t,i},\mathbf{v}_{t,i}\in\mathbb{R}^{d_{h}}bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , bold_k start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denote the query, key, and value of the i𝑖iitalic_i-th attention head, respectively; WOd×dhnhsuperscript𝑊𝑂superscript𝑑subscript𝑑subscript𝑛W^{O}\in\mathbb{R}^{d\times d_{h}n_{h}}italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the output projection matrix. During inference, all keys and values need to be cached to accelerate inference, so MHA needs to cache 2nhdhl2subscript𝑛subscript𝑑𝑙2n_{h}d_{h}l2 italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_l elements for each token. In model deployment, this heavy KV cache is a large bottleneck that limits the maximum batch size and sequence length.
在哪里 𝐪t,i,𝐤t,i,𝐯t,idhsubscriptsubscriptsubscriptsuperscriptsubscript\mathbf{q}_{t,i},\mathbf{k}_{t,i},\mathbf{v}_{t,i}\in\mathbb{R}^{d_{h}}bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , bold_k start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 表示查询、键和值 iiitalic_i 分别为第-个注意力头; WOd×dhnhsuperscriptsuperscriptsubscriptsubscriptW^{O}\in\mathbb{R}^{d\times d_{h}n_{h}}italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 表示输出投影矩阵。在推理过程中,所有的key和value都需要缓存以加速推理,所以MHA需要缓存 2nhdhl2subscriptsubscript2n_{h}d_{h}l2 italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_l 每个标记的元素。在模型部署中,这种沉重的KV缓存是一个很大的瓶颈,限制了最大批量大小和序列长度。

Refer to caption
Figure 3: Simplified illustration of Multi-Head Attention (MHA), Grouped-Query Attention (GQA), Multi-Query Attention (MQA), and Multi-head Latent Attention (MLA). Through jointly compressing the keys and values into a latent vector, MLA significantly reduces the KV cache during inference.
图 3:多头注意力 (MHA)、分组查询注意力 (GQA)、多查询注意力 (MQA) 和多头潜在注意力 (MLA) 的简化说明。通过将键和值联合压缩为潜在向量,MLA 显着减少了推理过程中的 KV 缓存。

2.1.2 Low-Rank Key-Value Joint Compression
2.1.2低秩键值联合压缩

The core of MLA is the low-rank joint compression for keys and values to reduce KV cache:
MLA的核心是对key和value进行低秩联合压缩,以减少KV缓存:

𝐜tKVsuperscriptsubscript𝐜𝑡𝐾𝑉\displaystyle\mathbf{c}_{t}^{KV}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT =WDKV𝐡t,absentsuperscript𝑊𝐷𝐾𝑉subscript𝐡𝑡\displaystyle=W^{DKV}\mathbf{h}_{t},= italic_W start_POSTSUPERSCRIPT italic_D italic_K italic_V end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (9)
𝐤tCsuperscriptsubscript𝐤𝑡𝐶\displaystyle\mathbf{k}_{t}^{C}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT =WUK𝐜tKV,absentsuperscript𝑊𝑈𝐾superscriptsubscript𝐜𝑡𝐾𝑉\displaystyle=W^{UK}\mathbf{c}_{t}^{KV},= italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT , (10)
𝐯tCsuperscriptsubscript𝐯𝑡𝐶\displaystyle\mathbf{v}_{t}^{C}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT =WUV𝐜tKV,absentsuperscript𝑊𝑈𝑉superscriptsubscript𝐜𝑡𝐾𝑉\displaystyle=W^{UV}\mathbf{c}_{t}^{KV},= italic_W start_POSTSUPERSCRIPT italic_U italic_V end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT , (11)

where 𝐜tKVdcsuperscriptsubscript𝐜𝑡𝐾𝑉superscriptsubscript𝑑𝑐\mathbf{c}_{t}^{KV}\in\mathbb{R}^{d_{c}}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the compressed latent vector for keys and values; dc(dhnh)annotatedsubscript𝑑𝑐much-less-thanabsentsubscript𝑑subscript𝑛d_{c}(\ll d_{h}n_{h})italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( ≪ italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) denotes the KV compression dimension; WDKVdc×dsuperscript𝑊𝐷𝐾𝑉superscriptsubscript𝑑𝑐𝑑W^{DKV}\in\mathbb{R}^{d_{c}\times d}italic_W start_POSTSUPERSCRIPT italic_D italic_K italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT is the down-projection matrix; and WUK,WUVdhnh×dcsuperscript𝑊𝑈𝐾superscript𝑊𝑈𝑉superscriptsubscript𝑑subscript𝑛subscript𝑑𝑐W^{UK},W^{UV}\in\mathbb{R}^{d_{h}n_{h}\times d_{c}}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_U italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are the up-projection matrices for keys and values, respectively. During inference, MLA only needs to cache 𝐜tKVsuperscriptsubscript𝐜𝑡𝐾𝑉\mathbf{c}_{t}^{KV}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT, so its KV cache has only dclsubscript𝑑𝑐𝑙d_{c}litalic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_l elements, where l𝑙litalic_l denotes the number of layers. In addition, during inference, since WUKsuperscript𝑊𝑈𝐾W^{UK}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT can be absorbed into WQsuperscript𝑊𝑄W^{Q}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT, and WUVsuperscript𝑊𝑈𝑉W^{UV}italic_W start_POSTSUPERSCRIPT italic_U italic_V end_POSTSUPERSCRIPT can be absorbed into WOsuperscript𝑊𝑂W^{O}italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT, we even do not need to compute keys and values out for attention. Figure 3 intuitively illustrates how the KV joint compression in MLA reduces the KV cache.
在哪里 𝐜tKVdcsuperscriptsubscriptsuperscriptsubscript\mathbf{c}_{t}^{KV}\in\mathbb{R}^{d_{c}}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 是键和值的压缩潜在向量; dc(dhnh)annotatedsubscriptmuch-less-thanabsentsubscriptsubscriptd_{c}(\ll d_{h}n_{h})italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( ≪ italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) 表示KV压缩维数; WDKVdc×dsuperscriptsuperscriptsubscriptW^{DKV}\in\mathbb{R}^{d_{c}\times d}italic_W start_POSTSUPERSCRIPT italic_D italic_K italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT 是下投影矩阵;和 WUK,WUVdhnh×dcsuperscriptsuperscriptsuperscriptsubscriptsubscriptsubscriptW^{UK},W^{UV}\in\mathbb{R}^{d_{h}n_{h}\times d_{c}}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_U italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 分别是键和值的上投影矩阵。在推理过程中,MLA只需要缓存 𝐜tKVsuperscriptsubscript\mathbf{c}_{t}^{KV}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT ,所以它的KV缓存只有 dclsubscriptd_{c}litalic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_l 元素,其中 llitalic_l 表示层数。另外,在推理过程中,由于 WUKsuperscriptW^{UK}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT 可以被吸收到 WQsuperscriptW^{Q}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , 和 WUVsuperscriptW^{UV}italic_W start_POSTSUPERSCRIPT italic_U italic_V end_POSTSUPERSCRIPT 可以被吸收到 WOsuperscriptW^{O}italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT ,我们甚至不需要计算键和值来引起注意。图3直观地说明了MLA中的KV联合压缩如何减少KV缓存。

Moreover, in order to reduce the activation memory during training, we also perform low-rank compression for the queries, even if it cannot reduce the KV cache:
此外,为了减少训练期间的激活内存,我们还对查询进行低秩压缩,即使它不能减少 KV 缓存:

𝐜tQsuperscriptsubscript𝐜𝑡𝑄\displaystyle\mathbf{c}_{t}^{Q}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT =WDQ𝐡t,absentsuperscript𝑊𝐷𝑄subscript𝐡𝑡\displaystyle=W^{DQ}\mathbf{h}_{t},= italic_W start_POSTSUPERSCRIPT italic_D italic_Q end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (12)
𝐪tCsuperscriptsubscript𝐪𝑡𝐶\displaystyle\mathbf{q}_{t}^{C}bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT =WUQ𝐜tQ,absentsuperscript𝑊𝑈𝑄superscriptsubscript𝐜𝑡𝑄\displaystyle=W^{UQ}\mathbf{c}_{t}^{Q},= italic_W start_POSTSUPERSCRIPT italic_U italic_Q end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , (13)

where 𝐜tQdcsuperscriptsubscript𝐜𝑡𝑄superscriptsuperscriptsubscript𝑑𝑐\mathbf{c}_{t}^{Q}\in\mathbb{R}^{d_{c}^{\prime}}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is the compressed latent vector for queries; dc(dhnh)annotatedsuperscriptsubscript𝑑𝑐much-less-thanabsentsubscript𝑑subscript𝑛d_{c}^{\prime}(\ll d_{h}n_{h})italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ≪ italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) denotes the query compression dimension; and WDQdc×d,WUQdhnh×dcformulae-sequencesuperscript𝑊𝐷𝑄superscriptsuperscriptsubscript𝑑𝑐𝑑superscript𝑊𝑈𝑄superscriptsubscript𝑑subscript𝑛superscriptsubscript𝑑𝑐W^{DQ}\in\mathbb{R}^{d_{c}^{\prime}\times d},W^{UQ}\in\mathbb{R}^{d_{h}n_{h}% \times d_{c}^{\prime}}italic_W start_POSTSUPERSCRIPT italic_D italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_d end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_U italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT are the down-projection and up-projection matrices for queries, respectively.
在哪里 𝐜tQdcsuperscriptsubscriptsuperscriptsuperscriptsubscript\mathbf{c}_{t}^{Q}\in\mathbb{R}^{d_{c}^{\prime}}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT 是查询的压缩潜在向量; dc(dhnh)annotatedsuperscriptsubscriptmuch-less-thanabsentsubscriptsubscriptd_{c}^{\prime}(\ll d_{h}n_{h})italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ≪ italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) 表示查询压缩维度;和 WDQdc×d,WUQdhnh×dcformulae-sequencesuperscriptsuperscriptsuperscriptsubscriptsuperscriptsuperscriptsubscriptsubscriptsuperscriptsubscriptW^{DQ}\in\mathbb{R}^{d_{c}^{\prime}\times d},W^{UQ}\in\mathbb{R}^{d_{h}n_{h}% \times d_{c}^{\prime}}italic_W start_POSTSUPERSCRIPT italic_D italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_d end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_U italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT 分别是查询的下投影和上投影矩阵。

2.1.3 Decoupled Rotary Position Embedding
2.1.3解耦旋转位置嵌入

Following DeepSeek 67B (DeepSeek-AI, 2024), we intend to use the Rotary Position Embedding (RoPE) (Su et al., 2024) for DeepSeek-V2. However, RoPE is incompatible with low-rank KV compression. To be specific, RoPE is position-sensitive for both keys and queries. If we apply RoPE for the keys 𝐤tCsuperscriptsubscript𝐤𝑡𝐶\mathbf{k}_{t}^{C}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, WUKsuperscript𝑊𝑈𝐾W^{UK}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT in Equation 10 will be coupled with a position-sensitive RoPE matrix. In this way, WUKsuperscript𝑊𝑈𝐾W^{UK}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT cannot be absorbed into WQsuperscript𝑊𝑄W^{Q}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT any more during inference, since a RoPE matrix related to the currently generating token will lie between WQsuperscript𝑊𝑄W^{Q}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and WUKsuperscript𝑊𝑈𝐾W^{UK}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT and matrix multiplication does not obey a commutative law. As a result, we must recompute the keys for all the prefix tokens during inference, which will significantly hinder the inference efficiency.
继 DeepSeek 67B (DeepSeek-AI, 2024之后,我们打算将旋转位置嵌入(RoPE) (Su 等人, 2024用于 DeepSeek-V2。然而,RoPE 与低秩 KV 压缩不兼容。具体来说,RoPE 对于键和查询都是位置敏感的。如果我们对钥匙应用 RoPE 𝐤tCsuperscriptsubscript\mathbf{k}_{t}^{C}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , WUKsuperscriptW^{UK}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT 等式10中将与位置敏感的 RoPE 矩阵相结合。这样, WUKsuperscriptW^{UK}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT 无法被吸收 WQsuperscriptW^{Q}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT 在推理过程中不再需要,因为与当前生成的令牌相关的 RoPE 矩阵将位于 WQsuperscriptW^{Q}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPTWUKsuperscriptW^{UK}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT 并且矩阵乘法不遵守交换律。因此,我们必须在推理过程中重新计算所有前缀标记的键,这将极大地影响推理效率。

As a solution, we propose the decoupled RoPE strategy that uses additional multi-head queries 𝐪t,iRdhRsuperscriptsubscript𝐪𝑡𝑖𝑅superscriptsuperscriptsubscript𝑑𝑅\mathbf{q}_{t,i}^{R}\in\mathbb{R}^{d_{h}^{R}}bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and a shared key 𝐤tRdhRsuperscriptsubscript𝐤𝑡𝑅superscriptsuperscriptsubscript𝑑𝑅\mathbf{k}_{t}^{R}\in\mathbb{R}^{d_{h}^{R}}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT to carry RoPE, where dhRsuperscriptsubscript𝑑𝑅d_{h}^{R}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT denotes the per-head dimension of the decoupled queries and key. Equipped with the decoupled RoPE strategy, MLA performs the following computation:
作为解决方案,我们提出了使用额外多头查询的解耦 RoPE 策略 𝐪t,iRdhRsuperscriptsubscriptsuperscriptsuperscriptsubscript\mathbf{q}_{t,i}^{R}\in\mathbb{R}^{d_{h}^{R}}bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT 和共享密钥 𝐤tRdhRsuperscriptsubscriptsuperscriptsuperscriptsubscript\mathbf{k}_{t}^{R}\in\mathbb{R}^{d_{h}^{R}}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT 携带 RoPE,其中 dhRsuperscriptsubscriptd_{h}^{R}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT 表示解耦查询和密钥的每头维度。配备解耦 RoPE 策略,MLA 执行以下计算:

[𝐪t,1R;𝐪t,2R;;𝐪t,nhR]=𝐪tRsuperscriptsubscript𝐪𝑡1𝑅superscriptsubscript𝐪𝑡2𝑅superscriptsubscript𝐪𝑡subscript𝑛𝑅superscriptsubscript𝐪𝑡𝑅\displaystyle[\mathbf{q}_{t,1}^{R};\mathbf{q}_{t,2}^{R};...;\mathbf{q}_{t,n_{h% }}^{R}]=\mathbf{q}_{t}^{R}[ bold_q start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ; bold_q start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ; … ; bold_q start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ] = bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT =RoPE(WQR𝐜tQ),absentRoPEsuperscript𝑊𝑄𝑅superscriptsubscript𝐜𝑡𝑄\displaystyle=\operatorname{RoPE}({W^{QR}}\mathbf{c}_{t}^{Q}),= roman_RoPE ( italic_W start_POSTSUPERSCRIPT italic_Q italic_R end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ) , (14)
𝐤tRsuperscriptsubscript𝐤𝑡𝑅\displaystyle\mathbf{k}_{t}^{R}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT =RoPE(WKR𝐡t),absentRoPEsuperscript𝑊𝐾𝑅subscript𝐡𝑡\displaystyle=\operatorname{RoPE}({W^{KR}}\mathbf{h}_{t}),= roman_RoPE ( italic_W start_POSTSUPERSCRIPT italic_K italic_R end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (15)
𝐪t,isubscript𝐪𝑡𝑖\displaystyle\mathbf{q}_{t,i}bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT =[𝐪t,iC;𝐪t,iR],absentsuperscriptsubscript𝐪𝑡𝑖𝐶superscriptsubscript𝐪𝑡𝑖𝑅\displaystyle=[\mathbf{q}_{t,i}^{C};\mathbf{q}_{t,i}^{R}],= [ bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ] , (16)
𝐤t,isubscript𝐤𝑡𝑖\displaystyle\mathbf{k}_{t,i}bold_k start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT =[𝐤t,iC;𝐤tR],absentsuperscriptsubscript𝐤𝑡𝑖𝐶superscriptsubscript𝐤𝑡𝑅\displaystyle=[\mathbf{k}_{t,i}^{C};\mathbf{k}_{t}^{R}],= [ bold_k start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ] , (17)
𝐨t,isubscript𝐨𝑡𝑖\displaystyle\mathbf{o}_{t,i}bold_o start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT =j=1tSoftmaxj(𝐪t,iT𝐤j,idh+dhR)𝐯j,iC,absentsuperscriptsubscript𝑗1𝑡subscriptSoftmax𝑗superscriptsubscript𝐪𝑡𝑖𝑇subscript𝐤𝑗𝑖subscript𝑑superscriptsubscript𝑑𝑅superscriptsubscript𝐯𝑗𝑖𝐶\displaystyle=\sum_{j=1}^{t}\operatorname{Softmax}_{j}(\frac{\mathbf{q}_{t,i}^% {T}\mathbf{k}_{j,i}}{\sqrt{d_{h}+d_{h}^{R}}})\mathbf{v}_{j,i}^{C},= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Softmax start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( divide start_ARG bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_k start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_ARG end_ARG ) bold_v start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , (18)
𝐮tsubscript𝐮𝑡\displaystyle\mathbf{u}_{t}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =WO[𝐨t,1;𝐨t,2;;𝐨t,nh],absentsuperscript𝑊𝑂subscript𝐨𝑡1subscript𝐨𝑡2subscript𝐨𝑡subscript𝑛\displaystyle=W^{O}[\mathbf{o}_{t,1};\mathbf{o}_{t,2};...;\mathbf{o}_{t,n_{h}}],= italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT [ bold_o start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ; bold_o start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ; … ; bold_o start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] , (19)

where WQRdhRnh×dcsuperscript𝑊𝑄𝑅superscriptsuperscriptsubscript𝑑𝑅subscript𝑛superscriptsubscript𝑑𝑐W^{QR}\in\mathbb{R}^{d_{h}^{R}n_{h}\times d_{c}^{\prime}}italic_W start_POSTSUPERSCRIPT italic_Q italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and WKRdhR×dsuperscript𝑊𝐾𝑅superscriptsuperscriptsubscript𝑑𝑅𝑑W^{KR}\in\mathbb{R}^{d_{h}^{R}\times d}italic_W start_POSTSUPERSCRIPT italic_K italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT × italic_d end_POSTSUPERSCRIPT are matrices to produce the decouples queries and key, respectively; RoPE()RoPE\operatorname{RoPE}(\cdot)roman_RoPE ( ⋅ ) denotes the operation that applies RoPE matrices; and [;][\cdot;\cdot][ ⋅ ; ⋅ ] denotes the concatenation operation. During inference, the decoupled key should also be cached. Therefore, DeepSeek-V2 requires a total KV cache containing (dc+dhR)lsubscript𝑑𝑐superscriptsubscript𝑑𝑅𝑙(d_{c}+d_{h}^{R})l( italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ) italic_l elements.
在哪里 WQRdhRnh×dcsuperscriptsuperscriptsuperscriptsubscriptsubscriptsuperscriptsubscriptW^{QR}\in\mathbb{R}^{d_{h}^{R}n_{h}\times d_{c}^{\prime}}italic_W start_POSTSUPERSCRIPT italic_Q italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPTWKRdhR×dsuperscriptsuperscriptsuperscriptsubscriptW^{KR}\in\mathbb{R}^{d_{h}^{R}\times d}italic_W start_POSTSUPERSCRIPT italic_K italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT × italic_d end_POSTSUPERSCRIPT 是分别产生解耦查询和键的矩阵; RoPE()\operatorname{RoPE}(\cdot)roman_RoPE ( ⋅ ) 表示应用RoPE矩阵的操作;和 [;][\cdot;\cdot][ ⋅ ; ⋅ ] 表示串联操作。在推理过程中,解耦的密钥也应该被缓存。因此,DeepSeek-V2需要一个总的KV缓存,其中包含 (dc+dhR)lsubscriptsuperscriptsubscript(d_{c}+d_{h}^{R})l( italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ) italic_l 元素。

In order to demonstrate the complete computation process of MLA, we also organize and provide its full formulas in Appendix C.
为了展示MLA的完整计算过程,我们还在附录C中整理并提供了其完整的公式。

Attention Mechanism 注意力机制 KV Cache per Token (# Element)
每个令牌的 KV 缓存(# 元素)
Capability 能力
Multi-Head Attention (MHA)
多头注意力(MHA)
2nhdhl2subscript𝑛subscript𝑑𝑙2n_{h}d_{h}l2 italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_l Strong 强的
Grouped-Query Attention (GQA)
分组查询注意力(GQA)
2ngdhl2subscript𝑛𝑔subscript𝑑𝑙2n_{g}d_{h}l2 italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_l Moderate 缓和
Multi-Query Attention (MQA)
多查询注意力(MQA)
    2dhl2subscript𝑑𝑙2d_{h}l2 italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_l Weak 虚弱的
MLA (Ours) MLA(我们的) (dc+dhR)l92dhlsubscript𝑑𝑐superscriptsubscript𝑑𝑅𝑙92subscript𝑑𝑙\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ (d_{c}+d_{h}^{R})l\approx\frac{9}{2}d_{h}l( italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ) italic_l ≈ divide start_ARG 9 end_ARG start_ARG 2 end_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_l Stronger 更强
Table 1: Comparison of the KV cache per token among different attention mechanisms. nhsubscript𝑛n_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT denotes the number of attention heads, dhsubscript𝑑d_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT denotes the dimension per attention head, l𝑙litalic_l denotes the number of layers, ngsubscript𝑛𝑔n_{g}italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT denotes the number of groups in GQA, and dcsubscript𝑑𝑐d_{c}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and dhRsuperscriptsubscript𝑑𝑅d_{h}^{R}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT denote the KV compression dimension and the per-head dimension of the decoupled queries and key in MLA, respectively. The amount of KV cache is measured by the number of elements, regardless of the storage precision. For DeepSeek-V2, dcsubscript𝑑𝑐d_{c}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is set to 4dh4subscript𝑑4d_{h}4 italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and dhRsuperscriptsubscript𝑑𝑅d_{h}^{R}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT is set to dh2subscript𝑑2\frac{d_{h}}{2}divide start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG. So, its KV cache is equal to GQA with only 2.25 groups, but its performance is stronger than MHA.
表 1:不同注意力机制中每个 token 的 KV 缓存比较。 nhsubscriptn_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 表示注意力头的数量, dhsubscriptd_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 表示每个注意力头的维度, llitalic_l 表示层数, ngsubscriptn_{g}italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT 表示 GQA 中的组数,并且 dcsubscriptd_{c}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPTdhRsuperscriptsubscriptd_{h}^{R}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT 分别表示 MLA 中解耦查询和键的 KV 压缩维度和每头维度。 KV缓存的数量以元素数量来衡量,与存储精度无关。对于 DeepSeek-V2, dcsubscriptd_{c}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT 设置为 4dh4subscript4d_{h}4 italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPTdhRsuperscriptsubscriptd_{h}^{R}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT 设置为 dh2subscript2\frac{d_{h}}{2}divide start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG 。 所以,它的KV缓存相当于GQA,只有2.25组,但性能却比MHA强。

2.1.4 Comparison of Key-Value Cache
2.1.4 Key-Value缓存对比

We demonstrate a comparison of the KV cache per token among different attention mechanisms in Table 1. MLA requires only a small amount of KV cache, equal to GQA with only 2.25 groups, but can achieve stronger performance than MHA.
我们在表1中展示了不同注意机制之间每个令牌的 KV 缓存的比较。 MLA只需要少量的KV缓存,相当于GQA只有2.25组,但可以实现比MHA更强的性能。

2.2 DeepSeekMoE: Training Strong Models at Economical Costs
2.2 DeepSeekMoE:以经济的成本训练强大的模型

2.2.1 Basic Architecture
2.2.1基本架构

For FFNs, we employ the DeepSeekMoE architecture (Dai et al., 2024). DeepSeekMoE has two key ideas: segmenting experts into finer granularity for higher expert specialization and more accurate knowledge acquisition, and isolating some shared experts for mitigating knowledge redundancy among routed experts. With the same number of activated and total expert parameters, DeepSeekMoE can outperform conventional MoE architectures like GShard (Lepikhin et al., 2021) by a large margin.
对于 FFN,我们采用 DeepSeekMoE 架构(Dai 等人, 2024 。 DeepSeekMoE有两个关键思想:将专家细分为更细的粒度,以实现更高的专家专业化和更准确的知识获取;以及隔离一些共享专家,以减轻路由专家之间的知识冗余。在激活和总专家参数数量相同的情况下,DeepSeekMoE 可以大幅优于 GShard (Lepikhin 等人, 2021等传统 MoE 架构。

Let 𝐮tsubscript𝐮𝑡\mathbf{u}_{t}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be the FFN input of the t𝑡titalic_t-th token, we compute the FFN output 𝐡tsuperscriptsubscript𝐡𝑡\mathbf{h}_{t}^{\prime}bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as follows:
𝐮tsubscript\mathbf{u}_{t}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 是 FFN 输入 ttitalic_t -th token,我们计算 FFN 输出 𝐡tsuperscriptsubscript\mathbf{h}_{t}^{\prime}bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 如下:

𝐡tsuperscriptsubscript𝐡𝑡\displaystyle\mathbf{h}_{t}^{\prime}bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =𝐮t+i=1NsFFNi(s)(𝐮t)+i=1Nrgi,tFFNi(r)(𝐮t),absentsubscript𝐮𝑡superscriptsubscript𝑖1subscript𝑁𝑠subscriptsuperscriptFFN𝑠𝑖subscript𝐮𝑡superscriptsubscript𝑖1subscript𝑁𝑟subscript𝑔𝑖𝑡subscriptsuperscriptFFN𝑟𝑖subscript𝐮𝑡\displaystyle=\mathbf{u}_{t}+\sum_{i=1}^{N_{s}}{\operatorname{FFN}^{(s)}_{i}% \left(\mathbf{u}_{t}\right)}+\sum_{i=1}^{N_{r}}{g_{i,t}\operatorname{FFN}^{(r)% }_{i}\left(\mathbf{u}_{t}\right)},= bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_FFN start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT roman_FFN start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (20)
gi,tsubscript𝑔𝑖𝑡\displaystyle g_{i,t}italic_g start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ={si,t,si,tTopk({sj,t|1jNr},Kr),0,otherwise,absentcasessubscript𝑠𝑖𝑡subscript𝑠𝑖𝑡Topkconditional-setsubscript𝑠𝑗𝑡1𝑗subscript𝑁𝑟subscript𝐾𝑟0otherwise\displaystyle=\begin{cases}s_{i,t},&s_{i,t}\in\operatorname{Topk}(\{s_{j,t}|1% \leqslant j\leqslant N_{r}\},K_{r}),\\ 0,&\text{otherwise},\end{cases}= { start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , end_CELL start_CELL italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∈ roman_Topk ( { italic_s start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT | 1 ⩽ italic_j ⩽ italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } , italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise , end_CELL end_ROW (21)
si,tsubscript𝑠𝑖𝑡\displaystyle s_{i,t}italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT =Softmaxi(𝐮tT𝐞i),absentsubscriptSoftmax𝑖superscriptsubscript𝐮𝑡𝑇subscript𝐞𝑖\displaystyle=\operatorname{Softmax}_{i}\left({\mathbf{u}_{t}}^{T}\mathbf{e}_{% i}\right),= roman_Softmax start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (22)

where Nssubscript𝑁𝑠N_{s}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and Nrsubscript𝑁𝑟N_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT denote the numbers of shared experts and routed experts, respectively; FFNi(s)()subscriptsuperscriptFFN𝑠𝑖\operatorname{FFN}^{(s)}_{i}(\cdot)roman_FFN start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) and FFNi(r)()subscriptsuperscriptFFN𝑟𝑖\operatorname{FFN}^{(r)}_{i}(\cdot)roman_FFN start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) denote the i𝑖iitalic_i-th shared expert and the i𝑖iitalic_i-th routed expert, respectively; Krsubscript𝐾𝑟K_{r}italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT denotes the number of activated routed experts; gi,tsubscript𝑔𝑖𝑡g_{i,t}italic_g start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is the gate value for the i𝑖iitalic_i-th expert; si,tsubscript𝑠𝑖𝑡s_{i,t}italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is the token-to-expert affinity; 𝐞isubscript𝐞𝑖\mathbf{e}_{i}bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the centroid of the i𝑖iitalic_i-th routed expert in this layer; and Topk(,K)Topk𝐾\operatorname{Topk}(\cdot,K)roman_Topk ( ⋅ , italic_K ) denotes the set comprising K𝐾Kitalic_K highest scores among the affinity scores calculated for the t𝑡titalic_t-th token and all routed experts.
在哪里 NssubscriptN_{s}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPTNrsubscriptN_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT 分别表示共享专家和路由专家的数量; FFNi(s)()subscriptsuperscript\operatorname{FFN}^{(s)}_{i}(\cdot)roman_FFN start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ )FFNi(r)()subscriptsuperscript\operatorname{FFN}^{(r)}_{i}(\cdot)roman_FFN start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) 表示 iiitalic_i 第 共享专家和 iiitalic_i -th 分别被路由的专家; KrsubscriptK_{r}italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT 表示激活的路由专家的数量; gi,tsubscriptg_{i,t}italic_g start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT 是门值 iiitalic_i -th 专家; si,tsubscripts_{i,t}italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT 是代币与专家的亲和力; 𝐞isubscript\mathbf{e}_{i}bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 是 的质心 iiitalic_i - 本层第 3 个路由专家;和 Topk(,K)\operatorname{Topk}(\cdot,K)roman_Topk ( ⋅ , italic_K ) 表示集合包括 KKitalic_K 计算的亲和力分数中的最高分数 ttitalic_t -th 令牌和所有路由专家。

2.2.2 Device-Limited Routing
2.2.2设备限制路由

We design a device-limited routing mechanism to bound MoE-related communication costs. When expert parallelism is employed, the routed experts will be distributed across multiple devices. For each token, its MoE-related communication frequency is proportional to the number of devices covered by its target experts. Due to the fine-grained expert segmentation in DeepSeekMoE, the number of activated experts can be large, so the MoE-related communication will be more costly if we apply expert parallelism.
我们设计了一种设备限制的路由机制来限制 MoE 相关的通信成本。当采用专家并行时,路由专家将分布在多个设备上。对于每个代币,其与 MoE 相关的通信频率与其目标专家覆盖的设备数量成正比。由于DeepSeekMoE中的细粒度专家分割,激活的专家数量可能很大,因此如果我们应用专家并行性,MoE相关的通信成本将会更高。

For DeepSeek-V2, beyond the naive top-K selection of routed experts, we additionally ensure that the target experts of each token will be distributed on at most M𝑀Mitalic_M devices. To be specific, for each token, we first select M𝑀Mitalic_M devices that have experts with the highest affinity scores in them. Then, we perform top-K selection among experts on these M𝑀Mitalic_M devices. In practice, we find that when M3𝑀3M\geqslant 3italic_M ⩾ 3, the device-limited routing can achieve a good performance roughly aligned with the unrestricted top-K routing.
对于 DeepSeek-V2,除了简单地选择路由专家的 top-K 之外,我们还确保每个代币的目标专家最多分布在 MMitalic_M 设备。具体来说,对于每个令牌,我们首先选择 MMitalic_M 拥有亲和力得分最高的专家的设备。然后,我们在这些专家中进行top-K选择 MMitalic_M 设备。在实践中,我们发现当 M33M\geqslant 3italic_M ⩾ 3 ,设备限制路由可以实现与无限制top-K路由大致一致的良好性能。

2.2.3 Auxiliary Loss for Load Balance
2.2.3负载平衡的辅助损耗

We take the load balance into consideration for automatically learned routing strategies. Firstly, unbalanced load will raise the risk of routing collapse (Shazeer et al., 2017), preventing some experts being fully trained and utilized. Secondly, when expert parallelism is employed, unbalanced load will diminish computation efficiency. During the training of DeepSeek-V2, we design three kinds of auxiliary losses, for controlling expert-level load balance (ExpBalsubscriptExpBal\mathcal{L}_{\mathrm{ExpBal}}caligraphic_L start_POSTSUBSCRIPT roman_ExpBal end_POSTSUBSCRIPT), device-level load balance (DevBalsubscriptDevBal\mathcal{L}_{\mathrm{DevBal}}caligraphic_L start_POSTSUBSCRIPT roman_DevBal end_POSTSUBSCRIPT), and communication balance (CommBalsubscriptCommBal\mathcal{L}_{\mathrm{CommBal}}caligraphic_L start_POSTSUBSCRIPT roman_CommBal end_POSTSUBSCRIPT), respectively.
我们在自动学习的路由策略中考虑了负载平衡。 首先,负载不平衡会增加路由崩溃的风险(Shazeer et al., 2017 ,导致部分专家无法得到充分的培训和利用。其次,当采用专家并行时,不平衡的负载会降低计算效率。在DeepSeek-V2的训练过程中,我们设计了三种辅助损失,用于控制专家级负载平衡( ExpBalsubscript\mathcal{L}_{\mathrm{ExpBal}}caligraphic_L start_POSTSUBSCRIPT roman_ExpBal end_POSTSUBSCRIPT )、设备级负载均衡( DevBalsubscript\mathcal{L}_{\mathrm{DevBal}}caligraphic_L start_POSTSUBSCRIPT roman_DevBal end_POSTSUBSCRIPT )和通信平衡( CommBalsubscript\mathcal{L}_{\mathrm{CommBal}}caligraphic_L start_POSTSUBSCRIPT roman_CommBal end_POSTSUBSCRIPT ), 分别。

Expert-Level Balance Loss.
专家级平衡损失。

We use an expert-level balance loss (Fedus et al., 2021; Lepikhin et al., 2021) to mitigate the risk of routing collapse:
我们使用专家级余额损失(Fedus et al., 2021 ;Lepikhin et al., 2021来减轻路由崩溃的风险:

ExpBalsubscriptExpBal\displaystyle\mathcal{L}_{\mathrm{ExpBal}}caligraphic_L start_POSTSUBSCRIPT roman_ExpBal end_POSTSUBSCRIPT =α1i=1NrfiPi,absentsubscript𝛼1superscriptsubscript𝑖1subscript𝑁𝑟subscript𝑓𝑖subscript𝑃𝑖\displaystyle=\alpha_{1}\sum_{i=1}^{N_{r}}{f_{i}P_{i}},= italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (23)
fisubscript𝑓𝑖\displaystyle f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =NrKrTt=1T𝟙(Token t selects Expert i),absentsubscript𝑁𝑟subscript𝐾𝑟𝑇superscriptsubscript𝑡1𝑇1Token t selects Expert i\displaystyle=\frac{N_{r}}{K_{r}T}\sum_{t=1}^{T}{\mathds{1}(\text{Token $t$ % selects Expert $i$})},= divide start_ARG italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG start_ARG italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_1 ( Token italic_t selects Expert italic_i ) , (24)
Pisubscript𝑃𝑖\displaystyle P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =1Tt=1Tsi,t,absent1𝑇superscriptsubscript𝑡1𝑇subscript𝑠𝑖𝑡\displaystyle=\frac{1}{T}\sum_{t=1}^{T}{s_{i,t}},= divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , (25)

where α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a hyper-parameter called expert-level balance factor; 𝟙()1\mathds{1}(\cdot)blackboard_1 ( ⋅ ) denotes the indicator function; and T𝑇Titalic_T denotes the number of tokens in a sequence.
在哪里 α1subscript1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 是一个超参数,称为专家级平衡因子; 𝟙()1\mathds{1}(\cdot)blackboard_1 ( ⋅ ) 表示指标函数;和 TTitalic_T 表示序列中标记的数量。

Device-Level Balance Loss.
设备级平衡损失。

In addition to the expert-level balance loss, we additionally design a device-level balance loss to ensure balanced computation across different devices. In the training process of DeepSeek-V2, we partition all routed experts into D𝐷Ditalic_D groups {1,2,,D}subscript1subscript2subscript𝐷\{\mathcal{E}_{1},\mathcal{E}_{2},...,\mathcal{E}_{D}\}{ caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT }, and deploy each group on a single device. The device-level balance loss is computed as follows:
除了专家级的平衡损失之外,我们还设计了设备级的平衡损失,以确保不同设备之间的平衡计算。 在DeepSeek-V2的训练过程中,我们将所有路由专家分为 DDitalic_D 团体 {1,2,,D}subscript1subscript2subscript\{\mathcal{E}_{1},\mathcal{E}_{2},...,\mathcal{E}_{D}\}{ caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT } ,并将每个组部署在单个设备上。设备级平衡损失计算如下:

DevBalsubscriptDevBal\displaystyle\mathcal{L}_{\mathrm{DevBal}}caligraphic_L start_POSTSUBSCRIPT roman_DevBal end_POSTSUBSCRIPT =α2i=1DfiPi,absentsubscript𝛼2superscriptsubscript𝑖1𝐷superscriptsubscript𝑓𝑖superscriptsubscript𝑃𝑖\displaystyle=\alpha_{2}\sum_{i=1}^{D}{f_{i}^{\prime}P_{i}^{\prime}},= italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , (26)
fisuperscriptsubscript𝑓𝑖\displaystyle f_{i}^{\prime}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =1|i|jifj,absent1subscript𝑖subscript𝑗subscript𝑖subscript𝑓𝑗\displaystyle=\frac{1}{|\mathcal{E}_{i}|}\sum_{j\in\mathcal{E}_{i}}{f_{j}},= divide start_ARG 1 end_ARG start_ARG | caligraphic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (27)
Pisuperscriptsubscript𝑃𝑖\displaystyle P_{i}^{\prime}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =jiPj,absentsubscript𝑗subscript𝑖subscript𝑃𝑗\displaystyle=\sum_{j\in\mathcal{E}_{i}}{P_{j}},= ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (28)

where α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a hyper-parameter called device-level balance factor.
在哪里 α2subscript2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 是一个称为设备级平衡因子的超参数。

Communication Balance Loss.
沟通平衡丧失。

Finally, we introduce a communication balance loss to ensure that the communication of each device is balanced. Although the device-limited routing mechanism guarantees that the sending communication of each device is bounded, if a certain device receives more tokens than other devices, the practical communication efficiency will also be affected. In order to mitigate this issue, we design a communication balance loss as follows:
最后,我们引入一个通信平衡损耗,以保证各个设备的通信是平衡的。虽然设备限制路由机制保证了每个设备的发送通信是有界的,但如果某个设备比其他设备接收到更多的令牌,那么实际的通信效率也会受到影响。为了缓解这个问题,我们设计了如下的通信平衡损失:

CommBalsubscriptCommBal\displaystyle\mathcal{L}_{\mathrm{CommBal}}caligraphic_L start_POSTSUBSCRIPT roman_CommBal end_POSTSUBSCRIPT =α3i=1Dfi′′Pi′′,absentsubscript𝛼3superscriptsubscript𝑖1𝐷superscriptsubscript𝑓𝑖′′superscriptsubscript𝑃𝑖′′\displaystyle=\alpha_{3}\sum_{i=1}^{D}{f_{i}^{\prime\prime}P_{i}^{\prime\prime% }},= italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT , (29)
fi′′superscriptsubscript𝑓𝑖′′\displaystyle f_{i}^{\prime\prime}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT =DMTt=1T𝟙(Token t is sent to Device i),absent𝐷𝑀𝑇superscriptsubscript𝑡1𝑇1Token t is sent to Device i\displaystyle=\frac{D}{MT}\sum_{t=1}^{T}{\mathds{1}(\text{Token $t$ is sent to% Device $i$})},= divide start_ARG italic_D end_ARG start_ARG italic_M italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_1 ( Token italic_t is sent to Device italic_i ) , (30)
Pi′′superscriptsubscript𝑃𝑖′′\displaystyle P_{i}^{\prime\prime}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT =jiPj,absentsubscript𝑗subscript𝑖subscript𝑃𝑗\displaystyle=\sum_{j\in\mathcal{E}_{i}}{P_{j}},= ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (31)

where α3subscript𝛼3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is a hyper-parameter called communication balance factor. The device-limited routing mechanism operates on the principle of ensuring that each device transmits at most MT𝑀𝑇MTitalic_M italic_T hidden states to other devices. Simultaneously, the communication balance loss is employed to encourage each device to receive around MT𝑀𝑇MTitalic_M italic_T hidden states from other devices. The communication balance loss guarantees a balanced exchange of information among devices, promoting efficient communications.
在哪里 α3subscript3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 是一个称为通信平衡因子的超参数。设备限制路由机制的运行原理是确保每个设备最多传输 MTMTitalic_M italic_T 对其他设备的隐藏状态。同时,利用通信平衡损失来鼓励每个设备同时接收周围的信息。 MTMTitalic_M italic_T 来自其他设备的隐藏状态。通信平衡损失保证了设备之间信息的平衡交换,促进高效通信。

2.2.4 Token-Dropping Strategy
2.2.4掉币策略

While balance losses aim to encourage a balanced load, it is important to acknowledge that they cannot guarantee a strict load balance. In order to further mitigate the computation wastage caused by unbalanced load, we introduce a device-level token-dropping strategy during training. This approach first computes the average computational budget for each device, which means that the capacity factor for each device is equivalent to 1.0. Then, inspired by Riquelme et al. (2021), we drop tokens with the lowest affinity scores on each device until reaching the computational budget. In addition, we ensure that the tokens belonging to approximately 10% of the training sequences will never be dropped. In this way, we can flexibly decide whether to drop tokens during inference according to the efficiency requirements, and always ensure consistency between training and inference.
虽然平衡损失旨在鼓励平衡负载,但重要的是要承认它们不能保证严格的负载平衡。为了进一步减轻负载不平衡造成的计算浪费,我们在训练过程中引入了设备级令牌丢弃策略。该方法首先计算每个设备的平均计算预算,这意味着每个设备的容量因子相当于 1.0。然后,受到里克尔梅等人的启发。 ( 2021 ,我们在每台设备上丢弃亲和力分数最低的令牌,直到达到计算预算。此外,我们确保属于大约 10% 训练序列的 token 永远不会被丢弃。这样,我们就可以根据效率要求灵活决定是否在推理过程中丢弃token,始终保证训练和推理的一致性。

3 Pre-Training
3预训练

3.1 Experimental Setups
3.1实验设置

3.1.1 Data Construction
3.1.1数据构建

While maintaining the same data processing stages as for DeepSeek 67B (DeepSeek-AI, 2024), we extend the amount of data and elevate the data quality. In order to enlarge our pre-training corpus, we explore the potential of the internet data and optimize our cleaning processes, thus recovering a large amount of mistakenly deleted data. Moreover, we incorporate more Chinese data, aiming to better leverage the corpus available on the Chinese internet. In addition to the amount of data, we also focus on the data quality. We enrich our pre-training corpus with high-quality data from various sources, and meanwhile improve the quality-based filtering algorithm. The improved algorithm ensures that a large amount of non-beneficial data will be removed, while the valuable data will be mostly retained. In addition, we filter out the contentious content from our pre-training corpus to mitigate the data bias introduced from specific regional cultures. A detailed discussion about the influence of this filtering strategy is presented in Appendix E.
在保持与 DeepSeek 67B (DeepSeek-AI, 2024 )相同的数据处理阶段的同时,我们扩展了数据量并提高了数据质量。为了扩大我们的预训练语料库,我们探索互联网数据的潜力并优化我们的清理流程,从而恢复大量误删除的数据。此外,我们纳入了更多的中文数据,旨在更好地利用中文互联网上可用的语料库。除了数据量,我们还关注数据质量。我们利用各种来源的高质量数据丰富了我们的预训练语料库,同时改进了基于质量的过滤算法。改进后的算法保证了大量无益数据被剔除,而有价值的数据大部分被保留。此外,我们从预训练语料库中过滤掉有争议的内容,以减轻特定区域文化引入的数据偏差。附录E详细讨论了该过滤策略的影响。

We adopt the same tokenizer as used in DeepSeek 67B, which is built based on the Byte-level Byte-Pair Encoding (BBPE) algorithm and has a vocabulary size of 100K. Our tokenized pre-training corpus contains 8.1T tokens, where Chinese tokens are approximately 12% more than English ones.
我们采用与 DeepSeek 67B 中使用的相同的分词器,它是基于字节级字节对编码(BBPE)算法构建的,词汇量为 100K。我们的标记化预训练语料库包含 8.1T 个标记,其中中文标记比英文标记多约 12%。

3.1.2 Hyper-Parameters
3.1.2超参数

Model Hyper-Parameters. 模型超参数。

We set the number of Transformer layers to 60 and the hidden dimension to 5120. All learnable parameters are randomly initialized with a standard deviation of 0.006. In MLA, we set the number of attention heads nhsubscript𝑛n_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT to 128 and the per-head dimension dhsubscript𝑑d_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT to 128. The KV compression dimension dcsubscript𝑑𝑐d_{c}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is set to 512, and the query compression dimension dcsuperscriptsubscript𝑑𝑐d_{c}^{\prime}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is set to 1536. For the decoupled queries and key, we set the per-head dimension dhRsuperscriptsubscript𝑑𝑅d_{h}^{R}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT to 64. Following Dai et al. (2024), we substitute all FFNs except for the first layer with MoE layers. Each MoE layer consists of 2 shared experts and 160 routed experts, where the intermediate hidden dimension of each expert is 1536. Among the routed experts, 6 experts will be activated for each token. In addition, the low-rank compression and fine-grained expert segmentation will impact the output scale of a layer. Therefore, in practice, we employ additional RMS Norm layers after the compressed latent vectors, and multiply additional scaling factors at the width bottlenecks (i.e., the compressed latent vectors and the intermediate hidden states of routed experts) to ensure stable training. Under this configuration, DeepSeek-V2 comprises 236B total parameters, of which 21B are activated for each token.
我们将 Transformer 层数设置为 60,隐藏维度设置为 5120。 所有可学习参数均随机初始化,标准差为 0.006。 在MLA中,我们设置注意力头的数量 nhsubscriptn_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 至 128 和每头尺寸 dhsubscriptd_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 至 128. KV 压缩维度 dcsubscriptd_{c}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT 设置为512,查询压缩维度 dcsuperscriptsubscriptd_{c}^{\prime}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 设置为 1536。对于解耦的查询和键,我们设置每头维度 dhRsuperscriptsubscriptd_{h}^{R}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT 至 64. 继Dai 等人之后。 ( 2024 ) ,我们用 MoE 层替换除第一层之外的所有 FFN。每个MoE层由2个共享专家和160个路由专家组成,其中每个专家的中间隐藏维度为1536。在路由专家中,每个代币将激活6个专家。此外,低秩压缩和细粒度专家分割也会影响层的输出规模。因此,在实践中,我们在压缩潜在向量之后采用额外的RMS Norm层,并在宽度瓶颈处乘以额外的缩放因子(即压缩的潜在向量和路由专家的中间隐藏状态)以确保稳定的训练。在此配置下,DeepSeek-V2总共包含236B个参数,其中每个令牌激活21B个参数。

Training Hyper-Parameters.
训练超参数。

We employ the AdamW optimizer (Loshchilov and Hutter, 2017) with hyper-parameters set to β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.95subscript𝛽20.95\beta_{2}=0.95italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95, and weight_decay=0.1weight_decay0.1\mathrm{weight\_decay}=0.1roman_weight _ roman_decay = 0.1. The learning rate is scheduled using a warmup-and-step-decay strategy (DeepSeek-AI, 2024). Initially, the learning rate linearly increases from 0 to the maximum value during the first 2K steps. Subsequently, the learning rate is multiplied by 0.316 after training about 60% of tokens, and again by 0.316 after training about 90% of tokens. The maximum learning rate is set to 2.4×1042.4superscript1042.4\times 10^{-4}2.4 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and the gradient clipping norm is set to 1.0. We also use a batch size scheduling strategy, where the batch size is gradually increased from 2304 to 9216 in the training of the first 225B tokens, and then keeps 9216 in the remaining training. We set the maximum sequence length to 4K, and train DeepSeek-V2 on 8.1T tokens. We leverage pipeline parallelism to deploy different layers of a model on different devices, and for each layer, the routed experts will be uniformly deployed on 8 devices (D=8𝐷8D=8italic_D = 8). As for the device-limited routing, each token will be sent to at most 3 devices (M=3𝑀3M=3italic_M = 3). As for balance losses, we set α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to 0.003, α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to 0.05, and α3subscript𝛼3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT to 0.02. We employ the token-dropping strategy during training for acceleration, but do not drop any tokens for evaluation.
我们采用 AdamW 优化器(Loshchilov 和 Hutter, 2017 ,超参数设置为 β1=0.9subscript10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , β2=0.95subscript20.95\beta_{2}=0.95italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95 , 和 weight_decay=0.10.1\mathrm{weight\_decay}=0.1roman_weight _ roman_decay = 0.1 。使用预热和逐步衰减策略来安排学习速率(DeepSeek-AI, 2024 。最初,学习率在前 2K 步中从 0 线性增加到最大值。随后,在训练大约 6​​0% 的标记后,将学习率乘以 0.316,在训练大约 90% 的标记后,再次将学习率乘以 0.316。最大学习率设置为 2.4×1042.4superscript1042.4\times 10^{-4}2.4 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT ,梯度裁剪范数设置为 1.0。我们还使用批量大小调度策略,在前 225B 个令牌的训练中,批量大小逐渐从 2304 增加到 9216,然后在剩余的训练中保持 9216。我们将最大序列长度设置为 4K,并在​​ 8.1T 令牌上训练 DeepSeek-V2。我们利用管道并行性将模型的不同层部署在不同的设备上,对于每一层,路由专家将统一部署在8台设备上( D=88D=8italic_D = 8 )。对于设备限制路由,每个token最多会发送到3个设备( M=33M=3italic_M = 3 )。对于余额损失,我们设置 α1subscript1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 至 0.003, α2subscript2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 至 0.05,并且 α3subscript3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 至 0.02。我们在训练期间采用令牌丢弃策略来加速,但不丢弃任何令牌进行评估。

3.1.3 Infrastructures
3.1.3基础设施

DeepSeek-V2 is trained based on the HAI-LLM framework (High-flyer, 2023), an efficient and light-weight training framework developed internally by our engineers. It employs a 16-way zero-bubble pipeline parallelism (Qi et al., 2023), an 8-way expert parallelism (Lepikhin et al., 2021), and ZeRO-1 data parallelism (Rajbhandari et al., 2020). Given that DeepSeek-V2 has relatively few activated parameters, and a portion of the operators are recomputed to save activation memory, it can be trained without the necessity of tensor parallelism, thereby decreasing the communication overhead. Moreover, in order to further improve the training efficiency, we overlap the computation of shared experts with the expert parallel all-to-all communication. We also customize faster CUDA kernels for communications, routing algorithms, and fused linear computations across different experts. In addition, MLA is also optimized based on an improved version of FlashAttention-2 (Dao, 2023).
DeepSeek-V2基于HAI- LLM框架(High-flyer, 2023进行训练,这是我们工程师内部开发的高效、轻量级的训练框架。它采用 16 路零气泡管道并行性(Qi 等人, 2023 年 、8 路专家并行性(Lepikhin 等人, 2021 年和 ZeRO-1 数据并行性(Rajbhandari 等人, 2020 年 。鉴于DeepSeek-V2的激活参数相对较少,并且重新计算了部分算子以节省激活内存,因此无需张量并行即可进行训练,从而减少了通信开销。此外,为了进一步提高训练效率,我们将共享专家的计算与专家并行的全对所有通信重叠。我们还为不同专家之间的通信、路由算法和融合线性计算定制了更快的 CUDA 内核。此外,MLA还基于FlashAttention-2的改进版本(Dao, 2023进行了优化。

We conduct all experiments on a cluster equipped with NVIDIA H800 GPUs. Each node in the H800 cluster contains 8 GPUs connected using NVLink and NVSwitch within nodes. Across nodes, InfiniBand interconnects are utilized to facilitate communications.
我们在配备 NVIDIA H800 GPU 的集群上进行所有实验。 H800 集群中的每个节点包含 8 个 GPU,通过节点内的 NVLink 和 NVSwitch 连接。在节点之间,InfiniBand 互连用于促进通信。

Refer to caption
Figure 4: Evaluation results on the “Needle In A Haystack” (NIAH) tests. DeepSeek-V2 performs well across all context window lengths up to 128K.
图 4: “大海捞针”(NIAH) 测试的评估结果。 DeepSeek-V2 在高达 128K 的所有上下文窗口长度上都表现良好。

3.1.4 Long Context Extension
3.1.4长上下文扩展

After the initial pre-training of DeepSeek-V2, we employ YaRN (Peng et al., 2023) to extend the default context window length from 4K to 128K. YaRN was specifically applied to the decoupled shared key 𝐤tRsubscriptsuperscript𝐤𝑅𝑡\mathbf{k}^{R}_{t}bold_k start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as it is responsible for carrying RoPE (Su et al., 2024). For YaRN, we set the scale s𝑠sitalic_s to 40, α𝛼\alphaitalic_α to 1, β𝛽\betaitalic_β to 32, and the target maximum context length to 160K. Under these settings, we can expect the model to respond well for a context length of 128K. Slightly diverging from original YaRN, due to our distinct attention mechanism, we adjust the length scaling factor to modulate the attention entropy. The factor t𝑡\sqrt{t}square-root start_ARG italic_t end_ARG is computed as t=0.0707lns+1𝑡0.0707𝑠1\sqrt{t}=0.0707\ln{s}+1square-root start_ARG italic_t end_ARG = 0.0707 roman_ln italic_s + 1, aiming at minimizing the perplexity.
在 DeepSeek-V2 的初始预训练之后,我们使用 YaRN (Peng et al., 2023将默认上下文窗口长度从 4K 扩展到 128K。 YaRN专门应用于解耦共享密钥 𝐤tRsubscriptsuperscript\mathbf{k}^{R}_{t}bold_k start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 因为它负责承载 RoPE (Su et al., 2024 ) 。对于 YaRN,我们设置了比例 ssitalic_s 到 40, α\alphaitalic_α 至 1, β\betaitalic_β 为 32,目标最大上下文长度为 160K。在这些设置下,我们可以预期模型对于 128K 的上下文长度有良好的响应。与原始 YaRN 略有不同,由于我们独特的注意力机制,我们调整长度缩放因子来调节注意力熵。因素 t\sqrt{t}square-root start_ARG italic_t end_ARG 计算为 t=0.0707lns+10.07071\sqrt{t}=0.0707\ln{s}+1square-root start_ARG italic_t end_ARG = 0.0707 roman_ln italic_s + 1 ,旨在最大限度地减少困惑。

We additionally train the model for 1000 steps, with a sequence length of 32K and a batch size of 576 sequences. Although the training is conducted solely at the sequence length of 32K, the model still demonstrates robust performance when being evaluated at a context length of 128K. As shown in Figure 4, the results on the “Needle In A Haystack” (NIAH) tests indicate that DeepSeek-V2 performs well across all context window lengths up to 128K.
我们还对模型进行了 1000 个步骤的训练,序列长度为 32K,批量大小为 576 个序列。尽管训练仅在 32K 的序列长度下进行,但该模型在以 128K 的上下文长度进行评估时仍然表现出稳健的性能。如图4所示,“大海捞针”(NIAH) 测试的结果表明 DeepSeek-V2 在高达 128K 的所有上下文窗口长度上都表现良好。

3.2 Evaluations
3.2评价

3.2.1 Evaluation Benchmarks
3.2.1评估基准

DeepSeek-V2 is pretrained on a bilingual corpus, so we evaluate it on a series of benchmarks in English and Chinese. Our evaluation is based on our internal evaluation framework integrated in our HAI-LLM framework. Included benchmarks are categorized and listed as follows, where underlined benchmarks are in Chinese:
DeepSeek-V2 是在双语语料库上进行预训练的,因此我们根据一系列英语和中文基准对其进行评估。我们的评估基于集成在 HAI LLM框架中的内部评估框架。包含的基准分类和列出如下,其中划线基准为中文:

Multi-subject multiple-choice datasets include MMLU (Hendrycks et al., 2020), C-Eval (Huang et al., 2023), and CMMLU (Li et al., 2023).
多主题多项选择数据集包括 MMLU (Hendrycks et al., 2020 )C-Eval (Huang et al., 2023 )CMMLU (Li et al., 2023 )

Language understanding and reasoning datasets include HellaSwag (Zellers et al., 2019), PIQA (Bisk et al., 2020), ARC (Clark et al., 2018), and BigBench Hard (BBH) (Suzgun et al., 2022).
语言理解和推理数据集包括 HellaSwag (Zellers et al., 2019 ) 、PIQA (Bisk et al., 2020 ) 、ARC (Clark et al., 2018 )和 BigBench Hard (BBH) (Suzgun et al., 2022 ) .

Closed-book question answering datasets include TriviaQA (Joshi et al., 2017) and NaturalQuestions (Kwiatkowski et al., 2019).
闭卷问答数据集包括 TriviaQA (Joshi et al., 2017 )和 NaturalQuestions (Kwiatkowski et al., 2019 )

Reading comprehension datasets include RACE Lai et al. (2017), DROP (Dua et al., 2019), C3 (Sun et al., 2019), and CMRC (Cui et al., 2019).
阅读理解数据集包括 RACE Lai 等人。 ( 2017 ) 、DROP (Dua 等人, 2019 )C3 (Sun 等人, 2019 )CMRC (Cui 等人, 2019 )

Reference disambiguation datasets include WinoGrande Sakaguchi et al. (2019) and CLUEWSC (Xu et al., 2020).
参考消歧数据集包括 WinoGrande Sakaguchi 等人。 ( 2019 )CLUEWSC (Xu 等人, 2020 )

Language modeling datasets include Pile (Gao et al., 2020).
语言建模数据集包括 Pile (Gao et al., 2020 )

Chinese understanding and culture datasets include CHID (Zheng et al., 2019) and CCPM (Li et al., 2021).
中国理解和文化数据集包括CHID (Zheng et al., 2019 )CCPM (Li et al., 2021 )

Math datasets include GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), and CMath (Wei et al., 2023).
数学数据集包括 GSM8K (Cobbe et al., 2021 ) 、MATH (Hendrycks et al., 2021 )CMath (Wei et al., 2023 )

Code datasets include HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), and CRUXEval (Gu et al., 2024).
代码数据集包括 HumanEval (Chen et al., 2021 ) 、MBPP (Austin et al., 2021 )和 CRUXEval (Gu et al., 2024 )

Standardized exams include AGIEval (Zhong et al., 2023). Note that AGIEval includes both English and Chinese subsets.
标准化考试包括AGIEval (Zhong 等人, 2023 。请注意,AGIEval 包括英语和中文子集。

Following our previous work (DeepSeek-AI, 2024), we adopt perplexity-based evaluation for datasets including HellaSwag, PIQA, WinoGrande, RACE-Middle, RACE-High, MMLU, ARC-Easy, ARC-Challenge, CHID, C-Eval, CMMLU, C3, and CCPM, and adopt generation-based evaluation for TriviaQA, NaturalQuestions, DROP, MATH, GSM8K, HumanEval, MBPP, CRUXEval, BBH, AGIEval, CLUEWSC, CMRC, and CMath. In addition, we perform language-modeling-based evaluation for Pile-test and use Bits-Per-Byte (BPB) as the metric to guarantee fair comparison among models with different tokenizers.
继我们之前的工作(DeepSeek-AI, 2024之后,我们对包括 HellaSwag、PIQA、WinoGrande、RACE-Middle、RACE-High、MMLU、ARC-Easy、ARC-Challenge、CHID、C-Eval 在内的数据集采用基于困惑度的评估、CMMLU、C3 和 CCPM,并对 TriviaQA、NaturalQuestions、DROP、MATH、GSM8K、HumanEval、MBPP、CRUXEval、BBH、AGIEval、CLUEWSC、CMRC 和 CMath 采用基于生成的评估。此外,我们对 Pile-test 进行基于语言建模的评估,并使用每字节位数(BPB)作为指标,以保证具有不同分词器的模型之间的公平比较。

For an intuitive overview of these benchmarks, we additionally provide our evaluation formats for each benchmark in Appendix G.
为了直观地概述这些基准,我们还在附录G中提供了每个基准的评估格式。

3.2.2 Evaluation Results
3.2.2评价结果

Benchmark (Metric) 基准(公制) # Shots # 镜头 DeepSeek 深度搜索 Qwen1.5 Mixtral 混合 LLaMA 3 骆驼3 DeepSeek-V2
67B 72B 8x22B 70B
Architecture 建筑学 - Dense 稠密 Dense 稠密 MoE 教育部 Dense 稠密 MoE 教育部
# Activated Params # 激活参数 - 67B 72B 39B 70B 21B
# Total Params # 总参数 - 67B 72B 141B 70B 236B
English 英语 Pile-test (BPB) 桩测试 (BPB) - 0.642 0.637 0.623 0.602 0.606
BBH (EM) 3-shot 3发 68.7 59.9 78.9 81.0 78.9
MMLU (Acc.) MMLU(加速) 5-shot 5发 71.3 77.2 77.6 78.9 78.5
DROP (F1) 掉落(F1) 3-shot 3发 69.7 71.5 80.4 82.5 80.1
ARC-Easy (Acc.) ARC-Easy(附件) 25-shot 25发 95.3 97.1 97.3 97.9 97.6
ARC-Challenge (Acc.) ARC-挑战(附件) 25-shot 25发 86.4 92.8 91.2 93.3 92.4
HellaSwag (Acc.) HellaSwag(加速) 10-shot 10发 86.3 85.8 86.6 87.9 84.2
PIQA (Acc.) PIQA(加速器) 0-shot 0次射击 83.6 83.3 83.6 85.0 83.7
WinoGrande (Acc.) 维诺格兰德 (Ac.) 5-shot 5发 84.9 82.4 83.7 85.7 84.9
RACE-Middle (Acc.) 比赛-中(加速) 5-shot 5发 69.9 63.4 73.3 73.3 73.1
RACE-High (Acc.) RACE-高(加速) 5-shot 5发 50.7 47.0 56.7 57.9 52.7
TriviaQA (EM) 问答问答 (EM) 5-shot 5发 78.9 73.1 82.1 81.6 79.9
NaturalQuestions (EM) 自然问题 (EM) 5-shot 5发 36.6 35.6 39.6 40.2 38.7
AGIEval (Acc.) AGIEval(加速) 0-shot 0次射击 41.3 64.4 43.4 49.8 51.2
Code 代码 HumanEval (Pass@1) 人类评估(通过@1) 0-shot 0次射击 45.1 43.9 53.1 48.2 48.8
MBPP (Pass@1) MBPP(通过@1) 3-shot 3发 57.4 53.6 64.2 68.6 66.6
CRUXEval-I (Acc.) CRUXEval-I(加速) 2-shot 2发 42.5 44.3 52.4 49.4 52.8
CRUXEval-O (Acc.) CRUXEval-O(加速) 2-shot 2发 41.0 42.3 52.8 54.3 49.8
Math 数学 GSM8K (EM) GSM8K(EM) 8-shot 8发 63.4 77.9 80.3 83.0 79.2
MATH (EM) 数学(EM) 4-shot 4发 18.7 41.4 42.5 42.2 43.6
CMath (EM) 数学数学(EM) 3-shot 3发 63.0 77.8 72.3 73.9 78.7
Chinese 中国人 CLUEWSC (EM) 5-shot 5发 81.0 80.5 77.5 78.3 82.2
C-Eval (Acc.) C-评估(加速) 5-shot 5发 66.1 83.7 59.6 67.5 81.7
CMMLU (Acc.) CMMLU(加速) 5-shot 5发 70.8 84.3 60.0 69.3 84.0
CMRC (EM) 1-shot 1 次射击 73.4 66.6 73.1 73.3 77.5
C3 (Acc.) C3(加速) 0-shot 0次射击 75.3 78.2 71.4 74.0 77.4
CHID (Acc.) 儿童(加速) 0-shot 0次射击 92.1 - 57.0 83.2 92.7
CCPM (Acc.) CCPM(加速) 0-shot 0次射击 88.5 88.1 61.0 68.1 93.1
Table 2: Comparison among DeepSeek-V2 and other representative open-source models. All models are evaluated in our internal framework and share the same evaluation setting. Bold denotes the best and underline denotes the second-best. Scores with a gap smaller than 0.3 are regarded as at the same level. With only 21B activated parameters, DeepSeek-V2 achieves top-tier performance among open-source models.
表2: DeepSeek-V2与其他代表性开源模型的比较。所有模型都在我们的内部框架中进行评估,并共享相同的评估设置。粗体表示最好,下划线表示次佳。差距小于0.3的分数视为同一水平。 DeepSeek-V2仅需要21B个激活参数,就实现了开源模型中顶级的性能。

In Table 2, we compare DeepSeek-V2 with several representative open-source models, including DeepSeek 67B (DeepSeek-AI, 2024) (our previous release), Qwen1.5 72B (Bai et al., 2023), LLaMA3 70B (AI@Meta, 2024), and Mixtral 8x22B (Mistral, 2024). We evaluate all these models with our internal evaluation framework, and ensure that they share the same evaluation setting. Overall, with only 21B activated parameters, DeepSeek-V2 significantly outperforms DeepSeek 67B on almost all benchmarks, and achieves top-tier performance among open-source models.
在表2中,我们将DeepSeek-V2与几个代表性的开源模型进行了比较,包括DeepSeek 67B (DeepSeek-AI, 2024 (我们之前的版本)、Qwen1.5 72B (Bai等人, 2023 、LLaMA3 70B (AI) @Meta, 2024和 Mixtral 8x22B (米斯特拉尔, 2024 。我们使用内部评估框架评估所有这些模型,并确保它们共享相同的评估设置。总体而言,仅激活 21B 个参数,DeepSeek-V2 在几乎所有基准测试中均显着优于 DeepSeek 67B,在开源模型中实现了顶级性能。

Further, we elaborately compare DeepSeek-V2 with its open-source counterparts one by one. (1) Compared with Qwen1.5 72B, another model that supports both Chinese and English, DeepSeek-V2 demonstrates overwhelming advantages on the majority of English, code, and math benchmarks. As for Chinese benchmarks, Qwen1.5 72B shows better performance on multi-subject multiple-choice tasks while DeepSeek-V2 is comparable or better on others. Note that for the CHID benchmark, the tokenizer of Qwen1.5 72B will encounter errors in our evaluation framework, so we leave the CHID score blank for Qwen1.5 72B. (2) Compared with Mixtral 8x22B, DeepSeek-V2 achieves comparable or better English performance, except for TriviaQA, NaturalQuestions, and HellaSwag, which are closely related to English commonsense knowledge. Notably, DeepSeek-V2 outperforms Mixtral 8x22B on MMLU. On code and math benchmarks, DeepSeek-V2 demonstrates comparable performance with Mixtral 8x22B. Since Mixtral 8x22B is not specifically trained on Chinese data, its Chinese capability lags far behind DeepSeek-V2. (3) Compared with LLaMA3 70B, DeepSeek-V2 is trained on fewer than a quarter of English tokens. Therefore, we acknowledge that DeepSeek-V2 still has a slight gap in basic English capabilities with LLaMA3 70B. However, even with much fewer training tokens and activated parameters, DeepSeek-V2 still demonstrates comparable code and math capability with LLaMA3 70B. Also, as a bilingual language model, DeepSeek-V2 outperforms LLaMA3 70B overwhelmingly on Chinese benchmarks.
此外,我们还详细地将 DeepSeek-V2 与其开源版本一一进行了比较。 (1)与另一种支持中英文的模型Qwen1.5 72B相比,DeepSeek-V2在大多数英语、代码和数学基准上都表现出了压倒性的优势。至于中国基准测试,Qwen1.5 72B 在多主题多项选择任务上表现出更好的性能,而 DeepSeek-V2 与其他基准测试相当或更好。请注意,对于 CHID 基准测试,Qwen1.5 72B 的分词器将在我们的评估框架中遇到错误,因此我们将 Qwen1.5 72B 的 CHID 分数留空。 (2)与 Mixtral 8x22B 相比,除了与英语常识知识密切相关的 TriviaQA、NaturalQuestions 和 HellaSwag 之外,DeepSeek-V2 实现了相当或更好的英语性能。值得注意的是,DeepSeek-V2 在 MMLU 上的性能优于 Mixtral 8x22B。在代码和数学基准测试中,DeepSeek-V2 表现出与 Mixtral 8x22B 相当的性能。由于Mixtral 8x22B没有专门针对中文数据进行训练,其中文能力远远落后于DeepSeek-V2。 (3) 与 LLaMA3 70B 相比,DeepSeek-V2 只用不到四分之一的英文标记进行训练。因此,我们承认DeepSeek-V2在基础英语能力上与LLaMA3 70B仍有细微差距。然而,即使训练令牌和激活参数少得多,DeepSeek-V2 仍然表现出与 LLaMA3 70B 相当的代码和数学能力。此外,作为双语语言模型,DeepSeek-V2 在中文基准测试中以压倒性优势优于 LLaMA3 70B。

Finally, it is worth mentioning that certain prior studies (Hu et al., 2024) incorporate SFT data during the pre-training stage, whereas DeepSeek-V2 has never been exposed to SFT data during pre-training.
最后,值得一提的是,某些先前的研究(Hu et al., 2024在预训练阶段纳入了 SFT 数据,而 DeepSeek-V2 在预训练期间从未接触过 SFT 数据。

3.2.3 Training and Inference Efficiency
3.2.3训练和推理效率

Training Costs. 培训费用。

Since DeepSeek-V2 activates fewer parameters for each token and requires fewer FLOPs than DeepSeek 67B, training DeepSeek-V2 will be more economical than training DeepSeek 67B theoretically. Although training an MoE model will introduce additional communication overheads, through our operator and communication optimizations, the training for DeepSeek-V2 can attain a relatively high Model FLOPs Utilization (MFU). During our practical training on the H800 cluster, for training on each trillion tokens, DeepSeek 67B requires 300.6K GPU hours, while DeepSeek-V2 needs only 172.8K GPU hours, i.e., sparse DeepSeek-V2 can save 42.5% training costs compared with dense DeepSeek 67B.
由于 DeepSeek-V2 为每个令牌激活的参数更少,并且比 DeepSeek 67B 需要更少的 FLOP,因此理论上训练 DeepSeek-V2 将比训练 DeepSeek 67B 更经济。虽然训练MoE模型会引入额外的通信开销,但通过我们的算子和通信优化,DeepSeek-V2的训练可以获得相对较高的模型FLOPs利用率(MFU)。我们在H800集群上的实际训练中,每万亿个token的训练,DeepSeek 67B需要300.6K GPU小时,而DeepSeek-V2只需要172.8K GPU小时,即稀疏DeepSeek-V2相比密集可以节省42.5%的训练成本深寻 67B。

Inference Efficiency. 推理效率。

In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8. In addition, we also perform KV cache quantization (Hooper et al., 2024; Zhao et al., 2023) for DeepSeek-V2 to further compress each element in its KV cache into 6 bits on average. Benefiting from MLA and these optimizations, actually deployed DeepSeek-V2 requires significantly less KV cache than DeepSeek 67B, and thus can serve a much larger batch size. We evaluate the generation throughput of DeepSeek-V2 based on the prompt and generation length distribution from the actually deployed DeepSeek 67B service. On a single node with 8 H800 GPUs, DeepSeek-V2 achieves a generation throughput exceeding 50K tokens per second, which is 5.76 times the maximum generation throughput of DeepSeek 67B. In addition, the prompt input throughput of DeepSeek-V2 exceeds 100K tokens per second.
为了高效部署DeepSeek-V2进行服务,我们首先将其参数转换为FP8的精度。此外,我们还对DeepSeek-V2进行KV缓存量化(Hooper et al., 2024 ;Zhao et al., 2023 ),进一步将其KV缓存中的每个元素平均压缩为6位。受益于 MLA 和这些优化,实际部署的 DeepSeek-V2 所需的 KV 缓存比 DeepSeek 67B 少得多,因此可以服务更大的批量大小。我们根据实际部署的 DeepSeek 67B 服务的提示和生成长度分布来评估 DeepSeek-V2 的生成吞吐量。在具有8个H800 GPU的单节点上,DeepSeek-V2实现了每秒超过50K token的生成吞吐量,是DeepSeek 67B最大生成吞吐量的5.76倍。此外,DeepSeek-V2的即时输入吞吐量超过每秒100K令牌。

4 Alignment
4对齐

4.1 Supervised Fine-Tuning
4.1监督微调

Building upon our prior research (DeepSeek-AI, 2024), we curate our instruction tuning datasets to include 1.5M instances, comprising 1.2M instances for helpfulness and 0.3M instances for safety. In comparison to the initial version, we improve the data quality to mitigate hallucinatory responses and enhance writing proficiency. We fine-tune DeepSeek-V2 with 2 epochs, and the learning rate is set to 5×1065superscript1065\times 10^{-6}5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT. For the evaluation of DeepSeek-V2 Chat (SFT), we mainly include generation-based benchmarks, except for several representative multiple-choice tasks (MMLU and ARC). We also conduct an instruction-following evaluation (IFEval) (Zhou et al., 2023) for DeepSeek-V2 Chat (SFT), using prompt-level loose accuracy as the metric. Moreover, we employ LiveCodeBench (Jain et al., 2024) questions from September 1st, 2023 to April 1st, 2024 to evaluate chat models. In addition to the standard benchmarks, we further evaluate our model on open-ended conversation benchmarks including MT-Bench (Zheng et al., 2023), AlpacaEval 2.0 (Dubois et al., 2024), and AlignBench (Liu et al., 2023). For comparison, we also evaluate Qwen1.5 72B Chat, LLaMA-3-70B Instruct, and Mistral-8x22B Instruct in our evaluation framework and settings. As for DeepSeek 67B Chat, we directly refer to the evaluation results reported in our previous release.
基于我们之前的研究(DeepSeek-AI, 2024 ,我们整理了指令调整数据集以包含 150 万个实例,其中 120 万个用于帮助的实例和 030 万个用于安全的实例。与初始版本相比,我们提高了数据质量,以减轻幻觉反应并提高写作水平。我们对 DeepSeek-V2 进行了 2 个 epoch 的微调,学习率设置为 5×1065superscript1065\times 10^{-6}5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 。对于 DeepSeek-V2 Chat(SFT)的评估,除了几个代表性的多项选择任务(MMLU 和 ARC)外,我们主要包括基于生成的基准。我们还使用提示级松散准确性作为指标,对 DeepSeek-V2 Chat (SFT) 进行指令跟踪评估 (IFEval) (Zhou et al., 2023 。此外,我们使用2023年9月1日至2024年4月1日的LiveCodeBench (Jain et al., 2024问题来评估聊天模型。除了标准基准之外,我们还进一步在开放式对话基准上评估我们的模型,包括 MT-Bench (Zheng et al., 2023 ) 、AlpacaEval 2.0 (Dubois et al., 2024 )和 AlignBench (Liu et al., 2024)。 2023 。为了进行比较,我们还在我们的评估框架和设置中评估了 Qwen1.5 72B Chat、LLaMA-3-70B Instruct 和 Mistral-8x22B Instruct。对于 DeepSeek 67B Chat,我们直接参考之前版本中报告的评估结果。

4.2 Reinforcement Learning
4.2强化学习

In order to further unlock the potential of DeepSeek-V2 and align it with human preference, we conduct Reinforcement Learning (RL) to adjust its preference.
为了进一步释放 DeepSeek-V2 的潜力并使其符合人类偏好,我们进行强化学习(RL)来调整其偏好。

Reinforcement Learning Algorithm.
强化学习算法。

In order to save the training costs of RL, we adopt Group Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is typically with the same size as the policy model, and estimates the baseline from group scores instead. Specifically, for each question q𝑞qitalic_q, GRPO samples a group of outputs {o1,o2,,oG}subscript𝑜1subscript𝑜2subscript𝑜𝐺\{o_{1},o_{2},\cdots,o_{G}\}{ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_o start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } from the old policy πθoldsubscript𝜋subscript𝜃𝑜𝑙𝑑\pi_{\theta_{old}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT and then optimizes the policy model πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT by maximizing the following objective:
为了节省强化学习的训练成本,我们采用组相对策略优化(GRPO) (Shao et al., 2024 ,它放弃了通常与策略模型大小相同的批评家模型,并估计基线而是以小组成绩代替。具体来说,对于每个问题 qqitalic_q , GRPO 对一组输出进行采样 {o1,o2,,oG}subscript1subscript2subscript\{o_{1},o_{2},\cdots,o_{G}\}{ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_o start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } 从旧政策来看 πθoldsubscriptsubscript\pi_{\theta_{old}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT 然后优化政策模型 πθsubscript\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT 通过最大化以下目标:

𝒥GRPO(θ)=𝔼[qP(Q),{oi}i=1Gπθold(O|q)]1Gi=1G(min(πθ(oi|q)πθold(oi|q)Ai,clip(πθ(oi|q)πθold(oi|q),1ε,1+ε)Ai)β𝔻KL(πθ||πref)),\begin{split}\mathcal{J}_{GRPO}(\theta)&=\mathbb{E}{[q\sim P(Q),\{o_{i}\}_{i=1% }^{G}\sim\pi_{\theta_{old}}(O|q)]}\\ &\frac{1}{G}\sum_{i=1}^{G}\left(\min\left(\frac{\pi_{\theta}(o_{i}|q)}{\pi_{% \theta_{old}}(o_{i}|q)}A_{i},\text{clip}\left(\frac{\pi_{\theta}(o_{i}|q)}{\pi% _{\theta_{old}}(o_{i}|q)},1-\varepsilon,1+\varepsilon\right)A_{i}\right)-\beta% \mathbb{D}_{KL}\left(\pi_{\theta}||\pi_{ref}\right)\right),\end{split}start_ROW start_CELL caligraphic_J start_POSTSUBSCRIPT italic_G italic_R italic_P italic_O end_POSTSUBSCRIPT ( italic_θ ) end_CELL start_CELL = blackboard_E [ italic_q ∼ italic_P ( italic_Q ) , { italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_O | italic_q ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_G end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ( roman_min ( divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , clip ( divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG , 1 - italic_ε , 1 + italic_ε ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_β blackboard_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT | | italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (32)
𝔻KL(πθ||πref)=πref(oi|q)πθ(oi|q)logπref(oi|q)πθ(oi|q)1,\mathbb{D}_{KL}\left(\pi_{\theta}||\pi_{ref}\right)=\frac{\pi_{ref}(o_{i}|q)}{% \pi_{\theta}(o_{i}|q)}-\log\frac{\pi_{ref}(o_{i}|q)}{\pi_{\theta}(o_{i}|q)}-1,blackboard_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT | | italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) = divide start_ARG italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG - roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG - 1 , (33)

where ε𝜀\varepsilonitalic_ε and β𝛽\betaitalic_β are hyper-parameters; and Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the advantage, computed using a group of rewards {r1,r2,,rG}subscript𝑟1subscript𝑟2subscript𝑟𝐺\{r_{1},r_{2},\ldots,r_{G}\}{ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } corresponding to the outputs within each group:
在哪里 ε\varepsilonitalic_εβ\betaitalic_β 是超参数;和 AisubscriptA_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 是使用一组奖励计算的优势 {r1,r2,,rG}subscript1subscript2subscript\{r_{1},r_{2},\ldots,r_{G}\}{ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } 对应于每组内的输出:

Ai=rimean({r1,r2,,rG})std({r1,r2,,rG}).subscript𝐴𝑖subscript𝑟𝑖m𝑒𝑎𝑛subscript𝑟1subscript𝑟2subscript𝑟𝐺s𝑡𝑑subscript𝑟1subscript𝑟2subscript𝑟𝐺A_{i}=\frac{r_{i}-{\mathrm{m}ean(\{r_{1},r_{2},\cdots,r_{G}\})}}{{\mathrm{s}td% (\{r_{1},r_{2},\cdots,r_{G}\})}}.italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_m italic_e italic_a italic_n ( { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_r start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } ) end_ARG start_ARG roman_s italic_t italic_d ( { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_r start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } ) end_ARG . (34)
Training Strategy. 培训策略。

In our preliminary experiments, we find that the RL training on reasoning data, such as code and math prompts, exhibits unique characteristics that are distinct from the training on general data. For example, the mathematical and coding abilities of our model can keep improving over a longer period of training steps. Therefore, we employ a two-stage RL training strategy, which first performs reasoning alignment, and then performs human preference alignment. In the first reasoning alignment stage, we train a reward model RMreasoning𝑅subscript𝑀𝑟𝑒𝑎𝑠𝑜𝑛𝑖𝑛𝑔RM_{reasoning}italic_R italic_M start_POSTSUBSCRIPT italic_r italic_e italic_a italic_s italic_o italic_n italic_i italic_n italic_g end_POSTSUBSCRIPT for code and math reasoning tasks, and optimize the policy model with the feedback of RMreasoning𝑅subscript𝑀𝑟𝑒𝑎𝑠𝑜𝑛𝑖𝑛𝑔RM_{reasoning}italic_R italic_M start_POSTSUBSCRIPT italic_r italic_e italic_a italic_s italic_o italic_n italic_i italic_n italic_g end_POSTSUBSCRIPT:
在我们的初步实验中,我们发现对推理数据(例如代码和数学提示)的强化学习训练表现出与一般数据训练不同的独特特征。 例如,我们模型的数学和编码能力可以在较长时期的训练步骤中不断提高。 因此,我们采用两阶段强化学习训练策略,首先进行推理对齐,然后进行人类偏好对齐。 在第一个推理对齐阶段,我们训练奖励模型 RMreasoningsubscriptRM_{reasoning}italic_R italic_M start_POSTSUBSCRIPT italic_r italic_e italic_a italic_s italic_o italic_n italic_i italic_n italic_g end_POSTSUBSCRIPT 用于代码和数学推理任务,并根据反馈优化策略模型 RMreasoningsubscriptRM_{reasoning}italic_R italic_M start_POSTSUBSCRIPT italic_r italic_e italic_a italic_s italic_o italic_n italic_i italic_n italic_g end_POSTSUBSCRIPT :

ri=RMreasoning(oi).subscript𝑟𝑖𝑅subscript𝑀𝑟𝑒𝑎𝑠𝑜𝑛𝑖𝑛𝑔subscript𝑜𝑖r_{i}=RM_{reasoning}(o_{i}).italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_R italic_M start_POSTSUBSCRIPT italic_r italic_e italic_a italic_s italic_o italic_n italic_i italic_n italic_g end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (35)

In the second human preference alignment stage, we adopt a multi-reward framework, which acquires rewards from a helpful reward model RMhelpful𝑅subscript𝑀𝑒𝑙𝑝𝑓𝑢𝑙RM_{helpful}italic_R italic_M start_POSTSUBSCRIPT italic_h italic_e italic_l italic_p italic_f italic_u italic_l end_POSTSUBSCRIPT, a safety reward model RMsafety𝑅subscript𝑀𝑠𝑎𝑓𝑒𝑡𝑦RM_{safety}italic_R italic_M start_POSTSUBSCRIPT italic_s italic_a italic_f italic_e italic_t italic_y end_POSTSUBSCRIPT, and a rule-based reward model RMrule𝑅subscript𝑀𝑟𝑢𝑙𝑒RM_{rule}italic_R italic_M start_POSTSUBSCRIPT italic_r italic_u italic_l italic_e end_POSTSUBSCRIPT. The final reward of a response oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is
在第二个人类偏好调整阶段,我们采用多重奖励框架,该框架从有用的奖励模型中获取奖励 RMhelpfulsubscriptRM_{helpful}italic_R italic_M start_POSTSUBSCRIPT italic_h italic_e italic_l italic_p italic_f italic_u italic_l end_POSTSUBSCRIPT ,安全奖励模型 RMsafetysubscriptRM_{safety}italic_R italic_M start_POSTSUBSCRIPT italic_s italic_a italic_f italic_e italic_t italic_y end_POSTSUBSCRIPT ,以及基于规则的奖励模型 RMrulesubscriptRM_{rule}italic_R italic_M start_POSTSUBSCRIPT italic_r italic_u italic_l italic_e end_POSTSUBSCRIPT 。回复的最终奖励 oisubscripto_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

ri=c1RMhelpful(oi)+c2RMsafety(oi)+c3RMrule(oi),subscript𝑟𝑖subscript𝑐1𝑅subscript𝑀𝑒𝑙𝑝𝑓𝑢𝑙subscript𝑜𝑖subscript𝑐2𝑅subscript𝑀𝑠𝑎𝑓𝑒𝑡𝑦subscript𝑜𝑖subscript𝑐3𝑅subscript𝑀𝑟𝑢𝑙𝑒subscript𝑜𝑖r_{i}=c_{1}\cdot RM_{helpful}(o_{i})+c_{2}\cdot RM_{safety}(o_{i})+c_{3}\cdot RM% _{rule}(o_{i}),italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_R italic_M start_POSTSUBSCRIPT italic_h italic_e italic_l italic_p italic_f italic_u italic_l end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_R italic_M start_POSTSUBSCRIPT italic_s italic_a italic_f italic_e italic_t italic_y end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⋅ italic_R italic_M start_POSTSUBSCRIPT italic_r italic_u italic_l italic_e end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (36)

where c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and c3subscript𝑐3c_{3}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are corresponding coefficients.
在哪里 c1subscript1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , c2subscript2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 和 c3subscript3c_{3}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 是对应的系数。

In order to obtain reliable reward models that play crucial roles in the RL training, we carefully collect preference data, and meticulously conduct quality filtering and proportion adjustments. We obtain code preference data based on compiler-feedback, and mathematical preference data based on the ground-truth labels. For reward model training, we initialize the reward models with DeepSeek-V2 Chat (SFT) and train them with either a point-wise or a pair-wise loss. In our experiments, we observe that the RL training can fully tap into and activate the potential of our model, enabling it to select the correct and satisfactory answer from possible responses.
为了获得在强化学习训练中发挥关键作用的可靠奖励模型,我们仔细收集偏好数据,并精心进行质量过滤和比例调整。我们根据编译器反馈获得代码偏好数据,并根据真实标签获得数学偏好数据。对于奖励模型训练,我们使用 DeepSeek-V2 Chat (SFT) 初始化奖励模型,并使用逐点或成对损失来训练它们。在我们的实验中,我们观察到强化学习训练可以充分挖掘和激活我们模型的潜力,使其能够从可能的响应中选择正确且令人满意的答案。

Optimizations for Training Efficiency.
优化培训效率。

Conducting RL training on extremely large models places high demands on the training framework. It requires careful engineering optimization to manage the GPU memory and RAM pressure, and meanwhile maintain a fast training speed. For this goal, we implement the following engineering optimizations. (1) Firstly, we propose a hybrid engine that adopts different parallel strategies for training and inference respectively to achieve higher GPU utilization. (2) Secondly, we leverage vLLM (Kwon et al., 2023) with large batch sizes as our inference backend to accelerate the inference speed. (3) Thirdly, we carefully design a scheduling strategy for offloading models to CPUs and loading models back to GPUs, which achieves a near-optimal balance between the training speed and memory consumption.
在超大型模型上进行强化学习训练对训练框架提出了很高的要求。它需要仔细的工程优化来管理 GPU 内存和 RAM 压力,同时保持快速的训练速度。为了这个目标,我们实施了以下工程优化。 (1)首先,我们提出了一种混合引擎,分别采用不同的并行策略进行训练和推理,以实现更高的GPU利用率。 (2) 其次,我们利用大批量的 vLLM (Kwon et al., 2023 )作为推理后端来加快推理速度。 (3) 第三,我们精心设计了将模型卸载到 CPU 并将模型加载回 GPU 的调度策略,从而在训练速度和内存消耗之间实现了近乎最佳的平衡。

4.3 Evaluation Results
4.3评价结果

Evaluations on Standard Benchmarks. Initially, we evaluate DeepSeek-V2 Chat (SFT) and DeepSeek-V2 Chat (RL) on standard benchmarks. Notably, DeepSeek-V2 Chat (SFT) demonstrates substantial improvements in GSM8K, MATH, and HumanEval evaluations compared with its base version. This progress can be attributed to the inclusion of our SFT data, which comprises a considerable volume of math and code related content. In addition, DeepSeek-V2 Chat (RL) further boosts the performance on math and code benchmarks. We show more code and math evaluations in Appendix F.
标准基准评估。最初,我们在标准基准上评估 DeepSeek-V2 Chat (SFT) 和 DeepSeek-V2 Chat (RL)。值得注意的是,与基础版本相比,DeepSeek-V2 Chat (SFT) 在 GSM8K、MATH 和 HumanEval 评估方面表现出显着改进。这一进展可归因于我们的 SFT 数据的纳入,其中包含大量数学和代码相关内容。此外,DeepSeek-V2 Chat (RL) 进一步提升了数学和代码基准测试的性能。我们在附录F中展示了更多代码和数学评估。

As for the comparisons with other models, we first compare DeepSeek-V2 Chat (SFT) with Qwen1.5 72B Chat, and find that DeepSeek-V2 Chat (SFT) surpasses Qwen1.5 72B Chat on almost all of English, math, and code benchmarks. On Chinese benchmarks, DeepSeek-V2 Chat (SFT) demonstrates slightly lower scores than Qwen1.5 72B Chat on multi-subject multiple-choice tasks, consistent with the performance observed from their base versions. When compared with the state-of-the-art open-source MoE model, Mixtral 8x22B Instruct, DeepSeek-V2 Chat (SFT) exhibits better performance on most benchmarks, except for NaturalQuestions and IFEval. Furthermore, in comparison to the state-of-the-art open-source model LLaMA3 70B Chat, DeepSeek-V2 Chat (SFT) shows similar performance in code and math related benchmarks. LLaMA3 70B Chat exhibits better performance on MMLU and IFEval, while DeepSeek-V2 Chat (SFT) showcases stronger performance on Chinese tasks. Ultimately, DeepSeek-V2 Chat (RL) demonstrates further enhanced performance in both mathematical and coding tasks compared with DeepSeek-V2 Chat (SFT). These comparisons highlight the strengths of DeepSeek-V2 Chat in relation to other language models in various domains and languages.
至于与其他模型的比较,我们首先将 DeepSeek-V2 Chat (SFT) 与 Qwen1.5 72B Chat 进行比较,发现 DeepSeek-V2 Chat (SFT) 在英语、数学、英语等几乎所有方面都超越了 Qwen1.5 72B Chat。代码基准。在中国基准测试中,DeepSeek-V2 Chat (SFT) 在多主题多项选择任务上的得分略低于 Qwen1.5 72B Chat,这与从其基础版本观察到的性能一致。与最先进的开源 MoE 模型 Mixtral 8x22B Instruct 相比,DeepSeek-V2 Chat (SFT) 在大多数基准测试中表现出更好的性能,但 NaturalQuestions 和 IFEval 除外。此外,与最先进的开源模型 LLaMA3 70B Chat 相比,DeepSeek-V2 Chat (SFT) 在代码和数学相关基准测试中显示出相似的性能。 LLaMA3 70B Chat 在 MMLU 和 IFEval 上表现出更好的性能,而 DeepSeek-V2 Chat (SFT) 在中文任务上表现出更强的性能。最终,与 DeepSeek-V2 Chat (SFT) 相比,DeepSeek-V2 Chat (RL) 在数学和编码任务方面表现出进一步增强的性能。这些比较突出了 DeepSeek-V2 Chat 相对于各个领域和语言的其他语言模型的优势。

Benchmark 基准 # Shots # 镜头 DeepSeek 深度搜索 Qwen 1.5 启文1.5 LLaMA3 拉玛3 Mixtral 混合 DeepSeek-V2 DeepSeek-V2
67B Chat 67B 聊天 72B Chat 72B 聊天 70B Inst. 70B 研究所。 8x22B Inst. 8x22B 仪器。 Chat (SFT) 聊天 (SFT) Chat (RL) 聊天(RL)
Context Length 上下文长度 - 4K 32K 8K 64K 128K 128K
Architecture 建筑学 - Dense 稠密 Dense 稠密 Dense 稠密 MoE 教育部 MoE 教育部 MoE 教育部
# Activated Params # 激活参数 - 67B 72B 70B 39B 21B 21B
# Total Params # 总参数 - 67B 72B 70B 141B 236B 236B
English 英语 TriviaQA 问答问答 5-shot 5发 81.5 79.6 69.1 80.0 85.4 86.7
NaturalQuestions 自然问题 5-shot 5发 47.0 46.9 44.6 54.9 51.9 53.4
MMLU 5-shot 5发 71.1 76.2 80.3 77.8 78.4 77.8
ARC-Easy 25-shot 25发 96.6 96.8 96.9 97.1 97.6 98.1
ARC-Challenge ARC-挑战 25-shot 25发 88.9 91.7 92.6 90.0 92.5 92.3
BBH 3-shot 3发 71.7 65.9 80.1 78.4 81.3 79.7
AGIEval AGIE值 0-shot 0次射击 46.4 62.8 56.6 41.4 63.2 61.4
IFEval IFE值 0-shot 0次射击 55.5 57.3 79.7 72.1 64.1 63.8
Code 代码 HumanEval 人类评估 0-shot 0次射击 73.8 68.9 76.2 75.0 76.8 81.1
MBPP 3-shot 3发 61.4 52.2 69.8 64.4 70.4 72.0
CRUXEval-I-COT 2-shot 2发 49.1 51.4 61.1 59.4 59.5 61.5
CRUXEval-O-COT 2-shot 2发 50.9 56.5 63.6 63.6 60.7 63.0
LiveCodeBench 实时代码测试平台 0-shot 0次射击 18.3 18.8 30.5 25.0 28.7 32.5
Math 数学 GSM8K 8-shot 8发 84.1 81.9 93.2 87.9 90.8 92.2
MATH 4-shot 4发 32.6 40.6 48.5 49.8 52.7 53.9
CMath 数学数学 0-shot 0次射击 80.3 82.8 79.2 75.1 82.0 81.9
Chinese 中国人 CLUEWSC 5-shot 5发 78.5 90.1 85.4 75.8 88.6 89.9
C-Eval C-评估 5-shot 5发 65.2 82.2 67.9 60.0 80.9 78.0
CMMLU 5-shot 5发 67.8 82.9 70.7 61.0 82.4 81.6
Table 3: Comparison among DeepSeek-V2 Chat (SFT), DeepSeek-V2 Chat (RL), and other representative open-source chat models. Regarding TriviaQA and NaturalQuestions, it is worth noting that chat models, such as LLaMA3 70B Instruct, might not strictly adhere to the format constraints typically specified in the few-shot setting. Consequently, this can lead to underestimation of certain models in our evaluation framework.
表 3: DeepSeek-V2 Chat (SFT)、DeepSeek-V2 Chat (RL) 和其他代表性开源聊天模型之间的比较。关于 TriviaQA 和 NaturalQuestions,值得注意的是,聊天模型(例如 LLaMA3 70B Instruct)可能不严格遵守通常在少样本设置中指定的格式约束。因此,这可能导致我们的评估框架中某些模型被低估。

Evaluations on Open-Ended Generation. We proceed with additional evaluations of our models on open-ended conversation benchmarks. For English open-ended conversation generation, we utilize MT-Bench and AlpacaEval 2.0 as the benchmarks. Evaluation results presented in Table 4 demonstrate a significant performance advantage of DeepSeek-V2 Chat (RL) over DeepSeek-V2 Chat (SFT). This outcome showcases the effectiveness of our RL training in achieving improved alignment. In comparison to other open-source models, DeepSeek-V2 Chat (RL) demonstrates superior performance over Mistral 8x22B Instruct and Qwen1.5 72B Chat on both benchmarks. When compared with LLaMA3 70B Instruct, DeepSeek-V2 Chat (RL) showcases competitive performance on MT-Bench and notably outperforms it on AlpacaEval 2.0. These results highlight the strong performance of DeepSeek-V2 Chat (RL) in generating high-quality and contextually relevant responses, particularly in instruction-based conversation tasks.
对开放式一代的评估。我们继续在开放式对话基准上对我们的模型进行额外评估。对于英语开放式对话生成,我们使用 MT-Bench 和 AlpacaEval 2.0 作为基准。表4中的评估结果表明 DeepSeek-V2 Chat (RL) 相对于 DeepSeek-V2 Chat (SFT) 具有显着的性能优势。这一结果展示了我们的强化学习训练在改善一致性方面的有效性。与其他开源模型相比,DeepSeek-V2 Chat (RL) 在两个基准测试中均表现出优于 Mistral 8x22B Instruct 和 Qwen1.5 72B Chat 的性能。与 LLaMA3 70B Instruct 相比,DeepSeek-V2 Chat (RL) 在 MT-Bench 上展示了具有竞争力的性能,并且在 AlpacaEval 2.0 上明显优于它。这些结果凸显了 DeepSeek-V2 Chat (RL) 在生成高质量且上下文相关的响应方面的强大性能,特别是在基于指令的对话任务中。

Model MT-Bench MT 工作台 AlpacaEval 2.0 羊驼评估2.0
DeepSeek 67B Chat DeepSeek 67B 聊天 8.35 16.6
Mistral 8x22B Instruct v0.1
米斯特拉尔 8x22B 指令 v0.1
8.66 30.9
Qwen1.5 72B Chat Qwen1.5 72B聊天 8.61 36.6
LLaMA3 70B Instruct LLaMA3 70B 指导 8.95 34.4
DeepSeek-V2 Chat (SFT) DeepSeek-V2 聊天 (SFT) 8.62 30.0
DeepSeek-V2 Chat (RL) DeepSeek-V2 聊天 (RL) 8.97 38.9
Table 4: English open-ended conversation evaluations. For AlpacaEval 2.0, we use the length-controlled win rate as the metric.
表 4:英语开放式对话评估。对于AlpacaEval 2.0,我们使用长度控制的获胜率作为指标。

In addition, we evaluate the Chinese open-ended generation capability based on AlignBench. As presented in Table 5, DeepSeek-V2 Chat (RL) exhibits a slight advantage over DeepSeek-V2 Chat (SFT). Notably, DeepSeek-V2 Chat (SFT) surpasses all open-source Chinese models by a significant margin. It significantly outperforms the second-best open-source model, Qwen1.5 72B Chat on both Chinese reasoning and language. Moreover, both DeepSeek-V2 Chat (SFT) and DeepSeek-V2 Chat (RL) outperform GPT-4-0613 and ERNIEBot 4.0, solidifying the position of our models in the top-tier LLMs that support Chinese. Specifically, DeepSeek-V2 Chat (RL) shows remarkable performance in Chinese language understanding, which outperforms all models including GPT-4-Turbo-1106-Preview. On the other hand, the reasoning capability of DeepSeek-V2 Chat (RL) still lags behind giant models, such as Erniebot-4.0 and GPT-4s.
此外,我们还基于AlignBench评估了中国的开放式生成能力。如表5所示,DeepSeek-V2 Chat (RL) 比 DeepSeek-V2 Chat (SFT) 表现出轻微的优势。值得注意的是,DeepSeek-V2 Chat(SFT)大幅超越了所有开源中国模型。它在中文推理和语言方面都显着优于第二好的开源模型 Qwen1.5 72B Chat。此外,DeepSeek-V2 Chat(SFT)和DeepSeek-V2 Chat(RL)的性能均优于GPT-4-0613和ERNIEBot 4.0,巩固了我们的模型在支持中文的顶级LLMs中的地位。具体来说,DeepSeek-V2 Chat(RL)在中文理解方面表现出色,优于包括 GPT-4-Turbo-1106-Preview 在内的所有模型。另一方面,DeepSeek-V2 Chat(RL)的推理能力仍然落后于Erniebot-4.0和GPT-4s等巨型模型。

Model Overall 全面的 Reasoning 中文推理 Language 中文语言
Avg. 平均。 Math. 数学。 Logi. 洛吉。 Avg. 平均。 Fund. 基金。 Chi. 驰。 Open. 打开。 Writ. 令状。 Role. 角色。 Pro. 亲。
模型 总分
推理
总分
数学
计算
逻辑
推理
语言
总分
基本
任务
中文
理解
综合
问答
文本
写作
角色
扮演
专业
能力
GPT-4-1106-Preview GPT-4-1106-预览 8.01 7.73 7.80 7.66 8.29 7.99 7.33 8.61 8.67 8.47 8.65
DeepSeek-V2 Chat (RL) DeepSeek-V2 聊天 (RL) 7.91 7.45 7.77 7.14 8.36 8.10 8.28 8.37 8.53 8.33 8.53
ERNIEBot-4.0-202404*(文心一言) 7.89 7.61 7.81 7.41 8.17 7.56 8.53 8.13 8.45 8.24 8.09
DeepSeek-V2 Chat (SFT) DeepSeek-V2 聊天 (SFT) 7.74 7.30 7.34 7.26 8.17 8.04 8.26 8.13 8.00 8.10 8.49
GPT-4-0613 7.53 7.47 7.56 7.37 7.59 7.81 6.93 7.42 7.93 7.51 7.94
ERNIEBot-4.0-202312*(文心一言) 7.36 6.84 7.00 6.67 7.88 7.47 7.88 8.05 8.19 7.84 7.85
Moonshot-v1-32k-202404*(月之暗面) 7.22 6.42 6.41 6.43 8.02 7.82 7.58 8.00 8.22 8.19 8.29
Qwen1.5-72B-Chat* Qwen1.5-72B-聊天* 7.19 6.45 6.58 6.31 7.93 7.38 7.77 8.15 8.02 8.05 8.24
DeepSeek-67B-Chat DeepSeek-67B-聊天 6.43 5.75 5.71 5.79 7.11 7.12 6.52 7.58 7.20 6.91 7.37
ChatGLM-Turbo(智谱清言) 6.24 5.00 4.74 5.26 7.49 6.82 7.17 8.16 7.77 7.76 7.24
ERNIEBot-3.5(文心一言) 6.14 5.15 5.03 5.27 7.13 6.62 7.60 7.26 7.56 6.83 6.90
Yi-34B-Chat* Yi-34B-聊天* 6.12 4.86 4.97 4.74 7.38 6.72 7.28 7.76 7.44 7.58 7.53
GPT-3.5-Turbo-0613 6.08 5.35 5.68 5.02 6.82 6.71 5.81 7.29 7.03 7.28 6.77
ChatGLM-Pro(智谱清言) 5.83 4.65 4.54 4.75 7.01 6.51 6.76 7.47 7.07 7.34 6.89
SparkDesk-V2(讯飞星火) 5.74 4.73 4.71 4.74 6.76 5.84 6.97 7.29 7.18 6.92 6.34
Qwen-14B-Chat Qwen-14B-聊天 5.72 4.81 4.91 4.71 6.63 6.90 6.36 6.74 6.64 6.59 6.56
Baichuan2-13B-Chat 百川2-13B-聊天 5.25 3.92 3.76 4.07 6.59 6.22 6.05 7.11 6.97 6.75 6.43
ChatGLM3-6B 聊天GLM3-6B 4.97 3.85 3.55 4.14 6.10 5.75 5.29 6.71 6.83 6.28 5.73
Baichuan2-7B-Chat 百川2-7B-聊天 4.97 3.66 3.56 3.75 6.28 5.81 5.50 7.13 6.84 6.53 5.84
InternLM-20B 4.96 3.66 3.39 3.92 6.26 5.96 5.50 7.18 6.19 6.49 6.22
Qwen-7B-Chat Qwen-7B-聊天 4.91 3.73 3.62 3.83 6.09 6.40 5.74 6.26 6.31 6.19 5.66
ChatGLM2-6B 聊天GLM2-6B 4.48 3.39 3.16 3.61 5.58 4.91 4.52 6.66 6.25 6.08 5.08
InternLM-Chat-7B 实习生LM-Chat-7B 3.65 2.56 2.45 2.66 4.75 4.34 4.09 5.82 4.89 5.32 4.06
Chinese-LLaMA-2-7B-Chat 中文-LLaMA-2-7B-聊天 3.57 2.68 2.29 3.07 4.46 4.31 4.26 4.50 4.63 4.91 4.13
LLaMA-2-13B-Chinese-Chat LLaMA-2-13B-中文聊天 3.35 2.47 2.21 2.73 4.23 4.13 3.31 4.79 3.93 4.53 4.71
Table 5: AlignBench leaderboard rated by GPT-4-0613. Models are ranked in descending order based on the overall score. Models marked with * represent that we evaluate them through their API service or open-weighted model, instead of referring to the results reported in their original papers. Suffixes of Erniebot-4.0 and Moonshot denote the timestamps when we called their API.
表 5: GPT-4-0613 评级的 AlignBench 排行榜。模型根据总分按降序排列。带*的模型代表我们通过他们的API服务或开放加权模型对其进行评估,而不是参考他们原始论文中报告的结果。 Erniebot-4.0 和 Moonshot 的后缀表示我们调用他们的 API 时的时间戳。

4.4 Discussion
4.4讨论

Amount of SFT Data. SFT 数据量。

The discussion surrounding the necessity of a large SFT corpus has been a topic of intense debate. Previous works (Young et al., 2024; Zhou et al., 2024) argue that fewer than 10K instances of SFT data are enough to produce satisfactory results. However, in our experiments, we observe a significant performance decline on the IFEval benchmark if we use fewer than 10K instances. A possible explanation is that, a language model necessitates a certain amount of data to develop specific skills. Although the requisite data amount may diminish with the model size increasing, it cannot be entirely eliminated. Our observation underscores the critical need for sufficient data to equip an LLM with desired capabilities. Moreover, the quality of SFT data is also crucial, especially for tasks involving writing or open-ended questions.
围绕大型 SFT 语料库必要性的讨论一直是激烈争论的话题。之前的工作(Young et al., 2024 ;Zhou et al., 2024认为少于 10K 个 SFT 数据实例就足以产生令人满意的结果。然而,在我们的实验中,我们观察到如果我们使用的实例少于 10K,IFEval 基准的性能会显着下降。一种可能的解释是,语言模型需要一定量的数据来开发特定技能。尽管所需的数据量可能会随着模型大小的增加而减少,但不能完全消除。我们的观察强调了对足够数据的迫切需求,以使LLM具备所需的能力。此外,SFT 数据的质量也至关重要,特别是对于涉及写作或开放式问题的任务。

Alignment Tax of Reinforcement Learning.
强化学习的对齐税。

During human preference alignment, we observe a significant performance enhancement on the open-ended generation benchmarks, in terms of the scores rated by both AI and human evaluators. However, we also notice a phenomenon of “alignment tax” (Ouyang et al., 2022), i.e., the alignment process can negatively impact the performance on some standard benchmarks such as BBH. In order to alleviate the alignment tax, during the RL stage, we make significant efforts in data processing and improving training strategies, finally achieving a tolerable trade-off between the performance on standard and open-ended benchmarks. Exploring how to align a model with human preferences without compromising its general performance presents a valuable direction for future research.
在人类偏好调整过程中,我们观察到开放式生成基准的性能显着提高,无论是人工智能还是人类评估者的评分。然而,我们也注意到一种“对齐税”现象(Ouyang et al., 2022 ,即对齐过程会对某些标准基准(例如 BBH)的性能产生负面影响。为了减轻对齐税,在强化学习阶段,我们在数据处理和改进训练策略方面做出了巨大努力,最终在标准基准和开放基准的性能之间实现了可容忍的权衡。探索如何在不影响其总体性能的情况下使模型与人类偏好保持一致,为未来的研究提供了一个有价值的方向。

Online Reinforcement Learning.
在线强化学习。

In our preference alignment experiments, we find that the online approach significantly outperforms the offline approach. Therefore, we invest tremendous efforts in implementing an online RL framework for aligning DeepSeek-V2. The conclusion about online or offline preference alignment can vary in different contexts, and we reserve a more thorough comparison and analysis between them for future work.
在我们的偏好对齐实验中,我们发现在线方法明显优于离线方法。因此,我们投入了巨大的努力来实现在线 RL 框架来对齐 DeepSeek-V2。关于线上或线下偏好调整的结论在不同的背景下可能会有所不同,我们为未来的工作保留对它们之间更彻底的比较和分析。

5 Conclusion, Limitation, and Future Work
5结论、局限性和未来的工作

In this paper, we introduce DeepSeek-V2, a large MoE language model that supports 128K context length. In addition to strong performance, it is also characterized by economical training and efficient inference, benefiting from its innovative architecture including MLA and DeepSeekMoE. In practice, compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. Evaluation results further demonstrate that with only 21B activated parameters, DeepSeek-V2 achieves top-tier performance among open-source models and becomes the strongest open-source MoE model.
在本文中,我们介绍了 DeepSeek-V2,一个支持 128K 上下文长度的大型 MoE 语言模型。除了强大的性能之外,它还具有经济训练和高效推理的特点,这得益于其包括 MLA 和 DeepSeekMoE 在内的创新架构。实际应用中,与DeepSeek 67B相比,DeepSeek-V2的性能明显增强,同时节省了42.5%的训练成本,减少了93.3%的KV缓存,最大生成吞吐量提升至5.76倍。评估结果进一步表明,仅用21B个激活参数,DeepSeek-V2就达到了开源模型中顶级的性能,成为最强的开源MoE模型。

DeepSeek-V2 and its chat versions share the acknowledged limitations commonly found in other LLMs, including the lack of ongoing knowledge updates after pre-training, the possibility of generating non-factual information such as unverified advice, and a chance to produce hallucinations. In addition, since our data primarily consist of Chinese and English content, our model may exhibit limited proficiency in other languages. In scenarios beyond Chinese and English, it should be used with caution.
DeepSeek-V2 及其聊天版本具有其他LLMs中常见的公认局限性,包括预训练后缺乏持续的知识更新、可能生成非事实信息(例如未经验证的建议)以及产生幻觉的机会。此外,由于我们的数据主要由中文和英文内容组成,因此我们的模型对其他语言的熟练程度可能有限。中英文以外的场景请谨慎使用。

DeepSeek will continuously invest in open-source large models with longtermism, aiming to progressively approach the goal of artificial general intelligence.
DeepSeek将以长远的眼光持续投资开源大型模型,旨在逐步接近通用人工智能的目标。

  • In our ongoing exploration, we are dedicated to devising methods that enable further scaling up MoE models while maintaining economical training and inference costs. The goal of our next step is to achieve performance on par with GPT-4 in our upcoming release.


    • 在我们持续的探索中,我们致力于设计能够进一步扩展 MoE 模型的方法,同时保持经济的训练和推理成本。我们下一步的目标是在即将发布的版本中实现与 GPT-4 相当的性能。
  • Our alignment team continuously strives to enhance our models, aiming to develop a model that is not only helpful but also honest and safe for worldwide users. Our ultimate objective is to align the values of our model with human values, while minimizing the need for human supervision. By prioritizing ethical considerations and responsible development, we are dedicated to creating a positive and beneficial impact on society.


    • 我们的调整团队不断努力增强我们的模型,旨在开发一个对全球用户不仅有帮助而且诚实且安全的模型。我们的最终目标是使模型的价值观与人类价值观保持一致,同时最大限度地减少人类监督的需要。通过优先考虑道德考虑和负责任的发展,我们致力于为社会创造积极和有益的影响。
  • Currently, DeepSeek-V2 is designed to support the text modality exclusively. In our forward-looking agenda, we intend to enable our model to support multiple modalities, enhancing its versatility and utility in a wider range of scenarios.


    • 目前,DeepSeek-V2 专为支持文本模式而设计。在我们的前瞻性议程中,我们打算使我们的模型能够支持多种模式,增强其在更广泛场景中的多功能性和实用性。

References 参考

  • AI@Meta (2024) 人工智能@元 (2024) AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
    人工智能@元。 Llama 3 模型卡,2024 年。URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
  • Ainslie et al. (2023) 安斯利等人。 (2023) J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
    J. Ainslie、J. Lee-Thorp、M. de Jong、Y. Zemlyanskiy、F. Lebrón 和 S. Sanghai。 Gqa:从多头检查点训练广义多查询变压器模型。 arXiv 预印本 arXiv:2305.13245,2023年。
  • Anthropic (2023) 人择 (2023) Anthropic. Introducing Claude, 2023. URL https://www.anthropic.com/index/introducing-claude.
    人为的。克劳德简介,2023。URL https://www.anthropic.com/index/introducing-claude
  • Austin et al. (2021) 奥斯汀等人。 (2021) J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
    J. Austin、A. Odena、M. Nye、M. Bosma、H. Michalewski、D. Dohan、E. Jiang、C. Cai、M. Terry、Q. Le 等人。使用大型语言模型进行程序综合。 arXiv 预印本 arXiv:2108.07732,2021年。
  • Bai et al. (2023) 白等人。 (2023) J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang, Z. Zhang, C. Zhou, J. Zhou, X. Zhou, and T. Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
    J. Bai,S. Bai,Y. Chu,Z. Cui,K. Dang,X. Deng,Y. Fan,W. Ge,Y. Han,F. Huang,B. Hui,L. Ji,M.李,J. Lin,R. Lin,D. Liu,G. Liu,C. Lu,K. Lu,J. Ma,R. Men,X. Ren,X. Ren,C. Tan,S. Tan, J. Tu,P. Wang,S. Wang,W. Wang,S. Wu,B. Xu,J. Xu,A. Yang,H. Yang,J. Yang,S. Yang,Y. Yao,B.于,袁宏,袁志,张建,张X,张Y,张Z,周C,周J,周X,朱T。 Qwen技术报告。 arXiv 预印本 arXiv:2309.16609 , 2023。
  • Bisk et al. (2020) 比斯克等人。 (2020) Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7432–7439. AAAI Press, 2020. 10.1609/aaai.v34i05.6239. URL https://doi.org/10.1609/aaai.v34i05.6239.
    Y. Bisk、R. Zellers、RL Bras、J. 高和 Y. Choi。 PIQA:用自然语言推理物理常识。第三十四届 AAAI 人工智能大会,AAAI 2020,第三十二届人工智能创新应用大会,IAAI 2020,第十届 AAAI 人工智能教育进展研讨会,EAAI 2020,美国纽约州纽约市,2 月2020 年 7-12 日,第 7432-7439 页。 AAAI 出版社,2020 年。10.1609/aaai.v34i05.6239网址https://doi.org/10.1609/aaai.v34i05.6239
  • Chen et al. (2021) 陈等人。 (2021) M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
    M. Chen、J. Tworek、H. Jun、Q. Yuan、H. P. de Oliveira Pinto、J. Kaplan、H. Edwards、Y. Burda、N. Joseph、G. Brockman、A. Ray、R. Puri、G克鲁格、M. 彼得罗夫、H. 克拉夫、G. 萨斯特里、P. 米什金、B. 陈、S. 格雷、N. 莱德、M. 巴甫洛夫、A. 鲍尔、L. 凯撒、M. 巴伐利亚、C. 温特、P. Tillet、F. P. Such、D. Cummings、M. Plapert、F. Chantzis、E. Barnes、A. Herbert-Voss、W. H. Guss、A. Nichol、A. Paino、N. Tezak、J. Tang、I巴布施金、S.巴拉吉、S.贾恩、W.桑德斯、C.黑塞、A.N.卡尔、J.雷克、J.阿奇姆、V.米斯拉、E.森川、A.雷德福德、M.奈特、M.布伦戴奇、 M. Murati、K. Mayer、P. Welinder、B. McGrew、D. Amodei、S. McCandlish、I. Sutskever 和 W. Zaremba。评估在代码上训练的大型语言模型。 CoRR ,abs/2107.03374,2021。URL https://arxiv.org/abs/2107.03374
  • Clark et al. (2018) 克拉克等人。 (2018) P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457.
    P. Clark、I. Cowhey、O. Etzioni、T. Khot、A. Sabharwal、C. Schoenick 和 O. Tafjord。您认为您已经解决了问答问题吗?尝试 arc,AI2 推理挑战赛。 CoRR ,abs/1803.05457,2018。URL http://arxiv.org/abs/1803.05457
  • Cobbe et al. (2021) 科布等人。 (2021) K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
    K. Cobbe、V. Kosaraju、M. Bavarian、M. Chen、H. Jun、L. Kaiser、M. Plapert、J. Tworek、J. Hilton、R. Nakano 等。培训验证员解决数学应用题。 arXiv 预印本 arXiv:2110.14168,2021年。
  • Cui et al. (2019) 崔等人。 (2019) Y. Cui, T. Liu, W. Che, L. Xiao, Z. Chen, W. Ma, S. Wang, and G. Hu. A span-extraction dataset for Chinese machine reading comprehension. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5883–5889, Hong Kong, China, Nov. 2019. Association for Computational Linguistics. 10.18653/v1/D19-1600. URL https://aclanthology.org/D19-1600.
    Y. Cui、T. Liu、W. Che、L. Shaw、Z. Chen、W. Ma、S. Wang 和 G. Hu。用于中文机器阅读理解的跨度提取数据集。 K. Inui、J. Jiang、V. Ng 和 X. Wan,编辑, 2019 年自然语言处理经验方法会议和第九届自然语言处理国际联合会议 (EMNLP-IJCNLP) 会议记录,第 5883 页–5889,中国香港,2019 年 11 月。计算语言学协会。 10.18653/v1/D19-1600网址https://aclanthology.org/D19-1600
  • Dai et al. (2024) 戴等人。 (2024) D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. CoRR, abs/2401.06066, 2024. URL https://doi.org/10.48550/arXiv.2401.06066.
    戴东、邓长、赵长、徐荣兴、高浩、陈东、李建、曾伟、于晓、吴勇、谢子、李玉坤、黄平、 F. 罗,C. 阮,Z. 隋,和 W. 梁。 Deepseekmoe:迈向混合专家语言模型的终极专家专业化。 CoRR ,abs/2401.06066,2024。URL https://doi.org/10.48550/arXiv.2401.06066
  • Dao (2023) 道 (2023) T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning, 2023.
    T.道。 FlashAttention-2:更快的注意力,更好的并行性和工作分区,2023 年。
  • DeepSeek-AI (2024) DeepSeek-AI. Deepseek LLM: scaling open-source language models with longtermism. CoRR, abs/2401.02954, 2024. URL https://doi.org/10.48550/arXiv.2401.02954.
    DeepSeek-AI。 Deepseek LLM :以长期主义扩展开源语言模型。 CoRR ,abs/2401.02954,2024。URL https://doi.org/10.48550/arXiv.2401.02954
  • Dua et al. (2019) 杜阿等人。 (2019) D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 2368–2378. Association for Computational Linguistics, 2019. 10.18653/V1/N19-1246. URL https://doi.org/10.18653/v1/n19-1246.
    D. Dua、Y. Wang、P. Dasigi、G. Stanovsky、S. Singh 和 M. Gardner。 DROP:阅读理解基准,需要对段落进行离散推理。 J. Burstein、C. Doran 和 T. Solorio,编辑, 2019 年计算语言学协会北美分会会议记录:人类语言技术,NAACL-HLT 2019,美国明尼苏达州明尼阿波利斯,6 月 2 日-7,2019 年,第 1 卷(长论文和短论文) ,第 2368–2378 页。计算语言学协会,2019。10.18653 /V1/N19-1246网址https://doi.org/10.18653/v1/n19-1246
  • Dubois et al. (2024) 杜布瓦等人。 (2024) Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.
    Y.Dubois、B.Galambosi、P.Liang 和 TB Hashimoto。长度控制的 alpacaeval:一种消除自动评估器偏差的简单方法。 arXiv 预印本 arXiv:2404.04475,2024年。
  • Fedus et al. (2021) 费杜斯等人。 (2021) W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961, 2021. URL https://arxiv.org/abs/2101.03961.
    W. Fedus、B. Zoph 和 N. Shazeer。开关变压器:通过简单高效的稀疏性扩展到万亿参数模型。 CoRR ,abs/2101.03961,2021。URL https://arxiv.org/abs/2101.03961
  • Gao et al. (2020) 高等人。 (2020) L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
    L. 高、S. Biderman、S. Black、L. Golding、T. Hoppe、C. Foster、J. Phang、H. He、A. Thite、N. Nabeshima 等。 The Pile:用于语言建模的 800GB 不同文本数据集。 arXiv 预印本 arXiv:2101.00027,2020年。
  • Google (2023) 谷歌 (2023) Google. Introducing gemini: our largest and most capable ai model, 2023. URL https://blog.google/technology/ai/google-gemini-ai/.
    谷歌。介绍 Gemini:我们最大、最有能力的人工智能模型,2023 年。URL https://blog.google/technology/ai/google-gemini-ai/
  • Gu et al. (2024) 顾等人。 (2024) A. Gu, B. Rozière, H. Leather, A. Solar-Lezama, G. Synnaeve, and S. I. Wang. Cruxeval: A benchmark for code reasoning, understanding and execution, 2024.
    A. Gu、B. Rozière、H. Leather、A. Solar-Lezama、G. Synnaeve 和 SI Wang。 Cruxeval:代码推理、理解和执行的基准,2024 年。
  • Hendrycks et al. (2020) 亨德里克斯等人。 (2020) D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
    D. Hendrycks、C. Burns、S. Basart、A. Zou、M. Mazeika、D. Song 和 J. Steinhardt。测量大规模多任务语言理解。 arXiv 预印本 arXiv:2009.03300 , 2020。
  • Hendrycks et al. (2021) 亨德里克斯等人。 (2021) D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
    D. Hendrycks、C. Burns、S. Kadavath、A. Arora、S. Basart、E. Tang、D. Song 和 J. Steinhardt。使用数学数据集衡量数学问题的解决能力。 arXiv 预印本 arXiv:2103.03874,2021年。
  • High-flyer (2023) 雄心勃勃 (2023) High-flyer. Hai-llm: 高效且轻量的大模型训练工具, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm.
    雄心勃勃。 Hai- llm :高效且轻量的大模型训练工具,2023。URL https://www.high-flyer.cn/en/blog/hai- llm
  • Hooper et al. (2024) 胡珀等人。 (2024) C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y. S. Shao, K. Keutzer, and A. Gholami. Kvquant: Towards 10 million context length LLM inference with KV cache quantization. CoRR, abs/2401.18079, 2024. URL https://doi.org/10.48550/arXiv.2401.18079.
    C. Hooper、S. Kim、H. Mohammadzadeh、MW Mahoney、YS Shao、K. Keutzer 和 A. Gholami。 Kvquant:通过 KV 缓存量化实现 1000 万上下文长度的LLM推理。 CoRR ,abs/2401.18079,2024。URL https://doi.org/10.48550/arXiv.2401.18079
  • Hu et al. (2024) 胡等人。 (2024) S. Hu, Y. Tu, X. Han, C. He, G. Cui, X. Long, Z. Zheng, Y. Fang, Y. Huang, W. Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024.
    S.胡,Y.涂,X.韩,C.何,G.崔,X.龙,Z.郑,Y.方,Y.黄,W.赵,等。 Minicpm:通过可扩展的训练策略揭示小语言模型的潜力。 arXiv 预印本 arXiv:2404.06395,2024年。
  • Huang et al. (2023) 黄等人。 (2023) Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
    黄Y.黄,Y.白,Z.朱,J.张,J.张,苏子,J.刘,C.Lv,Y.张,J.雷,等。 C-Eval:多层次、多学科的中文基础模型评估套件。 arXiv 预印本 arXiv:2305.08322,2023年。
  • Jain et al. (2024) 贾恩等人。 (2024) N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024.
    N. Jain、K. Han、A. Gu、W.-D。 Li、F. Yan、T. 张、S. Wang、A. Solar-Lezama、K. Sen 和 I. Stoica。 Livecodebench:对代码的大型语言模型进行全面且无污染的评估。 arXiv 预印本 arXiv:2403.07974,2024年。
  • Joshi et al. (2017) 乔希等人。 (2017) M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In R. Barzilay and M.-Y. Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
    M. Joshi、E. Choi、D. Weld 和 L. Zettlemoyer。 TriviaQA:用于阅读理解的大规模远程监督挑战数据集。在 R. Barzilay 和 M.-Y. Kan,编辑,计算语言学协会第 55 届年会论文集(第一卷:长论文) ,第 1601–1611 页,加拿大温哥华,2017 年 7 月。计算语言学协会。 10.18653/v1/P17-1147网址https://aclanthology.org/P17-1147
  • Kwiatkowski et al. (2019)
    科维亚特科夫斯基等人。 (2019)
    T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452–466, 2019. 10.1162/tacl_a_00276. URL https://doi.org/10.1162/tacl_a_00276.
    T. Kwiatkowski、J. Palomaki、O. Redfield、M. Collins、AP Parikh、C. Alberti、D. Epstein、I. Polosukhin、J. Devlin、K. Lee、K. Toutanova、L. Jones、M. Kelcey 、M. Chang、AM Dai、J. Uszkoreit、Q. Le 和 S. Petrov。自然问题:问答研究的基准。跨。副教授。计算。语言学,7:452–466,2019。10.1162 /tacl_a_00276网址https://doi.org/10.1162/tacl_a_00276
  • Kwon et al. (2023) 权等人。 (2023) W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
    W. Kwon、Z. Li、S. Zhuang、Y. Shen、L. Cheng、CH Yu、JE Gonzalez、H. Zhang 和 I. Stoica。通过 pagedattention 服务的大型语言模型的高效内存管理。收录于 2023 年 ACM SIGOPS 第 29 届操作系统原理研讨会论文集
  • Lai et al. (2017) 赖等人。 (2017) G. Lai, Q. Xie, H. Liu, Y. Yang, and E. H. Hovy. RACE: large-scale reading comprehension dataset from examinations. In M. Palmer, R. Hwa, and S. Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 785–794. Association for Computational Linguistics, 2017. 10.18653/V1/D17-1082. URL https://doi.org/10.18653/v1/d17-1082.
    G. Lai、Q. Xie、H. Liu、Y. Yang 和 EH Hovy。 RACE:来自考试的大规模阅读理解数据集。 M. Palmer、R. Hwa 和 S. Riedel,编辑, 2017 年自然语言处理经验方法会议论文集,EMNLP 2017,丹麦哥本哈根,2017 年 9 月 9-11 日,第 785-794 页。计算语言学协会,2017。10.18653 /V1/D17-1082网址https://doi.org/10.18653/v1/d17-1082
  • Lepikhin et al. (2021) 莱皮欣等人。 (2021) D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=qrwe7XHTmYb.
    D. Lepikhin、H. Lee、Y. Xu、D. Chen、O. Firat、Y. Huang、M. Krikun、N. Shazeer 和 Z. Chen。 Gshard:通过条件计算和自动分片扩展巨型模型。第九届学习表征国际会议,ICLR 2021 。 OpenReview.net,2021。网址https://openreview.net/forum?id=qrwe7XHTmYb
  • Li et al. (2023) 李等人。 (2023) H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU: Measuring massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212, 2023.
    H.Li、Y.Zhang、F.Koto、Y.Yang、H.Zhao、Y.Gong、N.Duan 和 T.Baldwin。 CMMLU:测量中文的大规模多任务语言理解。 arXiv 预印本 arXiv:2306.09212 , 2023。
  • Li et al. (2021) 李等人。 (2021) W. Li, F. Qi, M. Sun, X. Yi, and J. Zhang. Ccpm: A chinese classical poetry matching dataset, 2021.
    W. Li、F. Qi、M. Sun、X. Yi 和 J. 张。 CCPM:中国古典诗歌匹配数据集,2021。
  • Liu et al. (2023) 刘等人。 (2023) X. Liu, X. Lei, S. Wang, Y. Huang, Z. Feng, B. Wen, J. Cheng, P. Ke, Y. Xu, W. L. Tam, X. Zhang, L. Sun, H. Wang, J. Zhang, M. Huang, Y. Dong, and J. Tang. Alignbench: Benchmarking chinese alignment of large language models. CoRR, abs/2311.18743, 2023. 10.48550/ARXIV.2311.18743. URL https://doi.org/10.48550/arXiv.2311.18743.
    刘X.,雷X.,王S.,黄Y.,冯Z.冯,B.文,J.程,P.柯,Y.徐,WL Tam,X.张,L.孙,H.王、J. 张、M. 黄、Y. 董和 J. 唐。 Alignbench:大型语言模型的中文对齐基准测试。 CoRR ,abs/2311.18743,2023。10.48550 /ARXIV.2311.18743网址https://doi.org/10.48550/arXiv.2311.18743
  • Loshchilov and Hutter (2017)
    洛希洛夫与哈特 (2017)
    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
    I. 洛希洛夫和 F. 哈特。解耦权重衰减正则化。 arXiv 预印本 arXiv: 1711.05101,2017 年。
  • Mistral (2024) 米斯塔尔 (2024) Mistral. Cheaper, better, faster, stronger: Continuing to push the frontier of ai and making it accessible to all, 2024. URL https://mistral.ai/news/mixtral-8x22b.
    米斯特拉尔。更便宜、更好、更快、更强:2024 年,继续推动人工智能前沿,让所有人都能使用它。 URL https://mistral.ai/news/mixtral-8x22b
  • OpenAI (2022) 开放人工智能 (2022) OpenAI. Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
    开放人工智能。 ChatGPT 简介,2022。URL https://openai.com/blog/chatgpt
  • OpenAI (2023) 开放人工智能 (2023) OpenAI. GPT4 technical report. arXiv preprint arXiv:2303.08774, 2023.
    开放人工智能。 GPT4 技术报告。 arXiv 预印本 arXiv:2303.08774,2023年。
  • Ouyang et al. (2022) 欧阳等人。 (2022) L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
    L. 欧阳、J. 吴、X. 江、D. Almeida、C. Wainwright、P. Mishkin、C. 张、S. Agarwal、K. Slama、A. Ray 等。训练语言模型遵循人类反馈的指令。神经信息处理系统的进展,35:27730–27744,2022。
  • Peng et al. (2023) 彭等人。 (2023) B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
    B. Peng、J. Quesnelle、H. Fan 和 E. Shippole。 Yarn:大型语言模型的高效上下文窗口扩展。 arXiv 预印本 arXiv: 2309.00071,2023 年。
  • Qi et al. (2023) 齐等人。 (2023) P. Qi, X. Wan, G. Huang, and M. Lin. Zero bubble pipeline parallelism. arXiv preprint arXiv:2401.10241, 2023.
    P. Qi、X. Wan、G. Huang 和 M. Lin。零气泡管道并行性。 arXiv 预印本 arXiv:2401.10241,2023年。
  • Rajbhandari et al. (2020)
    拉杰班达里等人。 (2020)
    S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
    S. Rajbhandari、J. Rasley、O. Ruwase 和 Y. He。零:针对训练万亿参数模型的内存优化。SC20:高性能计算、网络、存储和分析国际会议,第 1-16 页。 IEEE,2020。
  • Riquelme et al. (2021) 里克尔梅等人。 (2021) C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. S. Pinto, D. Keysers, and N. Houlsby. Scaling vision with sparse mixture of experts. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, pages 8583–8595, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/48237d9f2dea8c74c2a72126cf63d933-Abstract.html.
    C. Riquelme、J. Puigcerver、B. Mustafa、M. Neumann、R. Jenatton、AS Pinto、D. Keysers 和 N. Houlsby。通过稀疏的专家组合来扩展愿景。神经信息处理系统进展 34:2021 年神经信息处理系统年会,NeurIPS 2021 ,第 8583–8595 页,2021。URL https://proceedings.neurips.cc/paper/2021/hash/48237d9f2dea8c74c2a72126cf63d933-Abstract.html
  • Sakaguchi et al. (2019) 坂口等人。 (2019) K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019.
    K. Sakaguchi、R​​L Bras、C. Bhagavatula 和 Y. Choi。 Winogrande:2019 年大规模的对抗性 winograd 模式挑战。
  • Shao et al. (2024) 邵等人。 (2024) Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
    Z. Shao、P. Wang、Q. Zhu、R. Xu、J. Song、M. 张、Y. Li、Y. Wu 和 D.Guo。 Deepseekmath:在开放语言模型中突破数学推理的极限。 arXiv 预印本 arXiv:2402.03300,2024年。
  • Shazeer (2019) 沙泽尔 (2019) N. Shazeer. Fast transformer decoding: One write-head is all you need. CoRR, abs/1911.02150, 2019. URL http://arxiv.org/abs/1911.02150.
    N.沙泽尔。快速变压器解码:您只需要一个写入头。 CoRR ,abs/1911.02150,2019。网址http://arxiv.org/abs/1911.02150
  • Shazeer et al. (2017) 沙泽尔等人。 (2017) N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V. Le, G. E. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017. OpenReview.net, 2017. URL https://openreview.net/forum?id=B1ckMDqlg.
    N. Shazeer、A. Mirhoseini、K. Maziarz、A. Davis、QV Le、GE Hinton 和 J. Dean。极其庞大的神经网络:稀疏门控的专家混合层。第五届学习表征国际会议,ICLR 2017 。 OpenReview.net,2017。网址https://openreview.net/forum?id=B1ckMDqlg
  • Su et al. (2024) 苏等人。 (2024) J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
    J. Su、M. Ahmed、Y. Lu、S. Pan、W. Bo 和 Y. Liu。 Roformer:具有旋转位置嵌入的增强型变压器。神经计算,568:127063,2024。
  • Sun et al. (2019) 孙等人。 (2019) K. Sun, D. Yu, D. Yu, and C. Cardie. Investigating prior knowledge for challenging chinese machine reading comprehension, 2019.
    K. Sun、D. Yu、D. Yu 和 C. Cardie。调查挑战中文机器阅读理解的先验知识,2019。
  • Suzgun et al. (2022) 苏兹贡等人。 (2022) M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
    M. Suzgun、N. Scales、N. Schärli、S. Gehrmann、Y. Tay、HW Chung、A. Chowdhery、QV Le、EH Chi、D. Zhou 等。具有挑战性的大工作台任务以及思维链是否可以解决它们。 arXiv 预印本 arXiv:2210.09261,2022年。
  • Vaswani et al. (2017) 瓦斯瓦尼等人。 (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
    A. Vaswani、N. Shazeer、N. Parmar、J. Uszkoreit、L. Jones、AN Gomez、Ł。凯撒和 I.波洛苏欣。您所需要的就是关注。神经信息处理系统的进展,2017 年 30 月。
  • Wei et al. (2022) 魏等人。 (2022) J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
    J. Wei、Y. Tay、R. Bommasani、C. Raffel、B. Zoph、S. Borgeaud、D. Yogatama、M. Bosma、D. Zhou、D. Metzler 等人。大型语言模型的新兴能力。 arXiv 预印本 arXiv:2206.07682,2022年。
  • Wei et al. (2023) 魏等人。 (2023) T. Wei, J. Luan, W. Liu, S. Dong, and B. Wang. Cmath: Can your language model pass chinese elementary school math test?, 2023.
    T. Wei、J. Luan、W. Liu、S. Dong 和 B. Wang。 Cmath:你的语言模型能通过中国小学数学考试吗?,2023。
  • Xu et al. (2020) 徐等人。 (2020) L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu, C. Yu, Y. Tian, Q. Dong, W. Liu, B. Shi, Y. Cui, J. Li, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou, S. Liu, Z. Zhao, Q. Zhao, C. Yue, X. Zhang, Z. Yang, K. Richardson, and Z. Lan. CLUE: A chinese language understanding evaluation benchmark. In D. Scott, N. Bel, and C. Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 4762–4772. International Committee on Computational Linguistics, 2020. 10.18653/V1/2020.COLING-MAIN.419. URL https://doi.org/10.18653/v1/2020.coling-main.419.
    徐丽、胡红、张X、李丽、曹春、李Y、徐Y、孙坤、于大、于C、田Y、董Q、W.刘,B.Shi,Y.Cui,J.Li,J.Zeng,R.Wang,W.Xie,Y.Li,Y.Patterson,Z.Tian,Y.Zhang,H.Zhou,S.Liu, Z. 赵、Q. 赵、C. Yue、X. 张、Z. 杨、K. Richardson 和 Z. Lan。线索:汉语理解评估基准。 D. Scott、N. Bel 和 C. Zong,编辑,第 28 届国际计算语言学会议论文集,COLING 2020,西班牙巴塞罗那(在线),2020 年 12 月 8-13 日,第 4762-4772 页。国际计算语言学委员会,2020。10.18653 /V1/2020.COLING-MAIN.419网址https://doi.org/10.18653/v1/2020.coling-main.419
  • Young et al. (2024) 杨等人。 (2024) A. Young, B. Chen, C. Li, C. Huang, G. Zhang, G. Zhang, H. Li, J. Zhu, J. Chen, J. Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
    A.Young,B.Chen,C.Li,C.Huang,G.Zhang,G.Zhang,H.Li,J.Zhu,J.Chen,J.Chang,等。 Yi:01.ai 开放基础模型。 arXiv 预印本 arXiv:2403.04652,2024年。
  • Zellers et al. (2019) 泽勒斯等人。 (2019) R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. HellaSwag: Can a machine really finish your sentence? In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics, 2019. 10.18653/v1/p19-1472. URL https://doi.org/10.18653/v1/p19-1472.
    R. Zellers、A. Holtzman、Y. Bisk、A. Farhadi 和 Y. Choi。 HellaSwag:机器真的能完成你的句子吗? A. Korhonen、DR Traum 和 L. Màrquez,编辑,第 57 届计算语言学协会会议记录,ACL 2019,意大利佛罗伦萨,2019 年 7 月 28 日至 8 月 2 日,第 1 卷:长论文,第 4791 页–4800。计算语言学协会,2019。10.18653 /v1/p19-1472网址https://doi.org/10.18653/v1/p19-1472
  • Zhao et al. (2023) 赵等人。 (2023) Y. Zhao, C. Lin, K. Zhu, Z. Ye, L. Chen, S. Zheng, L. Ceze, A. Krishnamurthy, T. Chen, and B. Kasikci. Atom: Low-bit quantization for efficient and accurate LLM serving. CoRR, abs/2310.19102, 2023. URL https://doi.org/10.48550/arXiv.2310.19102.
    Y. 赵、C. Lin、K. Zhu、Z. Ye、L. Chen、S. Cheng、L. Ceze、A. Krishnamurthy、T. Chen 和 B. Kasikci。 Atom:低位量化,实现高效、准确的LLM服务。 CoRR ,abs/2310.19102,2023。URL https://doi.org/10.48550/arXiv.2310.19102
  • Zheng et al. (2019) 郑等人。 (2019) C. Zheng, M. Huang, and A. Sun. Chid: A large-scale chinese idiom dataset for cloze test. In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 778–787. Association for Computational Linguistics, 2019. 10.18653/V1/P19-1075. URL https://doi.org/10.18653/v1/p19-1075.
    C. 郑、M. 黄和 A. 孙。 Chid:用于完形填空测试的大规模汉语成语数据集。 A. Korhonen、DR Traum 和 L. Màrquez,编辑,第 57 届计算语言学协会会议记录,ACL 2019,意大利佛罗伦萨,2019 年 7 月 28 日至 8 月 2 日,第 1 卷:长论文,第 778 页–787。计算语言学协会,2019。10.18653 /V1/P19-1075网址https://doi.org/10.18653/v1/p19-1075
  • Zheng et al. (2023) 郑等人。 (2023) L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
    L.郑,W.-L。蒋,Y.盛,S.庄,Z.吴,Y.庄,Z.Lin,Z.Li,D.Li,EP Xing,H.Zhang,JE Gonzalez,和I. Stoica。使用 mt-bench 和聊天机器人竞技场来评判llm硕士,2023 年。
  • Zhong et al. (2023) 钟等人。 (2023) W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. AGIEval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023. 10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364.
    W.Zhong、R.Cui、Y.Guo、Y.Liang、S.Lu、Y.Wang、A.Saied、W.Chen 和 N.Duan。 AGIEval:用于评估基础模型的以人为本的基准。 CoRR ,abs/2304.06364,2023。10.48550 /arXiv.2304.06364网址https://doi.org/10.48550/arXiv.2304.06364
  • Zhou et al. (2024) 周等人。 (2024) C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024.
    C. Zhou,P. Liu,P. Xu,S. Iyer,J. Sun,Y. Mao,X. Ma,A. Efrat,P. Yu,L. Yu,等。利马:对于协调来说,少即是多。神经信息处理系统的进展,36,2024。
  • Zhou et al. (2023) 周等人。 (2023) J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023.
    J. Zhou、T. Lu、S. Mishra、S. Brahma、S. Basu、Y. Luan、D. Zhou 和 L. Hou。大型语言模型的指令跟踪评估。 arXiv 预印本 arXiv:2311.07911 , 2023。

Appendix 附录

Appendix A Contributions and Acknowledgments
附录 A贡献和致谢

Research & Engineering 研究与工程
Aixin Liu  刘爱新
Bingxuan Wang  王秉轩
Bo Liu  刘波
Chenggang Zhao  赵成刚
Chengqi Deng  邓成琪
Chong Ruan  阮崇
Damai Dai  达麦岱
Daya Guo  郭大亚
Dejian Yang  杨德建
Deli Chen  陈德利
Erhang Li  李二航
Fangyun Lin  林芳云
Fuli Luo  罗芙丽
Guangbo Hao  郝广博
Guanting Chen  陈冠廷
Guowei Li  李国伟
H. Zhang  张宏
Hanwei Xu  徐汉伟
Hao Yang  浩阳
Haowei Zhang  张浩伟
Honghui Ding  丁洪辉
Huajian Xin  辛华建
Huazuo Gao  高华作
Hui Qu  曲慧
Jianzhong Guo  郭建中
Jiashi Li  李家士
Jingyang Yuan  景阳园
Junjie Qiu  邱俊杰
Junxiao Song  宋俊晓
Kai Dong  董凯
Kaige Gao  高凯歌
Kang Guan  康官
Lean Wang  王立安
Lecong Zhang  张乐从
Liang Zhao  梁昭
Liyue Zhang  张丽月
Mingchuan Zhang  张铭传
Minghua Zhang  张明华
Minghui Tang  唐明慧
Panpan Huang  黄盼盼
Peiyi Wang  王佩仪
Qihao Zhu  朱启浩
Qinyu Chen  陈勤宇
Qiushi Du  杜秋实
Ruiqi Ge  葛瑞琪
Ruizhe Pan  潘瑞哲
Runxin Xu  徐润新
Shanghao Lu  上好路
Shangyan Zhou  周尚彦
Shanhuang Chen  陈山皇
Shengfeng Ye  叶胜峰
Shirong Ma  马世荣
Shiyu Wang  王世玉
Shuiping Yu  于水平
Shunfeng Zhou  周顺风
Size Zheng  尺寸郑
Tian Pei  田培
Wangding Zeng  曾旺丁
Wen Liu  刘文
Wenfeng Liang  梁文峰
Wenjun Gao  高文俊
Wentao Zhang  张文涛
Xiao Bi  小毕
Xiaohan Wang  王晓涵
Xiaodong Liu  刘晓东
Xiaokang Chen  陈小康
Xiaotao Nie  聂晓涛
Xin Liu  刘鑫
Xin Xie  谢欣
Xingkai Yu  于兴凯
Xinyu Yang  杨新宇
Xuan Lu  宣路
Xuecheng Su  苏学成
Y. Wu  吴勇
Y.K. Li  李玉康
Y.X. Wei  魏永新
Yanhong Xu  徐艳红
Yao Li  姚莉
Yao Zhao  赵耀
Yaofeng Sun  孙耀峰
Yaohui Wang  王耀辉
Yichao Zhang  张一超
Yiliang Xiong  熊宜良
Yilong Zhao  赵一龙
Ying He  何瑛
Yishi Piao  一世朴
Yixin Dong  董益欣
Yixuan Tan  谭一轩
Yiyuan Liu  刘怡媛
Yongji Wang  王永吉
Yongqiang Guo  郭永强
Yuduan Wang  王玉端
Yuheng Zou  邹宇恒
Yuxiang You  游宇翔
Yuxuan Liu  刘玉轩
Z.Z. Ren  任正中
Zehui Ren  任泽慧
Zhangli Sha  张丽莎
Zhe Fu  付哲
Zhenda Xie  谢振达
Zhewen Hao  郝哲文
Zhihong Shao  邵志宏
Zhuoshu Li  李卓书
Zihan Wang  王子涵
Zihui Gu  顾子慧
Zilin Li  李子霖
Ziwei Xie  谢紫薇

Data Annotation 数据标注
Bei Feng  北风
Hui Li  李慧
J.L. Cai  蔡金良
Jiaqi Ni  倪嘉琪
Lei Xu  徐雷
Meng Li  李萌
Ning Tian  宁天
R.J. Chen  陈荣杰
R.L. Jin  金瑞荣
Ruyi Chen  陈如意
S.S. Li  李SS
Shuang Zhou  周双
Tian Yuan  天元
Tianyu Sun  孙天宇
X.Q. Li  李新庆
Xiangyue Jin  金香月
Xiaojin Shen  沉晓金
Xiaosha Chen  陈晓莎
Xiaowen Sun  孙晓文
Xiaoxiang Wang  王潇湘
Xinnan Song  宋新南
Xinyi Zhou  周欣怡
Y.X. Zhu  朱永新
Yanhong Xu  徐艳红
Yanping Huang  黄艳萍
Yaohui Li  李耀辉
Yi Zheng  郑毅
Yuchen Zhu  朱雨辰
Yunxian Ma  马云贤
Zhen Huang  黄震
Zhipeng Xu  徐志鹏
Zhongyu Zhang  张中玉

Business & Compliance 业务与合规
Bin Wang  王斌
Dongjie Ji  季东杰
Jian Liang  梁建
Jin Chen  陈金
Leyi Xia  夏乐一
Miaojun Wang  王妙君
Mingming Li  李明明
Peng Zhang  张鹏
Shaoqing Wu  吴少卿
Shengfeng Ye  叶胜峰
T. Wang  王涛
W.L. Xiao  肖文良
Wei An  伟安
Xianzu Wang  王显祖
Ying Tang  唐英
Yukun Zha  查玉坤
Yuting Yan  严玉婷
Zhen Zhang  张震
Zhiniu Wen  温志牛

Within each role, authors are listed alphabetically by first name. Especially, Huazuo Gao and Wangding Zeng have made key innovations in the research of the MLA architecture. Furthermore, we’d like to thank Jianlin Su for his helpful discussion on position embedding. We thank all those who have contributed to DeepSeek-V2 but are not mentioned in the paper. DeepSeek believes that innovation, novelty, and curiosity are essential in the path to AGI.
在每个角色中,作者按名字字母顺序列出。尤其是高华作和曾旺丁在MLA架构的研究上做出了关键创新。此外,我们还要感谢 Kenlin Su 对位置嵌入的有益讨论。我们感谢所有为 DeepSeek-V2 做出贡献但未在论文中提及的人。 DeepSeek 认为,创新、新颖和好奇心对于通向 AGI 的道路至关重要。

Appendix B DeepSeek-V2-Lite: A 16B Model Equipped with MLA and DeepSeekMoE
附录 B DeepSeek-V2-Lite:配备 MLA 和 DeepSeekMoE 的 16B 型号

B.1 Model Description
B.1型号说明

Architectures. 架构。

DeepSeek-V2-Lite has 27 layers and a hidden dimension of 2048. It also employs MLA and has 16 attention heads, where each head has a dimension of 128. Its KV compression dimension is 512, but slightly different from DeepSeek-V2, it does not compress the queries. For the decoupled queries and key, it has a per-head dimension of 64. DeepSeek-V2-Lite also employs DeepSeekMoE, and all FFNs except for the first layer are replaced with MoE layers. Each MoE layer consists of 2 shared experts and 64 routed experts, where the intermediate hidden dimension of each expert is 1408. Among the routed experts, 6 experts will be activated for each token. Under this configuration, DeepSeek-V2-Lite comprises 15.7B total parameters, of which 2.4B are activated for each token.
DeepSeek-V2-Lite有27层,隐藏维度为2048。它也采用MLA,有16个注意力头,每个注意力头的维度为128。它的KV压缩维度为512,但与DeepSeek-V2略有不同,它不压缩查询。对于解耦的查询和密钥,它的每头维度为 64。DeepSeek-V2-Lite 还采用 DeepSeekMoE,除了第一层之外的所有 FFN 都替换为 MoE 层。每个MoE层由2个共享专家和64个路由专家组成,其中每个专家的中间隐藏维度为1408。在路由专家中,每个代币将激活6个专家。在此配置下,DeepSeek-V2-Lite 总共包含 15.7B 个参数,其中每个令牌激活 2.4B 个参数。

Benchmark 基准 DeepSeek 7B 深寻7B DeepSeekMoE 16B 深寻MoE 16B DeepSeek-V2-Lite
Architecture 建筑学 MHA+Dense MHA+致密 MHA+MoE 内政部+教育部 MLA+MoE 司法部+教育部
Context Length 上下文长度 4K 4K 32K
# Activated Params # 激活参数 6.9B 2.8B 2.4B
# Total Params # 总参数 6.9B 16.4B 15.7B
# Training Tokens # 训练代币 2T 2T 5.7T
English 英语 MMLU 48.2 45.0 58.3
BBH 39.5 38.9 44.1
TriviaQA 问答问答 59.7 64.8 64.2
NaturalQuestions 自然问题 22.2 25.5 26.0
ARC-Easy 67.9 68.1 70.9
ARC-Challenge ARC-挑战 48.1 49.8 51.2
AGIEval AGIE值 26.4 17.4 33.2
Code 代码 HumanEval 人类评估 26.2 26.8 29.9
MBPP 39.0 39.2 43.2
Math 数学 GSM8K 17.4 18.8 41.1
MATH 3.3 4.3 17.1
CMath 数学数学 34.5 40.4 58.4
Chinese 中国人 CLUEWSC 73.1 72.1 74.3
C-Eval C-评估 45.0 40.6 60.3
CMMLU 47.2 42.5 64.3
Table 6: Performance of DeepSeek-V2-Lite, DeepSeekMoE 16B, and DeepSeek 7B.
表 6: DeepSeek-V2-Lite、DeepSeekMoE 16B 和 DeepSeek 7B 的性能。
Training Details. 培训细节。

DeepSeek-V2-Lite is also trained from scratch on the same pre-training corpus of DeepSeek-V2, which is not polluted by any SFT data. It uses the AdamW optimizer with hyper-parameters set to β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.95subscript𝛽20.95\beta_{2}=0.95italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95, and weight_decay=0.1weight_decay0.1\mathrm{weight\_decay}=0.1roman_weight _ roman_decay = 0.1. The learning rate is scheduled using a warmup-and-step-decay strategy. Initially, the learning rate linearly increases from 0 to the maximum value during the first 2K steps. Subsequently, the learning rate is multiplied by 0.316 after training about 80% of tokens, and again by 0.316 after training about 90% of tokens. The maximum learning rate is set to 4.2×1044.2superscript1044.2\times 10^{-4}4.2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and the gradient clipping norm is set to 1.0. We do not employ the batch size scheduling strategy for it, and it is trained with a constant batch size of 4608 sequences. During pre-training, we set the maximum sequence length to 4K, and train DeepSeek-V2-Lite on 5.7T tokens. We leverage pipeline parallelism to deploy different layers of it on different devices, but for each layer, all experts will be deployed on the same device. Therefore, we only employ a small expert-level balance loss with α1=0.001subscript𝛼10.001\alpha_{1}=0.001italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.001, and do not employ device-level balance loss and communication balance loss for it. After pre-training, we also perform long context extension and SFT for DeepSeek-V2-Lite and get a chat model called DeepSeek-V2-Lite Chat.
DeepSeek-V2-Lite也是在DeepSeek-V2相同的预训练语料库上从头开始训练,没有受到任何SFT数据的污染。 它使用 AdamW 优化器,超参数设置为 β1=0.9subscript10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , β2=0.95subscript20.95\beta_{2}=0.95italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95 , 和 weight_decay=0.10.1\mathrm{weight\_decay}=0.1roman_weight _ roman_decay = 0.1 。使用预热和逐步衰减策略来安排学习速率。最初,学习率在前 2K 步中从 0 线性增加到最大值。随后,在训练大约 80% 的标记后,将学习率乘以 0.316,在训练大约 90% 的标记后,再次将学习率乘以 0.316。最大学习率设置为 4.2×1044.2superscript1044.2\times 10^{-4}4.2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT ,梯度裁剪范数设置为 1.0。我们没有对其采用批量大小调度策略,而是使用 4608 个序列的恒定批量大小进行训练。在预训练过程中,我们将最大序列长度设置为4K,并在​​5.7T token上训练DeepSeek-V2-Lite。我们利用管道并行性将其不同层部署在不同的设备上,但对于每一层,所有专家都将部署在同一设备上。因此,我们只采用小的专家级平衡损失 α1=0.001subscript10.001\alpha_{1}=0.001italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.001 ,并且不采用设备级平衡损失和通信平衡损失来对其进行处理。经过预训练,我们还对 DeepSeek-V2-Lite 进行长上下文扩展和 SFT,得到一个名为 DeepSeek-V2-Lite Chat 的聊天模型。

Benchmark 基准 DeepSeek 深度搜索 DeepSeekMoE 深度搜索教育部 DeepSeek-V2-Lite
7B Chat 7B 聊天 16B Chat 16B 聊天 Chat 聊天
Architecture 建筑学 MHA+Dense MHA+致密 MHA+MoE 内政部+教育部 MLA+MoE 司法部+教育部
Context Length 上下文长度 4K 4K 32K
# Activated Params # 激活参数 6.9B 2.8B 2.4B
# Total Params # 总参数 6.9B 16.4B 15.7B
# Training Tokens # 训练代币 2T 2T 5.7T
English 英语 MMLU 49.7 47.2 55.7
BBH 43.1 42.2 48.1
TriviaQA 问答问答 59.5 63.3 65.2
NaturalQuestions 自然问题 32.7 35.1 35.5
ARC-Easy 70.2 69.9 74.3
ARC-Challenge ARC-挑战 50.2 50.0 51.5
AGIEval AGIE值 17.6 19.7 42.8
Code 代码 HumanEval 人类评估 45.1 45.7 57.3
MBPP 39.0 46.2 45.8
Math 数学 GSM8K 62.6 62.2 72.0
MATH 14.7 15.2 27.9
CMath 数学数学 66.4 67.9 71.7
Chinese 中国人 CLUEWSC 66.2 68.2 80.0
C-Eval C-评估 44.7 40.0 60.1
CMMLU 51.2 49.3 62.5
Table 7: Performance of DeepSeek-V2-Lite Chat, DeepSeekMoE 16B Chat, and DeepSeek 7B Chat.
表 7: DeepSeek-V2-Lite 聊天、DeepSeekMoE 16B 聊天和 DeepSeek 7B 聊天的性能。

B.2 Performance Evaluation
B.2绩效评估

Base Model. 基础模型。

We evaluate the performance of DeepSeek-V2-Lite and compare it with our previous small-size base models in Table 6. DeepSeek-V2-Lite exhibits overwhelming performance advantages, especially in reasoning, coding, and math.
我们评估了 DeepSeek-V2-Lite 的性能,并将其与表6中我们之前的小尺寸基础模型进行了比较。 DeepSeek-V2-Lite 展现出压倒性的性能优势,尤其是在推理、编码和数学方面。

Chat Model. 聊天模型。

We evaluate the performance of DeepSeek-V2-Lite Chat and compare it with our previous small-size chat models in Table 7. DeepSeek-V2-Lite also outperforms our previous small-size chat models by a large margin.
我们评估了 DeepSeek-V2-Lite Chat 的性能,并将其与表7中我们之前的小规模聊天模型进行了比较。 DeepSeek-V2-Lite 的性能也大幅优于我们之前的小尺寸聊天模型。

Appendix C Full Formulas of MLA
附录 C MLA 的完整公式

In order to demonstrate the complete computation process of MLA, we provide its full formulas in the following:
为了演示MLA的完整计算过程,我们提供其完整的公式如下:

𝐜tQsuperscriptsubscript𝐜𝑡𝑄\displaystyle\mathbf{c}_{t}^{Q}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT =WDQ𝐡t,absentsuperscript𝑊𝐷𝑄subscript𝐡𝑡\displaystyle=W^{DQ}\mathbf{h}_{t},= italic_W start_POSTSUPERSCRIPT italic_D italic_Q end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (37)
[𝐪t,1C;𝐪t,2C;;𝐪t,nhC]=𝐪tCsuperscriptsubscript𝐪𝑡1𝐶superscriptsubscript𝐪𝑡2𝐶superscriptsubscript𝐪𝑡subscript𝑛𝐶superscriptsubscript𝐪𝑡𝐶\displaystyle[\mathbf{q}_{t,1}^{C};\mathbf{q}_{t,2}^{C};...;\mathbf{q}_{t,n_{h% }}^{C}]=\mathbf{q}_{t}^{C}[ bold_q start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; bold_q start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; … ; bold_q start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ] = bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT =WUQ𝐜tQ,absentsuperscript𝑊𝑈𝑄superscriptsubscript𝐜𝑡𝑄\displaystyle=W^{UQ}\mathbf{c}_{t}^{Q},= italic_W start_POSTSUPERSCRIPT italic_U italic_Q end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , (38)
[𝐪t,1R;𝐪t,2R;;𝐪t,nhR]=𝐪tRsuperscriptsubscript𝐪𝑡1𝑅superscriptsubscript𝐪𝑡2𝑅superscriptsubscript𝐪𝑡subscript𝑛𝑅superscriptsubscript𝐪𝑡𝑅\displaystyle[\mathbf{q}_{t,1}^{R};\mathbf{q}_{t,2}^{R};...;\mathbf{q}_{t,n_{h% }}^{R}]=\mathbf{q}_{t}^{R}[ bold_q start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ; bold_q start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ; … ; bold_q start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ] = bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT =RoPE(WQR𝐜tQ),absentRoPEsuperscript𝑊𝑄𝑅superscriptsubscript𝐜𝑡𝑄\displaystyle=\operatorname{RoPE}({W^{QR}}\mathbf{c}_{t}^{Q}),= roman_RoPE ( italic_W start_POSTSUPERSCRIPT italic_Q italic_R end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ) , (39)
𝐪t,isubscript𝐪𝑡𝑖\displaystyle\mathbf{q}_{t,i}bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT =[𝐪t,iC;𝐪t,iR],absentsuperscriptsubscript𝐪𝑡𝑖𝐶superscriptsubscript𝐪𝑡𝑖𝑅\displaystyle=[\mathbf{q}_{t,i}^{C};\mathbf{q}_{t,i}^{R}],= [ bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ] , (40)
𝐜tKVsuperscriptsubscript𝐜𝑡𝐾𝑉\displaystyle\boxed{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}% {0,0,1}\mathbf{c}_{t}^{KV}}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT =WDKV𝐡t,absentsuperscript𝑊𝐷𝐾𝑉subscript𝐡𝑡\displaystyle=W^{DKV}\mathbf{h}_{t},= italic_W start_POSTSUPERSCRIPT italic_D italic_K italic_V end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (41)
[𝐤t,1C;𝐤t,2C;;𝐤t,nhC]=𝐤tCsuperscriptsubscript𝐤𝑡1𝐶superscriptsubscript𝐤𝑡2𝐶superscriptsubscript𝐤𝑡subscript𝑛𝐶superscriptsubscript𝐤𝑡𝐶\displaystyle[\mathbf{k}_{t,1}^{C};\mathbf{k}_{t,2}^{C};...;\mathbf{k}_{t,n_{h% }}^{C}]=\mathbf{k}_{t}^{C}[ bold_k start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; bold_k start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; … ; bold_k start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ] = bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT =WUK𝐜tKV,absentsuperscript𝑊𝑈𝐾superscriptsubscript𝐜𝑡𝐾𝑉\displaystyle=W^{UK}\mathbf{c}_{t}^{KV},= italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT , (42)
𝐤tRsuperscriptsubscript𝐤𝑡𝑅\displaystyle\boxed{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}% {0,0,1}\mathbf{k}_{t}^{R}}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT =RoPE(WKR𝐡t),absentRoPEsuperscript𝑊𝐾𝑅subscript𝐡𝑡\displaystyle=\operatorname{RoPE}({W^{KR}}\mathbf{h}_{t}),= roman_RoPE ( italic_W start_POSTSUPERSCRIPT italic_K italic_R end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (43)
𝐤t,isubscript𝐤𝑡𝑖\displaystyle\mathbf{k}_{t,i}bold_k start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT =[𝐤t,iC;𝐤tR],absentsuperscriptsubscript𝐤𝑡𝑖𝐶superscriptsubscript𝐤𝑡𝑅\displaystyle=[\mathbf{k}_{t,i}^{C};\mathbf{k}_{t}^{R}],= [ bold_k start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ] , (44)
[𝐯t,1C;𝐯t,2C;;𝐯t,nhC]=𝐯tCsuperscriptsubscript𝐯𝑡1𝐶superscriptsubscript𝐯𝑡2𝐶superscriptsubscript𝐯𝑡subscript𝑛𝐶superscriptsubscript𝐯𝑡𝐶\displaystyle[\mathbf{v}_{t,1}^{C};\mathbf{v}_{t,2}^{C};...;\mathbf{v}_{t,n_{h% }}^{C}]=\mathbf{v}_{t}^{C}[ bold_v start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; bold_v start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; … ; bold_v start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ] = bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT =WUV𝐜tKV,absentsuperscript𝑊𝑈𝑉superscriptsubscript𝐜𝑡𝐾𝑉\displaystyle=W^{UV}\mathbf{c}_{t}^{KV},= italic_W start_POSTSUPERSCRIPT italic_U italic_V end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT , (45)
𝐨t,isubscript𝐨𝑡𝑖\displaystyle\mathbf{o}_{t,i}bold_o start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT =j=1tSoftmaxj(𝐪t,iT𝐤j,idh+dhR)𝐯j,iC,absentsuperscriptsubscript𝑗1𝑡subscriptSoftmax𝑗superscriptsubscript𝐪𝑡𝑖𝑇subscript𝐤𝑗𝑖subscript𝑑superscriptsubscript𝑑𝑅superscriptsubscript𝐯𝑗𝑖𝐶\displaystyle=\sum_{j=1}^{t}\operatorname{Softmax}_{j}(\frac{\mathbf{q}_{t,i}^% {T}\mathbf{k}_{j,i}}{\sqrt{d_{h}+d_{h}^{R}}})\mathbf{v}_{j,i}^{C},= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Softmax start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( divide start_ARG bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_k start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_ARG end_ARG ) bold_v start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , (46)
𝐮tsubscript𝐮𝑡\displaystyle\mathbf{u}_{t}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =WO[𝐨t,1;𝐨t,2;;𝐨t,nh],absentsuperscript𝑊𝑂subscript𝐨𝑡1subscript𝐨𝑡2subscript𝐨𝑡subscript𝑛\displaystyle=W^{O}[\mathbf{o}_{t,1};\mathbf{o}_{t,2};...;\mathbf{o}_{t,n_{h}}],= italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT [ bold_o start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ; bold_o start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ; … ; bold_o start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] , (47)

where the boxed vectors in blue need to be cached for generation. During inference, the naive formula needs to recover 𝐤tCsuperscriptsubscript𝐤𝑡𝐶\mathbf{k}_{t}^{C}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT and 𝐯tCsuperscriptsubscript𝐯𝑡𝐶\mathbf{v}_{t}^{C}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT from 𝐜tKVsuperscriptsubscript𝐜𝑡𝐾𝑉\mathbf{c}_{t}^{KV}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT for attention. Fortunately, due to the associative law of matrix multiplication, we can absorb WUKsuperscript𝑊𝑈𝐾W^{UK}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT into WUQsuperscript𝑊𝑈𝑄W^{UQ}italic_W start_POSTSUPERSCRIPT italic_U italic_Q end_POSTSUPERSCRIPT, and WUVsuperscript𝑊𝑈𝑉W^{UV}italic_W start_POSTSUPERSCRIPT italic_U italic_V end_POSTSUPERSCRIPT into WOsuperscript𝑊𝑂W^{O}italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT. Therefore, we do not need to compute keys and values out for each query. Through this optimization, we avoid the computational overhead for recomputing 𝐤tCsuperscriptsubscript𝐤𝑡𝐶\mathbf{k}_{t}^{C}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT and 𝐯tCsuperscriptsubscript𝐯𝑡𝐶\mathbf{v}_{t}^{C}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT during inference.
其中蓝色的盒装向量需要缓存以供生成。 在推理过程中,朴素公式需要恢复 𝐤tCsuperscriptsubscript\mathbf{k}_{t}^{C}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT𝐯tCsuperscriptsubscript\mathbf{v}_{t}^{C}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT𝐜tKVsuperscriptsubscript\mathbf{c}_{t}^{KV}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT 引起注意。幸运的是,由于矩阵乘法的结合律,我们可以吸收 WUKsuperscriptW^{UK}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT 进入 WUQsuperscriptW^{UQ}italic_W start_POSTSUPERSCRIPT italic_U italic_Q end_POSTSUPERSCRIPT , 和 WUVsuperscriptW^{UV}italic_W start_POSTSUPERSCRIPT italic_U italic_V end_POSTSUPERSCRIPT 进入 WOsuperscriptW^{O}italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT 。因此,我们不需要为每个查询计算键和值。通过这种优化,我们避免了重新计算的计算开销 𝐤tCsuperscriptsubscript\mathbf{k}_{t}^{C}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT𝐯tCsuperscriptsubscript\mathbf{v}_{t}^{C}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT 推理过程中。

Appendix D Ablation of Attention Mechanisms
附录 D注意力机制的消融

D.1 Ablation of MHA, GQA, and MQA
D.1 MHA、GQA 和 MQA 的消融

We show the evaluation results for 7B dense models with MHA, GQA, and MQA on four hard benchmarks in Table 8. All of these three models are trained on 1.33T tokens, and share the same architecture except for the attention mechanisms. In addition, for a fair comparison, we align the number of parameters of them to around 7B by adjusting the number of layers. From the table, we can find that MHA demonstrates significant advantages over GQA and MQA on these benchmarks.
我们在表8中显示了具有 MHA、GQA 和 MQA 的 7B 密集模型在四个硬基准上的评估结果。所有这三个模型都在 1.33T 代币上进行训练,并且除了注意力机制之外共享相同的架构。另外,为了公平比较,我们通过调整层数将它们的参数数量调整到7B左右。从表中我们可以发现,MHA 在这些基准测试中比 GQA 和 MQA 表现出显着的优势。

Benchmark (Metric) 基准(公制) # Shots # 镜头 Dense 7B 密7B Dense 7B 密7B Dense 7B 密7B
w/ MQA 带MQA w/ GQA (8 Groups) 带 GQA(8 组) w/ MHA 含 MHA
# Params # 参数 - 7.1B 6.9B 6.9B
BBH (EM) 3-shot 3发 33.2 35.6 37.0
MMLU (Acc.) MMLU(加速) 5-shot 5发 37.9 41.2 45.2
C-Eval (Acc.) C-评估(加速) 5-shot 5发 30.0 37.7 42.9
CMMLU (Acc.) CMMLU(加速) 5-shot 5发 34.6 38.4 43.5
Table 8: Comparison among 7B dense models with MHA, GQA, and MQA, respectively. MHA demonstrates significant advantages over GQA and MQA on hard benchmarks.
表 8:分别使用 MHA、GQA 和 MQA 的 7B 密集模型之间的比较。 MHA 在硬基准测试中表现出优于 GQA 和 MQA 的显着优势。

D.2 Comparison Between MLA and MHA
D.2 MLA 和 MHA 之间的比较

In Table 9, we show the evaluation results for MoE models equipped with MLA and MHA, respectively, on four hard benchmarks. For a solid conclusion, we train and evaluate models across two scales. Two small MoE models comprise about 16B total parameters, and we train them on 1.33T tokens. Two large MoE models comprise about 250B total parameters, and we train them on 420B tokens. Also, two small MoE models and two large MoE models respectively share the same architecture except for the attention mechanisms. From the table, we can observe that MLA shows better performance than MHA. More importantly, MLA requires a significantly smaller amount of KV cache (14% for small MoE models and 4% for large MoE models) than MHA.
在表9中,我们显示了分别配备MLA和MHA的MoE模型在四个硬基准上的评估结果。为了得出可靠的结论,我们在两个尺度上训练和评估模型。两个小型 MoE 模型总共包含约 16B 个参数,我们在 1.33T 代币上训练它们。两个大型 MoE 模型包含大约 250B 总参数,我们在 420B 令牌上训练它们。此外,除了注意力机制之外,两个小型 MoE 模型和两个大型 MoE 模型分别共享相同的架构。从表中我们可以看出 MLA 的性能优于 MHA。更重要的是,MLA 需要的 KV 缓存量比 MHA 少得多(小型 MoE 模型为 14%,大型 MoE 模型为 4%)。

Benchmark (Metric) 基准(公制) # Shots # 镜头 Small MoE 小教育部 Small MoE 小教育部 Large MoE 大型教育部 Large MoE 大型教育部
w/ MHA 含 MHA w/ MLA 带 MLA w/ MHA 含 MHA w/ MLA 带 MLA
# Activated Params # 激活参数 - 2.5B 2.4B 25.0B 21.5B
# Total Params # 总参数 - 15.8B 15.7B 250.8B 247.4B
KV Cache per Token (# Element)
每个令牌的 KV 缓存(# 元素)
- 110.6K 15.6K 860.2K 34.6K
BBH (EM) 3-shot 3发 37.9 39.0 46.6 50.7
MMLU (Acc.) MMLU(加速) 5-shot 5发 48.7 50.0 57.5 59.0
C-Eval (Acc.) C-评估(加速) 5-shot 5发 51.6 50.9 57.9 59.2
CMMLU (Acc.) CMMLU(加速) 5-shot 5发 52.3 53.4 60.7 62.5
Table 9: Comparison between MLA and MHA on hard benchmarks. DeepSeek-V2 shows better performance than MHA, but requires a significantly smaller amount of KV cache.
表 9: MLA 和 MHA 在硬基准测试上的比较。 DeepSeek-V2 显示出比 MHA 更好的性能,但需要的 KV 缓存量要少得多。

Appendix E Discussion About Pre-Training Data Debiasing
附录 E关于预训练数据去偏的讨论

During pre-training data preparation, we identify and filter out contentious content, such as values influenced by regional cultures, to avoid our model exhibiting unnecessary subjective biases on these controversial topics. Consequently, we observe that DeepSeek-V2 performs slightly worse on the test sets that are closely associated with specific regional cultures. For example, when evaluated on MMLU, although DeepSeek-V2 achieves comparable or superior performance on the majority of testsets compared with its competitors like Mixtral 8x22B, it still lags behind on the Humanity-Moral subset, which is mainly associated with American values.
在训练前数据准备过程中,我们识别并过滤掉有争议的内容,例如受地区文化影响的价值观,以避免我们的模型对这些有争议的话题表现出不必要的主观偏见。因此,我们观察到 DeepSeek-V2 在与特定区域文化密切相关的测试集上表现稍差。例如,在 MMLU 上进行评估时,尽管 DeepSeek-V2 与 Mixtral 8x22B 等竞争对手相比,在大多数测试集上都取得了相当或更好的性能,但它在人性道德子集(主要与美国价值观相关)上仍然落后。

Further, we conduct a manual analysis on this subset. Three well-educated human annotators conduct independent annotations on 420 moral scenarios from the MMLU Humanity-Moral subset. Then, we compute the agreement among their annotations and the ground-truth label. As shown in Table 10, three human annotators and the ground-truth label exhibit a low agreement with each other. Therefore, we attribute the abnormal performance of DeepSeek-V2 on these value-sensitive test sets to our efforts in debiasing the pre-training corpus.
此外,我们对此子集进行手动分析。三位受过良好教育的人类注释者对 MMLU 人类道德子集中的 420 个道德场景进行独立注释。然后,我们计算它们的注释和真实标签之间的一致性。如表10所示,三个人工注释者和真实标签彼此的一致性较低。因此,我们将 DeepSeek-V2 在这些值敏感测试集上的异常表现归因于我们对预训练语料库去偏的努力。

Agreement 协议 Ground-Truth Label 真实标签 Annotator 1 注释者1 Annotator 2 注释器2 Annotator 3 注释器3
Ground-Truth Label 真实标签 100.0% 66.7% 59.8% 42.1%
Annotator 1 注释者1 66.7% 100.0% 57.9% 69.0%
Annotator 2 注释器2 59.8% 57.9% 100.0% 65.5%
Annotator 3 注释器3 42.1% 69.0% 65.5% 100.0%
Table 10: Three well-educated human annotators conduct independent annotations on 420 moral scenarios from the MMLU Humanity-Moral subset, on which DeepSeek-V2 and its competitive models demonstrate performance inconsistency. Three annotators and the ground-truth label exhibit a low agreement with each other. This indicates that the answers to the Humanity-Moral subset can be contentious according to specific regional cultures.
表 10:三位受过良好教育的人类注释者对 MMLU 人类道德子集的 420 个道德场景进行独立注释,DeepSeek-V2 及其竞争模型在这些场景上表现出性能不一致。三个注释者和真实标签彼此之间的一致性较低。这表明,根据特定的区域文化,人性道德子集的答案可能会存在争议。

Appendix F Additional Evaluations on Math and Code
附录 F对数学和代码的附加评估

The evaluation employs the SC-Math6 corpus, which consists of thousands of Chinese math problems. DeepSeek-V2 Chat (RL) outperforms all Chinese LLMs, including both open-source and close-source models.
评估采用的是SC-Math6语料库,该语料库包含数千道中国数学题。 DeepSeek-V2 Chat (RL) 的表现优于所有中国LLMs ,包括开源和闭源模型。

Model Name 型号名称 R Level R级 Comp. Score 比较。分数 Reas. Steps Score 雷亚斯.步数分数 OvrAcc Score 总分
GPT-4-1106-Preview GPT-4-1106-预览 5 90.71 91.65 89.77
GPT-4 5 88.40 89.10 87.71
DeepSeek-V2 Chat (RL) DeepSeek-V2 聊天 (RL) 5 83.35 85.73 84.54
Ernie-bot 4.0 厄尼机器人 4.0 5 85.60 86.82 84.38
Qwen-110B-Chat Qwen-110B-聊天 5 83.25 84.93 84.09
GLM-4 5 84.24 85.72 82.77
Xinghuo 3.5 星火3.5 5 83.73 85.37 82.09
Qwen-72B-Chat Qwen-72B-聊天 4 78.42 80.07 79.25
ChatGLM-Turbo 4 57.70 60.32 55.09
GPT-3.5-Turbo GPT-3.5-涡轮 4 57.05 59.61 54.50
Qwen-14B-Chat Qwen-14B-聊天 4 53.12 55.99 50.26
ChatGLM3-6B 聊天GLM3-6B 3 40.90 44.20 37.60
Xinghuo 3.0 星火3.0 3 40.08 45.27 34.89
Baichuan2-13B-Chat 百川2-13B-聊天 3 39.40 42.63 36.18
Ernie-3.5-turbo Ernie-3.5-涡轮 2 25.19 27.70 22.67
Chinese-Alpaca2-13B 中国-羊驼2-13B 2 20.55 22.52 18.58
Table 11: SC-Math6 Model Reasoning Level. “R Level” stands for Reasoning Level, “Comp. Score” stands for Comprehensive Score, “Reas. Steps Score” stands for Reasoning Steps Score, and “OvrAcc Score” stands for Overall Accuracy Score.
表 11: SC-Math6 模型推理水平。 “R Level”代表推理水平,“Comp. Score”代表综合得分,“Reas.Score”代表综合得分。 “步骤分数”代表推理步骤分数,“OvrAcc 分数”代表总体准确性分数。

We further share more results in Figure 5 on HumanEval and LiveCodeBench, where the questions of LiveCodeBench are selected from the period between September 1st, 2023, and April 1st, 2024. As shown in the figure, DeepSeek-V2 Chat (RL) demonstrates considerable proficiency in LiveCodeBench, achieving a Pass@1 score that even surpasses some giant models. This performance highlights the strong capability of DeepSeek-V2 Chat (RL) in tackling live coding tasks.
我们进一步在 HumanEval 和 LiveCodeBench 上分享了更多结果,如图5所示,其中 LiveCodeBench 的问题选自 2023 年 9 月 1 日至 2024 年 4 月 1 日期间。如图所示,DeepSeek-V2 Chat (RL) 表现出了相当大的熟练掌握LiveCodeBench,取得了Pass@1的成绩,甚至超越了一些巨型模型。这一性能凸显了 DeepSeek-V2 Chat (RL) 在处理实时编码任务方面的强大能力。

Refer to caption
Figure 5: Evaluation results on HumanEval and LiveCodeBench. The questions of LiveCodeBench are selected from the period between September 1st, 2023 and April 1st, 2024.
图 5: HumanEval 和 LiveCodeBench 的评估结果。 LiveCodeBench的问题选自2023年9月1日至2024年4月1日期间。

Appendix G Evaluation Formats
附录 G评估格式

We present our evaluation formats for each benchmark in Table 12-37, respectively.
我们分别在表12 - 37中列出了每个基准的评估格式。

PROMPT
以下是一道中国高考生物选择题,请选择正确的答案。
问题:下列有关高尔基体、线粒体和叶绿体的叙述, 正确的是 选项:(A)三者都存在于蓝藻中 (B)三者都含有 DNA (C)三者都是 ATP 合成的场所 (D)三者的膜结构中都含有蛋白质
答案:从A到D, 我们应选择
Table 12: An example of AGIEval.
表 12: AGIEval 示例。
PROMPT
Question: A sample in a cylindrical container has a cylindrical shape and a fixed volume. The state of matter of the sample _
问题:圆柱形容器中的样品具有圆柱形形状和固定体积。样品的物质状态_
A. must be solid A、必须坚固
B. could be either solid or liquid
B.可以是固体也可以是液体
C. must be liquid C.必须是液体
D. could be either liquid or gas
D.可以是液体也可以是气体
Answer: B 答案:B
Question: The speed of sound is generally greatest in _
问题:声速一般在 _ 处最大
A. solids and lowest in liquids
A. 固体和液体含量最低
B. solids and lowest in gases
B. 固体和气体含量最低
C. gases and lowest in liquids
C. 气体和液体中最低
D. gases and lowest in solids
D. 气体和固体含量最低
Answer: B 答案:B
Question: When oil and water are mixed together, they form a _
问:油和水混合后会形成_
A. gas A、气体
B. solid B、固体
C. compound C.化合物
D. suspension D、悬浮液
Answer: D 答案:D
Question: A container of liquid water was placed outside during the day when the temperature was 3°C. At night the outside temperature dropped to -2°C. This temperature change most likely caused the water to _
问题:白天温度为 3°C 时,将一个装有液态水的容器放在室外。夜间室外温度降至-2°C。这种温度变化很可能导致水_
A. condense A. 浓缩
B. evaporate B、蒸发
C. remain a liquid C. 保持液态
D. become a solid D、变成固体
Answer: 回答:
Table 13: An example of ARC.
表 13: ARC 示例。
PROMPT
Evaluate the result of a random Boolean expression.
评估随机布尔表达式的结果。
Q: not ( ( not not True ) ) is
问:不是 ( ( not not True ) ) 是
A: Let’s think step by step.
A:我们一步一步来想。
Remember that (i) expressions inside brackets are always evaluated first and that (ii) the order of operations from highest priority to lowest priority is "not", "and", "or", respectively. We first simplify this expression "Z" as follows: "Z = not ( ( not not True ) ) = not ( ( A ) )" where "A = not not True". Let’s evaluate A: A = not not True = not (not True) = not False = True. Plugging in A, we get: Z = not ( ( A ) ) = not ( ( True ) ) = not True = False. So the answer is False.
请记住,(i) 括号内的表达式始终首先计算,并且 (ii) 从最高优先级到最低优先级的运算顺序分别为“not”、“and”、“or”。我们首先将表达式“Z”简化如下:“Z = not ( ( not not True ) ) = not ( ( A ) )”,其中“A = not not True”。我们来评估A:A = not not True = not(not True)= not False = True。代入 A,我们得到: Z = not ( ( A ) ) = not ( ( True ) ) = not True = False。所以答案是错误的。
Q: True and False and not True and True is
问:真与假,而不是真与真
A: Let’s think step by step.
A:我们一步一步来想。
Remember that (i) expressions inside brackets are always evaluated first and that (ii) the order of operations from highest priority to lowest priority is "not", "and", "or", respectively. We first simplify this expression "Z" as follows: "Z = True and False and not True and True = A and B" where "A = True and False" and "B = not True and True". Let’s evaluate A: A = True and False = False. Let’s evaluate B: B = not True and True = not (True and True) = not (True) = False. Plugging in A and B, we get: Z = A and B = False and False = False. So the answer is False.
请记住,(i) 括号内的表达式始终首先计算,并且 (ii) 从最高优先级到最低优先级的运算顺序分别为“not”、“and”、“or”。我们首先将这个表达式“Z”简化如下:“Z = True and False and not True and True = A and B”,其中“A = True and False”并且“B = not True and True”。让我们评估 A:A = True,False = False。让我们评估 B:B = not True and True = not (True and True) = not (True) = False。代入 A 和 B,我们得到:Z = A 且 B = False 并且 False = False。所以答案是错误的。
Q: not not ( not ( False ) ) is
问:不是不是(不是(False))是
A: Let’s think step by step.
A:我们一步一步来想。
Remember that (i) expressions inside brackets are always evaluated first and that (ii) the order of operations from highest priority to lowest priority is "not", "and", "or", respectively. We first simplify this expression "Z" as follows: "Z = not not ( not ( False ) ) = not not ( A )" where "A = not ( False )". Let’s evaluate A: A = not ( False ) = not False = True. Plugging in A, we get: Z = not not ( A ) = not not (True) = not not False = True. So the answer is True.
请记住,(i) 括号内的表达式始终首先计算,并且 (ii) 从最高优先级到最低优先级的运算顺序分别为“not”、“and”、“or”。我们首先将这个表达式“Z”简化如下:“Z = not not ( not ( False ) ) = not not ( A )”,其中“A = not ( False )”。让我们评估 A:A = not ( False ) = not False = True。代入 A,我们得到: Z = not not ( A ) = not not (True) = not not False = True。所以答案是正确的。
Q: False and False and False or not False is
问: 假与假与假与否 假是
A: Let’s think step by step.
A:我们一步一步来想。
Table 14: An example of BBH.
表 14: BBH 示例。
PROMPT
以下是中国关于教育学考试的单项选择题,请选出其中的正确答案。
根据我国心理学家冯忠良教授的学习分类,培养学生品德要通过____。
A. 知识的学习
B. 技能的学习
C. 行为规范的学习
D. 态度的学习
答案: C
开设跨学科课程或建立跨学科专业体现了高等教育课程发展的____。
A. 综合化趋势
B. 多样化趋势
C. 人文化趋势
D. 科学化趋势
答案: A
心智技能的特点有____。
A. 物质性、外显性、简缩性
B. 观念性、内潜性、简缩性
C. 物质性、外显性、展开性
D. 观念性、内潜性、展开性
答案: B
下列关于大学生的情绪与理智关系的说法中正确的是____。
A. 能冷静控制自己情绪
B. 感情用事,难以用理智控制情绪
C. 遇事能坚持自己正确认识
D. 已发展到不为小事而发怒和怄气
答案: B
在学完一篇逻辑结构严密的课文以后,勾画出课文的论点论据的逻辑关系图以帮助理解和记忆。这种学习方法属于____。
A. 精细加工策略
B. 组织策略
C. 复述策略
D. 做笔记策略
答案: B
有学者强调,教育要根据一个民族固有的特征来定,这种观点体现了____
A. 生产力对教育的影响和制约
B. 政治制度对教育的影响和制约
C. 文化对教育的影响和制约
D. 经济制度对教育的影响和制约
答案:
OPTIONS
- A
- B -B
- C -C
- D -D
Table 15: An example of C-Eval.
表 15: C-Eval 示例。
PROMPT
女:这些药怎么吃?
男:一天三次,一次两片。
请根据上文回答问题:
他们在哪儿?
答案:
OPTIONS
- 商店
- 饭店
- 医院
- 教室
Table 16: An example of C3.
表 16: C3 的示例。
PROMPT
以下是将某句古诗文翻译而成的现代表述:春天已至,万物复苏,春风如一位美丽而又心灵手巧的姑娘,迈着纤纤细步款款而来,她挥舞剪刀,尽情地展示那高超的女工技巧,她先裁出了柳叶,随着柳条袅袅依依地舞蹈,又裁出杏叶,桃叶。
该翻译所对应的古诗文是:
OPTIONS
- 春风骋巧如翦刀
- 剪裁无巧似春风
- 风吹怨恨快如刀
- 春风欲擅秋风巧
Table 17: An example of CCPM.
表 17: CCPM 示例。
PROMPT
Q: 某小学在“献爱心–为汶川地震区捐款”活动中,六年级五个班共捐款8000元,其中一班捐款1500元,二班比一班多捐款200元,三班捐款1600元,四班与五班捐款数之比是3:5.四班捐款多少元?
A: 一班捐款1500元,而二班比一班多捐200元,所以二班捐款1500+200=1700元,又知道六年级五个班一共捐款8000元,所以四班和五班捐款之和 = 一共捐款 - 一班和二班和三班捐款之和,即8000-1500-1700-1600=3200元,而题目说四班与五班捐款数之比是3:5,则四班捐款了3200/(3+5)*3=1200元。所以答案是:1200。
Q: 小俊在东西大道上跑步,若规定向东为正。他先向东跑了800米,然后又跑了一段之后,他位于出发点西边100米处,小俊第二段跑了多少米?
A: 小俊第二段跑完后位于出发点西边,所以第二段应该是向西跑,第二段跑的长度-第一段跑的长度=100,第二段跑了100+800=900米。所以答案是:900。
Q: A车和B车同时从甲、乙两地相向开出,经过5小时相遇.然后,它们又各自按原速原方向继续行驶3小时,这时A车离乙地还有135千米,B车离甲地还有165千米.甲、乙两地相距多少千米?
A: 假设A车的速度为x千米每小时,B车的速度为y千米每小时,根据而A、B相遇时A车行驶了5小时,A车行驶3小时后离乙地还有135千米,B车行驶3小时后距离甲地还有165千米,可以得到甲乙两地相距=5x+5y=135+8x=165+8y,变换得到:10(x+y)=300+8(x+y),于是x+y=150,甲乙两地相距5(x+y)=750千米。所以答案是:750。
Q: 在一个底面半径为10厘米的圆柱形容器内,倒入10厘米深的水,然后将一个底面直径4厘米,高6厘米的圆锥形铅锤放入水中,容器中水面上升多少厘米?
A: 一个:
Table 18: An example of CMATH.
表 18: CMATH 示例。
PROMPT
以下是关于解剖学的单项选择题,请直接给出正确答案的选项。
题目:壁胸膜的分部不包括
A. 肋胸膜
B. 肺胸膜
C. 膈胸膜
D. 胸膜顶
答案是: B
题目:属于蝶骨上的结构为
A. 垂体窝
B. 棘孔
C. 破裂孔
D. 视神经管
答案是: B
题目:属于右心房的结构是
A. 肉柱
B. 室上嵴
C. 乳头肌
D. 梳状肌
答案是: D
题目:咽的分部
A. 咽隐窝
B. 口咽部
C. 鼻咽部
D. 喉咽部
答案是: C
题目:舌下神经核位于
A. 间脑
B. 延髓
C. 中脑
D. 脑挢
答案是: B
题目:从脑干背侧出脑的脑神经是
A. 副神经
B. 三叉神经
C. 舌下神经
D. 滑车神经
答案是:
OPTIONS
- A
- B -B
- C -C
- D -D
Table 19: An example of CMMLU.
表 19: CMMLU 示例。
PROMPT
文章:英雄广场(Heldenplatz)是奥地利首都维也纳的一个广场。在此曾发生许多重要事件 — 最著名的是1938年希特勒在此宣告德奥合并。英雄广场是霍夫堡皇宫的外部广场,兴建于皇帝弗朗茨·约瑟夫一世统治时期,是没有完全建成的所谓“帝国广场”(Kaiserforum)的一部分。 其东北部是霍夫堡皇宫的 Leopoldinian Tract,东南方是新霍夫堡,西南方的内环路,将其与“城门外”(Äußeres Burgtor)隔开。西北部没有任何建筑物,可以很好地眺望内环路、国会大厦、市政厅,以及城堡剧院。广场上有2尊军事领袖的骑马像:欧根亲王和卡尔大公。
根据上文回答下面的问题。
问题:英雄广场是哪个皇宫的外部广场?
答案:霍夫堡皇宫
问题:广场上有哪两位军事领袖的骑马像?
答案:
Table 20: An example of CMRC2018.
表 20: CMRC2018 的示例。
PROMPT
Passage: The median age in the city was 22.1 years. 10.1% of residents were under the age of 18; 56.2% were between the ages of 18 and 24; 16.1% were from 25 to 44; 10.5% were from 45 to 64; and 7% were 65 years of age or older. The gender makeup of the city was 64.3% male and 35.7% female.
通道:该市的平均年龄为22.1岁。 10.1%的居民年龄在18岁以下; 56.2%年龄在18岁至24岁之间; 16.1%为25岁至44岁; 10.5%为45岁至64岁; 7% 年龄在 65 岁或以上。该市的性别构成为男性64.3%,女性35.7%。
Answer the following questions based on the above passage, please calculate carefully if calculation is necessary.
根据以上段落回答下列问题,如需计算请仔细计算。
Q: How many percent were not from 25 to 44?
问:有多少百分比不是在 25 岁到 44 岁之间?
A: The answer type is number. So according to above Passage, the answer is 83.9.
答:答案类型为数字。所以根据上面的段落,答案是83.9。
Q: How many in percent weren’t 25 to 44?
问:有多少百分比不是 25 至 44?
A: The answer type is number. So according to above Passage, the answer is
答:答案类型为数字。所以根据上面的段落,答案是
Table 21: An example of DROP.
表 21: DROP 示例。
PROMPT
中新网12月7日电 综合外媒6日报道,在美国得克萨斯州,负责治疗新冠肺炎患者的医生约瑟夫·瓦隆(Joseph Varon)已连续上班超260天,每天只睡不超过2小时。瓦隆日前接受采访时呼吁,美国民众应遵从防疫规定,一线的医护人员“已
OPTIONS
- 神清气爽”。
- 诡计多端”。
- 精疲力竭”。
- 分工合作”。
- 寅吃卯粮”。
- 土豪劣绅”。
- 芸芸众生”。
Table 22: An example of CHID.
表 22: CHID 示例。
PROMPT
胡雪岩离船登岸,坐轿进城,等王有龄到家,他接着也到了他那里,脸上是掩抑不住的笑容,王有龄夫妇都觉得奇怪,问他什么事这么高兴。
上面的句子中的"他"指的是
胡雪岩
渐渐地,汤中凝结出一团团块状物,将它们捞起放进盆里冷却,肥皂便出现在世上了。
上面的句子中的"它们"指的是
块状物
“她序上明明引着JulesTellier的比喻,说有个生脱发病的人去理发,那剃头的对他说不用剪发,等不了几天,头毛压儿全掉光了;大部分现代文学也同样的不值批评。这比喻还算俏皮。”
上面的句子中的"他"指的是
生脱发病的人
在洛伦佐大街的尽头处,矗立着著名的圣三一大教堂。它有着巨大的穹顶,还有明亮的彩色玻璃窗,上面描绘着《旧约》和《新约》的场景。
上面的句子中的"它"指的是
圣三一大教堂
他伯父还有许多女弟子,大半是富商财主的外室;这些财翁白天忙着赚钱,怕小公馆里的情妇长日无聊,要不安分,常常叫她们学点玩艺儿消遣。
上面的句子中的"她们"指的是
情妇
赵雨又拿出了一个杯子,我们热情地请老王入座,我边给他倒酒边问:1962年的哪次记得吗?“
上面的句子中的"他"指的是
Table 23: An example of CLUEWSC.
表 23: CLUEWSC 的示例。
PROMPT
Q: Max can mow the lawn in 40 minutes. If it takes him twice that long to fertilize the lawn, how long will it take him to both mow and fertilize the lawn?
问:Max 可以在 40 分钟内修剪草坪。如果他给草坪施肥需要两倍的时间,那么他修剪草坪和给草坪施肥需要多长时间?
A: Let’s think step by step. It takes Max 2 * 40 minutes = 80 minutes to fertilize the lawn. In total, Max takes 80 minutes + 40 minutes = 120 minutes to both mow and fertilize the lawn. The answer is 120.
A:我们一步一步来想。给草坪施肥最多需要 2 * 40 分钟 = 80 分钟。总共,Max 需要 80 分钟 + 40 分钟 = 120 分钟来修剪草坪和施肥。答案是120。
Q: The bagels cost $2.25 each, or a dozen for $24. How much is saved, per bagel, in cents, by buying a dozen at a time?
问:百吉饼每个售价 2.25 美元,一打百吉饼售价 24 美元。一次购买一打百吉饼,每个百吉饼可以节省多少美分?
A: Let’s think step by step. They cost 2.25*100=225 cents each. At the bulk rate, they are 24/12=2 dollar each. They cost 2*100=200 cents each. 225-200=25 cents are saved per bagel. The answer is 25.
A:我们一步一步来想。每张售价 2.25*100=225 美分。按批量价格计算,每件为 24/12=2 美元。每张售价 2*100=200 美分。 225-200=每个百吉饼节省 25 美分。答案是25。
Q: Tim is 5 years old. His cousin, Rommel, is thrice as old as he is. His other cousin, Jenny, is 2 years older than Rommel. How many years younger is Tim than Jenny?
问:蒂姆今年 5 岁。他的表弟隆美尔的年龄是他的三倍。他的另一个表弟珍妮比隆美尔大两岁。蒂姆比珍妮小几岁?
A: Let’s think step by step. Rommel is 5 x 3 = 15 years old. Jenny is 15 + 2 = 17 years old. So, Tim is 17 - 5 = 12 years younger than Jenny. The answer is 12.
A:我们一步一步来想。隆美尔的年龄是 5 x 3 = 15 岁。珍妮 15 + 2 = 17 岁。所以,蒂姆比珍妮小 17 - 5 = 12 岁。答案是12。
Q: The school has 14 boys and 10 girls. If 4 boys and 3 girls drop out, how many boys and girls are left?
问:学校有 14 名男生和 10 名女生。如果有 4 个男孩和 3 个女孩辍学,那么还剩下多少个男孩和女孩?
A: Let’s think step by step. There are 14 boys - 4 boys = 10 boys left. There are 10 girls - 3 girls = 7 girls left. In total there are 10 boys + 7 girls = 17 boys and girls left. The answer is 17.
A:我们一步一步来想。剩下 14 个男孩 - 4 个男孩 = 10 个男孩。剩下 10 个女孩 - 3 个女孩 = 7 个女孩。总共有 10 个男孩 + 7 个女孩 = 17 个男孩和女孩。答案是17。
Q: Building one birdhouse requires 7 planks and 20 nails. If 1 nail costs 0.05, and one plank costs 3, what is the cost, in dollars, to build 4 birdhouses?
问:建造一座鸟舍需要 7 块木板和 20 颗钉子。如果 1 个钉子的成本为 0.05,一块木板的成本为 3,那么建造 4 个鸟舍的成本是多少美元?
A: Let’s think step by step. The cost of the planks for one birdhouse is 7 * 3 = 21. And the nails are a cost of 20 * 0.05 = 1 for each birdhouse. So to build one birdhouse one will need 21 + 1 = 22. So the cost of building 4 birdhouses is at 4 * 22 = 88. The answer is 88.
A:我们一步一步来想。一个鸟舍的木板成本为 7 * 3 = 21。每个鸟舍的钉子成本为 20 * 0.05 = 1。因此,建造 1 个鸟舍需要 21 + 1 = 22。因此建造 4 个鸟舍的成本为 4 * 22 = 88。答案是 88。
Q: Danny brings 3 watermelons to his family picnic. He cuts each watermelon into 10 slices. His sister brings 1 watermelon to the family picnic, and she cuts the watermelon into 15 slices. How many watermelon slices are there in total at the picnic?
问:丹尼带了 3 个西瓜去参加他的家庭野餐。他把每个西瓜切成 10 片。他的姐姐带了 1 个西瓜去参加家庭野餐,她把西瓜切成了 15 片。野餐时总共有多少片西瓜?
A: Let’s think step by step. From Danny, there are 3 * 10 = 30 watermelon slices. From his sister, there are 1 * 15 = 15 watermelon slices. There are a total of 30 + 15 = 45 watermelon slices. The answer is 45.
A:我们一步一步来想。丹尼有 3 * 10 = 30 片西瓜片。从他姐姐那里,有 1 * 15 = 15 片西瓜。总共有 30 + 15 = 45 片西瓜片。答案是45。
Q: Angela is a bike messenger in New York. She needs to deliver 8 times as many packages as meals. If she needs to deliver 27 meals and packages combined, how many meals does she deliver?
问:安吉拉是纽约的一名自行车送信员。她需要运送的包裹数量是餐食数量的 8 倍。如果她需要配送 27 份餐食和包裹,总共需要配送多少餐?
A: Let’s think step by step. Let p be the number of packages Angela delivers and m be the number of meals. We know that p + m = 27 and p = 8m. Substituting the second equation into the first equation, we get 8m + m = 27. Combining like terms, we get 9m = 27. Dividing both sides by 9, we get m = 3. The answer is 3.
A:我们一步一步来想。令 p 为 Angela 递送的包裹数量,m 为餐食数量。我们知道 p + m = 27 且 p = 8m。将第二个方程代入第一个方程,得到 8m + m = 27。组合类似项,得到 9m = 27。两边除以 9,得到 m = 3。答案是 3。
Q: Cori is 3 years old today. In 5 years, she will be one-third the age of her aunt. How old is her aunt today?
问:Cori 今天 3 岁了。 5年后,她的年龄将是姑妈的三分之一。她姨妈今天几岁了?
A: Let’s think step by step. In 5 years, Cori will be 3 + 5 = 8 years old. In 5 years, Cori’s aunt will be 8 x 3 = 24 years old. Today, her aunt is 24 - 5 = 19 years old. The answer is 19.
A:我们一步一步来想。 5年后,Cori就3 + 5 = 8岁。 5 年后,Cori 的阿姨将是 8 x 3 = 24 岁。今天,她的姨妈24 - 5 = 19 岁。答案是19。
Q: Indras has 6 letters in her name. Her sister’s name has 4 more letters than half of the letters in Indras’ name. How many letters are in Indras and her sister’s names?
问:因陀罗的名字中有 6 个字母。她姐姐的名字比因陀罗名字的一半多了 4 个字母。因陀罗和她妹妹的名字中有多少个字母?
A: Let’s think step by step.
A:我们一步一步来想。
Table 24: An example of GSM8K.
表 24: GSM8K 示例。
PROMPT
Playing piano: A man is seated at a piano. He
弹钢琴:一个男人坐在钢琴前。他
OPTIONS
- is playing the piano with his hands and his face.
- 用手和脸弹钢琴。
- bigins to play a song by timbaland on the piano.
- bigins 在钢琴上弹奏 timbaland 的歌曲。
- plays slowly, and pauses to snap his fingers.
- 缓慢地演奏,并停下来打响指。
- is playing a song in front of him.
- 正在他面前播放一首歌。
Table 25: An example of HellaSwag.
表 25: HellaSwag 示例。
PROMPT
def starts_one_ends(n): defstarts_one_ends(n):
""" ”“”
Given a positive integer n, return the count of the numbers of n-digit
给定一个正整数 n,返回 n 位数字的个数
positive integers that start or end with 1.
以 1 开头或结尾的正整数。
""" ”“”
Table 26: An example of HumanEval.
表 26: HumanEval 的示例。
PROMPT
Problem: 问题:
Find the domain of the expression $\frac{\sqrt{x-2}}{\sqrt{5-x}}$.}
求表达式 $\frac{\sqrt{x-2}}{\sqrt{5-x}}$.} 的定义域
Solution: 解决方案:
The expressions inside each square root must be non-negative.
每个平方根内的表达式必须是非负的。
Therefore, $x-2 \ge 0$, so $x\ge2$, and $5 - x \ge 0$, so $x \le 5$.
因此,$x-2 \ge 0$,所以$x\ge2$,$5 - x \ge 0$,所以$x \le 5$。
Also, the denominator cannot be equal to zero, so $5-x>0$, which gives $x<5$.
此外,分母不能等于 0,因此 $5-x>0$,即 $x<5$。
Therefore, the domain of the expression is $\boxed{[2,5)}$.
因此,表达式的定义域为$\boxed{[2,5)}$。
Final Answer: The final answer is $[2,5)$. I hope it is correct.
最终答案:最终答案是$[2,5)$。我希望这是正确的。
Problem: 问题:
If $\det \mathbf{A} = 2$ and $\det \mathbf{B} = 12,$ then find $\det (\mathbf{A} \mathbf{B}).$
如果 $\det \mathbf{A} = 2$ 且 $\det \mathbf{B} = 12,$ 则求 $\det (\mathbf{A} \mathbf{B}).$
Solution: 解决方案:
We have that $\det (\mathbf{A} \mathbf{B}) = (\det \mathbf{A})(\det \mathbf{B}) = (2)(12) = \boxed{24}.$
我们有 $\det (\mathbf{A} \mathbf{B}) = (\det \mathbf{A})(\det \mathbf{B}) = (2)(12) = \boxed{24} .$
Final Answer: The final answer is $24$. I hope it is correct.
最终答案:最终答案是$24$。我希望这是正确的。
Problem: 问题:
Terrell usually lifts two 20-pound weights 12 times. If he uses two 15-pound weights instead, how many times must Terrell lift them in order to lift the same total weight?
特雷尔通常会举起两个 20 磅的重物 12 次。如果他使用两个 15 磅的重物,特雷尔必须举起它们多少次才能举起相同的总重量?
Solution: 解决方案:
If Terrell lifts two 20-pound weights 12 times, he lifts a total of $2\cdot 12\cdot20=480$ pounds of weight. If he lifts two 15-pound weights instead for $n$ times, he will lift a total of $2\cdot15\cdot n=30n$ pounds of weight. Equating this to 480 pounds, we can solve for $n$: \begin{align*}
如果特雷尔举起两个 20 磅的重物 12 次,他总共举起 $2\cdot 12\cdot20=480$ 磅的重量。如果他举起两个 15 磅的重物 $n$ 次,他将总共举起 $2\cdot15\cdot n=30n$ 磅的重量。将其等同于 480 磅,我们可以求解 $n$: \begin{align*}
30n&=480\\ 30n&=480\\
\Rightarrow\qquad n&=480/30=\boxed{16}
\右箭头\qquad n&=480/30=\盒装{16}
\end{align*} \结束{对齐*}
Final Answer: The final answer is $16$. I hope it is correct.
最终答案:最终答案是$16$。我希望这是正确的。
Problem: 问题:
If the system of equations
如果方程组
\begin{align*} \开始{对齐*}
6x-4y&=a,\\ 6x-4y&=a,\\
6y-9x &=b. 6y-9x &=b。
\end{align*}has a solution $(x, y)$ where $x$ and $y$ are both nonzero, find $\frac{a}{b},$ assuming $b$ is nonzero.
\end{align*} 有一个解 $(x, y)$,其中 $x$ 和 $y$ 均非零,找到 $\frac{a}{b},$ 假设 $b$ 非零。
Solution: 解决方案:
If we multiply the first equation by $-\frac{3}{2}$, we obtain
如果我们将第一个方程乘以 $-\frac{3}{2}$,我们得到
$$6y-9x=-\frac{3}{2}a.$$Since we also know that $6y-9x=b$, we have
$$6y-9x=-\frac{3}{2}a.$$因为我们也知道$6y-9x=b$,所以我们有
$$-\frac{3}{2}a=b\Rightarrow\frac{a}{b}=\boxed{-\frac{2}{3}}.$$
Final Answer: The final answer is $-\frac{2}{3}$. I hope it is correct.
最终答案:最终答案是$-\frac{2}{3}$。我希望这是正确的。
Problem: Evaluate $\log_21$.
问题:评估$\log_21$。
Solution: 解决方案:
Table 27: An example of MATH.
表 27:数学示例。
PROMPT
You are an expert Python programmer, and here is your task: Write a function to find the similar elements from the given two tuple lists. Your code should pass these tests:
您是一名专业的 Python 程序员,您的任务是:编写一个函数,从给定的两个元组列表中查找相似的元素。您的代码应该通过这些测试:
assert similar_elements((3, 4, 5, 6),(5, 7, 4, 10)) == (4, 5)
断言相似元素((3, 4, 5, 6),(5, 7, 4, 10)) == (4, 5)
assert similar_elements((1, 2, 3, 4),(5, 4, 3, 7)) == (3, 4)
断言相似元素((1, 2, 3, 4),(5, 4, 3, 7)) == (3, 4)
assert similar_elements((11, 12, 14, 13),(17, 15, 14, 13)) == (13, 14)
断言相似元素((11, 12, 14, 13),(17, 15, 14, 13)) == (13, 14)
[BEGIN] [开始]
def similar_elements(test_tup1, test_tup2):
def 相似元素(test_tup1,test_tup2):
res = tuple(set(test_tup1) & set(test_tup2))
res = 元组(设置(test_tup1) & 设置(test_tup2))
return (res) 返回(资源)
[DONE] [完毕]
You are an expert Python programmer, and here is your task: Write a python function to identify non-prime numbers. Your code should pass these tests:
您是一名专业的 Python 程序员,您的任务是:编写一个 Python 函数来识别非素数。您的代码应该通过这些测试:
assert is_not_prime(2) == False
断言 is_not_prime(2) == False
assert is_not_prime(10) == True
断言 is_not_prime(10) == True
assert is_not_prime(35) == True
断言 is_not_prime(35) == True
[BEGIN] [开始]
import math 导入数学
def is_not_prime(n):
result = False 结果 = 假
for i in range(2,int(math.sqrt(n)) + 1):
对于范围内的 i(2,int(math.sqrt(n)) + 1):
if n % i == 0:
如果 n % i == 0:
result = True 结果=真
return result 返回结果
[DONE] [完毕]
You are an expert Python programmer, and here is your task: Write a function to find the largest integers from a given list of numbers using heap queue algorithm. Your code should pass these tests:
您是一名专业的 Python 程序员,您的任务是:编写一个函数,使用堆队列算法从给定的数字列表中查找最大整数。您的代码应该通过这些测试:
assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],3)==[85, 75, 65]
断言 heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],3)==[85, 75, 65]
assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],2)==[85, 75]
断言 heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],2)==[85, 75]
assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],5)==[85, 75, 65, 58, 35]
断言 heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],5)==[85, 75, 65, 58, 35]
[BEGIN] [开始]
import heapq as hq 将 heapq 导入为 hq
def heap_queue_largest(nums,n):
largest_nums = hq.nlargest(n, nums)
Maximum_nums = hq.nlargest(n, nums)
return largest_nums 返回最大的数字
[DONE] [完毕]
You are an expert Python programmer, and here is your task: Write a function to return the sum of all divisors of a number. Your code should pass these tests:
您是一名专业的 Python 程序员,您的任务是:编写一个函数来返回某个数字的所有除数之和。您的代码应该通过这些测试:
assert sum_div(8)==7 断言 sum_div(8)==7
assert sum_div(12)==16 断言 sum_div(12)==16
assert sum_div(7)==1 断言 sum_div(7)==1
[BEGIN] [开始]
Table 28: An example of MBPP.
表 28: MBPP 示例。
PROMPT
The following are multiple choice questions (with answers) about miscellaneous.
以下是关于杂项的多项选择题(附答案)。
How many axles does a standard automobile have?
标准汽车有多少个车轴?
A. one A、一
B. two B、二
C. four C、四
D. eight D、八
Answer: B 答案:B
What place is named in the title of the 1979 live album by rock legends Cheap Trick?
摇滚传奇人物 Cheap Trick 1979 年现场专辑的标题中提到了什么地方?
A. Budapest A、布达佩斯
B. Budokan B. 武道馆
C. Bhutan C.不丹
D. Britain D、英国
Answer: B 答案:B
Who is the shortest man to ever win an NBA slam dunk competition?
谁是赢得 NBA 扣篮大赛的最矮的人?
A. Anthony ’Spud’ Webb A. 安东​​尼·“马铃薯”·韦伯
B. Michael ’Air’ Jordan B. 迈克尔·乔丹
C. Tyrone ’Muggsy’ Bogues
C. 蒂龙·“Muggsy”·博格斯
D. Julius ’Dr J’ Erving
D. 朱利叶斯·“J”·欧文
Answer: A 答案:A
What is produced during photosynthesis?
光合作用过程中产生什么?
A. hydrogen A、氢气
B. nylon B、尼龙
C. oxygen C、氧气
D. light D、光
Answer: C 答案:C
Which of these songs was a Top 10 hit for the rock band The Police?
以下哪一首歌曲是摇滚乐队 The Police 的十大热门歌曲?
A. ’Radio Ga-Ga’ A.“嘎嘎电台”
B. ’Ob-la-di Ob-la-da’ B.“Ob-la-di Ob-la-da”
C. ’De Do Do Do De Da Da Da’
C. 'De Do Do Do De Da Da Da'
D. ’In-a-Gadda-Da-Vida’ D.“In-a-Gadda-Da-Vida”
Answer: C 答案:C
Which of the Three Stooges was not related to the others?
三个臭皮匠中哪一个与其他人没有血缘关系?
A. Moe A·莫伊
B. Larry B·拉里
C. Curly C.卷毛
D. Shemp D·谢姆普
Answer: 回答:
OPTIONS
- A
- B -B
- C -C
- D -D
Table 29: An example of MMLU.
表 29: MMLU 示例。
PROMPT
Answer these questions: 回答这些问题:
Q: Who is hosting the fifa world cup in 2022?
问:2022 年世界杯由谁主办?
A: Qatar 答:卡塔尔
Q: Who won the first women ’s fifa world cup?
问:谁赢得了第一届女子世界杯?
A: United States 答:美国
Q: When did miami vice go off the air?
问:《迈阿密风云》什么时候停播?
A: 1989 答:1989年
Q: Who wrote the song shout to the lord?
问:《向主呼喊》这首歌是谁写的?
A: Darlene Zschech 答:达琳·齐谢赫
Q: Who was thrown in the lion ’s den?
问:谁被扔进狮子坑里了?
A: Daniel 答:丹尼尔
Q: What is the meaning of the name habib?
问:哈比卜这个名字的含义是什么?
A: 一个:
Table 30: An example of NaturalQuestions.
表 30:自然问题的示例。
PROMPT
A woman notices that she is depressed every autumn, and wonders why. A friend suggests to her that perhaps certain changes that take place as seasons move from warm to cold may be having an effect on her. When pressed for an example of these changes, the friend cites
一位女士注意到每年秋天她都会感到沮丧,并想知道为什么。一位朋友向她建议,也许随着季节从温暖到寒冷而发生的某些变化可能会对她产生影响。当被问及这些变化的例子时,这位朋友引用了
OPTIONS
- flowers blooming - 鲜花盛开
- grass turning brown - 草变成棕色
- trees growing - 树木生长
- blossoms blooming - 花朵盛开
Table 31: An example of OpenBookQA.
表 31: OpenBookQA 示例。
PROMPT
To make it easier to push the reset button of the garbage disposable machine which is located underneath the machine,
为了更方便地按下位于机器下方的垃圾处理机的重置按钮,
OPTIONS
- place a wall mirror on the floor of the cabinet
- 在柜子的地板上放置一面壁镜
- hold a hand mirror under the garbage disposable machine
- 在垃圾处理机下方放一个手持镜
Table 32: An example of PIQA.
表 32: PIQA 示例。
PROMPT
Article: 文章:
When you read an article you will understand and remember it better if you can work out how the writer has put the ideas together. Sometimes a writer puts ideas together by asking questions and then answering them.For example, if the article is about groundhogs, the set of questions in the writer’s head might be:
当你阅读一篇文章时,如果你能弄清楚作者如何将这些想法组合在一起,你就会更好地理解和记住它。有时,作者通过提出问题然后回答问题来将想法组合在一起。例如,如果文章是关于土拨鼠的,那么作者脑海中的问题集可能是:
What does a groundhog look like?
土拨鼠长什么样?
Where do groundhogs live?
土拨鼠住在哪里?
What do they eat?… 他们吃什么?...
In the article,the author might answer those questions.
在这篇文章中,作者可能会回答这些问题。
Sometimes an author writes out her questions in the article.These questions give you signals.They tell you what the author is going to write next.Often an author has a question in her head but she doesn’t write it out for you.You have to work out her question for yourself.Here’s a sample reading for you to practice this method.
有时作者会在文章中写出她的问题。这些问题会给你信号。它们告诉你作者下一步要写什么。通常作者脑子里有一个问题,但她不会为你写出来。你必须自己解决她的问题。这里有一个示例阅读材料,供您练习此方法。
Earthworms 蚯蚓
Do you know how many kinds of earthworms there are?There are about 1800 kinds in the world! They can be brown,purple,green.They can be as small as 3 cm long and as large as 3 m long.
你知道蚯蚓有多少种吗?全世界大约有1800种!它们可以是棕色、紫色、绿色。它们可以小至3厘米长,大至3m长。
The best time to see earthworms is at night,especially a cool,damp night.That’s when they come up from their burrows to hunt for food.Earthworms don’t like to be in the sun.That’s because they breathe through their skin,and they can’t breathe if their skin gets too dry.Earthworms must come out of the earth if it rains a lot,because they can’t breathe in their flooded burrows.What a dangerous life!
观赏蚯蚓的最佳时间是晚上,尤其是凉爽潮湿的夜晚。那时它们从洞穴中出来寻找食物。蚯蚓不喜欢在阳光下。那是因为它们通过皮肤呼吸,并且如果皮肤太干,它们就无法呼吸。如果下大雨,蚯蚓就必须从土里出来,因为它们在被淹没的洞穴中无法呼吸。多么危险的生活啊!
Earthworms don’t have eyes,so how can they tell when it’s dark? They have special places on their skin that are sensitive to light.These spots tell whether it’s light or dark.If you shine a flashlight on an earthworm at night,it will quickly disappear into the ground.
蚯蚓没有眼睛,怎么知道天黑了呢?它们的皮肤上有特殊的地方对光敏感。这些斑点可以分辨出是亮还是暗。如果你在晚上用手电筒照射蚯蚓,它很快就会消失在地下。
Earthworms don’t have ears either,but they can hear by feeling movements in the earth.If you want to hear like an earthworm,lie on the ground with your fingers in your ears.Then have a friend stamp his or her feet near you.This is how earthworms feel birds and people walking,and moles digging,near them.
蚯蚓也没有耳朵,但它们可以通过感觉地球的运动来听到声音。如果你想像蚯蚓一样听到声音,请躺在地上,用手指捂住耳朵。然后让朋友在你附近跺脚这就是蚯蚓感受鸟类和人行走以及鼹鼠在它们附近挖掘的方式。
Earthworms are useful.Farmers and gardeners like having lots of earthworms in their land because the worms help to make better soil when they dig.That digging keeps the soil loose and airy .In one year earthworms can pile up as much as 23,000 kg of castings in an area about the size of a football field.
蚯蚓很有用。农民和园丁喜欢在自己的土地上养很多蚯蚓,因为蚯蚓在挖掘时有助于改善土壤。挖掘可以使土壤保持疏松和通风。一年内,蚯蚓可以堆积多达 23,000 公斤的铸件在大约一个足球场大小的区域。
Q: What’s the purpose of reading Earthworms?
问:读《蚯蚓》的目的是什么?
A: To put the writer’s idea into real use.
A:把作者的想法付诸实践。
Q: Which question CANNOT be answered in the passage?
问:文中哪个问题不能回答?
A: Why can human listen like earthworms?
A:人为什么能像蚯蚓一样听声音?
Q: How can you understand Earthworms better according to this passage?
问:根据这段话,你如何更好地了解蚯蚓?
A: Read to work out all the questions in the writer’s head while reading.
答:阅读时要解决作者脑子里的所有问题。
Q: What’s the best title for the passage?
问:这篇文章的最佳标题是什么?
A: 一个:
OPTIONS
- One way to help with understanding
- 帮助理解的一种方法
- One way to practice with a new idea
- 练习新想法的一种方法
- One way to learn to be a wise writer
- 学习成为明智作家的一种方法
- One way to be clearer about worms
- 一种更清楚地了解蠕虫的方法
Table 33: An example of RACE.
表 33: RACE 示例。
PROMPT
Answer these questions: 回答这些问题:
Q: A Jayhawker was a term applied to anti-slavery militant bands from a certain US state that clashed with pro-slavery factions from Missouri. Which state is this, sometimes referred to as the Jayhawk State?
问:Jayhawker 一词适用于美国某个州的反奴隶制激进组织,这些组织与密苏里州的支持奴隶制派系发生冲突。这是哪个州,有时被称为松鸦鹰州?
A: Kans. 答:坎斯。
Q: Which Swedish DJ and record producer had a UK Number One single in 2013 with ’Wake Me Up’?
问:哪位瑞典 DJ 兼唱片制作人凭借《Wake Me Up》成为 2013 年英国排名第一的单曲?
A: Tim Bergling 答:蒂姆·伯格林
Q: Who is the MP for Sheffield Hallam?
问:谢菲尔德哈勒姆的议员是谁?
A: Nick clegg 答:尼克·克莱格
Q: A case that riveted the nation, the case of The State of Tennessee v. John Thomas Scopes concluded on July 21, 1925, with the jury finding Mr. Scopes guilty of teaching what?
问:田纳西州诉约翰·托马斯·斯科普斯案是一个举国瞩目的案件,于 1925 年 7 月 21 日结束,陪审团裁定斯科普斯先生犯有哪些教导罪?
A: Survival of species A:物种的生存
Q: What cartoon series featured a character called Little My?
问:哪个卡通系列中有一个叫“小我”的角色?
A: Muumi 答: 穆米
Q: "What English model, with her short-haired androgynous look, born Lesley Hornby, was discovered in 1966 by Nigel Davies when she was 16 and weighed 6 stone (41 kg, 91 lbs), and became ""The Face of ’66"" with her high fashion mod look created by Mary Quant?"
问:“哪位英国模特,有着雌雄同体的短发,原名莱斯利·霍恩比 (Lesley Hornby),1966 年被奈杰尔·戴维斯 (Nigel Davies) 发掘,当时她 16 岁,体重 6 英石(41 公斤,91 磅),并成为“‘面孔’” 66 英寸,她的高级时尚造型是 Mary Quant 设计的?”
A: 一个:
Table 34: An example of TriviaQA.
表 34: TriviaQA 示例。
PREFIXES
- So Monica - 所以莫妮卡
- So Jessica - 所以杰西卡
COMPLETION
avoids eating carrots for their eye health because Emily needs good eyesight while Monica doesn’t.
为了眼睛健康,避免吃胡萝卜,因为艾米丽需要良好的视力,而莫妮卡则不需要。
Table 35: An example of WinoGrande. Note that there are multiple prefixes and only one completion for WinoGrande, and we choose the predicted prefix with the lowest perplexity of the completion.
表 35: WinoGrande 的示例。请注意,WinoGrande 有多个前缀,但只有一种补全,我们选择补全复杂度最低的预测前缀。
Prompt 迅速的
You will be given a function f and an output in the form f(??) == output. Find any input such that executing f on the input leads to the given output. There may be multiple answers, but you should only output one. In [ANSWER] and [/ANSWER] tags, complete the assertion with one such input that will produce the output when executing the function.
您将获得一个函数 f 和一个形式为 f(??) == output 的输出。找到任何输入,使得在输入上执行 f 会产生给定的输出。可能有多个答案,但你应该只输出一个。在 [ANSWER] 和 [/ANSWER] 标记中,使用一个这样的输入来完成断言,该输入将在执行函数时产生输出。
[PYTHON]
def f(my_list):
count = 0 计数 = 0
for i in my_list: 对于 my_list 中的我:
if len(i) % 2 == 0:
如果 len(i) % 2 == 0:
count += 1 计数 += 1
return count 返回计数
assert f(??) == 3 断言 f(??) == 3
[/PYTHON]
[ANSWER] [回答]
assert f( ["mq", "px", "zy"]) == 3
断言 f(["mq", "px", "zy"]) == 3
[/ANSWER] [/回答]
[PYTHON]
def f(s1, s2):
return s1 + s2 返回 s1 + s2
assert f(??) == "banana" 断言 f(??) == "香蕉"
[/PYTHON]
[ANSWER] [回答]
assert f("ba", "nana") == "banana"
断言 f("ba", "nana") == "香蕉"
[/ANSWER] [/回答]
[PYTHON]
def f(a, b, c):
result = {} 结果={}
for d in a, b, c:
对于 a、b、c 中的 d:
result.update(dict.fromkeys(d))
结果.update(dict.fromkeys(d))
return result 返回结果
assert f(??) == {1: None, 2: None}
断言 f(??) == {1:无,2:无}
[/PYTHON]
[ANSWER] [回答]
Table 36: An example of CRUXEval-I.
表 36: CRUXEval-I 的示例。
Prompt 迅速的
You are given a Python function and an assertion containing an input to the function. Complete the assertion with a literal (no unsimplified expressions, no function calls) containing the output when executing the provided code on the given input, even if the function is incorrect or incomplete. Do NOT output any extra information. Provide the full assertion with the correct output in [ANSWER] and [/ANSWER] tags, following the examples.
您将获得一个 Python 函数和一个包含该函数输入的断言。在给定输入上执行提供的代码时,使用包含输出的文字(没有未简化的表达式,没有函数调用)完成断言,即使函数不正确或不完整。不要输出任何额外的信息。按照示例,在 [ANSWER] 和 [/ANSWER] 标记中提供具有正确输出的完整断言。
[PYTHON]
def f(n): 定义 f(n):
return n 返回n
assert f(17) == ?? 断言 f(17) == ??
[/PYTHON]
[ANSWER] [回答]
assert f(17) == 17 断言 f(17) == 17
[/ANSWER] [/回答]
[PYTHON]
def f(s):
return s + "a" 返回 s + "a"
assert f("x9j") == ?? 断言 f("x9j") == ??
[/PYTHON]
[ANSWER] [回答]
assert f("x9j") == "x9ja"
断言 f("x9j") == "x9ja"
[/ANSWER] [/回答]
[PYTHON]
def f(nums): def f(数字):
output = [] 输出=[]
for n in nums: 对于 nums 中的 n:
output.append((nums.count(n), n))
输出.append((nums.count(n), n))
output.sort(reverse=True)
输出.排序(反向=True)
return output 返回输出
assert f( [1, 1, 3, 1, 3, 1]) == ??
断言 f( [1, 1, 3, 1, 3, 1]) == ??
[/PYTHON]
[ANSWER] [回答]
Table 37: An example of CRUXEval-O.
表 37: CRUXEval-O 示例。