DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2：一种强大、经济且高效的专家混合语言模型

DeepSeek-AIresearch@deepseek.com

\begin{abstract}
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128 K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves

42.5 %

of training costs, reduces the KV cache by

93.3 %

, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1 T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V2.
我们推出了 DeepSeek-V2，这是一种强大的混合专家（MoE）语言模型，具有经济的训练和高效的推理特点。它总共有 2360 亿个参数，其中每个 token 激活 21 亿，并支持 128K tokens 的上下文长度。DeepSeek-V2 采用了创新的架构，包括多头潜在注意力（MLA）和 DeepSeekMoE。MLA 通过将键值（KV）缓存显著压缩为潜在向量，确保了高效的推理，而 DeepSeekMoE 则通过稀疏计算以经济的成本训练强大的模型。与 DeepSeek 67B 相比，DeepSeek-V2 的性能显著增强，同时节省了

42.5 %

的训练成本，减少了 KV 缓存

93.3 %

，并将最大生成吞吐量提升至 5.76 倍。我们在一个由 8.1T tokens 组成的高质量多源语料库上对 DeepSeek-V2 进行了预训练，并进一步进行了监督微调（SFT）和强化学习（RL），以充分释放其潜力。评估结果表明，即使只有 21 亿个激活参数，DeepSeek-V2 及其聊天版本在开源模型中仍然实现了顶级性能。模型检查点可在 https://github.com/deepseek-ai/DeepSeek-V2 获取。

Figure 1 | (a) MMLU accuracy vs. activated parameters, among different open-source models. (b) Training costs and inference efficiency of DeepSeek 67B (Dense) and DeepSeek-V2.
图 1 | (a) 不同开源模型中，MMLU 准确率与激活参数的关系。(b) DeepSeek 67B（密集型）和 DeepSeek-V2 的训练成本和推理效率。

Contents 内容

1 Introduction … 4 1 引言 … 4
2 Architecture … 6
2 架构 … 6
2.1 Multi-Head Latent Attention: Boosting Inference Efficiency … 6
2.1 多头潜在注意力：提升推理效率 … 6
2.1.1 Preliminaries: Standard Multi-Head Attention … 6
2.1.1 前言：标准多头注意力 … 6
2.1.2 Low-Rank Key-Value Joint Compression … 7
2.1.2 低秩键值联合压缩 … 7
2.1.3 Decoupled Rotary Position Embedding … 8
2.1.3 解耦旋转位置嵌入 … 8
2.1.4 Comparison of Key-Value Cache … 8
2.1.4 键值缓存的比较 … 8
2.2 DeepSeekMoE: Training Strong Models at Economical Costs … 9
2.2 DeepSeekMoE：以经济成本训练强大模型 … 9
2.2.1 Basic Architecture … 9
2.2.1 基本架构 … 9
2.2.2 Device-Limited Routing … 9
2.2.2 设备限制路由 … 9
2.2.3 Auxiliary Loss for Load Balance … 10
2.2.3 辅助损失用于负载平衡 … 10
2.2.4 Token-Dropping Strategy … 11
2.2.4 令牌丢弃策略 … 11
3 Pre-Training … 11
3 预训练 … 11
3.1 Experimental Setups … 11
3.1 实验设置 … 11
3.1.1 Data Construction … 11
3.1.1 数据构建 … 11
3.1.2 Hyper-Parameters … 12
3.1.2 超参数 … 12
3.1.3 Infrastructures … 12
3.1.3 基础设施 … 12
3.1.4 Long Context Extension … 13
3.1.4 长上下文扩展 … 13
3.2 Evaluations … 13
3.2 评估 … 13
3.2.1 Evaluation Benchmarks … 13
3.2.1 评估基准 … 13
3.2.2 Evaluation Results … 14
3.2.2 评估结果 … 14
3.2.3 Training and Inference Efficiency … 16
3.2.3 训练和推理效率 … 16
4 Alignment … 16
4 对齐 … 16
4.1 Supervised Fine-Tuning … 16
4.1 监督微调 … 16
4.2 Reinforcement Learning … 17
4.2 强化学习 … 17
4.3 Evaluation Results … 18
4.3 评估结果 … 18
4.4 Discussion … 20
4.4 讨论 … 20
5 Conclusion, Limitation, and Future Work … 21
5 结论、局限性和未来工作 … 21
A Contributions and Acknowledgments … 27
A 贡献与致谢 … 27
B DeepSeek-V2-Lite: A 16B Model Equipped with MLA and DeepSeekMoE … 29
B DeepSeek-V2-Lite：一款配备 MLA 和 DeepSeekMoE 的 16B 模型 … 29
B. 1 Model Description … 29
B. 1 模型描述 … 29
B. 2 Performance Evaluation … 30
B. 2 性能评估 … 30
C Full Formulas of MLA … 31
C MLA 的完整公式 … 31
D Ablation of Attention Mechanisms … 31
D 注意机制的消融 … 31
D. 1 Ablation of MHA, GQA, and MQA … 31
D. 1 MHA、GQA 和 MQA 的消融 … 31
D. 2 Comparison Between MLA and MHA … 31
D. 2 MLA 与 MHA 的比较 … 31
E Discussion About Pre-Training Data Debiasing … 32
E 关于预训练数据去偏见的讨论 … 32
F Additional Evaluations on Math and Code … 32
F 额外的数学和代码评估 … 32
G Evaluation Formats … 33
G 评估格式 … 33

1. Introduction 1. 引言

In the past few years, Large Language Models (LLMs) (Anthropic, 2023; Google, 2023; OpenAI, 2022. 2023) have undergone rapid development, offering a glimpse into the dawn of Artificial General Intelligence (AGI). In general, the intelligence of an LLM tends to improve as the number of parameters increases, allowing it to exhibit emergent capabilities across various tasks (Wei et al., 2022). However, the improvement comes at the cost of larger computing resources for training and a potential decrease in inference throughput. These constraints present significant challenges that impede the widespread adoption and utilization of LLMs. In order to tackle this problem, we introduce DeepSeek-V2, a strong open-source Mixture-of-Experts (MoE) language model, characterized by economical training and efficient inference through an innovative Transformer architecture. It is equipped with a total of 236B parameters, of which 21B are activated for each token, and supports a context length of 128 K tokens.
在过去的几年中，大型语言模型 (LLMs) (Anthropic, 2023; Google, 2023; OpenAI, 2022. 2023) 经历了快速发展，展现了人工通用智能 (AGI) 的曙光。一般来说，LLM 的智能随着参数数量的增加而提高，使其能够在各种任务中展现出突现能力 (Wei et al., 2022)。然而，这种改进以训练所需更大的计算资源为代价，并可能导致推理吞吐量的下降。这些限制带来了显著的挑战，阻碍了 LLMs 的广泛采用和利用。为了解决这个问题，我们推出了 DeepSeek-V2，一个强大的开源专家混合 (MoE) 语言模型，其特点是通过创新的 Transformer 架构实现经济的训练和高效的推理。它配备了总计 236B 的参数，其中每个 token 激活 21B，并支持 128 K tokens 的上下文长度。

We optimize the attention modules and Feed-Forward Networks (FFNs) within the Transformer framework (Vaswani et al., 2017) with our proposed Multi-head Latent Attention (MLA) and DeepSeekMoE. (1) In the context of attention mechanisms, the Key-Value (KV) cache of the Multi-Head Attention (MHA) (Vaswani et al., 2017) poses a significant obstacle to the inference efficiency of LLMs. Various approaches have been explored to address this issue, including Grouped-Query Attention (GQA) (Ainslie et al., 2023) and Multi-Query Attention (MQA) (Shazeer, 2019). However, these methods often compromise performance in their attempt to reduce the KV cache. In order to achieve the best of both worlds, we introduce MLA, an attention mechanism equipped with low-rank key-value joint compression. Empirically, MLA achieves superior performance compared with MHA, and meanwhile significantly reduces the KV cache during inference, thus boosting the inference efficiency. (2) For Feed-Forward Networks (FFNs), we follow the DeepSeekMoE architecture (Dai et al., 2024), which adopts fine-grained expert segmentation and shared expert isolation for higher potential in expert specialization. The DeepSeekMoE architecture demonstrates great advantages compared with conventional MoE architectures like GShard (Lepikhin et al., 2021), enabling us to train strong models at an economical cost. As we employ expert parallelism during training, we also devise supplementary mechanisms to control communication overheads and ensure load balance. By combining these two techniques, DeepSeek-V2 features strong performance (Figure 1(a)), economical training costs, and efficient inference throughput (Figure 1(b)), simultaneously.
我们在 Transformer 框架内（Vaswani et al., 2017）优化了注意力模块和前馈网络（FFNs），并提出了多头潜在注意力（MLA）和 DeepSeekMoE。 (1) 在注意力机制的背景下，多头注意力（MHA）的键值（KV）缓存（Vaswani et al., 2017）对LLMs的推理效率构成了重大障碍。为了解决这个问题，已经探索了各种方法，包括分组查询注意力（GQA）（Ainslie et al., 2023）和多查询注意力（MQA）（Shazeer, 2019）。然而，这些方法在试图减少 KV 缓存时往往会妥协性能。为了实现两全其美，我们引入了 MLA，这是一种配备低秩键值联合压缩的注意力机制。实证表明，MLA 的性能优于 MHA，同时在推理过程中显著减少 KV 缓存，从而提高推理效率。 (2) 对于前馈网络（FFNs），我们遵循 DeepSeekMoE 架构（Dai et al., 2024），该架构采用细粒度专家分割和共享专家隔离，以提高专家专业化的潜力。 DeepSeekMoE 架构相比于传统的 MoE 架构如 GShard（Lepikhin 等，2021）展现出巨大的优势，使我们能够以经济的成本训练强大的模型。在训练过程中，我们采用专家并行，同时设计了补充机制来控制通信开销并确保负载平衡。通过结合这两种技术，DeepSeek-V2 同时具备强大的性能（图 1(a)）、经济的训练成本和高效的推理吞吐量（图 1(b)）。

We construct a high-quality and multi-source pre-training corpus consisting of 8.1T tokens. Compared with the corpus used in DeepSeek 67B (our previous release) (DeepSeek-AI, 2024), this corpus features an extended amount of data, especially Chinese data, and higher data quality. We first pretrain DeepSeek-V2 on the full pre-training corpus. Then, we collect 1.5 M conversational sessions, which encompass various domains such as math, code, writing, reasoning, safety, and more, to perform Supervised Fine-Tuning (SFT) for DeepSeek-V2 Chat (SFT). Finally, we follow DeepSeekMath (Shao et al., 2024) to employ Group Relative Policy Optimization (GRPO) to further align the model with human preference and produce DeepSeek-V2 Chat (RL).
我们构建了一个高质量的多源预训练语料库，包含 8.1T 的标记。与 DeepSeek 67B（我们之前的发布）（DeepSeek-AI，2024）使用的语料库相比，这个语料库的数据量更大，特别是中文数据，并且数据质量更高。我们首先在完整的预训练语料库上对 DeepSeek-V2 进行预训练。然后，我们收集了 150 万次对话会话，涵盖数学、代码、写作、推理、安全等多个领域，以对 DeepSeek-V2 Chat 进行监督微调（SFT）。最后，我们遵循 DeepSeekMath（Shao et al.，2024）采用群体相对策略优化（GRPO）进一步使模型与人类偏好对齐，并生成 DeepSeek-V2 Chat（RL）。

We evaluate DeepSeek-V2 on a wide range of benchmarks in English and Chinese, and compare it with representative open-source models. Evaluation results show that even with only 21B activated parameters, DeepSeek-V2 still achieves top-tier performance among open-source models and becomes the strongest open-source MoE language model. Figure 1(a) highlights that, on MMLU, DeepSeek-V2 achieves top-ranking performance with only a small number of activated parameters. In addition, as shown in Figure 1(b), compared with DeepSeek 67B, DeepSeek-V2 saves

42.5 %

of training costs, reduces the KV cache by

93.3 %

, and boosts the maximum generation throughput to 5.76 times. We also evaluate DeepSeek-V2 Chat (SFT) and
我们在英语和中文的广泛基准上评估了 DeepSeek-V2，并将其与代表性的开源模型进行了比较。评估结果显示，即使只有 21B 激活参数，DeepSeek-V2 仍在开源模型中实现了顶级性能，并成为最强的开源 MoE 语言模型。图 1(a)突显了在 MMLU 上，DeepSeek-V2 仅用少量激活参数就达到了顶尖性能。此外，如图 1(b)所示，与 DeepSeek 67B 相比，DeepSeek-V2 节省了

42.5 %

的训练成本，减少了 KV 缓存

93.3 %

，并将最大生成吞吐量提升至 5.76 倍。我们还评估了 DeepSeek-V2 Chat (SFT)和

Figure 2 | Illustration of the architecture of DeepSeek-V2. MLA ensures efficient inference by significantly reducing the KV cache for generation, and DeepSeekMoE enables training strong models at an economical cost through the sparse architecture.
图 2 | DeepSeek-V2 架构的示意图。MLA 通过显著减少生成的 KV 缓存来确保高效推理，而 DeepSeekMoE 通过稀疏架构以经济的成本训练强大的模型。

DeepSeek-V2 Chat (RL) on open-ended benchmarks. Notably, DeepSeek-V2 Chat (RL) achieves 38.9 length-controlled win rate on AlpacaEval 2.0 (Dubois et al., 2024), 8.97 overall score on MT-Bench (Zheng et al., 2023), and 7.91 overall score on AlignBench (Liu et al., 2023). The English open-ended conversation evaluations demonstrate that DeepSeek-V2 Chat (RL) has toptier performance among open-source chat models. In addition, the evaluation on AlignBench indicates that in Chinese, DeepSeek-V2 Chat (RL) outperforms all of open-source models, and even beats most of closed-source models.
DeepSeek-V2 Chat (RL) 在开放式基准测试中的表现。值得注意的是，DeepSeek-V2 Chat (RL) 在 AlpacaEval 2.0（Dubois 等，2024）上实现了 38.9 的长度控制胜率，在 MT-Bench（Zheng 等，2023）上获得了 8.97 的总体得分，在 AlignBench（Liu 等，2023）上获得了 7.91 的总体得分。英语开放式对话评估表明，DeepSeek-V2 Chat (RL) 在开源聊天模型中表现出色。此外，在 AlignBench 上的评估表明，在中文中，DeepSeek-V2 Chat (RL) 超过了所有开源模型，甚至击败了大多数闭源模型。

In order to facilitate further research and development on MLA and DeepSeekMoE, we also release DeepSeek-V2-Lite, a smaller model equipped with MLA and DeepSeekMoE, for the open-source community. It has a total of 15.7B parameters, where 2.4B are activated for each token. Detailed descriptions about DeepSeek-V2-Lite can be found in Appendix B
为了促进对 MLA 和 DeepSeekMoE 的进一步研究和开发，我们还向开源社区发布了 DeepSeek-V2-Lite，这是一个配备了 MLA 和 DeepSeekMoE 的小型模型。它总共有 15.7 亿个参数，其中每个 token 激活了 2.4 亿个参数。关于 DeepSeek-V2-Lite 的详细描述可以在附录 B 中找到。

In the rest of this paper, we first provide a detailed description of the model architecture of DeepSeek-V2 (Section 2). Subsequently, we introduce our pre-training endeavors, including the training data construction, hyper-parameter settings, infrastructures, long context extension, and the evaluation of model performance and efficiency (Section 3 ). Following this, we demonstrate our efforts in alignment, encompassing Supervised Fine-Tuning (SFT), Reinforcement
在本文的其余部分，我们首先详细描述 DeepSeek-V2 的模型架构（第 2 节）。随后，我们介绍我们的预训练工作，包括训练数据构建、超参数设置、基础设施、长上下文扩展以及模型性能和效率的评估（第 3 节）。接下来，我们展示在对齐方面的努力，包括监督微调（SFT）、强化学习。

Learning (RL), the evaluation results, and other discussion (Section/4). Finally, we summarize the conclusion, deliberate on the current limitations of DeepSeek-V2, and outline our future work (Section 5).
学习（RL）、评估结果和其他讨论（第 4 节）。最后，我们总结结论，讨论 DeepSeek-V2 的当前局限性，并概述我们的未来工作（第 5 节）。

2. Architecture 2. 架构

By and large, DeepSeek-V2 is still in the Transformer architecture (Vaswani et al., 2017), where each Transformer block consists of an attention module and a Feed-Forward Network (FFN). However, for both the attention module and the FFN, we design and employ innovative architectures. For attention, we design MLA, which utilizes low-rank key-value joint compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference. For FFNs, we adopt the DeepSeekMoE architecture (Dai et al., 2024), a high-performance MoE architecture that enables training strong models at an economical cost. An illustration of the architecture of DeepSeek-V2 is presented in Figure 2, and we will introduce the details of MLA and DeepSeekMoE in this section. For other tiny details (e.g., layer normalization and the activation function in FFNs), unless specifically stated, DeepSeek-V2 follows the settings of DeepSeek 67B (DeepSeek-AI, 2024).
总体而言，DeepSeek-V2 仍然采用 Transformer 架构（Vaswani et al., 2017），其中每个 Transformer 块由一个注意力模块和一个前馈网络（FFN）组成。然而，对于注意力模块和 FFN，我们设计并采用了创新的架构。对于注意力，我们设计了 MLA，它利用低秩键值联合压缩来消除推理时键值缓存的瓶颈，从而支持高效推理。对于 FFN，我们采用了 DeepSeekMoE 架构（Dai et al., 2024），这是一种高性能的 MoE 架构，能够以经济的成本训练强大的模型。DeepSeek-V2 的架构示意图如图 2 所示，我们将在本节中介绍 MLA 和 DeepSeekMoE 的详细信息。对于其他细节（例如，层归一化和 FFN 中的激活函数），除非特别说明，DeepSeek-V2 遵循 DeepSeek 67B 的设置（DeepSeek-AI, 2024）。

2.1. Multi-Head Latent Attention: Boosting Inference Efficiency
2.1. 多头潜在注意力：提升推理效率

Conventional Transformer models usually adopts Multi-Head Attention (MHA) (Vaswani et al., 2017), but during generation, its heavy Key-Value (KV) cache will become the bottleneck that limit the inference efficiency. In order to reduce the KV cache, Multi-Query Attention (MQA) (Shazeer, 2019) and Grouped-Query Attention (GQA) (Ainslie et al., 2023) are proposed. They require a smaller magnitude of KV cache, but their performance does not match MHA (we provide the ablation of MHA, GQA and MQA in Appendix D.1.
传统的 Transformer 模型通常采用多头注意力（MHA）（Vaswani 等，2017），但在生成过程中，其庞大的键值（KV）缓存将成为限制推理效率的瓶颈。为了减少 KV 缓存，提出了多查询注意力（MQA）（Shazeer，2019）和分组查询注意力（GQA）（Ainslie 等，2023）。它们需要较小规模的 KV 缓存，但其性能与 MHA 不匹配（我们在附录 D.1 中提供了 MHA、GQA 和 MQA 的消融实验）。

For DeepSeek-V2, we design an innovative attention mechanism called Multi-head Latent Attention (MLA). Equipped with low-rank key-value joint compression, MLA achieves better performance than MHA, but requires a significantly smaller amount of KV cache. We introduce its architecture in the following, and also provide a comparison between MLA and MHA in Appendix D. 2.
对于 DeepSeek-V2，我们设计了一种创新的注意力机制，称为多头潜在注意力（MLA）。配备低秩键值联合压缩，MLA 的性能优于 MHA，但所需的 KV 缓存量显著更小。我们在下面介绍其架构，并在附录 D 中提供 MLA 与 MHA 的比较。2。

2.1.1. Preliminaries: Standard Multi-Head Attention
2.1.1. 前言：标准多头注意力

We first introduce the standard MHA mechanism as background. Let

d

be the embedding dimension,

n_{h}

be the number of attention heads,

d_{h}

be the dimension per head, and

h_{t} \in R^{d}

be the attention input of the

t

-th token at an attention layer. Standard MHA first produces

q_{t}, k_{t}, v_{t} \in R^{d_{h} n_{h}}

through three matrices

W^{Q}, W^{K}, W^{V} \in R^{d_{h} n_{h} \times d}

, respectively:
我们首先介绍标准的多头自注意力机制作为背景。设

d

为嵌入维度，

n_{h}

为注意力头的数量，

d_{h}

为每个头的维度，

h_{t} \in R^{d}

为注意力层中第

t

个标记的注意力输入。标准的多头自注意力机制首先通过三个矩阵

W^{Q}, W^{K}, W^{V} \in R^{d_{h} n_{h} \times d}

生成

q_{t}, k_{t}, v_{t} \in R^{d_{h} n_{h}}

。

\begin{aligned} q_{t} & = W^{Q} h_{t}, \\ k_{t} & = W^{K} h_{t} \\ v_{t} & = W^{V} h_{t} \end{aligned}

Figure 3 | Simplified illustration of Multi-Head Attention (MHA), Grouped-Query Attention (GQA), Multi-Query Attention (MQA), and Multi-head Latent Attention (MLA). Through jointly compressing the keys and values into a latent vector, MLA significantly reduces the KV cache during inference.
图 3 | 多头注意力（MHA）、分组查询注意力（GQA）、多查询注意力（MQA）和多头潜在注意力（MLA）的简化示意图。通过将键和值共同压缩为潜在向量，MLA 在推理过程中显著减少了 KV 缓存。

Then,

q_{t}, k_{t}, v_{t}

will be sliced into

n_{h}

heads for the multi-head attention computation:
然后，

q_{t}, k_{t}, v_{t}

将被切分成

n_{h}

个头用于多头注意力计算：

\begin{aligned} [q_{t, 1}; q_{t, 2}; \dots; q_{t, n_{h}}] = q_{t}, \\ [k_{t, 1}; k_{t, 2}; \dots; k_{t, n_{h}}] = k_{t}, \\ [v_{t, 1}; v_{t, 2}; \dots; v_{t, n_{h}}] = v_{t}, \\ o_{t, i} = \sum_{j = 1}^{t} {Softmax}_{j} (\frac{q_{t, i}^{T} k_{j, i}}{\sqrt{d_{h}}}) v_{j, i}, \\ u_{t} = W^{O} [o_{t, 1}; o_{t, 2}; \dots; o_{t, n_{h}}], \end{aligned}

where

q_{t, i,}, k_{t, i}, v_{t, i} \in R^{d_{h}}

denote the query, key, and value of the

i

-th attention head, respectively;

W^{0} \in R^{d \times d_{h} n_{h}}

denotes the output projection matrix. During inference, all keys and values need to be cached to accelerate inference, so MHA needs to cache

2 n_{h} d_{h} l

elements for each token. In model deployment, this heavy KV cache is a large bottleneck that limits the maximum batch size and sequence length.
其中

q_{t, i,}, k_{t, i}, v_{t, i} \in R^{d_{h}}

分别表示第

i

个注意力头的查询、键和值；

W^{0} \in R^{d \times d_{h} n_{h}}

表示输出投影矩阵。在推理过程中，所有键和值需要被缓存以加速推理，因此 MHA 需要为每个标记缓存

2 n_{h} d_{h} l

个元素。在模型部署中，这个庞大的 KV 缓存是限制最大批量大小和序列长度的一个重大瓶颈。

2.1.2. Low-Rank Key-Value Joint Compression
2.1.2. 低秩键值联合压缩

The core of MLA is the low-rank joint compression for keys and values to reduce KV cache:
MLA 的核心是对键和值进行低秩联合压缩，以减少 KV 缓存：

\begin{aligned} c_{t}^{K V} = W^{D K V} h_{t}, \\ k_{t}^{C} = W^{U K} c_{t}^{K V}, \\ v_{t}^{C} = W^{U V} c_{t}^{K V}, \end{aligned}

where

c_{t}^{K V} \in R^{d_{c}}

is the compressed latent vector for keys and values;

d_{c} (≪ d_{h} n_{h})

denotes the KV compression dimension;

W^{D K V} \in R^{d_{c} \times d}

is the down-projection matrix; and

W^{U K}, W^{U V} \in R^{d_{h} n_{h} \times d_{c}}

are the up-projection matrices for keys and values, respectively. During inference, MLA only needs to cache

c_{t}^{K V}

, so its

K V

cache has only

d_{c} l

elements, where

l

denotes the number of layers. In addition, during inference, since

W^{U K}

can be absorbed into

W^{Q}

, and

W^{U V}

can be absorbed into

W^{0}

, we even do not need to compute keys and values out for attention. Figure 3 intuitively illustrates how the KV joint compression in MLA reduces the KV cache.
其中

c_{t}^{K V} \in R^{d_{c}}

是键和值的压缩潜在向量；

d_{c} (≪ d_{h} n_{h})

表示 KV 压缩维度；

W^{D K V} \in R^{d_{c} \times d}

是下投影矩阵；

W^{U K}, W^{U V} \in R^{d_{h} n_{h} \times d_{c}}

分别是键和值的上投影矩阵。在推理过程中，MLA 只需要缓存

c_{t}^{K V}

，因此其

K V

缓存只有

d_{c} l

个元素，其中

l

表示层数。此外，在推理过程中，由于

W^{U K}

可以被吸收到

W^{Q}

中，而

W^{U V}

可以被吸收到

W^{0}

中，我们甚至不需要计算用于注意力的键和值。图 3 直观地说明了 MLA 中的 KV 联合压缩如何减少 KV 缓存。

Moreover, in order to reduce the activation memory during training, we also perform
此外，为了减少训练期间的激活内存，我们还执行
low-rank compression for the queries, even if it cannot reduce the KV cache:
查询的低秩压缩，即使它无法减少 KV 缓存：

\begin{aligned} c_{t}^{Q} = W^{D Q} h_{t}, \\ q_{t}^{C} = W^{U Q} c_{t}^{Q}, \end{aligned}

where

c_{t}^{Q} \in R^{d_{c}^{'}}

is the compressed latent vector for queries;

d_{c}^{'} (≪ d_{h} n_{h})

denotes the query compression dimension; and

W^{D Q} \in R^{d_{c}^{'} \times d}, W^{U Q} \in R^{d_{h} n_{h} \times d_{c}^{'}}

are the down-projection and upprojection matrices for queries, respectively.
其中

c_{t}^{Q} \in R^{d_{c}^{'}}

是查询的压缩潜在向量；

d_{c}^{'} (≪ d_{h} n_{h})

表示查询压缩维度；

W^{D Q} \in R^{d_{c}^{'} \times d}, W^{U Q} \in R^{d_{h} n_{h} \times d_{c}^{'}}

分别是查询的下投影和上投影矩阵。

2.1.3. Decoupled Rotary Position Embedding
2.1.3. 解耦旋转位置嵌入

Following DeepSeek 67B (DeepSeek-AI, 2024), we intend to use the Rotary Position Embedding (RoPE) (Su et al., 2024) for DeepSeek-V2. However, RoPE is incompatible with low-rank KV compression. To be specific, RoPE is position-sensitive for both keys and queries. If we apply RoPE for the keys

k_{t}^{C}, W^{U K}

in Equation 10 will be coupled with a position-sensitive RoPE matrix. In this way,

W^{U K}

cannot be absorbed into

W^{Q}

any more during inference, since a RoPE matrix related to the currently generating token will lie between

W^{Q}

and

W^{U K}

and matrix multiplication does not obey a commutative law. As a result, we must recompute the keys for all the prefix tokens during inference, which will significantly hinder the inference efficiency.
根据 DeepSeek 67B（DeepSeek-AI，2024），我们打算在 DeepSeek-V2 中使用旋转位置嵌入（RoPE）（Su 等，2024）。然而，RoPE 与低秩 KV 压缩不兼容。具体来说，RoPE 对键和查询都是位置敏感的。如果我们对方程 10 中的键

k_{t}^{C}, W^{U K}

应用 RoPE，将会与一个位置敏感的 RoPE 矩阵耦合。这样，在推理过程中，

W^{U K}

将无法再被吸收到

W^{Q}

中，因为与当前生成的标记相关的 RoPE 矩阵将位于

W^{Q}

和

W^{U K}

之间，而矩阵乘法不遵循交换律。因此，我们必须在推理过程中重新计算所有前缀标记的键，这将显著阻碍推理效率。

As a solution, we propose the decoupled RoPE strategy that uses additional multi-head queries

q_{t, i}^{R} \in R_{h}^{d_{h}^{R}}

and a shared key

k_{t}^{R} \in R_{h}^{d_{h}^{R}}

to carry RoPE, where

d_{h}^{R}

denotes the per-head dimension of the decoupled queries and key. Equipped with the decoupled RoPE strategy, MLA performs the following computation:
作为解决方案，我们提出了解耦的 RoPE 策略，该策略使用额外的多头查询

q_{t, i}^{R} \in R_{h}^{d_{h}^{R}}

和共享键

k_{t}^{R} \in R_{h}^{d_{h}^{R}}

来承载 RoPE，其中

d_{h}^{R}

表示解耦查询和键的每头维度。配备解耦的 RoPE 策略，MLA 执行以下计算：

\begin{aligned} [q_{t, 1}^{R}; q_{t, 2}^{R}; \dots; q_{t, n_{h}}^{R}] = & q_{t}^{R} \end{aligned} = RoPE (W^{Q R} c_{t}^{Q}), \begin{aligned} k_{t}^{R} & = RoPE (W^{K R} h_{t}), \\ q_{t, i} & = [q_{t, i}^{C}; q_{t, i}^{R}], \\ k_{t, i} & = [k_{t, i}^{C} k_{t}^{R}], \\ o_{t, i} & = \sum_{j = 1}^{t} Softmax (\frac{q_{t, i}^{T} k_{j, i}}{\sqrt{d_{h} + d_{h}^{R}}}) v_{j, i}^{C} \\ u_{t} & = W^{O} [o_{t, 1}; o_{t, 2}; \dots; o_{t, n_{h}}], \end{aligned}

where

W^{Q R} \in R_{h}^{d_{h}^{R} n_{h} \times d_{c}^{'}}

and

W^{K R} \in R_{h}^{d_{h}^{R} \times d}

are matrices to produce the decouples queries and key, respectively;

RoPE (\cdot)

denotes the operation that applies RoPE matrices; and

[∵; \cdot]

denotes the concatenation operation. During inference, the decoupled key should also be cached. Therefore, DeepSeek-V2 requires a total KV cache containing

(d_{c} + d_{h}^{R}) l

elements.
其中

W^{Q R} \in R_{h}^{d_{h}^{R} n_{h} \times d_{c}^{'}}

和

W^{K R} \in R_{h}^{d_{h}^{R} \times d}

是生成解耦查询和键的矩阵；

RoPE (\cdot)

表示应用 RoPE 矩阵的操作；

[∵; \cdot]

表示连接操作。在推理过程中，解耦键也应被缓存。因此，DeepSeek-V2 需要一个包含

(d_{c} + d_{h}^{R}) l

个元素的总 KV 缓存。

In order to demonstrate the complete computation process of MLA, we also organize and provide its full formulas in Appendix C.
为了展示 MLA 的完整计算过程，我们还在附录 C 中整理并提供了其完整公式。

2.1.4. Comparison of Key-Value Cache
2.1.4. 键值缓存的比较

We demonstrate a comparison of the KV cache per token among different attention mechanisms in Table 1 . MLA requires only a small amount of KV cache, equal to GQA with only 2.25 groups, but can achieve stronger performance than MHA.
我们展示了不同注意力机制中每个令牌的 KV 缓存比较，如表 1 所示。MLA 只需要少量的 KV 缓存，等于只有 2.25 组的 GQA，但可以实现比 MHA 更强的性能。

Attention Mechanism 注意力机制	KV Cache per Token (# Element) 每个令牌的 KV 缓存（#元素）	Capability 能力
Multi-Head Attention (MHA) 多头注意力（MHA）	$2 n_{h} d_{h} l$	Strong 强壮
Grouped-Query Attention (GQA) 分组查询注意力 (GQA)	$2 n_{g} d_{h} l$	Moderate 适度
Multi-Query Attention (MQA) 多查询注意力（MQA）	$2 d_{h} l$	Weak 弱
MLA (Ours) MLA（我们的）	$(d_{c} + d_{h}^{R}) l \approx \frac{9}{2} d_{h} l$	Stronger 更强大

Table 1 | Comparison of the KV cache per token among different attention mechanisms.

n_{h}

denotes the number of attention heads,

d_{h}

denotes the dimension per attention head,

l

denotes the number of layers,

n_{g}

denotes the number of groups in GQA, and

d_{c}

and

d_{h}^{R}

denote the KV compression dimension and the per-head dimension of the decoupled queries and key in MLA, respectively. The amount of KV cache is measured by the number of elements, regardless of the storage precision. For DeepSeek-V2,

d_{c}

is set to

4 d_{h}

and

d_{h}^{R}

is set to

\frac{d_{h}}{2}

. So, its KV cache is equal to GQA with only 2.25 groups, but its performance is stronger than MHA.
表 1 | 不同注意力机制中每个令牌的 KV 缓存比较。

n_{h}

表示注意力头的数量，

d_{h}

表示每个注意力头的维度，

l

表示层数，

n_{g}

表示 GQA 中的组数，

d_{c}

和

d_{h}^{R}

分别表示 MLA 中解耦查询和键的 KV 压缩维度和每头维度。KV 缓存的量是通过元素数量来衡量的，与存储精度无关。对于 DeepSeek-V2，

d_{c}

设置为

4 d_{h}

，

d_{h}^{R}

设置为

\frac{d_{h}}{2}

。因此，它的 KV 缓存等于只有 2.25 组的 GQA，但其性能强于 MHA。

2.2. DeepSeekMoE: Training Strong Models at Economical Costs
2.2. DeepSeekMoE：以经济成本训练强大模型

2.2.1. Basic Architecture
2.2.1. 基本架构

For FFNs, we employ the DeepSeekMoE architecture (Dai et al., 2024). DeepSeekMoE has two key ideas: segmenting experts into finer granularity for higher expert specialization and more accurate knowledge acquisition, and isolating some shared experts for mitigating knowledge redundancy among routed experts. With the same number of activated and total expert parameters, DeepSeekMoE can outperform conventional MoE architectures like GShard (Lepikhin et al., 2021) by a large margin.
对于 FFNs，我们采用 DeepSeekMoE 架构（Dai 等，2024）。DeepSeekMoE 有两个关键思想：将专家细分为更细的粒度以提高专家专业化和更准确的知识获取，以及隔离一些共享专家以减轻路由专家之间的知识冗余。在激活的专家参数和总专家参数数量相同的情况下，DeepSeekMoE 可以大幅超越传统的 MoE 架构，如 GShard（Lepikhin 等，2021）。

Let

u_{t}

be the FFN input of the

t

-th token, we compute the FFN output

h_{t}^{'}

as follows:
让

u_{t}

成为第

t

个标记的 FFN 输入，我们计算 FFN 输出

h_{t}^{'}

如下：

\begin{aligned} h_{t}^{'} = u_{t} + \sum_{i = 1}^{N_{s}} {FFN}_{i}^{(s)} (u_{t}) + \sum_{i = 1}^{N_{r}} g_{i, t} {FFN}_{i}^{(r)} (u_{t}), \\ g_{i, t} = {\begin{cases} s_{i, t}, & s_{i, t} \in Topk ({s_{j, t} ∣ 1 ⩽ j ⩽ N_{r}}, K_{r}), \\ 0, & otherwise, \end{cases} \\ s_{i, t} = {Softmax}_{i} (u_{t}^{T} e_{i}), \end{aligned}

where

N_{s}

and

N_{r}

denote the numbers of shared experts and routed experts, respectively;

{FFN}_{i}^{(s)} (\cdot)

and

{FFN}_{i}^{(r)} (\cdot)

denote the

i

-th shared expert and the

i

-th routed expert, respectively;

K_{r}

denotes the number of activated routed experts;

g_{i, t}

is the gate value for the

i

-th expert;

s_{i, t}

is the token-to-expert affinity;

e_{i}

is the centroid of the

i

-th routed expert in this layer; and

Topk (\cdot, K)

denotes the set comprising

K

highest scores among the affinity scores calculated for the

t

-th token and all routed experts.
其中

N_{s}

和

N_{r}

分别表示共享专家和路由专家的数量；

{FFN}_{i}^{(s)} (\cdot)

和

{FFN}_{i}^{(r)} (\cdot)

分别表示第

i

个共享专家和第

i

个路由专家；

K_{r}

表示激活的路由专家数量；

g_{i, t}

是第

i

个专家的门值；

s_{i, t}

是令牌与专家的亲和度；

e_{i}

是该层第

i

个路由专家的中心；

Topk (\cdot, K)

表示包含针对第

t

个令牌和所有路由专家计算的亲和度分数中最高的

K

个分数的集合。

2.2.2. Device-Limited Routing
2.2.2. 设备限制路由

We design a device-limited routing mechanism to bound MoE-related communication costs. When expert parallelism is employed, the routed experts will be distributed across multiple devices. For each token, its MoE-related communication frequency is proportional to the number of devices covered by its target experts. Due to the fine-grained expert segmentation in DeepSeekMoE, the number of activated experts can be large, so the MoE-related communication will be more costly if we apply expert parallelism.
我们设计了一种设备限制的路由机制，以限制与 MoE 相关的通信成本。当采用专家并行时，路由的专家将分布在多个设备上。对于每个令牌，其与 MoE 相关的通信频率与其目标专家覆盖的设备数量成正比。由于 DeepSeekMoE 中的专家细粒度分段，激活的专家数量可能很大，因此如果我们应用专家并行，MoE 相关的通信成本将更高。

For DeepSeek-V2, beyond the naive top-K selection of routed experts, we additionally ensure that the target experts of each token will be distributed on at most

M

devices. To be specific, for each token, we first select

M

devices that have experts with the highest affinity scores in them. Then, we perform top-K selection among experts on these

M

devices. In practice, we find that when

M ⩾ 3

, the device-limited routing can achieve a good performance roughly aligned with the unrestricted top-K routing.
对于 DeepSeek-V2，除了对路由专家进行简单的 top-K 选择外，我们还确保每个 token 的目标专家最多分布在

M

个设备上。具体来说，对于每个 token，我们首先选择

M

个设备，这些设备中具有最高亲和力分数的专家。然后，我们在这

M

个设备上的专家中进行 top-K 选择。实际上，我们发现当

M ⩾ 3

时，设备限制的路由可以实现与不受限制的 top-K 路由大致相当的良好性能。

2.2.3. Auxiliary Loss for Load Balance
2.2.3. 负载平衡的辅助损失

We take the load balance into consideration for automatically learned routing strategies. Firstly, unbalanced load will raise the risk of routing collapse (Shazeer et al., 2017), preventing some experts being fully trained and utilized. Secondly, when expert parallelism is employed, unbalanced load will diminish computation efficiency. During the training of DeepSeek-V2, we design three kinds of auxiliary losses, for controlling expert-level load balance (

L_{ExpBal}

), device-level load balance (

L_{DevBal}

), and communication balance (

L_{CommBal}

), respectively.
我们考虑负载平衡以自动学习路由策略。首先，负载不平衡会增加路由崩溃的风险（Shazeer et al., 2017），阻止某些专家得到充分训练和利用。其次，当采用专家并行时，负载不平衡会降低计算效率。在 DeepSeek-V2 的训练过程中，我们设计了三种辅助损失，以分别控制专家级负载平衡（

L_{ExpBal}

）、设备级负载平衡（

L_{DevBal}

）和通信平衡（

L_{CommBal}

）。

Expert-Level Balance Loss. We use an expert-level balance loss (Fedus et al., 2021; Lepikhin et al. 2021) to mitigate the risk of routing collapse:
专家级平衡损失。我们使用专家级平衡损失（Fedus et al., 2021; Lepikhin et al. 2021）来减轻路由崩溃的风险：

\begin{aligned} L_{ExpBal} & = α_{1} \sum_{i = 1}^{N_{r}} f_{i} P_{i} \\ f_{i} & = \frac{N_{r}}{K_{r} T} \sum_{t = 1}^{T} 1 (Token t selects Expert i), \\ P_{i} & = \frac{1}{T} \sum_{t = 1}^{T} s_{i, t} \end{aligned}

where

α_{1}

is a hyper-parameter called expert-level balance factor;

1 (\cdot)

denotes the indicator function; and

T

denotes the number of tokens in a sequence.
其中

α_{1}

是一个称为专家级平衡因子的超参数；

1 (\cdot)

表示指示函数；

T

表示序列中的标记数量。

Device-Level Balance Loss. In addition to the expert-level balance loss, we additionally design a device-level balance loss to ensure balanced computation across different devices. In the training process of DeepSeek-V2, we partition all routed experts into

D

groups

{E_{1}, E_{2}, \dots, E_{D}}

, and deploy each group on a single device. The device-level balance loss is computed as follows:
设备级平衡损失。除了专家级平衡损失，我们还设计了设备级平衡损失，以确保不同设备之间的计算平衡。在 DeepSeek-V2 的训练过程中，我们将所有路由的专家分成

D

组

{E_{1}, E_{2}, \dots, E_{D}}

，并将每组部署在单个设备上。设备级平衡损失的计算如下：

\begin{aligned} L_{DevBal} & = α_{2} \sum_{i = 1}^{D} f_{i}^{'} P_{i}^{'} \\ f_{i}^{'} & = \frac{1}{| E_{i} |} \sum_{j \in E_{i}} f_{j} \\ P_{i}^{'} & = \sum_{j \in E_{i}} P_{j} \end{aligned}

where

α_{2}

is a hyper-parameter called device-level balance factor.
其中

α_{2}

是一个称为设备级平衡因子的超参数。

Communication Balance Loss. Finally, we introduce a communication balance loss to ensure that the communication of each device is balanced. Although the device-limited routing mechanism guarantees that the sending communication of each device is bounded, if a certain device
通信平衡损失。最后，我们引入了一种通信平衡损失，以确保每个设备的通信是平衡的。尽管设备限制的路由机制保证了每个设备的发送通信是有界的，但如果某个设备
receives more tokens than other devices, the practical communication efficiency will also be affected. In order to mitigate this issue, we design a communication balance loss as follows:
接收的令牌比其他设备多，实际的通信效率也会受到影响。为了缓解这个问题，我们设计了如下的通信平衡损失：

\begin{aligned} L_{CommBal} & = α_{3} \sum_{i = 1}^{D} f_{i}^{''} P_{i}^{''}, \\ f_{i}^{''} & = \frac{D}{M T} \sum_{t = 1}^{T} 1 (Token t is sent to Device i), \\ P_{i}^{''} & = \sum_{j \in E_{i}} P_{j}, \end{aligned}

where

α_{3}

is a hyper-parameter called communication balance factor. The device-limited routing mechanism operates on the principle of ensuring that each device transmits at most

M T

hidden states to other devices. Simultaneously, the communication balance loss is employed to encourage each device to receive around

M T

hidden states from other devices. The communication balance loss guarantees a balanced exchange of information among devices, promoting efficient communications.
其中

α_{3}

是一个称为通信平衡因子的超参数。设备限制的路由机制基于确保每个设备最多向其他设备传输

M T

个隐藏状态的原则。同时，使用通信平衡损失来鼓励每个设备从其他设备接收大约

M T

个隐藏状态。通信平衡损失保证了设备之间信息的平衡交换，促进了高效的通信。

2.2.4. Token-Dropping Strategy
2.2.4. 令牌丢弃策略

While balance losses aim to encourage a balanced load, it is important to acknowledge that they cannot guarantee a strict load balance. In order to further mitigate the computation wastage caused by unbalanced load, we introduce a device-level token-dropping strategy during training. This approach first computes the average computational budget for each device, which means that the capacity factor for each device is equivalent to 1.0. Then, inspired by Riquelme et al. (2021), we drop tokens with the lowest affinity scores on each device until reaching the computational budget. In addition, we ensure that the tokens belonging to approximately

10 %

of the training sequences will never be dropped. In this way, we can flexibly decide whether to drop tokens during inference according to the efficiency requirements, and always ensure consistency between training and inference.
虽然平衡损失旨在鼓励负载平衡，但重要的是要承认它们无法保证严格的负载平衡。为了进一步减少因负载不平衡而导致的计算浪费，我们在训练过程中引入了一种设备级的令牌丢弃策略。该方法首先计算每个设备的平均计算预算，这意味着每个设备的容量因子等于 1.0。然后，受到 Riquelme 等人（2021 年）的启发，我们在每个设备上丢弃具有最低亲和力分数的令牌，直到达到计算预算。此外，我们确保大约

10 %

的训练序列中的令牌永远不会被丢弃。通过这种方式，我们可以根据效率要求灵活决定在推理过程中是否丢弃令牌，并始终确保训练和推理之间的一致性。

3. Pre-Training 3. 预训练

3.1. Experimental Setups
3.1. 实验设置

3.1.1. Data Construction
3.1.1. 数据构建

While maintaining the same data processing stages as for DeepSeek 67B (DeepSeek-AI, 2024), we extend the amount of data and elevate the data quality. In order to enlarge our pre-training corpus, we explore the potential of the internet data and optimize our cleaning processes, thus recovering a large amount of mistakenly deleted data. Moreover, we incorporate more Chinese data, aiming to better leverage the corpus available on the Chinese internet. In addition to the amount of data, we also focus on the data quality. We enrich our pre-training corpus with high-quality data from various sources, and meanwhile improve the quality-based filtering algorithm. The improved algorithm ensures that a large amount of non-beneficial data will be removed, while the valuable data will be mostly retained. In addition, we filter out the contentious content from our pre-training corpus to mitigate the data bias introduced from specific regional cultures. A detailed discussion about the influence of this filtering strategy is presented in Appendix E
在保持与 DeepSeek 67B (DeepSeek-AI, 2024)相同的数据处理阶段的同时，我们扩展了数据量并提升了数据质量。为了扩大我们的预训练语料库，我们探索了互联网数据的潜力并优化了清理流程，从而恢复了大量错误删除的数据。此外，我们还纳入了更多的中文数据，旨在更好地利用中国互联网可用的语料库。除了数据量外，我们还关注数据质量。我们通过来自各种来源的高质量数据丰富我们的预训练语料库，同时改进基于质量的过滤算法。改进后的算法确保大量无益数据将被移除，而有价值的数据将大部分保留。此外，我们从预训练语料库中过滤掉有争议的内容，以减轻特定区域文化带来的数据偏见。关于这一过滤策略影响的详细讨论见附录 E。

We adopt the same tokenizer as used in DeepSeek 67B, which is built based on the Byte-level Byte-Pair Encoding (BBPE) algorithm and has a vocabulary size of 100K. Our tokenized pretraining corpus contains 8.1 T tokens, where Chinese tokens are approximately

12 %

more than English ones.
我们采用与 DeepSeek 67B 相同的分词器，该分词器基于字节级字节对编码（BBPE）算法构建，词汇表大小为 100K。我们的分词预训练语料库包含 8.1 万亿个标记，其中中文标记比英文标记多约

12 %

。

3.1.2. Hyper-Parameters 3.1.2. 超参数

Model Hyper-Parameters. We set the number of Transformer layers to 60 and the hidden dimension to 5120. All learnable parameters are randomly initialized with a standard deviation of 0.006 . In MLA, we set the number of attention heads

n_{h}

to 128 and the per-head dimension

d_{h}

to 128 . The KV compression dimension

d_{c}

is set to 512 , and the query compression dimension

d_{c}^{'}

is set to 1536. For the decoupled queries and key, we set the per-head dimension

d_{h}^{R}

to 64 . Following Dai et al. (2024), we substitute all FFNs except for the first layer with MoE layers. Each MoE layer consists of 2 shared experts and 160 routed experts, where the intermediate hidden dimension of each expert is 1536. Among the routed experts, 6 experts will be activated for each token. In addition, the low-rank compression and fine-grained expert segmentation will impact the output scale of a layer. Therefore, in practice, we employ additional RMS Norm layers after the compressed latent vectors, and multiply additional scaling factors at the width bottlenecks (i.e., the compressed latent vectors and the intermediate hidden states of routed experts) to ensure stable training. Under this configuration, DeepSeek-V2 comprises 236B total parameters, of which 21B are activated for each token.
模型超参数。我们将 Transformer 层的数量设置为 60，隐藏维度设置为 5120。所有可学习参数随机初始化，标准差为 0.006。在 MLA 中，我们将注意力头的数量

n_{h}

设置为 128，每个头的维度

d_{h}

设置为 128。KV 压缩维度

d_{c}

设置为 512，查询压缩维度

d_{c}^{'}

设置为 1536。对于解耦的查询和键，我们将每个头的维度

d_{h}^{R}

设置为 64。根据 Dai 等人（2024）的研究，我们将除了第一层之外的所有 FFN 替换为 MoE 层。每个 MoE 层由 2 个共享专家和 160 个路由专家组成，其中每个专家的中间隐藏维度为 1536。在路由专家中，每个 token 将激活 6 个专家。此外，低秩压缩和细粒度专家分段将影响层的输出规模。因此，在实践中，我们在压缩的潜在向量之后使用额外的 RMS Norm 层，并在宽度瓶颈（即压缩的潜在向量和路由专家的中间隐藏状态）处乘以额外的缩放因子，以确保稳定的训练。在此配置下，DeepSeek-V2 总共有 236B 的参数，其中每个 token 激活 21B。

Training Hyper-Parameters. We employ the AdamW optimizer (Loshchilov and Hutter, 2017) with hyper-parameters set to

β_{1} = 0.9, β_{2} = 0.95

, and weight_decay

= 0.1

. The learning rate is scheduled using a warmup-and-step-decay strategy (DeepSeek-AI, 2024). Initially, the learning rate linearly increases from 0 to the maximum value during the first 2 K steps. Subsequently, the learning rate is multiplied by 0.316 after training about

60 %

of tokens, and again by 0.316 after training about

90 %

of tokens. The maximum learning rate is set to

2.4 \times 10^{- 4}

, and the gradient clipping norm is set to 1.0 . We also use a batch size scheduling strategy, where the batch size is gradually increased from 2304 to 9216 in the training of the first 225B tokens, and then keeps 9216 in the remaining training. We set the maximum sequence length to 4 K , and train DeepSeek-V2 on 8.1T tokens. We leverage pipeline parallelism to deploy different layers of a model on different devices, and for each layer, the routed experts will be uniformly deployed on 8 devices

(D = 8)

. As for the device-limited routing, each token will be sent to at most 3 devices

(M = 3)

. As for balance losses, we set

α_{1}

0.003, α_{2}

to 0.05 , and

α_{3}

to 0.02 . We employ the token-dropping strategy during training for acceleration, but do not drop any tokens for evaluation.
训练超参数。我们使用 AdamW 优化器（Loshchilov 和 Hutter，2017），超参数设置为

β_{1} = 0.9, β_{2} = 0.95

，权重衰减为

= 0.1

。学习率采用预热和阶梯衰减策略进行调度（DeepSeek-AI，2024）。最初，学习率在前 2 K 步骤中线性从 0 增加到最大值。随后，在训练约

60 %

个标记后，学习率乘以 0.316，训练约

90 %

个标记后再次乘以 0.316。最大学习率设置为

2.4 \times 10^{- 4}

，梯度裁剪范数设置为 1.0。我们还使用批量大小调度策略，在前 225B 个标记的训练中，批量大小逐渐从 2304 增加到 9216，然后在剩余训练中保持 9216。我们将最大序列长度设置为 4 K，并在 8.1T 个标记上训练 DeepSeek-V2。我们利用管道并行性将模型的不同层部署在不同设备上，对于每一层，路由专家将均匀部署在 8 个设备

(D = 8)

。至于设备限制路由，每个标记最多将发送到 3 个设备

(M = 3)

。关于余额损失，我们将

α_{1}

设置为

0.003, α_{2}

为 0.05，将

α_{3}

设置为 0.02。我们在训练期间采用了丢弃令牌策略以加速，但在评估时不丢弃任何令牌。

3.1.3. Infrastructures 3.1.3. 基础设施

DeepSeek-V2 is trained based on the HAI-LLM framework (High-flyer, 2023), an efficient and light-weight training framework developed internally by our engineers. It employs a 16 -way zero-bubble pipeline parallelism (Qi et al., 2023), an 8-way expert parallelism (Lepikhin et al., 2021), and ZeRO-1 data parallelism (Rajbhandari et al., 2020). Given that DeepSeek-V2 has relatively few activated parameters, and a portion of the operators are recomputed to save activation memory, it can be trained without the necessity of tensor parallelism, thereby decreasing the communication overhead. Moreover, in order to further improve the training efficiency, we overlap the computation of shared experts with the expert parallel all-to-all communication. We also customize faster CUDA kernels for communications, routing algorithms, and fused
DeepSeek-V2 是基于 HAI-LLM框架（High-flyer, 2023）训练的，这是一个由我们的工程师内部开发的高效轻量级训练框架。它采用了 16 路零气泡管道并行（Qi et al., 2023）、8 路专家并行（Lepikhin et al., 2021）和 ZeRO-1 数据并行（Rajbhandari et al., 2020）。鉴于 DeepSeek-V2 激活参数相对较少，并且部分操作符被重新计算以节省激活内存，它可以在不需要张量并行的情况下进行训练，从而减少通信开销。此外，为了进一步提高训练效率，我们将共享专家的计算与专家并行的全到全通信重叠。我们还为通信、路由算法和融合定制了更快的 CUDA 内核。

Figure 4 | Evaluation results on the “Needle In A Haystack” (NIAH) tests. DeepSeek-V2 performs well across all context window lengths up to 128 K .
图 4 | “干草堆中的针”（NIAH）测试的评估结果。DeepSeek-V2 在所有上下文窗口长度（最高可达 128K）上表现良好。
linear computations across different experts. In addition, MLA is also optimized based on an improved version of FlashAttention-2 (Dao, 2023).
不同专家之间的线性计算。此外，MLA 还基于改进版的 FlashAttention-2（Dao，2023）进行了优化。

We conduct all experiments on a cluster equipped with NVIDIA H800 GPUs. Each node in the H800 cluster contains 8 GPUs connected using NVLink and NVSwitch within nodes. Across nodes, InfiniBand interconnects are utilized to facilitate communications.
我们在配备 NVIDIA H800 GPU 的集群上进行所有实验。H800 集群中的每个节点包含 8 个通过 NVLink 和 NVSwitch 连接的 GPU。节点之间使用 InfiniBand 互连以促进通信。

3.1.4. Long Context Extension
3.1.4. 长上下文扩展

After the initial pre-training of DeepSeek-V2, we employ YaRN (Peng et al., 2023) to extend the default context window length from 4 K to 128 K . YaRN was specifically applied to the decoupled shared key

k_{t}^{R}

as it is responsible for carrying RoPE (Su et al. 2024). For YaRN, we set the scale

s

40, α

1, β

to 32 , and the target maximum context length to 160 K . Under these settings, we can expect the model to respond well for a context length of 128 K . Slightly diverging from original YaRN, due to our distinct attention mechanism, we adjust the length scaling factor to modulate the attention entropy. The factor

\sqrt{t}

is computed as

\sqrt{t} = 0.0707 \ln s + 1

, aiming at minimizing the perplexity.
在 DeepSeek-V2 的初始预训练之后，我们采用 YaRN（Peng 等，2023）将默认上下文窗口长度从 4K 扩展到 128K。YaRN 专门应用于解耦的共享密钥

k_{t}^{R}

，因为它负责承载 RoPE（Su 等，2024）。对于 YaRN，我们将比例

s

设置为

40, α

到

1, β

为 32，并将目标最大上下文长度设置为 160K。在这些设置下，我们可以期待模型在 128K 的上下文长度下表现良好。由于我们独特的注意力机制，稍微偏离原始 YaRN，我们调整了长度缩放因子以调节注意力熵。因子

\sqrt{t}

被计算为

\sqrt{t} = 0.0707 \ln s + 1

，旨在最小化困惑度。

We additionally train the model for 1000 steps, with a sequence length of 32 K and a batch size of 576 sequences. Although the training is conducted solely at the sequence length of 32 K , the model still demonstrates robust performance when being evaluated at a context length of 128 K . As shown in Figure 4 , the results on the “Needle In A Haystack” (NIAH) tests indicate that DeepSeek-V2 performs well across all context window lengths up to 128 K .
我们还将模型训练了 1000 步，序列长度为 32 K，批量大小为 576 个序列。尽管训练仅在 32 K 的序列长度下进行，但模型在 128 K 的上下文长度评估时仍表现出强大的性能。如图 4 所示，“Needle In A Haystack”（NIAH）测试的结果表明，DeepSeek-V2 在所有上下文窗口长度（最高可达 128 K）上表现良好。

3.2. Evaluations 3.2. 评估

3.2.1. Evaluation Benchmarks
3.2.1. 评估基准

DeepSeek-V2 is pretrained on a bilingual corpus, so we evaluate it on a series of benchmarks in English and Chinese. Our evaluation is based on our internal evaluation framework integrated
DeepSeek-V2 在双语语料库上进行了预训练，因此我们在一系列英语和中文基准上对其进行评估。我们的评估基于我们内部集成的评估框架。
in our HAI-LLM framework. Included benchmarks are categorized and listed as follows, where underlined benchmarks are in Chinese:
在我们的 HAI-LLM框架中。包含的基准被分类并列出如下，其中下划线基准为中文：

Multi-subject multiple-choice datasets include MMLU (Hendrycks et al., 2020), C-Eval (Huang et al., 2023), and CMMLU (Li et al., 2023).
多学科多项选择数据集包括 MMLU（Hendrycks 等，2020）、C-Eval（Huang 等，2023）和 CMMLU（Li 等，2023）。

Language understanding and reasoning datasets include HellaSwag (Zellers et al., 2019), PIQA (Bisk et al., 2020), ARC (Clark et al., 2018), and BigBench Hard (BBH) (Suzgun et al., 2022).
语言理解和推理数据集包括 HellaSwag (Zellers et al., 2019)、PIQA (Bisk et al., 2020)、ARC (Clark et al., 2018) 和 BigBench Hard (BBH) (Suzgun et al., 2022)。

Closed-book question answering datasets include TriviaQA (Joshi et al., 2017) and NaturalQuestions (Kwiatkowski et al., 2019).
闭卷问答数据集包括 TriviaQA（Joshi 等，2017）和 NaturalQuestions（Kwiatkowski 等，2019）。

Reading comprehension datasets include RACE Lai et al. (2017), DROP (Dua et al., 2019), C3 (Sun et al., 2019), and CMRC (Cui et al., 2019).
阅读理解数据集包括 RACE Lai 等（2017）、DROP（Dua 等，2019）、C3（Sun 等，2019）和 CMRC（Cui 等，2019）。

Reference disambiguation datasets include WinoGrandeSakaguchi et al. (2019) and CLUEWSC (Xu et al., 2020).
参考消歧义数据集包括 WinoGrandeSakaguchi 等（2019）和 CLUEWSC（Xu 等，2020）。

Language modeling datasets include Pile (Gao et al., 2020).
语言建模数据集包括 Pile (Gao et al., 2020)。
Chinese understanding and culture datasets include CHID (Zheng et al., 2019) and CCPM (Li et al., 2021).
中文理解和文化数据集包括 CHID (Zheng et al., 2019) 和 CCPM (Li et al., 2021)。

Math datasets include GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), and CMath (Wei et al., 2023).
数学数据集包括 GSM8K（Cobbe 等，2021）、MATH（Hendrycks 等，2021）和 CMath（Wei 等，2023）。

Code datasets include HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), and CRUXEval (Gu et al., 2024).
代码数据集包括 HumanEval (Chen et al., 2021)、MBPP (Austin et al., 2021) 和 CRUXEval (Gu et al., 2024)。

Standardized exams include AGIEval (Zhong et al., 2023). Note that AGIEval includes both English and Chinese subsets.
标准化考试包括 AGIEval（Zhong 等，2023）。请注意，AGIEval 包括英语和中文子集。

Following our previous work (DeepSeek-AI, 2024), we adopt perplexity-based evaluation for datasets including HellaSwag, PIQA, WinoGrande, RACE-Middle, RACE-High, MMLU, ARC-Easy, ARC-Challenge, CHID, C-Eval, CMMLU, C3, and CCPM, and adopt generationbased evaluation for TriviaQA, NaturalQuestions, DROP, MATH, GSM8K, HumanEval, MBPP, CRUXEval, BBH, AGIEval, CLUEWSC, CMRC, and CMath. In addition, we perform language-modeling-based evaluation for Pile-test and use Bits-Per-Byte (BPB) as the metric to guarantee fair comparison among models with different tokenizers.
根据我们之前的工作（DeepSeek-AI，2024），我们对包括 HellaSwag、PIQA、WinoGrande、RACE-Middle、RACE-High、MMLU、ARC-Easy、ARC-Challenge、CHID、C-Eval、CMMLU、C3 和 CCPM 在内的数据集采用基于困惑度的评估，并对 TriviaQA、NaturalQuestions、DROP、MATH、GSM8K、HumanEval、MBPP、CRUXEval、BBH、AGIEval、CLUEWSC、CMRC 和 CMath 采用基于生成的评估。此外，我们对 Pile-test 进行基于语言建模的评估，并使用每字节比特数（BPB）作为指标，以确保不同分词器的模型之间的公平比较。

For an intuitive overview of these benchmarks, we additionally provide our evaluation formats for each benchmark in Appendix

G

为了直观了解这些基准，我们还在附录

G

中提供了每个基准的评估格式

3.2.2. Evaluation Results
3.2.2. 评估结果

In Table 2, we compare DeepSeek-V2 with several representative open-source models, including DeepSeek 67B (DeepSeek-AI, 2024) (our previous release), Qwen1.5 72B (Bai et al., 2023), LLaMA3 70B (AI@Meta, 2024), and Mixtral 8x22B (Mistral, 2024). We evaluate all these models with our internal evaluation framework, and ensure that they share the same evaluation setting. Overall, with only 21B activated parameters, DeepSeek-V2 significantly outperforms DeepSeek 67B on almost all benchmarks, and achieves top-tier performance among open-source models.
在表 2 中，我们将 DeepSeek-V2 与几个代表性的开源模型进行比较，包括 DeepSeek 67B (DeepSeek-AI, 2024)（我们之前的版本）、Qwen1.5 72B (Bai et al., 2023)、LLaMA3 70B (AI@Meta, 2024)和 Mixtral 8x22B (Mistral, 2024)。我们使用内部评估框架对所有这些模型进行评估，并确保它们共享相同的评估设置。总体而言，DeepSeek-V2 仅用 21B 激活参数，在几乎所有基准测试中显著超越 DeepSeek 67B，并在开源模型中实现了顶级性能。

Further, we elaborately compare DeepSeek-V2 with its open-source counterparts one by one. (1) Compared with Qwen1.5 72B, another model that supports both Chinese and English, DeepSeek-V2 demonstrates overwhelming advantages on the majority of English, code, and math benchmarks. As for Chinese benchmarks, Qwen1.5 72B shows better performance on
此外，我们逐一详细比较了 DeepSeek-V2 与其开源对手。(1) 与支持中文和英文的另一个模型 Qwen1.5 72B 相比，DeepSeek-V2 在大多数英语、代码和数学基准测试中表现出压倒性的优势。至于中文基准测试，Qwen1.5 72B 表现更好。

	Benchmark (Metric) 基准（指标）	# Shots	$\begin{matrix} DeepSeek \\ 67B \end{matrix}$	Qwen1.5 72B	$\begin{matrix} Mixtral \\ 8 \times 22 B \end{matrix}$	$\begin{matrix} LLaMA 3 \\ 70 B \end{matrix}$	DeepSeek-V2
	Architecture 架构	-	Dense 密集	Dense 密集	MoE	Dense 密集	MoE
	# Activated Params # 激活参数	-	67B	72B	39B	70B	21B
	# Total Params # 总参数	-	67B	72B	141B	70B	236B
English 英语	Pile-test (BPB)	-	0.642	0.637	0.623	0.602	$\underset{―}{0.606}$
	BBH (EM)	3-shot	68.7	59.9	78.9	81.0	78.9
	MMLU (Acc.)	5-shot	71.3	77.2	77.6	78.9	78.5
	DROP (F1)	3-shot	69.7	71.5	80.4	82.5	80.1
	ARC-Easy (Acc.)	25-shot	95.3	97.1	97.3	97.9	97.6
	ARC-Challenge (Acc.) ARC 挑战（准确性）	25-shot	86.4	$\underset{―}{92.8}$	91.2	93.3	92.4
	HellaSwag (Acc.)	10-shot	86.3	85.8	86.6	87.9	84.2
	PIQA (Acc.)	0 -shot	83.6	83.3	$\underset{―}{83.6}$	85.0	83.7
	WinoGrande (Acc.)	5-shot	84.9	82.4	83.7	85.7	84.9
	RACE-Middle (Acc.)	5-shot	$\underset{―}{69.9}$	63.4	73.3	73.3	73.1
	RACE-High (Acc.)	5-shot	50.7	47.0	56.7	57.9	52.7
	TriviaQA (EM)	5-shot	78.9	73.1	82.1	81.6	79.9
	NaturalQuestions (EM)	5-shot	36.6	35.6	39.6	40.2	38.7
	AGIEval (Acc.)	0 -shot	41.3	64.4	43.4	49.8	51.2
Code 代码	HumanEval (Pass@1)	0 -shot	45.1	43.9	53.1	48.2	48.8
	MBPP (Pass@1)	3-shot	57.4	53.6	64.2	68.6	$\underset{―}{66.6}$
	CRUXEval-I (Acc.)	2-shot	42.5	44.3	52.4	49.4	52.8
	CRUXEval-O (Acc.)	2-shot	41.0	42.3	52.8	54.3	49.8
Math 数学	GSM8K (EM)	8 -shot 8-shot	63.4	77.9	80.3	83.0	79.2
	MATH (EM)	4-shot	18.7	41.4	42.5	42.2	43.6
	CMath (EM)	3-shot	63.0	77.8	72.3	73.9	78.7
Chinese 中文	CLUEWSC (EM)	5-shot	81.0	80.5	77.5	78.3	82.2
	C-Eval (Acc.)	5 -shot	66.1	83.7	59.6	67.5	81.7
	CMMLU (Acc.)	5-shot	70.8	84.3	60.0	69.3	84.0
	CMRC (EM)	1-shot	73.4	66.6	73.1	73.3	77.5
	C3 (Acc.)	0 -shot	75.3	78.2	71.4	74.0	77.4
	CHID (Acc.)	0 -shot	$\underset{―}{92.1}$	-	57.0	83.2	92.7
	CCPM (Acc.)	0 -shot	88.5	88.1	61.0	68.1	93.1

| | Benchmark (Metric) | # Shots | $\begin{gathered} \text { DeepSeek } \\ \text { 67B } \end{gathered}$ | Qwen1.5 72B | $\begin{gathered} \hline \text { Mixtral } \\ 8 \times 22 B \end{gathered}$ | $\begin{gathered} \text { LLaMA 3 } \\ 70 \mathrm{~B} \end{gathered}$ | DeepSeek-V2 | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | Architecture | - | Dense | Dense | MoE | Dense | MoE | | | # Activated Params | - | 67B | 72B | 39B | 70B | 21B | | | # Total Params | - | 67B | 72B | 141B | 70B | 236B | | English | Pile-test (BPB) | - | 0.642 | 0.637 | 0.623 | 0.602 | $\underline{0.606}$ | | | BBH (EM) | 3-shot | 68.7 | 59.9 | 78.9 | 81.0 | 78.9 | | | MMLU (Acc.) | 5-shot | 71.3 | 77.2 | 77.6 | 78.9 | 78.5 | | | DROP (F1) | 3-shot | 69.7 | 71.5 | 80.4 | 82.5 | 80.1 | | | ARC-Easy (Acc.) | 25-shot | 95.3 | 97.1 | 97.3 | 97.9 | 97.6 | | | ARC-Challenge (Acc.) | 25-shot | 86.4 | $\underline{92.8}$ | 91.2 | 93.3 | 92.4 | | | HellaSwag (Acc.) | 10-shot | 86.3 | 85.8 | 86.6 | 87.9 | 84.2 | | | PIQA (Acc.) | 0 -shot | 83.6 | 83.3 | $\underline{83.6}$ | 85.0 | 83.7 | | | WinoGrande (Acc.) | 5-shot | 84.9 | 82.4 | 83.7 | 85.7 | 84.9 | | | RACE-Middle (Acc.) | 5-shot | $\underline{69.9}$ | 63.4 | 73.3 | 73.3 | 73.1 | | | RACE-High (Acc.) | 5-shot | 50.7 | 47.0 | 56.7 | 57.9 | 52.7 | | | TriviaQA (EM) | 5-shot | 78.9 | 73.1 | 82.1 | 81.6 | 79.9 | | | NaturalQuestions (EM) | 5-shot | 36.6 | 35.6 | 39.6 | 40.2 | 38.7 | | | AGIEval (Acc.) | 0 -shot | 41.3 | 64.4 | 43.4 | 49.8 | 51.2 | | Code | HumanEval (Pass@1) | 0 -shot | 45.1 | 43.9 | 53.1 | 48.2 | 48.8 | | | MBPP (Pass@1) | 3-shot | 57.4 | 53.6 | 64.2 | 68.6 | $\underline{66.6}$ | | | CRUXEval-I (Acc.) | 2-shot | 42.5 | 44.3 | 52.4 | 49.4 | 52.8 | | | CRUXEval-O (Acc.) | 2-shot | 41.0 | 42.3 | 52.8 | 54.3 | 49.8 | | Math | GSM8K (EM) | 8 -shot | 63.4 | 77.9 | 80.3 | 83.0 | 79.2 | | | MATH (EM) | 4-shot | 18.7 | 41.4 | 42.5 | 42.2 | 43.6 | | | CMath (EM) | 3-shot | 63.0 | 77.8 | 72.3 | 73.9 | 78.7 | | Chinese | CLUEWSC (EM) | 5-shot | 81.0 | 80.5 | 77.5 | 78.3 | 82.2 | | | C-Eval (Acc.) | 5 -shot | 66.1 | 83.7 | 59.6 | 67.5 | 81.7 | | | CMMLU (Acc.) | 5-shot | 70.8 | 84.3 | 60.0 | 69.3 | 84.0 | | | CMRC (EM) | 1-shot | 73.4 | 66.6 | 73.1 | 73.3 | 77.5 | | | C3 (Acc.) | 0 -shot | 75.3 | 78.2 | 71.4 | 74.0 | 77.4 | | | CHID (Acc.) | 0 -shot | $\underline{92.1}$ | - | 57.0 | 83.2 | 92.7 | | | CCPM (Acc.) | 0 -shot | 88.5 | 88.1 | 61.0 | 68.1 | 93.1 |

Table 2 | Comparison among DeepSeek-V2 and other representative open-source models. All models are evaluated in our internal framework and share the same evaluation setting. Bold denotes the best and underline denotes the second-best. Scores with a gap smaller than 0.3 are regarded as at the same level. With only 21B activated parameters, DeepSeek-V2 achieves top-tier performance among open-source models.
表 2 | DeepSeek-V2 与其他代表性开源模型的比较。所有模型均在我们的内部框架中评估，并共享相同的评估设置。粗体表示最佳，下划线表示第二最佳。得分差距小于 0.3 的被视为同一水平。DeepSeek-V2 仅以 21B 激活参数，在开源模型中实现了顶级性能。
multi-subject multiple-choice tasks while DeepSeek-V2 is comparable or better on others. Note that for the CHID benchmark, the tokenizer of Qwen1.5 72B will encounter errors in our evaluation framework, so we leave the CHID score blank for Qwen1.5 72B. (2) Compared with Mixtral 8x22B, DeepSeek-V2 achieves comparable or better English performance, except for TriviaQA, NaturalQuestions, and HellaSwag, which are closely related to English commonsense knowledge. Notably, DeepSeek-V2 outperforms Mixtral 8x22B on MMLU. On code and math benchmarks, DeepSeek-V2 demonstrates comparable performance with Mixtral 8x22B. Since Mixtral

8 \times 22 B

is not specifically trained on Chinese data, its Chinese capability lags far behind DeepSeek-V2. (3) Compared with LLaMA3 70B, DeepSeek-V2 is trained on fewer than a quarter of English tokens. Therefore, we acknowledge that DeepSeek-V2 still has a slight gap in basic English capabilities with LLaMA3 70B. However, even with much fewer training tokens and activated parameters, DeepSeek-V2 still demonstrates comparable code and math capability with LLaMA3 70B. Also, as a bilingual language model, DeepSeek-V2 outperforms LLaMA3
多主题多项选择任务，而 DeepSeek-V2 在其他任务上表现相当或更好。请注意，对于 CHID 基准，Qwen1.5 72B 的分词器在我们的评估框架中会遇到错误，因此我们将 Qwen1.5 72B 的 CHID 分数留空。(2) 与 Mixtral 8x22B 相比，DeepSeek-V2 在英语表现上相当或更好，除了 TriviaQA、NaturalQuestions 和 HellaSwag，这些与英语常识知识密切相关。值得注意的是，DeepSeek-V2 在 MMLU 上超越了 Mixtral 8x22B。在代码和数学基准上，DeepSeek-V2 与 Mixtral 8x22B 表现相当。由于 Mixtral

8 \times 22 B

并未专门针对中文数据进行训练，其中文能力远远落后于 DeepSeek-V2。(3) 与 LLaMA3 70B 相比，DeepSeek-V2 的训练数据少于四分之一的英语标记。因此，我们承认 DeepSeek-V2 在基本英语能力上仍与 LLaMA3 70B 存在轻微差距。然而，即使训练标记和激活参数少得多，DeepSeek-V2 在代码和数学能力上仍与 LLaMA3 70B 表现相当。此外，作为一个双语语言模型，DeepSeek-V2 的表现优于 LLaMA3。

70B overwhelmingly on Chinese benchmarks.
70B 在中国基准测试中占据主导地位。
Finally, it is worth mentioning that certain prior studies (Hu et al., 2024) incorporate SFT data during the pre-training stage, whereas DeepSeek-V2 has never been exposed to SFT data during pre-training.
最后，值得一提的是，某些先前的研究（Hu et al., 2024）在预训练阶段纳入了 SFT 数据，而 DeepSeek-V2 在预训练期间从未接触过 SFT 数据。

3.2.3. Training and Inference Efficiency
3.2.3. 训练和推理效率

Training Costs. Since DeepSeek-V2 activates fewer parameters for each token and requires fewer FLOPs than DeepSeek 67B, training DeepSeek-V2 will be more economical than training DeepSeek 67B theoretically. Although training an MoE model will introduce additional communication overheads, through our operator and communication optimizations, the training for DeepSeek-V2 can attain a relatively high Model FLOPs Utilization (MFU). During our practical training on the H800 cluster, for training on each trillion tokens, DeepSeek 67B requires 300.6K GPU hours, while DeepSeek-V2 needs only 172.8K GPU hours, i.e., sparse DeepSeek-V2 can save

42.5 %

training costs compared with dense DeepSeek 67B.
训练成本。由于 DeepSeek-V2 为每个 token 激活的参数较少，并且所需的 FLOPs 也低于 DeepSeek 67B，因此理论上训练 DeepSeek-V2 的经济性将优于训练 DeepSeek 67B。尽管训练 MoE 模型会引入额外的通信开销，但通过我们的操作和通信优化，DeepSeek-V2 的训练可以达到相对较高的模型 FLOPs 利用率 (MFU)。在我们对 H800 集群的实际训练中，训练每万亿个 token 时，DeepSeek 67B 需要 300.6K GPU 小时，而 DeepSeek-V2 仅需 172.8K GPU 小时，即稀疏的 DeepSeek-V2 相比于密集的 DeepSeek 67B 可以节省

42.5 %

的训练成本。

Inference Efficiency. In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8. In addition, we also perform KV cache quantization (Hooper et al., 2024: Zhao et al., 2023) for DeepSeek-V2 to further compress each element in its KV cache into 6 bits on average. Benefiting from MLA and these optimizations, actually deployed DeepSeek-V2 requires significantly less KV cache than DeepSeek 67B, and thus can serve a much larger batch size. We evaluate the generation throughput of DeepSeek-V2 based on the prompt and generation length distribution from the actually deployed DeepSeek 67B service. On a single node with 8 H800 GPUs, DeepSeek-V2 achieves a generation throughput exceeding 50 K tokens per second, which is 5.76 times the maximum generation throughput of DeepSeek 67B. In addition, the prompt input throughput of DeepSeek-V2 exceeds 100K tokens per second.
推理效率。为了高效部署 DeepSeek-V2 服务，我们首先将其参数转换为 FP8 精度。此外，我们还对 DeepSeek-V2 进行 KV 缓存量化（Hooper et al., 2024: Zhao et al., 2023），进一步将其 KV 缓存中的每个元素平均压缩为 6 位。得益于 MLA 和这些优化，实际部署的 DeepSeek-V2 所需的 KV 缓存显著少于 DeepSeek 67B，因此可以处理更大的批量大小。我们根据实际部署的 DeepSeek 67B 服务的提示和生成长度分布评估 DeepSeek-V2 的生成吞吐量。在一台配备 8 个 H800 GPU 的单节点上，DeepSeek-V2 的生成吞吐量超过每秒 50 K 个标记，是 DeepSeek 67B 最大生成吞吐量的 5.76 倍。此外，DeepSeek-V2 的提示输入吞吐量超过每秒 100K 个标记。

4. Alignment 4. 对齐

4.1. Supervised Fine-Tuning
4.1. 监督微调

Building upon our prior research (DeepSeek-AI, 2024), we curate our instruction tuning datasets to include 1.5 M instances, comprising 1.2 M instances for helpfulness and 0.3 M instances for safety. In comparison to the initial version, we improve the data quality to mitigate hallucinatory responses and enhance writing proficiency. We fine-tune DeepSeek-V2 with 2 epochs, and the learning rate is set to

5 \times 10^{- 6}

. For the evaluation of DeepSeek-V2 Chat (SFT), we mainly include generation-based benchmarks, except for several representative multiple-choice tasks (MMLU and ARC). We also conduct an instruction-following evaluation (IFEval) (Zhou et al., 2023) for DeepSeek-V2 Chat (SFT), using prompt-level loose accuracy as the metric. Moreover, we employ LiveCodeBench (Jain et al., 2024) questions from September 1st, 2023 to April 1st, 2024 to evaluate chat models. In addition to the standard benchmarks, we further evaluate our model on open-ended conversation benchmarks including MT-Bench (Zheng et al., 2023), AlpacaEval 2.0 (Dubois et al., 2024), and AlignBench (Liu et al., 2023). For comparison, we also evaluate Qwen1.5 72B Chat, LLaMA-3-70B Instruct, and Mistral-8x22B Instruct in our evaluation framework and settings. As for DeepSeek 67B Chat, we directly refer to the evaluation results reported in our previous release.
基于我们之前的研究（DeepSeek-AI，2024），我们整理了指令调优数据集，包括 150 万实例，其中 120 万实例用于有用性，30 万实例用于安全性。与初始版本相比，我们提高了数据质量，以减轻幻觉反应并增强写作能力。我们对 DeepSeek-V2 进行了 2 个周期的微调，学习率设置为

5 \times 10^{- 6}

。对于 DeepSeek-V2 Chat（SFT）的评估，我们主要包括基于生成的基准测试，除了几个代表性的多项选择任务（MMLU 和 ARC）。我们还对 DeepSeek-V2 Chat（SFT）进行了指令跟随评估（IFEval）（Zhou 等，2023），使用提示级别的松散准确性作为指标。此外，我们使用 2023 年 9 月 1 日至 2024 年 4 月 1 日的 LiveCodeBench（Jain 等，2024）问题来评估聊天模型。除了标准基准外，我们还在开放式对话基准上进一步评估我们的模型，包括 MT-Bench（Zheng 等，2023）、AlpacaEval 2.0（Dubois 等，2024）和 AlignBench（Liu 等，2023）。为了比较，我们还在我们的评估框架和设置中评估了 Qwen1.5 72B Chat、LLaMA-3-70B Instruct 和 Mistral-8x22B Instruct。至于 DeepSeek 67B Chat，我们直接参考了我们之前发布的评估结果。

4.2. Reinforcement Learning
4.2. 强化学习

In order to further unlock the potential of DeepSeek-V2 and align it with human preference, we conduct Reinforcement Learning (RL) to adjust its preference.
为了进一步释放 DeepSeek-V2 的潜力并使其与人类偏好对齐，我们进行强化学习（RL）以调整其偏好。

Reinforcement Learning Algorithm. In order to save the training costs of RL, we adopt Group Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is typically with the same size as the policy model, and estimates the baseline from group scores instead. Specifically, for each question

q

, GRPO samples a group of outputs

{o_{1}, o_{2}, \dots, o_{G}}

from the old policy

π_{θ_{o l d}}

and then optimizes the policy model

π_{θ}

by maximizing the following objective:
强化学习算法。为了节省 RL 的训练成本，我们采用了群体相对策略优化（GRPO）（Shao 等，2024），该方法放弃了通常与策略模型大小相同的评论模型，而是从群体得分中估计基线。具体来说，对于每个问题

q

，GRPO 从旧策略

π_{θ_{o l d}}

中抽取一组输出

{o_{1}, o_{2}, \dots, o_{G}}

，然后通过最大化以下目标来优化策略模型

π_{θ}

：

\begin{matrix} J_{G R P O} (θ) = E [q \sim P (Q), {o_{i}}_{i = 1}^{G} \sim π_{θ_{o l d}} (O ∣ q)] \\ \frac{1}{G} \sum_{i = 1}^{G} (min (\frac{π_{θ} (o_{i} ∣ q)}{π_{θ_{o l d}} (o_{i} ∣ q)} A_{i}, clip (\frac{π_{θ} (o_{i} ∣ q)}{π_{θ_{o l d}} (o_{i} ∣ q)}, 1 - ε, 1 + ε) A_{i}) - β D_{K L} (π_{θ} | | π_{r e f})), \\ D_{K L} (π_{θ} | | π_{r e f}) = \frac{π_{r e f} (o_{i} ∣ q)}{π_{θ} (o_{i} ∣ q)} - \log \frac{π_{r e f} (o_{i} ∣ q)}{π_{θ} (o_{i} ∣ q)} - 1, \end{matrix}

where

ε

and

β

are hyper-parameters; and

A_{i}

is the advantage, computed using a group of rewards

{r_{1}, r_{2}, \dots, r_{G}}

corresponding to the outputs within each group:
其中

ε

和

β

是超参数；

A_{i}

是优势，使用一组奖励

{r_{1}, r_{2}, \dots, r_{G}}

计算，该奖励对应于每组内的输出：

A_{i} = \frac{r_{i} - mean ({r_{1}, r_{2}, \dots, r_{G}})}{std ({r_{1}, r_{2}, \dots, r_{G}})}

Training Strategy. In our preliminary experiments, we find that the RL training on reasoning data, such as code and math prompts, exhibits unique characteristics that are distinct from the training on general data. For example, the mathematical and coding abilities of our model can keep improving over a longer period of training steps. Therefore, we employ a two-stage RL training strategy, which first performs reasoning alignment, and then performs human preference alignment. In the first reasoning alignment stage, we train a reward model

R M_{reasoning}

for code and math reasoning tasks, and optimize the policy model with the feedback of

R M_{reasoning}

:
训练策略。在我们的初步实验中，我们发现对推理数据（如代码和数学提示）的强化学习训练表现出与对一般数据训练不同的独特特征。例如，我们模型的数学和编码能力可以在更长的训练步骤中持续提高。因此，我们采用了两阶段的强化学习训练策略，首先进行推理对齐，然后进行人类偏好对齐。在第一个推理对齐阶段，我们为代码和数学推理任务训练了一个奖励模型

R M_{reasoning}

，并利用

R M_{reasoning}

的反馈优化策略模型：

r_{i} = R M_{reasoning} (o_{i}) .

In the second human preference alignment stage, we adopt a multi-reward framework, which acquires rewards from a helpful reward model

R M_{helpful}

, a safety reward model

R M_{safety}

, and a rule-based reward model

R M_{rule}

. The final reward of a response

o_{i}

is
在第二个人类偏好对齐阶段，我们采用了一个多奖励框架，该框架从一个有用的奖励模型

R M_{helpful}

、一个安全奖励模型

R M_{safety}

和一个基于规则的奖励模型

R M_{rule}

中获取奖励。响应的最终奖励

o_{i}

是

r_{i} = c_{1} \cdot R M_{helpful} (o_{i}) + c_{2} \cdot R M_{safety} (o_{i}) + c_{3} \cdot R M_{rule} (o_{i}),

where

c_{1}, c_{2}

, and

c_{3}

are corresponding coefficients.
其中

c_{1}, c_{2}

和

c_{3}

是相应的系数。
In order to obtain reliable reward models that play crucial roles in the RL training, we carefully collect preference data, and meticulously conduct quality filtering and proportion adjustments. We obtain code preference data based on compiler-feedback, and mathematical preference data based on the ground-truth labels. For reward model training, we initialize the reward models with DeepSeek-V2 Chat (SFT) and train them with either a point-wise or a pair-wise loss. In our experiments, we observe that the RL training can fully tap into and activate the potential of our model, enabling it to select the correct and satisfactory answer from possible responses.
为了获得在强化学习训练中发挥关键作用的可靠奖励模型，我们仔细收集偏好数据，并精心进行质量过滤和比例调整。我们基于编译器反馈获得代码偏好数据，并基于真实标签获得数学偏好数据。对于奖励模型的训练，我们使用 DeepSeek-V2 Chat (SFT)初始化奖励模型，并使用点对点或成对损失进行训练。在我们的实验中，我们观察到强化学习训练能够充分挖掘和激活我们模型的潜力，使其能够从可能的响应中选择正确且令人满意的答案。

Optimizations for Training Efficiency. Conducting RL training on extremely large models places high demands on the training framework. It requires careful engineering optimization to manage the GPU memory and RAM pressure, and meanwhile maintain a fast training speed. For this goal, we implement the following engineering optimizations. (1) Firstly, we propose a hybrid engine that adopts different parallel strategies for training and inference respectively to achieve higher GPU utilization. (2) Secondly, we leverage vLLM (Kwon et al., 2023) with large batch sizes as our inference backend to accelerate the inference speed. (3) Thirdly, we carefully design a scheduling strategy for offloading models to CPUs and loading models back to GPUs, which achieves a near-optimal balance between the training speed and memory consumption.
训练效率的优化。对极大模型进行强化学习训练对训练框架提出了很高的要求。它需要精心的工程优化来管理 GPU 内存和 RAM 压力，同时保持快速的训练速度。为此，我们实施了以下工程优化。(1) 首先，我们提出了一种混合引擎，分别采用不同的并行策略进行训练和推理，以实现更高的 GPU 利用率。(2) 其次，我们利用 vLLM (Kwon et al., 2023) 作为我们的推理后端，使用大批量大小来加速推理速度。(3) 第三，我们精心设计了一种调度策略，用于将模型卸载到 CPU 并将模型加载回 GPU，从而在训练速度和内存消耗之间实现近乎最佳的平衡。

4.3. Evaluation Results 4.3. 评估结果

Evaluations on Standard Benchmarks. Initially, we evaluate DeepSeek-V2 Chat (SFT) and DeepSeek-V2 Chat (RL) on standard benchmarks. Notably, DeepSeek-V2 Chat (SFT) demonstrates substantial improvements in GSM8K, MATH, and HumanEval evaluations compared with its base version. This progress can be attributed to the inclusion of our SFT data, which comprises a considerable volume of math and code related content. In addition, DeepSeek-V2 Chat (RL) further boosts the performance on math and code benchmarks. We show more code and math evaluations in Appendix

F

.
在标准基准上的评估。最初，我们在标准基准上评估了 DeepSeek-V2 Chat (SFT) 和 DeepSeek-V2 Chat (RL)。值得注意的是，与其基础版本相比，DeepSeek-V2 Chat (SFT) 在 GSM8K、MATH 和 HumanEval 评估中表现出显著的改进。这一进展可以归因于我们 SFT 数据的包含，该数据包含大量与数学和代码相关的内容。此外，DeepSeek-V2 Chat (RL) 进一步提升了在数学和代码基准上的表现。我们在附录

F

中展示了更多的代码和数学评估。

As for the comparisons with other models, we first compare DeepSeek-V2 Chat (SFT) with Qwen1.5 72B Chat, and find that DeepSeek-V2 Chat (SFT) surpasses Qwen1.5 72B Chat on almost all of English, math, and code benchmarks. On Chinese benchmarks, DeepSeekV2 Chat (SFT) demonstrates slightly lower scores than Qwen1.5 72B Chat on multi-subject multiple-choice tasks, consistent with the performance observed from their base versions. When compared with the state-of-the-art open-source MoE model, Mixtral 8x22B Instruct, DeepSeekV2 Chat (SFT) exhibits better performance on most benchmarks, except for NaturalQuestions and IFEval. Furthermore, in comparison to the state-of-the-art open-source model LLaMA3 70B Chat, DeepSeek-V2 Chat (SFT) shows similar performance in code and math related benchmarks. LLaMA3 70B Chat exhibits better performance on MMLU and IFEval, while DeepSeek-V2 Chat (SFT) showcases stronger performance on Chinese tasks. Ultimately, DeepSeek-V2 Chat (RL) demonstrates further enhanced performance in both mathematical and coding tasks compared with DeepSeek-V2 Chat (SFT). These comparisons highlight the strengths of DeepSeek-V2 Chat in relation to other language models in various domains and languages.
关于与其他模型的比较，我们首先将 DeepSeek-V2 Chat (SFT)与 Qwen1.5 72B Chat 进行比较，发现 DeepSeek-V2 Chat (SFT)在几乎所有英语、数学和代码基准测试中都超过了 Qwen1.5 72B Chat。在中文基准测试中，DeepSeekV2 Chat (SFT)在多学科选择题任务上的得分略低于 Qwen1.5 72B Chat，这与它们基础版本的表现一致。与最先进的开源 MoE 模型 Mixtral 8x22B Instruct 相比，DeepSeekV2 Chat (SFT)在大多数基准测试中表现更好，只有在 NaturalQuestions 和 IFEval 上表现不佳。此外，与最先进的开源模型 LLaMA3 70B Chat 相比，DeepSeek-V2 Chat (SFT)在与代码和数学相关的基准测试中表现相似。LLaMA3 70B Chat 在 MMLU 和 IFEval 上表现更好，而 DeepSeek-V2 Chat (SFT)在中文任务上表现更强。最终，DeepSeek-V2 Chat (RL)在数学和编码任务中的表现相比 DeepSeek-V2 Chat (SFT)进一步增强。这些比较突显了 DeepSeek-V2 Chat 在各个领域和语言中相对于其他语言模型的优势。

Evaluations on Open-Ended Generation. We proceed with additional evaluations of our models on open-ended conversation benchmarks. For English open-ended conversation generation, we utilize MT-Bench and AlpacaEval 2.0 as the benchmarks. Evaluation results presented in Table 4 demonstrate a significant performance advantage of DeepSeek-V2 Chat (RL) over DeepSeek-V2 Chat (SFT). This outcome showcases the effectiveness of our RL training in achieving improved alignment. In comparison to other open-source models, DeepSeek-V2 Chat (RL) demonstrates superior performance over Mistral 8x22B Instruct and Qwen1.5 72B Chat on both benchmarks. When compared with LLaMA3 70B Instruct, DeepSeek-V2 Chat (RL) showcases competitive performance on MT-Bench and notably outperforms it on AlpacaEval 2.0. These results highlight the strong performance of DeepSeek-V2 Chat (RL) in generating high-quality and contextually relevant responses, particularly in instruction-based conversation tasks.
开放式生成的评估。我们继续对我们的模型在开放式对话基准上的额外评估。对于英语开放式对话生成，我们利用 MT-Bench 和 AlpacaEval 2.0 作为基准。表 4 中呈现的评估结果显示，DeepSeek-V2 Chat (RL)相较于 DeepSeek-V2 Chat (SFT)具有显著的性能优势。这个结果展示了我们 RL 训练在实现更好对齐方面的有效性。与其他开源模型相比，DeepSeek-V2 Chat (RL)在这两个基准上表现优于 Mistral 8x22B Instruct 和 Qwen1.5 72B Chat。与 LLaMA3 70B Instruct 相比，DeepSeek-V2 Chat (RL)在 MT-Bench 上表现出竞争力，并在 AlpacaEval 2.0 上显著超越它。这些结果突显了 DeepSeek-V2 Chat (RL)在生成高质量和上下文相关的响应方面的强大表现，特别是在基于指令的对话任务中。

In addition, we evaluate the Chinese open-ended generation capability based on AlignBench. As presented in Table 5, DeepSeek-V2 Chat (RL) exhibits a slight advantage over DeepSeek-V2 Chat (SFT). Notably, DeepSeek-V2 Chat (SFT) surpasses all open-source Chinese models by a significant margin. It significantly outperforms the second-best open-source model, Qwen1.5
此外，我们基于 AlignBench 评估了中文开放式生成能力。如表 5 所示，DeepSeek-V2 Chat (RL) 相较于 DeepSeek-V2 Chat (SFT) 展现出轻微的优势。值得注意的是，DeepSeek-V2 Chat (SFT) 以显著的优势超越了所有开源中文模型。它的表现显著优于第二好的开源模型 Qwen1.5。

Benchmark 基准测试

# Shots

DeepSeek 67B Chat DeepSeek 67B 聊天

Qwen 1.5 72B 聊天

Qwen 1.5

72B Chat

\begin{aligned} LLaMA3 \\ 70B Inst. \end{aligned}

\begin{matrix} Mixtral \\ 8 \times 22 B Inst. \end{matrix}

DeepSeek-V2 Chat (SFT) DeepSeek-V2 聊天 (SFT)

\begin{aligned} DeepSeek-V2 \\ Chat (RL) \end{aligned}

Context Length 上下文长度

32K

64 K

128K

Architecture 架构

Dense 密集

MoE

# Activated Params # 激活参数

67B

72B

70B

39B

21B

# Total Params # 总参数

67B

72B

70B

141B

236B

English 英语

TriviaQA

5-shot

81.5

79.6

69.1

80.0

85.4

86.7

NaturalQuestions

5-shot

47.0

46.9

44.6

54.9

51.9

53.4

MMLU

5-shot

71.1

76.2

80.3

77.8

78.4

77.8

ARC-Easy

25-shot

96.6

96.8

96.9

97.1

97.6

98.1

ARC-Challenge

25-shot

88.9

91.7

92.6

90.0

92.5

92.3

BBH

3-shot

71.7

65.9

80.1

78.4

81.3

79.7

AGIEval

0 -shot

46.4

62.8

56.6

41.4

63.2

61.4

IFEval

0 -shot

55.5

57.3

79.7

72.1

64.1

63.8

Code 代码

HumanEval

0-shot

73.8

68.9

76.2

75.0

76.8

81.1

MBPP

3-shot

61.4

52.2

69.8

64.4

\underset{―}{70.4}

72.0

CRUXEval-I-COT

2-shot

49.1

51.4

61.1

59.4

59.5

61.5

CRUXEval-O-COT

2-shot

50.9

56.5

63.6

60.7

\underset{―}{63.0}

LiveCodeBench

0 -shot

18.3

18.8

30.5

25.0

28.7

32.5

Math 数学

GSM8K

8-shot

84.1

81.9

93.2

87.9

90.8

92.2

MATH

4-shot

32.6

40.6

48.5

49.8

52.7

53.9

CMath

0 -shot

80.3

82.8

79.2

75.1

82.0

81.9

Chinese 中文

CLUEWSC

5-shot

78.5

90.1

85.4

75.8

88.6

89.9

C-Eval

5-shot

65.2

82.2

67.9

60.0

88.9

78.0

CMMLU

5-shot

67.8

82.9

70.7

61.0

82.4

81.6

| | Benchmark | # Shots | DeepSeek 67B Chat | Qwen 1.5 72B Chat | $\begin{aligned} & \text { LLaMA3 } \\ & \text { 70B Inst. } \end{aligned}$ | $\begin{gathered} \text { Mixtral } \\ 8 \times 22 B \text { Inst. } \end{gathered}$ | DeepSeek-V2 Chat (SFT) | $\begin{aligned} & \text { DeepSeek-V2 } \\ & \text { Chat (RL) } \end{aligned}$ | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | Context Length | - | 4K | 32K | 8K | 64 K | 128K | 128K | | | Architecture | - | Dense | Dense | Dense | MoE | MoE | MoE | | | # Activated Params | - | 67B | 72B | 70B | 39B | 21B | 21B | | | # Total Params | - | 67B | 72B | 70B | 141B | 236B | 236B | | English | TriviaQA | 5-shot | 81.5 | 79.6 | 69.1 | 80.0 | 85.4 | 86.7 | | | NaturalQuestions | 5-shot | 47.0 | 46.9 | 44.6 | 54.9 | 51.9 | 53.4 | | | MMLU | 5-shot | 71.1 | 76.2 | 80.3 | 77.8 | 78.4 | 77.8 | | | ARC-Easy | 25-shot | 96.6 | 96.8 | 96.9 | 97.1 | 97.6 | 98.1 | | | ARC-Challenge | 25-shot | 88.9 | 91.7 | 92.6 | 90.0 | 92.5 | 92.3 | | | BBH | 3-shot | 71.7 | 65.9 | 80.1 | 78.4 | 81.3 | 79.7 | | | AGIEval | 0 -shot | 46.4 | 62.8 | 56.6 | 41.4 | 63.2 | 61.4 | | | IFEval | 0 -shot | 55.5 | 57.3 | 79.7 | 72.1 | 64.1 | 63.8 | | Code | HumanEval | 0-shot | 73.8 | 68.9 | 76.2 | 75.0 | 76.8 | 81.1 | | | MBPP | 3-shot | 61.4 | 52.2 | 69.8 | 64.4 | $\underline{70.4}$ | 72.0 | | | CRUXEval-I-COT | 2-shot | 49.1 | 51.4 | 61.1 | 59.4 | 59.5 | 61.5 | | | CRUXEval-O-COT | 2-shot | 50.9 | 56.5 | 63.6 | 63.6 | 60.7 | $\underline{63.0}$ | | | LiveCodeBench | 0 -shot | 18.3 | 18.8 | 30.5 | 25.0 | 28.7 | 32.5 | | Math | GSM8K | 8-shot | 84.1 | 81.9 | 93.2 | 87.9 | 90.8 | 92.2 | | | MATH | 4-shot | 32.6 | 40.6 | 48.5 | 49.8 | 52.7 | 53.9 | | | CMath | 0 -shot | 80.3 | 82.8 | 79.2 | 75.1 | 82.0 | 81.9 | | Chinese | CLUEWSC | 5-shot | 78.5 | 90.1 | 85.4 | 75.8 | 88.6 | 89.9 | | | C-Eval | 5-shot | 65.2 | 82.2 | 67.9 | 60.0 | 88.9 | 78.0 | | | CMMLU | 5-shot | 67.8 | 82.9 | 70.7 | 61.0 | 82.4 | 81.6 |

Table 3 | Comparison among DeepSeek-V2 Chat (SFT), DeepSeek-V2 Chat (RL), and other representative open-source chat models. Regarding TriviaQA and NaturalQuestions, it is worth noting that chat models, such as LLaMA3 70B Instruct, might not strictly adhere to the format constraints typically specified in the few-shot setting. Consequently, this can lead to underestimation of certain models in our evaluation framework.
表 3 | DeepSeek-V2 Chat (SFT)、DeepSeek-V2 Chat (RL) 和其他代表性开源聊天模型的比较。关于 TriviaQA 和 NaturalQuestions，值得注意的是，聊天模型，如 LLaMA3 70B Instruct，可能并不严格遵循通常在少量示例设置中指定的格式限制。因此，这可能导致我们评估框架中对某些模型的低估。

Model 模型	MT-Bench	AlpacaEval 2.0
DeepSeek 67B Chat DeepSeek 67B 聊天	8.35	16.6
Mistral 8x22B Instruct v0.1	8.66	30.9
Qwen1.5 72B Chat Qwen1.5 72B 聊天	8.61	36.6
LLaMA3 70B Instruct LLaMA3 70B 指令	$8.95$	34.4
DeepSeek-V2 Chat (SFT) DeepSeek-V2 聊天 (SFT)	8.62	30.0
DeepSeek-V2 Chat (RL) DeepSeek-V2 聊天 (RL)	$8.97$	$38.9$

Table

4 ∣

English open-ended conversation evaluations. For AlpacaEval 2.0, we use the lengthcontrolled win rate as the metric.
表

4 ∣

英语开放式对话评估。对于 AlpacaEval 2.0，我们使用长度控制的胜率作为指标。

72B Chat on both Chinese reasoning and language. Moreover, both DeepSeek-V2 Chat (SFT) and DeepSeek-V2 Chat (RL) outperform GPT-4-0613 and ERNIEBot 4.0, solidifying the position of our models in the top-tier LLMs that support Chinese. Specifically, DeepSeek-V2 Chat (RL) shows remarkable performance in Chinese language understanding, which outperforms all models including GPT-4-Turbo-1106-Preview. On the other hand, the reasoning capability of DeepSeek-V2 Chat (RL) still lags behind giant models, such as Erniebot-4.0 and GPT-4s.
72B Chat 在中文推理和语言方面表现出色。此外，DeepSeek-V2 Chat (SFT) 和 DeepSeek-V2 Chat (RL) 的表现超过了 GPT-4-0613 和 ERNIEBot 4.0，巩固了我们模型在支持中文的顶级 LLMs 中的地位。具体而言，DeepSeek-V2 Chat (RL) 在中文语言理解方面表现出色，超过了包括 GPT-4-Turbo-1106-Preview 在内的所有模型。另一方面，DeepSeek-V2 Chat (RL) 的推理能力仍然落后于一些大型模型，如 Erniebot-4.0 和 GPT-4s。

Model 模型
模型

总体总分

Overall

总分

Reasoning 中文推理推理中文推理

Language 中文语言语言中文语言

平均推理总分

Avg．

推理

总分

数学计算

Math．

数学

计算

Logi

逻辑

推理

Avg．

语言

总分

Fund．基本任务

Chi．

中文

理解

Open．综合问答

Writ．

文本

写作

角色．扮演

Role．

角色

扮演

专业能力

Pro．

专业能力

GPT－4－1106－Preview

8.01

7.73

7.80

7.66

8.29

7.99

7.33

8.61

8.67

8.47

8.65

DeepSeek－V2 Chat（RL） DeepSeek－V2 聊天（RL）

7.91

7.45

7.77

7.14

8.36

8.10

8.28

8.37

8.53

8.33

8.53

ERNIEBot－4．0－202404＊（文心一言）

7.89

7.61

7.81

7.41

8.17

7.56

8.53

8.13

8.45

8.24

8.09

DeepSeek－V2 Chat（SFT）

7.74

7.30

7.34

7.26

8.17

8.04

8.26

8.13

8.00

8.10

8.49

GPT－4－0613

7.53

7.47

7.56

7.37

7.59

7.81

6.93

7.42

7.93

7.51

7.94

ERNIEBot－4．0－202312＊（文心一言）

7.36

6.84

7.00

6.67

7.88

7.47

7.88

8.05

8.19

7.84

7.85

Moonshot－v1－32k－202404＊（月之暗面）

7.22

6.42

6.41

6.43

8.02

7.82

7.58

8.00

8.22

8.19

8.29

Qwen1．5－72B－Chat＊ Qwen1.5－72B－Chat＊

7.19

6.45

6.58

6.31

7.93

7.38

7.77

8.15

8.02

8.05

8.24

DeepSeek－67B－Chat

6.43

5.75

5.71

5.79

7.11

7.12

6.52

7.58

7.20

6.91

7.37

ChatGLM－Turbo（智谱清言）

6.24

5.00

4.74

5.26

7.49

6.82

7.17

8.16

7.77

7.76

7.24

ERNIEBot－3．5（文心一言）

6.14

5.15

5.03

5.27

7.13

6.62

7.60

7.26

7.56

6.83

6.90

Yi－34B－Chat＊

6.12

4.86

4.97

4.74

7.38

6.72

7.28

7.76

7.44

7.58

7.53

GPT－3．5－Turbo－0613

6.08

5.35

5.68

5.02

6.82

6.71

5.81

7.29

7.03

7.28

6.77

ChatGLM－Pro（智谱清言）

5.83

4.65

4.54

4.75

7.01

6.51

6.76

7.47

7.07

7.34

6.89

SparkDesk－V2（讯飞星火）

5.74

4.73

4.71

4.74

6.76

5.84

6.97

7.29

7.18

6.92

6.34

Qwen－14B－Chat

5.72

4.81

4.91

4.71

6.63

6.90

6.36

6.74

6.64

6.59

6.56

Baichuan2－13B－Chat

5.25

3.92

3.76

4.07

6.59

6.22

6.05

7.11

6.97

6.75

6.43

ChatGLM3－6B

4.97

3.85

3.55

4.14

6.10

5.75

5.29

6.71

6.83

6.28

5.73

Baichuan2－7B－Chat

4.97

3.66

3.56

3.75

6.28

5.81

5.50

7.13

6.84

6.53

5.84

InternLM－20B

4.96

3.66

3.39

3.92

6.26

5.96

5.50

7.18

6.19

6.49

6.22

Qwen－7B－Chat

4.91

3.73

3.62

3.83

6.09

6.40

5.74

6.26

6.31

6.19

5.66

ChatGLM2－6B

4.48

3.39

3.16

3.61

5.58

4.91

4.52

6.66

6.25

6.08

5.08

InternLM－Chat－7B

3.65

2.56

2.45

2.66

4.75

4.34

4.09

5.82

4.89

5.32

4.06

Chinese－LLaMA－2－7B－Chat

3.57

2.68

2.29

3.07

4.46

4.31

4.26

4.50

4.63

4.91

4.13

LLaMA－2－13B－Chinese－Chat

3.35

2.47

2.21

2.73

4.23

4.13

3.31

4.79

3.93

4.53

4.71

Model模型 "Overall 总分" Reasoning 中文推理 Language 中文语言 "Avg．推理总分" "Math．数学计算" "Logi 逻辑推理" "Avg．语言总分" Fund．基本任务 "Chi．中文理解" Open．综合问答 "Writ．文本写作" "Role．角色扮演" "Pro．专业能力" GPT－4－1106－Preview 8.01 7.73 7.80 7.66 8.29 7.99 7.33 8.61 8.67 8.47 8.65 DeepSeek－V2 Chat（RL） 7.91 7.45 7.77 7.14 8.36 8.10 8.28 8.37 8.53 8.33 8.53 ERNIEBot－4．0－202404＊（文心一言） 7.89 7.61 7.81 7.41 8.17 7.56 8.53 8.13 8.45 8.24 8.09 DeepSeek－V2 Chat（SFT） 7.74 7.30 7.34 7.26 8.17 8.04 8.26 8.13 8.00 8.10 8.49 GPT－4－0613 7.53 7.47 7.56 7.37 7.59 7.81 6.93 7.42 7.93 7.51 7.94 ERNIEBot－4．0－202312＊（文心一言） 7.36 6.84 7.00 6.67 7.88 7.47 7.88 8.05 8.19 7.84 7.85 Moonshot－v1－32k－202404＊（月之暗面） 7.22 6.42 6.41 6.43 8.02 7.82 7.58 8.00 8.22 8.19 8.29 Qwen1．5－72B－Chat＊ 7.19 6.45 6.58 6.31 7.93 7.38 7.77 8.15 8.02 8.05 8.24 DeepSeek－67B－Chat 6.43 5.75 5.71 5.79 7.11 7.12 6.52 7.58 7.20 6.91 7.37 ChatGLM－Turbo（智谱清言） 6.24 5.00 4.74 5.26 7.49 6.82 7.17 8.16 7.77 7.76 7.24 ERNIEBot－3．5（文心一言） 6.14 5.15 5.03 5.27 7.13 6.62 7.60 7.26 7.56 6.83 6.90 Yi－34B－Chat＊ 6.12 4.86 4.97 4.74 7.38 6.72 7.28 7.76 7.44 7.58 7.53 GPT－3．5－Turbo－0613 6.08 5.35 5.68 5.02 6.82 6.71 5.81 7.29 7.03 7.28 6.77 ChatGLM－Pro（智谱清言） 5.83 4.65 4.54 4.75 7.01 6.51 6.76 7.47 7.07 7.34 6.89 SparkDesk－V2（讯飞星火） 5.74 4.73 4.71 4.74 6.76 5.84 6.97 7.29 7.18 6.92 6.34 Qwen－14B－Chat 5.72 4.81 4.91 4.71 6.63 6.90 6.36 6.74 6.64 6.59 6.56 Baichuan2－13B－Chat 5.25 3.92 3.76 4.07 6.59 6.22 6.05 7.11 6.97 6.75 6.43 ChatGLM3－6B 4.97 3.85 3.55 4.14 6.10 5.75 5.29 6.71 6.83 6.28 5.73 Baichuan2－7B－Chat 4.97 3.66 3.56 3.75 6.28 5.81 5.50 7.13 6.84 6.53 5.84 InternLM－20B 4.96 3.66 3.39 3.92 6.26 5.96 5.50 7.18 6.19 6.49 6.22 Qwen－7B－Chat 4.91 3.73 3.62 3.83 6.09 6.40 5.74 6.26 6.31 6.19 5.66 ChatGLM2－6B 4.48 3.39 3.16 3.61 5.58 4.91 4.52 6.66 6.25 6.08 5.08 InternLM－Chat－7B 3.65 2.56 2.45 2.66 4.75 4.34 4.09 5.82 4.89 5.32 4.06 Chinese－LLaMA－2－7B－Chat 3.57 2.68 2.29 3.07 4.46 4.31 4.26 4.50 4.63 4.91 4.13 LLaMA－2－13B－Chinese－Chat 3.35 2.47 2.21 2.73 4.23 4.13 3.31 4.79 3.93 4.53 4.71

| Model模型 | Overall 总分 | Reasoning 中文推理 | | | Language 中文语言 | | | | | | | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | | Avg． 推理 总分 | Math． 数学 计算 | Logi 逻辑 推理 | Avg． 语言 总分 | Fund．基本任务 | Chi． 中文 理解 | Open．综合问答 | Writ． 文本 写作 | Role． 角色 扮演 | Pro． 专业能力 | | GPT－4－1106－Preview | 8.01 | 7.73 | 7.80 | 7.66 | 8.29 | 7.99 | 7.33 | 8.61 | 8.67 | 8.47 | 8.65 | | DeepSeek－V2 Chat（RL） | 7.91 | 7.45 | 7.77 | 7.14 | 8.36 | 8.10 | 8.28 | 8.37 | 8.53 | 8.33 | 8.53 | | ERNIEBot－4．0－202404＊（文心一言） | 7.89 | 7.61 | 7.81 | 7.41 | 8.17 | 7.56 | 8.53 | 8.13 | 8.45 | 8.24 | 8.09 | | DeepSeek－V2 Chat（SFT） | 7.74 | 7.30 | 7.34 | 7.26 | 8.17 | 8.04 | 8.26 | 8.13 | 8.00 | 8.10 | 8.49 | | GPT－4－0613 | 7.53 | 7.47 | 7.56 | 7.37 | 7.59 | 7.81 | 6.93 | 7.42 | 7.93 | 7.51 | 7.94 | | ERNIEBot－4．0－202312＊（文心一言） | 7.36 | 6.84 | 7.00 | 6.67 | 7.88 | 7.47 | 7.88 | 8.05 | 8.19 | 7.84 | 7.85 | | Moonshot－v1－32k－202404＊（月之暗面） | 7.22 | 6.42 | 6.41 | 6.43 | 8.02 | 7.82 | 7.58 | 8.00 | 8.22 | 8.19 | 8.29 | | Qwen1．5－72B－Chat＊ | 7.19 | 6.45 | 6.58 | 6.31 | 7.93 | 7.38 | 7.77 | 8.15 | 8.02 | 8.05 | 8.24 | | DeepSeek－67B－Chat | 6.43 | 5.75 | 5.71 | 5.79 | 7.11 | 7.12 | 6.52 | 7.58 | 7.20 | 6.91 | 7.37 | | ChatGLM－Turbo（智谱清言） | 6.24 | 5.00 | 4.74 | 5.26 | 7.49 | 6.82 | 7.17 | 8.16 | 7.77 | 7.76 | 7.24 | | ERNIEBot－3．5（文心一言） | 6.14 | 5.15 | 5.03 | 5.27 | 7.13 | 6.62 | 7.60 | 7.26 | 7.56 | 6.83 | 6.90 | | Yi－34B－Chat＊ | 6.12 | 4.86 | 4.97 | 4.74 | 7.38 | 6.72 | 7.28 | 7.76 | 7.44 | 7.58 | 7.53 | | GPT－3．5－Turbo－0613 | 6.08 | 5.35 | 5.68 | 5.02 | 6.82 | 6.71 | 5.81 | 7.29 | 7.03 | 7.28 | 6.77 | | ChatGLM－Pro（智谱清言） | 5.83 | 4.65 | 4.54 | 4.75 | 7.01 | 6.51 | 6.76 | 7.47 | 7.07 | 7.34 | 6.89 | | SparkDesk－V2（讯飞星火） | 5.74 | 4.73 | 4.71 | 4.74 | 6.76 | 5.84 | 6.97 | 7.29 | 7.18 | 6.92 | 6.34 | | Qwen－14B－Chat | 5.72 | 4.81 | 4.91 | 4.71 | 6.63 | 6.90 | 6.36 | 6.74 | 6.64 | 6.59 | 6.56 | | Baichuan2－13B－Chat | 5.25 | 3.92 | 3.76 | 4.07 | 6.59 | 6.22 | 6.05 | 7.11 | 6.97 | 6.75 | 6.43 | | ChatGLM3－6B | 4.97 | 3.85 | 3.55 | 4.14 | 6.10 | 5.75 | 5.29 | 6.71 | 6.83 | 6.28 | 5.73 | | Baichuan2－7B－Chat | 4.97 | 3.66 | 3.56 | 3.75 | 6.28 | 5.81 | 5.50 | 7.13 | 6.84 | 6.53 | 5.84 | | InternLM－20B | 4.96 | 3.66 | 3.39 | 3.92 | 6.26 | 5.96 | 5.50 | 7.18 | 6.19 | 6.49 | 6.22 | | Qwen－7B－Chat | 4.91 | 3.73 | 3.62 | 3.83 | 6.09 | 6.40 | 5.74 | 6.26 | 6.31 | 6.19 | 5.66 | | ChatGLM2－6B | 4.48 | 3.39 | 3.16 | 3.61 | 5.58 | 4.91 | 4.52 | 6.66 | 6.25 | 6.08 | 5.08 | | InternLM－Chat－7B | 3.65 | 2.56 | 2.45 | 2.66 | 4.75 | 4.34 | 4.09 | 5.82 | 4.89 | 5.32 | 4.06 | | Chinese－LLaMA－2－7B－Chat | 3.57 | 2.68 | 2.29 | 3.07 | 4.46 | 4.31 | 4.26 | 4.50 | 4.63 | 4.91 | 4.13 | | LLaMA－2－13B－Chinese－Chat | 3.35 | 2.47 | 2.21 | 2.73 | 4.23 | 4.13 | 3.31 | 4.79 | 3.93 | 4.53 | 4.71 |

Table 5 ｜AlignBench leaderboard rated by GPT－4－0613．Models are ranked in descending order based on the overall score．Models marked with＊represent that we evaluate them through their API service or open－weighted model，instead of referring to the results reported in their original papers．Suffixes of Erniebot－4．0 and Moonshot denote the timestamps when we called their API．
表 5 ｜AlignBench 排行榜由 GPT－4－0613 评定。模型根据总体得分按降序排列。标有＊的模型表示我们通过其 API 服务或开放加权模型进行评估，而不是参考其原始论文中报告的结果。Erniebot－4．0 和 Moonshot 的后缀表示我们调用其 API 的时间戳

4．4．Discussion 4．4．讨论

Amount of SFT Data．The discussion surrounding the necessity of a large SFT corpus has been a topic of intense debate．Previous works（Young et al．，2024；Zhou et al．，2024）argue that fewer than 10 K instances of SFT data are enough to produce satisfactory results．However，in our experiments，we observe a significant performance decline on the IFEval benchmark if we use fewer than 10 K instances．A possible explanation is that，a language model necessitates a certain amount of data to develop specific skills．Although the requisite data amount may diminish with the model size increasing，it cannot be entirely eliminated．Our observation underscores the critical need for sufficient data to equip an LLM with desired capabilities．Moreover，the quality of SFT data is also crucial，especially for tasks involving writing or open－ended questions．
SFT 数据的数量。关于大型 SFT 语料库必要性的讨论一直是一个激烈辩论的话题。之前的研究（Young et al.，2024；Zhou et al.，2024）认为少于 10K 的 SFT 数据实例足以产生令人满意的结果。然而，在我们的实验中，如果使用少于 10K 的实例，我们观察到在 IFEval 基准上的性能显著下降。一个可能的解释是，语言模型需要一定量的数据来发展特定的技能。尽管所需的数据量可能随着模型规模的增加而减少，但不能完全消除。我们的观察强调了为LLM提供所需能力所需的足够数据的关键需求。此外，SFT 数据的质量也至关重要，特别是对于涉及写作或开放式问题的任务

Alignment Tax of Reinforcement Learning．During human preference alignment，we observe a significant performance enhancement on the open－ended generation benchmarks，in terms of the scores rated by both AI and human evaluators．However，we also notice a phenomenon of＂alignment tax＂（Ouyang et al．，2022），i．e．，the alignment process can negatively impact the performance on some standard benchmarks such as BBH．In order to alleviate the alignment tax，during the RL stage，we make significant efforts in data processing and improving training strategies，finally achieving a tolerable trade－off between the performance on standard and open－ended benchmarks．Exploring how to align a model with human preferences without
强化学习的对齐税。在人类偏好对齐过程中，我们观察到在开放式生成基准测试中，AI 和人类评估者评分的表现显著提升。然而，我们也注意到一种“对齐税”的现象（Ouyang et al.，2022），即对齐过程可能对一些标准基准（如 BBH）的表现产生负面影响。为了减轻对齐税，在强化学习阶段，我们在数据处理和改进训练策略方面做出了重大努力，最终在标准基准和开放式基准之间实现了可接受的权衡。探索如何在不影响模型性能的情况下使模型与人类偏好对齐
compromising its general performance presents a valuable direction for future research.
妥协其整体性能为未来研究提供了一个宝贵的方向。

Online Reinforcement Learning. In our preference alignment experiments, we find that the online approach significantly outperforms the offline approach. Therefore, we invest tremendous efforts in implementing an online RL framework for aligning DeepSeek-V2. The conclusion about online or offline preference alignment can vary in different contexts, and we reserve a more thorough comparison and analysis between them for future work.
在线强化学习。在我们的偏好对齐实验中，我们发现在线方法显著优于离线方法。因此，我们投入了大量精力来实现一个在线 RL 框架，以对齐 DeepSeek-V2。关于在线或离线偏好对齐的结论在不同的上下文中可能会有所不同，我们将为未来的工作保留更全面的比较和分析。

5. Conclusion, Limitation, and Future Work
5. 结论、局限性和未来工作

In this paper, we introduce DeepSeek-V2, a large MoE language model that supports 128 K context length. In addition to strong performance, it is also characterized by economical training and efficient inference, benefiting from its innovative architecture including MLA and DeepSeekMoE. In practice, compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves

42.5 %

of training costs, reduces the KV cache by

93.3 %

, and boosts the maximum generation throughput to 5.76 times. Evaluation results further demonstrate that with only 21B activated parameters, DeepSeek-V2 achieves top-tier performance among open-source models and becomes the strongest open-source MoE model.
在本文中，我们介绍了 DeepSeek-V2，一个支持 128K 上下文长度的大型 MoE 语言模型。除了强大的性能外，它还具有经济的训练和高效的推理，得益于其创新的架构，包括 MLA 和 DeepSeekMoE。在实践中，与 DeepSeek 67B 相比，DeepSeek-V2 实现了显著更强的性能，同时节省了

42.5 %

的训练成本，减少了 KV 缓存

93.3 %

，并将最大生成吞吐量提升至 5.76 倍。评估结果进一步表明，仅使用 21B 激活参数，DeepSeek-V2 在开源模型中实现了顶级性能，成为最强的开源 MoE 模型。

DeepSeek-V2 and its chat versions share the acknowledged limitations commonly found in other LLMs, including the lack of ongoing knowledge updates after pre-training, the possibility of generating non-factual information such as unverified advice, and a chance to produce hallucinations. In addition, since our data primarily consist of Chinese and English content, our model may exhibit limited proficiency in other languages. In scenarios beyond Chinese and English, it should be used with caution.
DeepSeek-V2 及其聊天版本存在其他LLMs中常见的已知局限性，包括在预训练后缺乏持续的知识更新、生成未经验证的建议等非事实信息的可能性，以及产生幻觉的机会。此外，由于我们的数据主要由中文和英文内容组成，我们的模型在其他语言中的表现可能有限。在中文和英文以外的场景中，应谨慎使用。

DeepSeek will continuously invest in open-source large models with longtermism, aiming to progressively approach the goal of artificial general intelligence.
DeepSeek 将持续投资于开放源代码的大型模型，秉持长期主义，旨在逐步接近人工通用智能的目标。

In our ongoing exploration, we are dedicated to devising methods that enable further scaling up MoE models while maintaining economical training and inference costs. The goal of our next step is to achieve performance on par with GPT-4 in our upcoming release.
在我们持续的探索中，我们致力于制定方法，使 MoE 模型能够进一步扩展，同时保持经济的训练和推理成本。我们下一步的目标是在即将发布的版本中实现与 GPT-4 相当的性能。
Our alignment team continuously strives to enhance our models, aiming to develop a model that is not only helpful but also honest and safe for worldwide users. Our ultimate objective is to align the values of our model with human values, while minimizing the need for human supervision. By prioritizing ethical considerations and responsible development, we are dedicated to creating a positive and beneficial impact on society.
我们的对齐团队不断努力提升我们的模型，旨在开发一个不仅有用而且诚实、安全的模型，以服务全球用户。我们的最终目标是使我们的模型的价值观与人类价值观保持一致，同时尽量减少对人类监督的需求。通过优先考虑伦理因素和负责任的发展，我们致力于对社会产生积极和有益的影响。
Currently, DeepSeek-V2 is designed to support the text modality exclusively. In our forward-looking agenda, we intend to enable our model to support multiple modalities, enhancing its versatility and utility in a wider range of scenarios.
目前，DeepSeek-V2 仅设计用于支持文本模态。在我们前瞻性的计划中，我们打算使我们的模型支持多种模态，从而增强其在更广泛场景中的多功能性和实用性。

References 参考文献

AI@Meta. Llama 3 model card, 2024. URLhttps://github.com/meta-llama/llama3/bl ob/main/MODEL_CARD.md.
AI@Meta. Llama 3 模型卡，2024。URLhttps://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md。
J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, 和 S. Sanghai. Gqa: 从多头检查点训练广义多查询变换器模型. arXiv 预印本 arXiv:2305.13245, 2023.

Anthropic. Introducing Claude, 2023. URL https://www.anthropic.com/index/introd ucing-claude.
Anthropic。介绍 Claude，2023。网址 https://www.anthropic.com/index/introd ucing-claude。
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, 等人。使用大型语言模型进行程序合成。arXiv 预印本 arXiv:2108.07732, 2021。
J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang, Z. Zhang, C. Zhou, J. Zhou, X. Zhou, and T. Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang, Z. Zhang, C. Zhou, J. Zhou, X. Zhou, 和 T. Zhu. Qwen 技术报告. arXiv 预印本 arXiv:2309.16609, 2023.
Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7432-7439. AAAI Press, 2020. doi: 10.1609/aaai.v34i05.6239. URL https://doi.org/10.1609/aaai.v34i05.6239,
Y. Bisk, R. Zellers, R. L. Bras, J. Gao, 和 Y. Choi. PIQA：在自然语言中推理物理常识。在第三十四届人工智能 AAAI 会议，AAAI 2020，第三十二届人工智能创新应用会议，IAAI 2020，第十届人工智能教育进展 AAAI 研讨会，EAAI 2020，纽约，纽约州，美国，2020 年 2 月 7 日至 12 日，页码 7432-7439。AAAI 出版社，2020 年。doi: 10.1609/aaai.v34i05.6239。网址 https://doi.org/10.1609/aaai.v34i05.6239，
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, 和 W. Zaremba. 评估在代码上训练的大型语言模型。CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457.
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick 和 O. Tafjord. 认为你已经解决了问答问题？试试 arc，AI2 推理挑战。CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457.
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, 等。训练验证器以解决数学文字问题。arXiv 预印本 arXiv:2110.14168, 2021。
Y. Cui, T. Liu, W. Che, L. Xiao, Z. Chen, W. Ma, S. Wang, and G. Hu. A span-extraction dataset for Chinese machine reading comprehension. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5883-5889, Hong Kong, China, Nov. 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1600. URL https://aclanthology.org/D19-1 600
Y. Cui, T. Liu, W. Che, L. Xiao, Z. Chen, W. Ma, S. Wang, 和 G. Hu. 一个用于中文机器阅读理解的跨度提取数据集。在 K. Inui, J. Jiang, V. Ng, 和 X. Wan 编辑的《2019 年自然语言处理实证方法会议暨第九届国际联合自然语言处理会议（EMNLP-IJCNLP）论文集》，第 5883-5889 页，香港，中国，2019 年 11 月。计算语言学协会。doi: 10.18653/v1/D19-1600。网址 https://aclanthology.org/D19-1600
D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. CoRR, abs/2401.06066, 2024. URL https://doi.org/10.48550/arXiv.2401.06066.
D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, 和 W. Liang. Deepseekmoe: 朝着混合专家语言模型中的终极专家专业化迈进。CoRR, abs/2401.06066, 2024. URL https://doi.org/10.48550/arXiv.2401.06066.
T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning, 2023.
T. Dao. FlashAttention-2: 更快的注意力机制，具有更好的并行性和工作分配，2023。

DeepSeek－AI．Deepseek LLM：scaling open－source language models with longtermism．CoRR， abs／2401．02954，2024．URLhttps：／／doi．org／10．48550／arXiv．2401．02954．
DeepSeek－AI．Deepseek LLM：通过长期主义扩展开源语言模型．CoRR， abs／2401．02954，2024．URLhttps：／／doi．org／10．48550／arXiv．2401．02954．

D．Dua，Y．Wang，P．Dasigi，G．Stanovsky，S．Singh，and M．Gardner．DROP：A reading compre－ hension benchmark requiring discrete reasoning over paragraphs．In J．Burstein，C．Doran，and T．Solorio，editors，Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，NAACL－HLT 2019，Minneapolis，MN，USA，June 2－7，2019，Volume 1 （Long and Short Papers），pages 2368－ 2378．Association for Computational Linguistics，2019．doi：10．18653／V1／N19－1246．URL https：／／doi．org／10．18653／v1／n19－1246．
D．Dua，Y．Wang，P．Dasigi，G．Stanovsky，S．Singh，和 M．Gardner．DROP：一个需要对段落进行离散推理的阅读理解基准。在 J．Burstein，C．Doran，和 T．Solorio，编辑，2019 年北美计算语言学协会会议：人类语言技术，NAACL－HLT 2019，明尼阿波利斯，MN，美国，2019 年 6 月 2－7 日，第 1 卷（长篇和短篇论文），第 2368－2378 页。计算语言学协会，2019 年。doi：10．18653／V1／N19－1246。网址 https：／／doi．org／10．18653／v1／n19－1246

Y．Dubois，B．Galambosi，P．Liang，and T．B．Hashimoto．Length－controlled alpacaeval：A simple way to debias automatic evaluators．arXiv preprint arXiv：2404．04475， 2024.
Y．Dubois，B．Galambosi，P．Liang，和 T．B．Hashimoto．长度控制的羊驼评估：一种简单的去偏自动评估方法。arXiv 预印本 arXiv：2404．04475，2024。

W．Fedus，B．Zoph，and N．Shazeer．Switch transformers：Scaling to trillion parameter models with simple and efficient sparsity．CoRR，abs／2101．03961，2021．URL https：／／arxiv．org／ abs／2101．03961
W．Fedus，B．Zoph，和 N．Shazeer．Switch transformers：扩展到万亿参数模型的简单高效稀疏性．CoRR，abs／2101．03961，2021．URL https：／／arxiv．org／ abs／2101．03961

L．Gao，S．Biderman，S．Black，L．Golding，T．Hoppe，C．Foster，J．Phang，H．He，A．Thite， N．Nabeshima，et al．The Pile：An 800GB dataset of diverse text for language modeling．arXiv preprint arXiv：2101．00027， 2020.
L．Gao，S．Biderman，S．Black，L．Golding，T．Hoppe，C．Foster，J．Phang，H．He，A．Thite，N．Nabeshima，等。The Pile：一个 800GB 的多样化文本数据集，用于语言建模。arXiv 预印本 arXiv：2101．00027，2020。

Google．Introducing gemini：our largest and most capable ai model，2023．URL https：／／blog．google／technology／ai／google－gemini－ai／．
谷歌．介绍 Gemini：我们最大和最强大的 AI 模型，2023．网址 https：／／blog．google／technology／ai／google－gemini－ai／．

A．Gu，B．Rozière，H．Leather，A．Solar－Lezama，G．Synnaeve，and S．I．Wang．Cruxeval：A benchmark for code reasoning，understanding and execution， 2024.
A．Gu，B．Rozière，H．Leather，A．Solar－Lezama，G．Synnaeve，和 S．I．Wang．Cruxeval：一个用于代码推理、理解和执行的基准，2024。

D．Hendrycks，C．Burns，S．Basart，A．Zou，M．Mazeika，D．Song，and J．Steinhardt．Measuring massive multitask language understanding．arXiv preprint arXiv：2009．03300， 2020.
D．Hendrycks，C．Burns，S．Basart，A．Zou，M．Mazeika，D．Song，和 J．Steinhardt．测量大规模多任务语言理解．arXiv 预印本 arXiv：2009．03300，2020。

D．Hendrycks，C．Burns，S．Kadavath，A．Arora，S．Basart，E．Tang，D．Song，and J．Steinhardt．Mea－ suring mathematical problem solving with the math dataset．arXiv preprint arXiv：2103．03874， 2021.
D．Hendrycks，C．Burns，S．Kadavath，A．Arora，S．Basart，E．Tang，D．Song，和 J．Steinhardt．使用数学数据集测量数学问题解决能力．arXiv 预印本 arXiv：2103．03874，2021。

High－flyer．Hai－llm：高效且轻量的大模型训练工具，2023．URL https：／／www．high－flyer．c n／en／blog／hai－llm
高－飞行者．Hai－llm：高效且轻量的大模型训练工具，2023．URL https：／／www．high－flyer．c n／en／blog／hai－llm

C．Hooper，S．Kim，H．Mohammadzadeh，M．W．Mahoney，Y．S．Shao，K．Keutzer，and A．Gholami． Kvquant：Towards 10 million context length LLM inference with KV cache quantization．CoRR， abs／2401．18079，2024．URLhttps：／／doi．org／10．48550／arXiv．2401．18079
C．Hooper，S．Kim，H．Mohammadzadeh，M．W．Mahoney，Y．S．Shao，K．Keutzer，和 A．Gholami。 Kvquant：朝着 1000 万上下文长度 LLM 推理与 KV 缓存量化。CoRR，abs／2401．18079，2024。URL：https：／／doi．org／10．48550／arXiv．2401．18079

S．Hu，Y．Tu，X．Han，C．He，G．Cui，X．Long，Z．Zheng，Y．Fang，Y．Huang，W．Zhao，et al． Minicpm：Unveiling the potential of small language models with scalable training strategies． arXiv preprint arXiv：2404．06395， 2024.
S．Hu，Y．Tu，X．Han，C．He，G．Cui，X．Long，Z．Zheng，Y．Fang，Y．Huang，W．Zhao，等。Minicpm：揭示小型语言模型在可扩展训练策略下的潜力。arXiv 预印本 arXiv：2404．06395，2024。

Y．Huang，Y．Bai，Z．Zhu，J．Zhang，J．Zhang，T．Su，J．Liu，C．Lv，Y．Zhang，J．Lei，et al．C－Eval：A multi－level multi－discipline chinese evaluation suite for foundation models．arXiv preprint arXiv：2305．08322， 2023.
Y．Huang，Y．Bai，Z．Zhu，J．Zhang，J．Zhang，T．Su，J．Liu，C．Lv，Y．Zhang，J．Lei，等．C－Eval：一个多层次多学科的中文评估套件，用于基础模型．arXiv 预印本 arXiv：2305．08322，2023。

N．Jain，K．Han，A．Gu，W．－D．Li，F．Yan，T．Zhang，S．Wang，A．Solar－Lezama，K．Sen，and I．Stoica． Livecodebench：Holistic and contamination free evaluation of large language models for code． arXiv preprint arXiv：2403．07974， 2024.
N．Jain，K．Han，A．Gu，W．－D．Li，F．Yan，T．Zhang，S．Wang，A．Solar－Lezama，K．Sen，和 I．Stoica． Livecodebench：对大型语言模型进行整体和无污染的代码评估． arXiv 预印本 arXiv：2403．07974，2024。
M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In R. Barzilay and M.-Y. Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601-1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
M. Joshi, E. Choi, D. Weld, 和 L. Zettlemoyer. TriviaQA: 一个大规模远程监督的阅读理解挑战数据集。在 R. Barzilay 和 M.-Y. Kan 编辑的《计算语言学协会第 55 届年会论文集（第一卷：长篇论文）》中，页面 1601-1611，加拿大温哥华，2017 年 7 月。计算语言学协会。doi: 10.18653/v1/P17-1147。网址 https://aclanthology.org/P17-1147。
T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452-466, 2019. doi: 10.1162/tacl

∖

∖

_00276. URLhttps://doi.org/10.1162/tacl_a_00276.
T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, 和 S. Petrov. 自然问题：问答研究的基准。计算语言学协会会刊，7:452-466, 2019。doi: 10.1162/tacl

∖

∖

_00276. URLhttps://doi.org/10.1162/tacl_a_00276.
W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, 和 I. Stoica. 针对大语言模型服务的高效内存管理与分页注意力. 在 2023 年 ACM SIGOPS 第 29 届操作系统原理研讨会论文集中.
G. Lai, Q. Xie, H. Liu, Y. Yang, and E. H. Hovy. RACE: large-scale reading comprehension dataset from examinations. In M. Palmer, R. Hwa, and S. Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 785-794. Association for Computational Linguistics, 2017. doi: 10.18653/V1/D17-1082. URLhttps://doi.org/10.18653/v1/d1 7-1082.
G. Lai, Q. Xie, H. Liu, Y. Yang, 和 E. H. Hovy. RACE：来自考试的大规模阅读理解数据集。在 M. Palmer, R. Hwa, 和 S. Riedel 编辑的《2017 年自然语言处理实证方法会议论文集》，EMNLP 2017，丹麦哥本哈根，2017 年 9 月 9-11 日，页码 785-794。计算语言学协会，2017 年。doi: 10.18653/V1/D17-1082。URL https://doi.org/10.18653/v1/d17-1082。
D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=qrwe7XHTmYb.
D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, 和 Z. Chen. Gshard: 使用条件计算和自动分片扩展巨型模型. 在第九届国际学习表征会议，ICLR 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=qrwe7XHTmYb.
H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU: Measuring massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212, 2023.
H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, 和 T. Baldwin. CMMLU: 测量中文的大规模多任务语言理解. arXiv 预印本 arXiv:2306.09212, 2023.
W. Li, F. Qi, M. Sun, X. Yi, and J. Zhang. Ccpm: A chinese classical poetry matching dataset, 2021.
W. Li, F. Qi, M. Sun, X. Yi, 和 J. Zhang. Ccpm: 一个中文古典诗词匹配数据集, 2021.
X. Liu, X. Lei, S. Wang, Y. Huang, Z. Feng, B. Wen, J. Cheng, P. Ke, Y. Xu, W. L. Tam, X. Zhang, L. Sun, H. Wang, J. Zhang, M. Huang, Y. Dong, and J. Tang. Alignbench: Benchmarking chinese alignment of large language models. CoRR, abs/2311.18743, 2023. doi: 10.48550/A RXIV.2311.18743. URLhttps://doi.org/10.48550/arXiv.2311.18743.
X. Liu, X. Lei, S. Wang, Y. Huang, Z. Feng, B. Wen, J. Cheng, P. Ke, Y. Xu, W. L. Tam, X. Zhang, L. Sun, H. Wang, J. Zhang, M. Huang, Y. Dong, 和 J. Tang. Alignbench: 大型语言模型的中文对齐基准测试. CoRR, abs/2311.18743, 2023. doi: 10.48550/A RXIV.2311.18743. URLhttps://doi.org/10.48550/arXiv.2311.18743.
I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
I. Loshchilov 和 F. Hutter. 解耦权重衰减正则化. arXiv 预印本 arXiv:1711.05101, 2017.

Mistral. Cheaper, better, faster, stronger: Continuing to push the frontier of ai and making it accessible to all, 2024. URL https://mistral.ai/news/mixtral-8x22b.
Mistral。更便宜、更好、更快、更强：继续推动人工智能的前沿，使其对所有人可及，2024 年。网址 https://mistral.ai/news/mixtral-8x22b。

OpenAI. Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
OpenAI。介绍 ChatGPT，2022。网址 https://openai.com/blog/chatgpt。
OpenAI. GPT4 technical report. arXiv preprint arXiv:2303.08774, 2023.
OpenAI. GPT4 技术报告。arXiv 预印本 arXiv:2303.08774, 2023。
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730-27744, 2022.
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, 等。通过人类反馈训练语言模型以遵循指令。《神经信息处理系统进展》，35:27730-27744，2022。
B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
B. Peng, J. Quesnelle, H. Fan, 和 E. Shippole. Yarn: 大型语言模型的高效上下文窗口扩展. arXiv 预印本 arXiv:2309.00071, 2023.
P. Qi, X. Wan, G. Huang, and M. Lin. Zero bubble pipeline parallelism. arXiv preprint arXiv:2401.10241, 2023.
P. Qi, X. Wan, G. Huang, 和 M. Lin. 零气泡管道并行性. arXiv 预印本 arXiv:2401.10241, 2023.
S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1-16. IEEE, 2020.
S. Rajbhandari, J. Rasley, O. Ruwase, 和 Y. He. Zero: 针对训练万亿参数模型的内存优化. 在 SC20: 高性能计算、网络、存储和分析国际会议, 页码 1-16. IEEE, 2020.
C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. S. Pinto, D. Keysers, and N. Houlsby. Scaling vision with sparse mixture of experts. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, pages 8583-8595, 2021. URL https://proceedings.neurips.cc/paper /2021/hash/48237d9f2dea8c74c2a72126cf63d933-Abstract.html.
C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. S. Pinto, D. Keysers, 和 N. Houlsby. 使用稀疏专家混合模型扩展视觉. 在《神经信息处理系统进展 34：2021 年神经信息处理系统年会，NeurIPS 2021》，第 8583-8595 页，2021 年. URL https://proceedings.neurips.cc/paper/2021/hash/48237d9f2dea8c74c2a72126cf63d933-Abstract.html.
K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019.
K. Sakaguchi, R. L. Bras, C. Bhagavatula, 和 Y. Choi. Winogrande: 一项大规模的对抗性 Winograd 语法挑战, 2019.
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, 和 D. Guo. Deepseekmath: 在开放语言模型中推动数学推理的极限. arXiv 预印本 arXiv:2402.03300, 2024.
N. Shazeer. Fast transformer decoding: One write-head is all you need. CoRR, abs/1911.02150, 2019. URL http://arxiv.org/abs/1911.02150.
N. Shazeer. 快速变换器解码：一个写头就是你所需要的。CoRR, abs/1911.02150, 2019. URL http://arxiv.org/abs/1911.02150.
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V. Le, G. E. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017. OpenReview.net, 2017. URL https: //openreview.net/forum?id=B1ckMDqlg.
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V. Le, G. E. Hinton, 和 J. Dean. 极其庞大的神经网络：稀疏门控专家混合层. 在第五届国际学习表征会议，ICLR 2017. OpenReview.net, 2017. URL https: //openreview.net/forum?id=B1ckMDqlg.
J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, 和 Y. Liu. Roformer: 带有旋转位置嵌入的增强型变换器. Neurocomputing, 568:127063, 2024.
K. Sun, D. Yu, D. Yu, and C. Cardie. Investigating prior knowledge for challenging chinese machine reading comprehension, 2019.
K. Sun, D. Yu, D. Yu, 和 C. Cardie. 研究挑战性中文机器阅读理解的先验知识，2019。
M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, 等. 挑战大基准任务以及链式思维是否能解决它们。arXiv 预印本 arXiv:2210.09261, 2022.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, 和 I. Polosukhin. 注意力是你所需要的一切. 神经信息处理系统进展, 30, 2017.
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, 等. 大型语言模型的突现能力. arXiv 预印本 arXiv:2206.07682, 2022.
T. Wei, J. Luan, W. Liu, S. Dong, and B. Wang. Cmath: Can your language model pass chinese elementary school math test?, 2023.
T. Wei, J. Luan, W. Liu, S. Dong, 和 B. Wang. Cmath: 你的语言模型能通过中国小学数学测试吗？，2023。
L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu, C. Yu, Y. Tian, Q. Dong, W. Liu, B. Shi, Y. Cui, J. Li, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou,
S. Liu, Z. Zhao, Q. Zhao, C. Yue, X. Zhang, Z. Yang, K. Richardson, and Z. Lan. CLUE: A chinese language understanding evaluation benchmark. In D. Scott, N. Bel, and C. Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 4762-4772. International Committee on Computational Linguistics, 2020. doi: 10.18653/V1/2020.COLING-MAIN.419. URL https://doi.org/10.18653/v1/2020.coling-main. 419
S. Liu, Z. Zhao, Q. Zhao, C. Yue, X. Zhang, Z. Yang, K. Richardson, 和 Z. Lan. CLUE: 一个中文语言理解评估基准. 在 D. Scott, N. Bel, 和 C. Zong, 编辑, 第 28 届国际计算语言学会议论文集, COLING 2020, 西班牙巴塞罗那（在线）, 2020 年 12 月 8-13 日, 页码 4762-4772. 国际计算语言学委员会, 2020. doi: 10.18653/V1/2020.COLING-MAIN.419. URL https://doi.org/10.18653/v1/2020.coling-main. 419
A. Young, B. Chen, C. Li, C. Huang, G. Zhang, G. Zhang, H. Li, J. Zhu, J. Chen, J. Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
A. Young, B. Chen, C. Li, C. Huang, G. Zhang, G. Zhang, H. Li, J. Zhu, J. Chen, J. Chang, 等. Yi: 由 01. ai 开放的基础模型. arXiv 预印本 arXiv:2403.04652, 2024.
R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. HellaSwag: Can a machine really finish your sentence? In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791-4800. Association for Computational Linguistics, 2019. doi:

10.18653 / v 1 / p 19 - 1472

. URL https://doi.org/10.18653/v1/p1 9-1472.
R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, 和 Y. Choi. HellaSwag: 机器真的能完成你的句子吗？在 A. Korhonen, D. R. Traum, 和 L. Màrquez 编辑的《第 57 届计算语言学协会会议论文集》，ACL 2019，意大利佛罗伦萨，2019 年 7 月 28 日至 8 月 2 日，第 1 卷：长篇论文，页码 4791-4800。计算语言学协会，2019。doi:

10.18653 / v 1 / p 19 - 1472

。网址 https://doi.org/10.18653/v1/p1 9-1472。
Y. Zhao, C. Lin, K. Zhu, Z. Ye, L. Chen, S. Zheng, L. Ceze, A. Krishnamurthy, T. Chen, and B. Kasikci. Atom: Low-bit quantization for efficient and accurate LLM serving. CoRR, abs/2310.19102, 2023. URLhttps://doi.org/10.48550/arXiv.2310.19102.
Y. Zhao, C. Lin, K. Zhu, Z. Ye, L. Chen, S. Zheng, L. Ceze, A. Krishnamurthy, T. Chen, 和 B. Kasikci. Atom: 低位量化用于高效和准确的 LLM 服务. CoRR, abs/2310.19102, 2023. URL https://doi.org/10.48550/arXiv.2310.19102.
C. Zheng, M. Huang, and A. Sun. Chid: A large-scale chinese idiom dataset for cloze test. In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 778-787. Association for Computational Linguistics, 2019. doi: 10.18653/V1/P19-1075. URL https://doi.org/10.18653/v1/p19-1075.
C. Zheng, M. Huang, 和 A. Sun. Chid: 一个用于填空测试的大规模中文成语数据集. 在 A. Korhonen, D. R. Traum, 和 L. Màrquez, 编辑, 第 57 届计算语言学协会会议论文集, ACL 2019, 意大利佛罗伦萨, 2019 年 7 月 28 日至 8 月 2 日, 第 1 卷: 长篇论文, 第 778-787 页. 计算语言学协会, 2019. doi: 10.18653/V1/P19-1075. URL https://doi.org/10.18653/v1/p19-1075.
L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, 和 I. Stoica. 使用 mt-bench 和 chatbot arena 评判 llm-作为-法官，2023。
W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. AGIEval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023. doi: 10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv. 2304.06364.
W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, 和 N. Duan. AGIEval: 一个以人为中心的基准，用于评估基础模型. CoRR, abs/2304.06364, 2023. doi: 10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364.
C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024.
C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, 等. Lima: 少即是多用于对齐. 神经信息处理系统进展, 36, 2024.
J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023.
J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, 和 L. Hou. 大型语言模型的指令跟随评估. arXiv 预印本 arXiv:2311.07911, 2023.

Appendix 附录

A. Contributions and Acknowledgments
A. 贡献与致谢

Research & Engineering 研究与工程

Aixin Liu
Bingxuan Wang
Bo Liu  博刘
Chenggang Zhao  赵承刚
Chengqi Deng  邓承启
Chong Ruan
Damai Dai  达麦·戴
Daya Guo
Dejian Yang
Deli Chen
Erhang Li
Fangyun Lin  方云林
Fuli Luo
Guangbo Hao  光博浩
Guanting Chen  关廷晨
Guowei Li
H. Zhang

Hanwei Xu
Hao Yang  浩阳
Haowei Zhang
Honghui Ding  丁洪辉
Huajian Xin  华建新
Huazuo Gao  华作高
Hui Qu  惠曲
Jianzhong Guo
Jiashi Li
Jingyang Yuan  袁京阳
Junjie Qiu  邱俊杰
Junxiao Song
Kai Dong  凯东
Kaige Gao  高凯歌
Kang Guan
Lean Wang
Lecong Zhang
Liang Zhao  梁赵
Liyue Zhang
Mingchuan Zhang
Minghua Zhang
Minghui Tang
Panpan Huang  潘潘黄
Peiyi Wang
Qihao Zhu
Qinyu Chen
Qiushi Du

Ruiqi Ge
Ruizhe Pan
Runxin Xu
Shanghao Lu
Shangyan Zhou
Shanhuang Chen  单黄陈
Shengfeng Ye  盛丰叶
Shirong Ma
Shiyu Wang
Shuiping Yu
Shunfeng Zhou  周顺丰
Size Zheng  大小郑
Tian Pei  田佩
Wangding Zeng  王丁增
Wen Liu  文刘
Wenfeng Liang
Wenjun Gao  高文君
Wentao Zhang
Xiao Bi  小比
Xiaohan Wang  小寒王
Xiaodong Liu  刘晓东
Xiaokang Chen  小康陈
Xiaotao Nie
Xin Liu
Xin Xie  辛谢
Xingkai Yu
Xinyu Yang
Xuan Lu
Xuecheng Su  薛成苏
Y. Wu
Y.K. Li
Y.X. Wei

Yanhong Xu  徐彦宏
Yao Li  姚李
Yao Zhao  姚赵
Yaofeng Sun
Yaohui Wang  姚辉王
Yichao Zhang
Yiliang Xiong
Yilong Zhao  姚龙赵
Ying He  英赫
Yishi Piao
Yixin Dong
Yixuan Tan
Yiyuan Liu

Yongii Wang	Xiaosha Chen
Yongqiang Guo 郭永强	Xiaowen Sun 小文孙
Yuduan Wang	Xiaoxiang Wang
Yuheng Zou	Xinnan Song 新南宋
Yuxiang You	Xinyi Zhou 周欣怡
Yuxuan Liu	Y.X. Zhu
Z.Z. Ren	Yanhong Xu 徐彦宏
Zehui Ren 任泽辉	Yanping Huang 黄燕平
Zhangli Sha 张丽莎	Yaohui Li 姚辉李
Zhe Fu	Yi Zheng 易正
Zhenda Xie	Yuchen Zhu
Zhewen Hao	Yunxian Ma 云仙马
Zhihong Shao 邵志宏	Zhen Huang
Zhuoshu Li	Zhipeng Xu
Zihan Wang 王子涵	Zhongyu Zhang 张忠宇
Zihui Gu
Zilin Li
Ziwei Xie 谢子维	Business & Compliance 商业与合规
	Bin Wang
Data Annotation 数据标注	Dongjie Ji 董洁吉
Bei Feng 北风	Jian Liang
Hui Li	Jin Chen 金晨
J.L. Cai	Leyi Xia
Jiaqi Ni 贾琦·倪	Miaojun Wang
Lei Xu 雷旭	Mingming Li
Meng Li 孟丽	Peng Zhang 彭张
Ning Tian 宁天	Shaoqing Wu
R.J. Chen	Shengfeng Ye
R.L. Jin	T. Wang
Ruyi Chen	W.L. Xiao
S.S. Li	Wei An 魏安
Shuang Zhou	Xianzu Wang 王先祖
Tian Yuan 天元	Ying Tang
Tianyu Sun 天宇孙	Yukun Zha
X.Q. Li	Yuting Yan 闫玉婷
Xiangyue Jin	Zhen Zhang
Xiaojin Shen	Zhiniu Wen

Within each role, authors are listed alphabetically by first name. Especially, Huazuo Gao and Wangding Zeng have made key innovations in the research of the MLA architecture. Furthermore, we’d like to thank Jianlin Su for his helpful discussion on position embedding. We thank all those who have contributed to DeepSeek-V2 but are not mentioned in the paper. DeepSeek believes that innovation, novelty, and curiosity are essential in the path to AGI.
在每个角色中，作者按名字的字母顺序列出。特别是，Huazuo Gao 和 Wangding Zeng 在 MLA 架构的研究中做出了关键创新。此外，我们要感谢 Jianlin Su 对位置嵌入的有益讨论。我们感谢所有为 DeepSeek-V2 做出贡献但未在论文中提及的人。DeepSeek 认为创新、新颖性和好奇心在通往 AGI 的道路上至关重要。

B. DeepSeek-V2-Lite: A 16B Model Equipped with MLA and DeepSeekMoE
B. DeepSeek-V2-Lite：一款配备 MLA 和 DeepSeekMoE 的 16B 模型

B.1. Model Description B.1. 模型描述

Architectures. DeepSeek-V2-Lite has 27 layers and a hidden dimension of 2048. It also employs MLA and has 16 attention heads, where each head has a dimension of 128. Its KV compression dimension is 512 , but slightly different from DeepSeek-V2, it does not compress the queries. For the decoupled queries and key, it has a per-head dimension of 64. DeepSeek-V2-Lite also employs DeepSeekMoE, and all FFNs except for the first layer are replaced with MoE layers. Each MoE layer consists of 2 shared experts and 64 routed experts, where the intermediate hidden dimension of each expert is 1408 . Among the routed experts, 6 experts will be activated for each token. Under this configuration, DeepSeek-V2-Lite comprises 15.7B total parameters, of which 2.4 B are activated for each token.
架构。DeepSeek-V2-Lite 具有 27 层和 2048 的隐藏维度。它还采用了 MLA，并且有 16 个注意力头，每个头的维度为 128。它的 KV 压缩维度为 512，但与 DeepSeek-V2 略有不同，它不压缩查询。对于解耦的查询和键，它的每个头的维度为 64。DeepSeek-V2-Lite 还采用了 DeepSeekMoE，除了第一层外，所有 FFN 都被 MoE 层替换。每个 MoE 层由 2 个共享专家和 64 个路由专家组成，每个专家的中间隐藏维度为 1408。在路由专家中，每个令牌将激活 6 个专家。在这种配置下，DeepSeek-V2-Lite 总共有 15.7B 个参数，其中每个令牌激活 2.4B 个。

	Benchmark 基准测试	DeepSeek 7B	DeepSeekMoE 16B	DeepSeek-V2-Lite
	Architecture 架构	MHA+Dense	MHA+MoE	MLA+MoE
	Context Length 上下文长度	4 K	4K	32 K
	# Activated Params # 激活参数	6.9B	2.8B	2.4B
	# Total Params # 总参数	6.9B	16.4B	15.7B
	# Training Tokens # 训练标记	2 T	2 T	5.7 T
English 英语	MMLU	48.2	45.0	58.3
	BBH	39.5	38.9	44.1
	TriviaQA	59.7	64.8	64.2
	NaturalQuestions	22.2	25.5	26.0
	ARC-Easy	67.9	68.1	70.9
	ARC-Challenge	48.1	49.8	51.2
	AGIEval	26.4	17.4	33.2
Code 代码	HumanEval	26.2	26.8	29.9
Code 代码	MBPP	39.0	39.2	43.2
Math 数学	GSM8K	17.4	18.8	41.1
	MATH	3.3	4.3	17.1
	CMath	34.5	40.4	58.4
Chinese 中文	CLUEWSC	73.1	72.1	74.3
	C-Eval	45.0	40.6	60.3
	CMMLU	47.2	42.5	64.3

Table 6 | Performance of DeepSeek-V2-Lite, DeepSeekMoE 16B, and DeepSeek 7B.
表 6 | DeepSeek-V2-Lite、DeepSeekMoE 16B 和 DeepSeek 7B 的性能。

Training Details. DeepSeek-V2-Lite is also trained from scratch on the same pre-training corpus of DeepSeek-V2, which is not polluted by any SFT data. It uses the AdamW optimizer with hyper-parameters set to

β_{1} = 0.9, β_{2} = 0.95

, and weight_decay

= 0.1

. The learning rate is scheduled using a warmup-and-step-decay strategy. Initially, the learning rate linearly increases from 0 to the maximum value during the first 2 K steps. Subsequently, the learning rate is multiplied by 0.316 after training about

80 %

of tokens, and again by 0.316 after training about

90 %

of tokens. The maximum learning rate is set to

4.2 \times 10^{- 4}

, and the gradient clipping norm is set to 1.0. We do not employ the batch size scheduling strategy for it , and it is trained with a constant batch size of 4608 sequences. During pre-training, we set the maximum sequence
训练细节。DeepSeek-V2-Lite 也是在与 DeepSeek-V2 相同的预训练语料库上从头开始训练的，该语料库未被任何 SFT 数据污染。它使用 AdamW 优化器，超参数设置为

β_{1} = 0.9, β_{2} = 0.95

和权重衰减

= 0.1

。学习率采用预热和阶梯衰减策略进行调度。最初，学习率在前 2 K 步骤中线性从 0 增加到最大值。随后，在训练约

80 %

个标记后，学习率乘以 0.316，在训练约

90 %

个标记后再次乘以 0.316。最大学习率设置为

4.2 \times 10^{- 4}

，梯度裁剪范数设置为 1.0。我们没有为其采用批量大小调度策略，而是以 4608 个序列的固定批量大小进行训练。在预训练期间，我们设置最大序列
length to 4 K , and train DeepSeek-V2-Lite on 5.7T tokens. We leverage pipeline parallelism to deploy different layers of it on different devices, but for each layer, all experts will be deployed on the same device. Therefore, we only employ a small expert-level balance loss with

α_{1} = 0.001

, and do not employ device-level balance loss and communication balance loss for it. After pre-training, we also perform long context extension and SFT for DeepSeek-V2-Lite and get a chat model called DeepSeek-V2-Lite Chat.
长度为 4K，并在 5.7T 标记上训练 DeepSeek-V2-Lite。我们利用管道并行性将其不同层部署在不同设备上，但对于每一层，所有专家将部署在同一设备上。因此，我们只使用一个小的专家级平衡损失

α_{1} = 0.001

，并不为其使用设备级平衡损失和通信平衡损失。预训练后，我们还对 DeepSeek-V2-Lite 进行长上下文扩展和 SFT，得到一个名为 DeepSeek-V2-Lite Chat 的聊天模型。

	Benchmark 基准测试	DeepSeek 7B Chat DeepSeek 7B 聊天	DeepSeekMoE 16B Chat DeepSeekMoE 16B 聊天	DeepSeek-V2-Lite Chat DeepSeek-V2-Lite 聊天
	Architecture 架构	MHA+Dense	MHA+MoE	MLA+MoE
	Context Length 上下文长度	4K	4K	32 K
	# Activated Params # 激活参数	6.9B	2.8B	2.4B
	# Total Params # 总参数	6.9B	16.4B	15.7B
	# Training Tokens # 训练标记	2 T	2 T	5.7T
English 英语	MMLU	49.7	47.2	55.7
	BBH	43.1	42.2	48.1
	TriviaQA	59.5	63.3	65.2
	NaturalQuestions	32.7	35.1	35.5
	ARC-Easy	70.2	69.9	74.3
	ARC-Challenge	50.2	50.0	51.5
	AGIEval	17.6	19.7	42.8
Code 代码	HumanEval	45.1	45.7	57.3
Code 代码	MBPP	39.0	46.2	45.8
Math 数学	GSM8K	62.6	62.2	72.0
	MATH	14.7	15.2	27.9
	CMath	66.4	67.9	71.7
Chinese 中文	CLUEWSC	66.2	68.2	80.0
	C-Eval	44.7	40.0	60.1
	CMMLU	51.2	49.3	62.5

Table 7 | Performance of DeepSeek-V2-Lite Chat, DeepSeekMoE 16B Chat, and DeepSeek 7B Chat.
表 7 | DeepSeek-V2-Lite Chat、DeepSeekMoE 16B Chat 和 DeepSeek 7B Chat 的性能。

B.2. Performance Evaluation
B.2. 性能评估

Base Model. We evaluate the performance of DeepSeek-V2-Lite and compare it with our previous small-size base models in Table 6. DeepSeek-V2-Lite exhibits overwhelming performance advantages, especially in reasoning, coding, and math.
基础模型。我们评估了 DeepSeek-V2-Lite 的性能，并将其与我们之前的小型基础模型进行比较，如表 6 所示。DeepSeek-V2-Lite 在推理、编码和数学方面表现出压倒性的性能优势。

Chat Model. We evaluate the performance of DeepSeek-V2-Lite Chat and compare it with our previous small-size chat models in Table7. DeepSeek-V2-Lite also outperforms our previous small-size chat models by a large margin.
聊天模型。我们评估了 DeepSeek-V2-Lite 聊天的性能，并将其与我们之前的小型聊天模型进行比较，如表 7 所示。DeepSeek-V2-Lite 的表现也大幅超越了我们之前的小型聊天模型。

C. Full Formulas of MLA
C. MLA 的完整公式

In order to demonstrate the complete computation process of MLA, we provide its full formulas in the following:
为了展示 MLA 的完整计算过程，我们在下面提供其完整公式：

\begin{aligned} c_{t}^{Q} & = W^{D Q} h_{t}, \\ [q_{t, 1}^{C}; q_{t, 2}^{C}; \dots; q_{t, n_{h}}^{C}] = q_{t}^{C} & = W^{U Q} c_{t}^{Q}, \\ [q_{t, 1}^{R}; q_{t, 2}^{R} \dots; q_{t, n_{h}}^{R}] = q_{t}^{R} & = RoPE (W^{Q R} c_{t}^{Q}), \\ q_{t, i} & = [q_{t, i}^{C}; q_{t, i}^{R}], \\ c_{t}^{K V} & = W^{W K V} h_{t}, \\ [k_{t, 1}^{C}; k_{t, 2}^{C}; \dots; k_{t, n_{h}}^{C}] = k_{t}^{C} & = W^{U K} c_{t}^{K V}, \\ k_{t}^{R} & = RoPE (W^{K R} h_{t}), \\ k_{t, i} & = [k_{t, i}^{C}; k_{t}^{R}], \\ [v_{t, 1}^{C} v_{t, 2}^{C}; \dots; v_{t, n_{h}}^{C}] = v_{t}^{C} & = W^{U V} c_{t}^{K V}, \\ o_{t, i} & = \sum_{j = 1}^{t} {Softmax}_{j} (\frac{q_{t, i}^{T} k_{j, i}}{\sqrt{d_{h} + d_{h}^{R}}}) v_{j, i}^{C} \\ u_{t} & = W^{O} [o_{t, i}; o_{t, 2}; \dots; o_{t, n_{h}}], \end{aligned}

where the boxed vectors in blue need to be cached for generation. During inference, the naive formula needs to recover

k_{t}^{C}

and

v_{t}^{C}

from

c_{t}^{K V}

for attention. Fortunately, due to the associative law of matrix multiplication, we can absorb

W^{U K}

into

W^{U Q}

, and

W^{U V}

into

W^{O}

. Therefore, we do not need to compute keys and values out for each query. Through this optimization, we avoid the computational overhead for recomputing

k_{t}^{C}

and

v_{t}^{C}

during inference.
在生成过程中，蓝色框中的向量需要被缓存。在推理期间，简单公式需要从

c_{t}^{K V}

中恢复

k_{t}^{C}

和

v_{t}^{C}

以进行注意力计算。幸运的是，由于矩阵乘法的结合律，我们可以将

W^{U K}

吸收进

W^{U Q}

，将

W^{U V}

吸收进

W^{O}

。因此，我们不需要为每个查询计算键和值。通过这种优化，我们避免了在推理过程中重新计算

k_{t}^{C}

和

v_{t}^{C}

的计算开销。

D. Ablation of Attention Mechanisms
D. 注意力机制的消融

D.1. Ablation of MHA, GQA, and MQA
D.1. MHA、GQA 和 MQA 的消融

We show the evaluation results for 7B dense models with MHA, GQA, and MQA on four hard benchmarks in Table 8, All of these three models are trained on 1.33T tokens, and share the same architecture except for the attention mechanisms. In addition, for a fair comparison, we align the number of parameters of them to around 7B by adjusting the number of layers. From the table, we can find that MHA demonstrates significant advantages over GQA and MQA on these benchmarks.
我们在表 8 中展示了 7B 稠密模型在四个困难基准上的评估结果，使用了 MHA、GQA 和 MQA。这三种模型都是在 1.33T tokens 上训练的，除了注意力机制外，架构相同。此外，为了公平比较，我们通过调整层数将它们的参数数量对齐到大约 7B。从表中可以看出，MHA 在这些基准上相较于 GQA 和 MQA 表现出显著优势。

D.2. Comparison Between MLA and MHA
D.2. MLA 与 MHA 的比较

In Table 9, we show the evaluation results for MoE models equipped with MLA and MHA, respectively, on four hard benchmarks. For a solid conclusion, we train and evaluate models across two scales. Two small MoE models comprise about 16B total parameters, and we train them on 1.33T tokens. Two large MoE models comprise about 250B total parameters, and we train them on 420B tokens. Also, two small MoE models and two large MoE models respectively share the same architecture except for the attention mechanisms. From the table, we can observe that MLA shows better performance than MHA. More importantly, MLA requires a significantly smaller amount of KV cache (

14 %

for small MoE models and

4 %

for large MoE models) than MHA.
在表 9 中，我们展示了分别配备 MLA 和 MHA 的 MoE 模型在四个困难基准上的评估结果。为了得出可靠的结论，我们在两个规模上训练和评估模型。两个小型 MoE 模型总参数约为 16B，我们在 1.33T 个 token 上训练它们。两个大型 MoE 模型总参数约为 250B，我们在 420B 个 token 上训练它们。此外，两个小型 MoE 模型和两个大型 MoE 模型在架构上基本相同，唯一的区别在于注意力机制。从表中可以观察到，MLA 的表现优于 MHA。更重要的是，MLA 所需的 KV 缓存量显著小于 MHA（小型 MoE 模型为

14 %

，大型 MoE 模型为

4 %

）。

Benchmark (Metric) 基准（指标）

# Shots

Dense 7B

w/ MQA

Dense 7B

w/ GQA (8 Groups)

密集 7B 带 MHA

Dense 7B

w/ MHA

# Params # 参数

7.1 B

6.9 B

BBH (EM)

3-shot

33.2

35.6

37.0

MMLU (Acc.)

5-shot

37.9

41.2

45.2

C-Eval (Acc.)

5-shot

30.0

37.7

42.9

CMMLU (Acc.)

5-shot

34.6

38.4

43.5

Table 8 | Comparison among 7B dense models with MHA, GQA, and MQA, respectively. MHA demonstrates significant advantages over GQA and MQA on hard benchmarks.
表 8 | 7B 稠密模型在 MHA、GQA 和 MQA 之间的比较。MHA 在困难基准测试中相较于 GQA 和 MQA 表现出显著优势。

Benchmark (Metric) 基准（指标）

# Shots

小型 MoE w/ MHA

Small MoE

w/ MHA

小型 MoE w/ MLA

Small MoE

w/ MLA

大型 MoE 与 MHA

Large MoE

w/ MHA

大型 MoE 与 MLA

Large MoE

w/ MLA

2.5 B

2.4 B

25.0 B

21.5 B

# Total Params # 总参数

15.8 B

15.7 B

250.8 B

247.4 B

KV Cache per Token (# Element)
每个令牌的 KV 缓存（#元素）

110.6 K

15.6 K

860.2 K

34.6 K

BBH (EM)

3-shot

37.9

39.0

46.6

50.7

MMLU (Acc.)

5-shot

48.7

50.0

57.5

59.0

C-Eval (Acc.)

5-shot

51.6

50.9

57.9

59.2

CMMLU (Acc.)

5-shot

52.3

53.4

60.7

62.5

Table 9 | Comparison between MLA and MHA on hard benchmarks. DeepSeek-V2 shows better performance than MHA, but requires a significantly smaller amount of KV cache.
表 9 | MLA 与 MHA 在硬基准上的比较。DeepSeek-V2 的性能优于 MHA，但所需的 KV 缓存量显著更小。

E. Discussion About Pre-Training Data Debiasing
E. 关于预训练数据去偏见的讨论

During pre-training data preparation, we identify and filter out contentious content, such as values influenced by regional cultures, to avoid our model exhibiting unnecessary subjective biases on these controversial topics. Consequently, we observe that DeepSeek-V2 performs slightly worse on the test sets that are closely associated with specific regional cultures. For example, when evaluated on MMLU, although DeepSeek-V2 achieves comparable or superior performance on the majority of testsets compared with its competitors like Mixtral 8x22B, it still lags behind on the Humanity-Moral subset, which is mainly associated with American values.
在预训练数据准备过程中，我们识别并过滤有争议的内容，例如受地区文化影响的价值观，以避免我们的模型在这些争议话题上表现出不必要的主观偏见。因此，我们观察到 DeepSeek-V2 在与特定地区文化密切相关的测试集上表现稍逊。例如，在 MMLU 评估中，尽管 DeepSeek-V2 在大多数测试集上与其竞争对手如 Mixtral 8x22B 相比表现相当或更好，但在主要与美国价值观相关的人性-道德子集上仍然落后。

Further, we conduct a manual analysis on this subset. Three well-educated human annotators conduct independent annotations on 420 moral scenarios from the MMLU Humanity-Moral subset. Then, we compute the agreement among their annotations and the ground-truth label. As shown in Table10, three human annotators and the ground-truth label exhibit a low agreement with each other. Therefore, we attribute the abnormal performance of DeepSeek-V2 on these value-sensitive test sets to our efforts in debiasing the pre-training corpus.
此外，我们对这一子集进行了手动分析。三位受过良好教育的人类标注者对 MMLU Humanity-Moral 子集中的 420 个道德场景进行了独立标注。然后，我们计算他们的标注与真实标签之间的一致性。如表 10 所示，三位人类标注者与真实标签之间的一致性较低。因此，我们将 DeepSeek-V2 在这些价值敏感测试集上的异常表现归因于我们在去偏见预训练语料库方面的努力。

F. Additional Evaluations on Math and Code
F. 关于数学和代码的额外评估

The evaluation employs the SC-Math6 corpus, which consists of thousands of Chinese math problems. DeepSeek-V2 Chat (RL) outperforms all Chinese LLMs, including both open-source and close-source models.
评估使用了 SC-Math6 语料库，该语料库包含数千个中文数学问题。DeepSeek-V2 Chat (RL) 超越了所有中文 LLMs，包括开源和闭源模型。

We further share more results in Figure 50n HumanEval and LiveCodeBench, where the
我们在图 50 中进一步分享了 HumanEval 和 LiveCodeBench 的更多结果，其中

Agreement 协议	Ground-Truth Label 真实标签	Annotator 1 标注者 1	Annotator 2	Annotator 3
Ground-Truth Label 真实标签	$100.0 %$	$66.7 %$	$59.8 %$	$42.1 %$
Annotator 1 标注者 1	$66.7 %$	$100.0 %$	$57.9 %$	$69.0 %$
Annotator 2	$59.8 %$	$57.9 %$	$100.0 %$	$65.5 %$
Annotator 3	$42.1 %$	$69.0 %$	$65.5 %$	$100.0 %$

Table 10 | Three well-educated human annotators conduct independent annotations on 420 moral scenarios from the MMLU Humanity-Moral subset, on which DeepSeek-V2 and its competitive models demonstrate performance inconsistency. Three annotators and the ground-truth label exhibit a low agreement with each other. This indicates that the answers to the Humanity-Moral subset can be contentious according to specific regional cultures.
表 10 | 三位受过良好教育的人类注释者对来自 MMLU 人性-道德子集的 420 个道德场景进行独立注释，DeepSeek-V2 及其竞争模型在这些场景上的表现不一致。三位注释者与真实标签之间的协议度较低。这表明，人性-道德子集的答案可能因特定地区文化而存在争议。

Model Name 模型名称	R Level	Comp. Score	Reas. Steps Score 理由步骤分数	OvrAcc Score OvrAcc 分数
GPT-4-1106-Preview	$5$	90.71	91.65	89.77
GPT-4	$5$	88.40	89.10	87.71
DeepSeek-V2 Chat (RL) DeepSeek-V2 聊天 (RL)	$5$	83.35	85.73	84.54
Ernie-bot 4.0	$5$	85.60	86.82	84.38
Qwen-110B-Chat	$5$	83.25	84.93	84.09
GLM-4	$5$	84.24	85.72	82.77
Xinghuo 3.5	$5$	83.73	85.37	82.09
Qwen-72B-Chat	$4$	78.42	80.07	79.25
ChatGLM-Turbo	$4$	57.70	60.32	55.09
GPT-3.5-Turbo	$4$	57.05	59.61	54.50
Qwen-14B-Chat	$4$	53.12	55.99	50.26
ChatGLM3-6B	$3$	40.90	44.20	37.60
Xinghuo 3.0	$3$	40.08	45.27	34.89
Baichuan2-13B-Chat	$3$	39.40	42.63	36.18
Ernie-3.5-turbo	$2$	25.19	27.70	22.67
Chinese-Alpaca2-13B	$2$	20.55	22.52	18.58

Table 11 | SC-Math6 Model Reasoning Level. “R Level” stands for Reasoning Level, “Comp. Score” stands for Comprehensive Score, “Reas. Steps Score” stands for Reasoning Steps Score, and “OvrAcc Score” stands for Overall Accuracy Score.
表 11 | SC-Math6 模型推理水平。“R Level”代表推理水平，“Comp. Score”代表综合得分，“Reas. Steps Score”代表推理步骤得分，“OvrAcc Score”代表整体准确性得分。
questions of LiveCodeBench are selected from the period between September 1st, 2023, and April 1st, 2024. As shown in the figure, DeepSeek-V2 Chat (RL) demonstrates considerable proficiency in LiveCodeBench, achieving a Pass@1 score that even surpasses some giant models. This performance highlights the strong capability of DeepSeek-V2 Chat (RL) in tackling live coding tasks.
LiveCodeBench 的问题选自 2023 年 9 月 1 日至 2024 年 4 月 1 日期间。如图所示，DeepSeek-V2 Chat (RL) 在 LiveCodeBench 中表现出相当高的能力，Pass@1 分数甚至超过了一些大型模型。这一表现突显了 DeepSeek-V2 Chat (RL) 在处理实时编码任务方面的强大能力。

G. Evaluation Formats G. 评估格式

We present our evaluation formats for each benchmark in Table 12,37, respectively.
我们在表 12,37 中分别展示了每个基准的评估格式。

Figure 5 ｜Evaluation results on HumanEval and LiveCodeBench．The questions of Live－ CodeBench are selected from the period between September 1st， 2023 and April 1st， 2024.
图 5 ｜在 HumanEval 和 LiveCodeBench 上的评估结果。LiveCodeBench 的问题选自 2023 年 9 月 1 日至 2024 年 4 月 1 日之间。

PROMPT 提示

以下是一道中国高考生物选择题，请选择正确的答案。
问题：下列有关高尔基体，线粒体和叶绿体的叙述，正确的是选项：（A）三者都存在于蓝藻中

(B)

三者都含有DNA（C）三者都是ATP 合成的场所（D）三者的膜结构中都含有蛋白质
问题：下列有关高尔基体，线粒体和叶绿体的叙述，正确的是选项：（A）三者都存在于蓝藻中

(B)

三者都含有 DNA（C）三者都是 ATP 合成的场所（D）三者的膜结构中都含有蛋白质
答案：从

A

到

D

，我们应选择
Table 12 ｜An example of AGIEval．
表 12｜AGIEval 的一个示例

PROMPT 提示

Question: A sample in a cylindrical container has a cylindrical shape and a fixed volume. The state of matter of the sample
问题：一个圆柱形容器中的样本具有圆柱形状和固定体积。样本的物态
A. must be solid
A. 必须是坚固的
B. could be either solid or liquid
B. 可以是固体或液体
C. must be liquid
C. 必须是液体
D. could be either liquid or gas
D. 可以是液体或气体

Answer: B 答案：B
Question: The speed of sound is generally greatest in
声音的传播速度通常在
A. solids and lowest in liquids
A. 固体和液体中最低
B. solids and lowest in gases
B. 固体和气体中最低
C. gases and lowest in liquids
C. 气体和液体中最低
D. gases and lowest in solids
D. 气体和固体中最低

Answer: B  答案：B
Question: When oil and water are mixed together, they form a _
问题：当油和水混合在一起时，它们形成一个_
A. gas  A. 气体
B. solid  B. 固体
C. compound  C. 化合物
D. suspension  D. 悬挂

Answer: D 答案：D

Question: A container of liquid water was placed outside during the day when the temperature was

3^{\circ} C

. At night the outside temperature dropped to

- 2^{\circ} C

. This temperature change most likely caused the water to
问题：一个装有液态水的容器在白天放在外面，当时温度为

3^{\circ} C

。晚上外面的温度降到了

- 2^{\circ} C

。这个温度变化很可能导致水
A. condense A. 压缩
B. evaporate B. 蒸发
C. remain a liquid
C. 保持液态
D. become a solid
D. 变成固体

Answer: 答案：
Table 13 | An example of ARC.
表 13 | ARC 的一个示例。

PROMPT 提示

Evaluate the result of a random Boolean expression.
评估随机布尔表达式的结果。
Q: not ( ( not not True ) ) is
Q: not ( ( not not True ) ) 是
A: Let’s think step by step.
A: 让我们一步一步来思考。
Remember that (i) expressions inside brackets are always evaluated first and that (ii) the order of operations from highest priority to lowest priority is “not”, “and”, “or”, respectively. We first simplify this expression “Z” as follows: "Z =

not ((not not

True

)) = not ((A))

" where “A

=

not not True”. Let’s evaluate A:

A =

not not True

=

not

(

not True

) =

not False

=

True. Plugging in A , we get:

Z =

not ((A)) = not ((

True

)) = not

True

=

False. So the answer is False.
请记住，(i) 括号内的表达式总是优先计算，(ii) 运算的优先级从高到低依次为“非”、“与”、“或”。我们首先将这个表达式“Z”简化如下：“Z =

not ((not not

True

)) = not ((A))

”，其中“A

=

not not True”。让我们评估 A：

A =

not not True

=

not

(

not True

) =

not False

=

True。代入 A，我们得到：

Z =

not ((A)) = not ((

True

)) = not

True

=

False。所以答案是 False。

Q: True and False and not True and True is
Q: 真和假和非真和真是
A: Let’s think step by step.
A: 让我们一步一步来思考。
Remember that (i) expressions inside brackets are always evaluated first and that (ii) the order of operations from highest priority to lowest priority is “not”, “and”, “or”, respectively. We first simplify this expression “Z” as follows: “Z = True and False and not True and True = A and B” where “A = True and False” and “B = not True and True”. Let’s evaluate A: A = True and False = False. Let’s evaluate B: B = not True and True

=

not (True and True)

=

not (True)

=

False. Plugging in

A

and

B

, we get:

Z = A

and

B =

False and False

=

False. So the answer is False.
请记住，(i) 括号内的表达式总是优先计算，(ii) 运算顺序从最高优先级到最低优先级依次为“非”、“与”、“或”。我们首先将这个表达式“Z”简化如下：“Z = True and False and not True and True = A and B”，其中“A = True and False”和“B = not True and True”。我们来计算 A：A = True and False = False。我们来计算 B：B = not True and True

=

not (True and True)

=

not (True)

=

False。代入

A

和

B

，我们得到：

Z = A

and

B =

False and False

=

False。所以答案是 False。

Q: not not ( not (False) ) is
Q: not not ( not (False) ) 是
A: Let’s think step by step.
A: 让我们一步一步来思考。
Remember that (i) expressions inside brackets are always evaluated first and that (ii) the order of operations from highest priority to lowest priority is “not”, “and”, “or”, respectively. We first simplify this expression “Z” as follows: "Z = not not

(not (

False

)) = not not (A) "

where "A

= not (

False

)

". Let’s evaluate A: A = not (False ) = not False = True. Plugging in A, we get:

Z =

not not (A )

=

not not

(

True

) =

not not False

=

True. So the answer is True.
请记住，(i) 括号内的表达式总是优先计算，(ii) 运算顺序从最高优先级到最低优先级依次为“非”、“与”、“或”。我们首先将这个表达式“Z”简化如下：“Z = not not

(not (

False

)) = not not (A) "

其中 "A

= not (

False

)

”。让我们来计算 A：A = not (False) = not False = True。代入 A，我们得到：

Z =

not not (A)

=

not not

(

True

) =

not not False

=

True。因此答案是 True。

Q: False and False and False or not False is
Q: False 和 False 和 False 或 not False 是
A: Let’s think step by step.
A: 让我们一步一步来思考。
Table 14 | An example of BBH.
表 14 | BBH 的一个例子。

PROMPT 提示

以下是中国关于教育学考试的单项选择题，请选出其中的正确答案。
根据我国心理学家冯忠良教授的学习分类，培养学生品德要通过

。
A．知识的学习
B．技能的学习
C．行为规范的学习
D．态度的学习
答案：C
开设跨学科课程或建立跨学科专业体现了高等教育课程发展的

。
A．综合化趋势
B．多样化趋势
C．人文化趋势
D．科学化趋势
答案：A

心智技能的特点有

。
A．物质性，外显性，简缩性
B．观念性，内潜性，简缩性
C．物质性，外显性，展开性
D．观念性，内潜性，展开性答案：B

下列关于大学生的情绪与理智关系的说法中正确的是

。
A．能冷静控制自己情绪
B．感情用事，难以用理智控制情绪
C．遇事能坚持自己正确认识
D．已发展到不为小事而发怒和怄气
答案：B
在学完一篇逻辑结构严密的课文以后，勾画出课文的论点论据的逻辑关系图以帮助理解和记忆。这种学习方法属于

。
A．精细加工策略
B．组织策略
C．复述策略
D．做笔记策略
答案：B
有学者强调，教育要根据一个民族固有的特征来定，这种观点体现了

A．生产力对教育的影响和制约
B．政治制度对教育的影响和制约
C．文化对教育的影响和制约
D．经济制度对教育的影响和制约
答案：

OPTIONS 选项

－A
－B
－C
－D

Table 15 ｜An example of C－Eval．
表 15 ｜C－Eval 的示例

PROMPT 提示

女：这些药怎么吃？
男：一天三次，一次两片。
请根据上文回答问题：
他们在哪儿？
答案：

OPTIONS 选项

商店
饭店
医院
教室

Table 16 ｜An example of C3．
表 16 ｜C3 的一个例子

PROMPT 提示

以下是将某句古诗文翻译而成的现代表述：春天已至，万物复苏，春风如一位美丽而又心灵手巧的姑娘，迈着纤纤细步款款而来，她挥舞剪刀，尽情地展示那高超的女工技巧，她先裁出了柳叶，随着柳条表裛依依地舞蹈，又裁出杏叶，桃叶。
该翻译所对应的古诗文是：

OPTIONS 选项

春风骋巧如翦刀春风如刀，巧妙地驰骋
剪裁无巧似春风
风吹怨恨快如刀 The wind blows resentment as fast as a knife
春风欲擅秋风巧春风想要胜过秋风的巧妙

Table 17｜An example of CCPM．
表 17｜CCPM 的一个例子

PROMPT 提示

Q：某小学在＂献爱心－为汶川地震区捐款＂活动中，六年级五个班共捐款 8000 元，其中一班捐款 1500 元，二班比一班多捐款 200 元，三班捐款 1600 元，四班与五班捐款数之比是

3 : 5

。四班捐款多少元？
某小学在“献爱心－为汶川地震区捐款”活动中，六年级五个班共捐款 8000 元，其中一班捐款 1500 元，二班比一班多捐款 200 元，三班捐款 1600 元，四班与五班捐款数之比是

3 : 5

。四班捐款多少元？
A：一班捐款 1500 元，而二班比一班多捐 200 元，所以二班捐款

1500 + 200 = 1700

元，又知道六年级五个班一共捐款 8000 元，所以四班和五班捐款之和

=

一共捐款－一班和二班和三班捐款之和，即8000－1500－

1700 - 1600 = 3200

元，而题目说四班与五班捐款数之比是3：5，则四班捐款了

3200 / (3 + 5)^{*} 3 = 1200

元。所以答案是： 1200 。
A：一班捐款 1500 元，而二班比一班多捐 200 元，所以二班捐款 1700 元，又知道六年级五个班一共捐款 8000 元，所以四班和五班捐款之和 6800 一共捐款－一班和二班和三班捐款之和，即 8000－1500－ 1200 元，而题目说四班与五班捐款数之比是 3：5，则四班捐款了 1200 元。所以答案是： 1200 。

Q：小俊在东西大道上跑步，若规定向东为正。他先向东跑了 800 米，然后又跑了一段之后，他位于出发点西边 100 米处，小俊第二段跑了多少米？
小俊第二段跑了 900 米。
A：小俊第二段跑完后位于出发点西边，所以第二段应该是向西跑，第二段跑的长度－第一段跑的长度＝100，第二段跑了

100 + 800 = 900

米。所以答案是：900。

Q：A车和B车同时从甲，乙两地相向开出，经过 5 小时相遇。然后，它们又各自按原速原方向继续行驶 3 小时，这时 A 车离乙地还有 135 千米， B 车离甲地还有165千米。甲，乙两地相距多少千米？
甲，乙两地相距多少千米？
A ：假设 A 车的速度为 x 千米每小时， B 车的速度为 y 千米每小时，根据而

A, B

相遇时 A 车行驶了 5 小时， A 车行驶 3 小时后离乙地还有 135 千米， B 车行驶 3 小时后距离甲地还有 165 千米，可以得到甲乙两地相距

= 5 x + 5 y = 135 + 8 x = 165 + 8 y

，变换得到：

10 (x + y) = 300 + 8 (x + y)

，于是

x + y = 150

，甲乙两地相距

5 (x + y) = 750

千米。所以答案是：750。

Q：在一个底面半径为 10 厘米的圆柱形容器内，倒入 10 厘米深的水，然后将一个底面直径 4 厘米，高 6 厘米的圆锥形铅锤放入水中，容器中水面上升多少厘米？
在一个底面半径为 10 厘米的圆柱形容器内，倒入 10 厘米深的水，然后将一个底面直径 4 厘米，高 6 厘米的圆锥形铅锤放入水中，容器中水面上升多少厘米？
A：

Table 18 ｜An example of CMATH．
表 18 ｜CMATH 的一个例子

PROMPT 提示

以下是关于解剖学的单项选择题，请直接给出正确答案的选项。
题目：壁胸膜的分部不包括
A．肋胸膜
B．肺胸膜
C．膈胸膜
D．胸膜顶
答案是：B

题目：属于蝶骨上的结构为
A．垂体窝
B．棘孔
C．破裂孔
D．视神经管
答案是：

题目：属于右心房的结构是
A．肉柱
B．室上嵴
C．乳头肌
D．梳状肌
答案是：D
题目：咽的分部
A．咽隐窝
B．口咽部
C．鼻咽部
D．喉咽部
答案是：
题目：舌下神经核位于
A．间脑
B．延髓
C．中脑
D．脑挢
答案是：B
题目：从脑干背侧出脑的脑神经是
A．副神经
B．三叉神经
C．舌下神经
D．滑车神经
答案是：

OPTIONS 选项

－A
－B
－C
－D
Table 19 ｜An example of CMMLU．
表 19｜CMMLU 的一个例子

PROMPT 提示

文章：英雄广场（Heldenplatz）是奥地利首都维也纳的一个广场。在此曾发生许多重要事件—最著名的是1938年希特勒在此宣告德奥合并。英雄广场是霍夫堡皇宫的外部广场，兴建于皇帝弗朗茨．约瑟夫一世统治时期，是没有完全建成的所谓＂帝国广场＂（Kaiserforum）的一部分。其东北部是霍夫堡皇宫的Leopoldinian Tract，东南方是新霍夫堡，西南方的内环路，将其与＂城门外＂（Äußeres Burgtor）隔开。西北部没有任何建筑物，可以很好地眺望内环路，国会大厦，市政厅，以及城堡剧院。广场上有2尊军事领袖的骑马像：欧根亲王和卡尔大公。
英雄广场（Heldenplatz）是奥地利首都维也纳的一个广场。在此曾发生许多重要事件—最著名的是 1938 年希特勒在此宣告德奥合并。英雄广场是霍夫堡皇宫的外部广场，兴建于皇帝弗朗茨·约瑟夫一世统治时期，是没有完全建成的所谓“帝国广场”（Kaiserforum）的一部分。其东北部是霍夫堡皇宫的 Leopoldinian Tract，东南方是新霍夫堡，西南方的内环路，将其与“城门外”（Äußeres Burgtor）隔开。西北部没有任何建筑物，可以很好地眺望内环路，国会大厦，市政厅，以及城堡剧院。广场上有 2 尊军事领袖的骑马像：欧根亲王和卡尔大公。

根据上文回答下面的问题。
问题：英雄广场是哪个皇宫的外部广场？
答案：霍夫堡皇宫
问题：广场上有哪两位军事领袖的骑马像？
答案：
Table 20 ｜An example of CMRC2018．
表 20 ｜CMRC2018 的一个例子

PROMPT 提示

Passage：The median age in the city was 22.1 years．

10.1 %

of residents were under the age of

18; 56.2 %

were between the ages of 18 and

24; 16.1 %

were from 25 to

44; 10.5 %

were from 45 to 64 ；and

7 %

were 65 years of age or older． The gender makeup of the city was

64.3 %

male and

35.7 %

female．
该城市的中位年龄为 22.1 岁。

10.1 %

的居民年龄在

18; 56.2 %

以下，年龄在 18 到

24; 16.1 %

之间，年龄在 25 到

44; 10.5 %

之间，年龄在 45 到 64 之间；

7 %

的居民年龄在 65 岁或以上。该城市的性别构成为

64.3 %

男性和

35.7 %

女性。

Answer the following questions based on the above passage，please calculate carefully if calculation is necessary．
根据上述段落回答以下问题，如果需要计算，请仔细计算
Q：How many percent were not from 25 to 44 ？
问：有多少百分比的人不是在 25 到 44 岁之间？
A：The answer type is number．So according to above Passage，the answer is 83．9．
A：答案类型是数字。因此根据上述段落，答案是 83.9

Q：How many in percent weren＇t 25 to 44 ？
问：25 到 44 岁的人占多少百分比？
A：The answer type is number．So according to above Passage，the answer is
A：答案类型是数字。因此根据上述段落，答案是
Table 21 ｜An example of DROP．
表 21｜DROP 的示例

PROMPT 提示

中新网12月7日电综合外媒6日报道，在美国得克萨斯州，负责治疗新冠肺炎患者的医生约瑟夫•瓦隆（Joseph Varon）已连续上班超260天，每天只睡不超过2小时。瓦隆日前接受采访时呼吁，美国民众应遵从防疫规定，一线的医护人员＂已
中新网 12 月 7 日电综合外媒 6 日报道，在美国得克萨斯州，负责治疗新冠肺炎患者的医生约瑟夫•瓦隆（Joseph Varon）已连续上班超 260 天，每天只睡不超过 2 小时。瓦隆日前接受采访时呼吁，美国民众应遵从防疫规定，一线的医护人员“已

OPTIONS 选项

神清气爽＂。神清气爽。
诡计多端＂。诡计多端。
精疲力竭＂。精疲力竭。
分工合作＂。分工合作。
寅吃卯粮＂。寅吃卯粮。
土豪劣绅＂。土豪劣绅。
芸芸众生＂。芸芸众生。

Table 22 ｜An example of CHID．
表 22 ｜CHID 的一个例子

PROMPT 提示

胡雪岩离船登岸，坐轿进城，等王有龄到家，他接着也到了他那里，脸上是掩抑不住的笑容，王有龄夫妇都觉得奇怪，问他什么事这么高兴。
上面的句子中的＂他＂指的是  上面的句子中的“他”指的是
胡雪岩
渐渐地，汤中凝结出一团团块状物，将它们捞起放进盆里冷却，肥皀便出现在世上了。
渐渐地，汤中凝结出一团团块状物，将它们捞起放进盆里冷却，肥皂便出现在世上了。
上面的句子中的＂它们＂指的是  上面的句子中的“它们”指的是
块状物
＂她序上明明引着JulesTellier的比喻，说有个生脱发病的人去理发，那剃头的对他说不用剪发，等不了几天，头毛压儿全掉光了；大部分现代文学也同样的不值批评。这比喻还算俏皮。＂
“她序上明明引着 Jules Tellier 的比喻，说有个生脱发病的人去理发，那剃头的对他说不用剪发，等不了几天，头毛压儿全掉光了；大部分现代文学也同样的不值批评。这比喻还算俏皮。”
上面的句子中的＂他＂指的是  上面的句子中的“他”指的是
生脱发病的人
在洛伦佐大街的尽头处，眮立着著名的圣三一大教堂。它有着巨大的穹顶，还有明亮的彩色玻璃窗，上面描绘着《旧约》和《新约》的场景。
在洛伦佐大街的尽头处，矗立着著名的圣三一大教堂。它有着巨大的穹顶，还有明亮的彩色玻璃窗，上面描绘着《旧约》和《新约》的场景。
上面的句子中的＂它＂指的是  上面的句子中的“它”指的是
圣三一大教堂
他伯父还有许多女弟子，大半是富商财主的外室；这些财翁白天忙着赚钱，怕小公馆里的情妇长日无聊，要不安分，常常叫她们学点玩艺儿消遣。
上面的句子中的＂她们＂指的是  上面的句子中的“她们”指的是
情妇
赵雨又拿出了一个杯子，我们热情地请老王入座，我边给他倒酒边问：1962年的哪次记得吗？
赵雨又拿出了一个杯子，我们热情地请老王入座，我边给他倒酒边问：1962 年的哪次记得吗？
上面的句子中的＂他＂指的是  上面的句子中的“他”指的是
Table 23 ｜An example of CLUEWSC．
表 23 ｜CLUEWSC 的一个例子

PROMPT 提示

Q: Max can mow the lawn in 40 minutes. If it takes him twice that long to fertilize the lawn, how long will it take him to both mow and fertilize the lawn? A: Let’s think step by step. It takes Max

2 * 40

minutes

= 80

minutes to fertilize the lawn. In total, Max takes 80 minutes +40 minutes

= 120

minutes to both mow and fertilize the lawn. The answer is 120 .
问：马克修剪草坪需要 40 分钟。如果施肥需要他花费两倍的时间，那么他修剪和施肥草坪总共需要多长时间？答：让我们一步一步来思考。马克施肥需要

2 * 40

分钟

= 80

分钟。总的来说，马克修剪和施肥草坪需要 80 分钟+40 分钟

= 120

分钟。答案是 120。

Q: The bagels cost

$ 2.25

each, or a dozen for

$ 24

. How much is saved, per bagel, in cents, by buying a dozen at a time?
问：贝果每个

$ 2.25

，一打

$ 24

。一次购买一打贝果，每个贝果节省了多少美分？
A: Let’s think step by step. They cost

{2.25}^{*} 100 = 225

cents each. At the bulk rate, they are

24 / 12 = 2

dollar each. They cost

2^{*} 100 = 200

cents each.

225 - 200 = 25

cents are saved per bagel. The answer is 25 .
A: 让我们一步一步来思考。它们每个花费

{2.25}^{*} 100 = 225

美分。按批发价格，它们每个花费

24 / 12 = 2

美元。它们每个花费

2^{*} 100 = 200

美分。每个贝果节省

225 - 200 = 25

美分。答案是 25。

Q: Tim is 5 years old. His cousin, Rommel, is thrice as old as he is. His other cousin, Jenny, is 2 years older than Rommel. How many years younger is Tim than Jenny?
提姆 5 岁。他的表弟罗梅尔的年龄是他的三倍。他的另一个表妹珍妮比罗梅尔大 2 岁。提姆比珍妮小多少岁？
A: Let’s think step by step. Rommel is

5 \times 3 = 15

years old. Jenny is

15 + 2 = 17

years old. So, Tim is

17 - 5 = 12

years younger than Jenny. The answer is 12 .
A: 让我们一步一步来思考。隆美尔

5 \times 3 = 15

岁。珍妮

15 + 2 = 17

岁。所以，蒂姆比珍妮小

17 - 5 = 12

岁。答案是 12。

Q: The school has 14 boys and 10 girls. If 4 boys and 3 girls drop out, how many boys and girls are left?
问：学校有 14 个男孩和 10 个女孩。如果 4 个男孩和 3 个女孩退学，剩下多少个男孩和女孩？
A: Let’s think step by step. There are 14 boys -4 boys

= 10

boys left. There are 10 girls -3 girls

= 7

girls left. In total there are 10 boys +7 girls

= 17

boys and girls left. The answer is 17 .
A: 让我们一步一步来思考。总共有 14 个男孩 -4 个男孩

= 10

剩下的男孩。总共有 10 个女孩 -3 个女孩

= 7

剩下的女孩。总共有 10 个男孩 +7 个女孩

= 17

剩下的男孩和女孩。答案是 17。

Q: Building one birdhouse requires 7 planks and 20 nails. If 1 nail costs 0.05 , and one plank costs 3 , what is the cost, in dollars, to build 4 birdhouses?
问：建造一个鸟屋需要 7 块木板和 20 个钉子。如果 1 个钉子的价格是 0.05，1 块木板的价格是 3，那么建造 4 个鸟屋的费用是多少美元？
A: Let’s think step by step. The cost of the planks for one birdhouse is

7 * 3 =

21. And the nails are a cost of

20 * 0.05 = 1

for each birdhouse. So to build one birdhouse one will need

21 + 1 = 22

. So the cost of building 4 birdhouses is at 4 *

22 = 88

. The answer is 88 .
A: 让我们一步一步来想。一个鸟屋的木板成本是

7 * 3 =

21。而钉子的成本是每个鸟屋

20 * 0.05 = 1

。所以建造一个鸟屋需要

21 + 1 = 22

。因此建造 4 个鸟屋的成本是 4 *

22 = 88

。答案是 88。

Q: Danny brings 3 watermelons to his family picnic. He cuts each watermelon into 10 slices. His sister brings 1 watermelon to the family picnic, and she cuts the watermelon into 15 slices. How many watermelon slices are there in total at the picnic?
问：丹尼带了 3 个西瓜去家庭野餐。他把每个西瓜切成 10 片。他的妹妹带了 1 个西瓜去家庭野餐，她把西瓜切成 15 片。野餐上总共有多少片西瓜？
A: Let’s think step by step. From Danny, there are

3 * 10 = 30

watermelon slices. From his sister, there are

1 * 15 = 15

watermelon slices. There are a total of

30 +

15 = 45

watermelon slices. The answer is 45 .
A: 让我们一步一步来思考。从丹尼那里，有

3 * 10 = 30

片西瓜。从他的妹妹那里，有

1 * 15 = 15

片西瓜。总共有

30 +

15 = 45

片西瓜。答案是 45。

Q: Angela is a bike messenger in New York. She needs to deliver 8 times as many packages as meals. If she needs to deliver 27 meals and packages combined, how many meals does she deliver?
安吉拉是纽约的一名自行车快递员。她需要送达的包裹数量是餐食数量的 8 倍。如果她需要送达 27 个餐食和包裹的总和，她送达了多少个餐食？
A: Let’s think step by step. Let

p

be the number of packages Angela delivers and m be the number of meals. We know that

p + m = 27

and

p = 8 m

. Substituting the second equation into the first equation, we get

8 m + m = 27

. Combining like terms, we get

9 m = 27

. Dividing both sides by 9 , we get

m = 3

. The answer is 3 .
A: 让我们一步一步来。设

p

为安吉拉送的包裹数量，m 为餐点数量。我们知道

p + m = 27

和

p = 8 m

。将第二个方程代入第一个方程，我们得到

8 m + m = 27

。合并同类项，我们得到

9 m = 27

。两边都除以 9 ，我们得到

m = 3

。答案是 3 。

Q: Cori is 3 years old today. In 5 years, she will be one-third the age of her aunt. How old is her aunt today?
问：科里今天 3 岁。5 年后，她将是她姑姑年龄的三分之一。她姑姑今天多大了？
A: Let’s think step by step. In 5 years, Cori will be

3 + 5 = 8

years old. In 5 years, Cori’s aunt will be

8 \times 3 = 24

years old. Today, her aunt is

24 - 5 = 19

years old. The answer is 19 .
A: 让我们一步一步来想。在 5 年后，Cori 将会

3 + 5 = 8

岁。在 5 年后，Cori 的姑姑将会

8 \times 3 = 24

岁。今天，她的姑姑

24 - 5 = 19

岁。答案是 19。

Q: Indras has 6 letters in her name. Her sister’s name has 4 more letters than half of the letters in Indras’ name. How many letters are in Indras and her sister’s names?
Q: Indras 的名字有 6 个字母。她姐姐的名字比 Indras 名字字母数的一半多 4 个字母。Indras 和她姐姐的名字各有多少个字母？
A: Let’s think step by step.
A: 让我们一步一步来思考。

PROMPT 提示

Playing piano: A man is seated at a piano. He
弹钢琴：一个男人坐在钢琴前。他

OPTIONS 选项

is playing the piano with his hands and his face.
他正在用手和脸弹钢琴。
bigins to play a song by timbaland on the piano.
开始在钢琴上演奏 Timbaland 的一首歌。
plays slowly, and pauses to snap his fingers.
慢慢地演奏，并停下来打响指。
is playing a song in front of him.
正在他面前播放一首歌。

Table 25 | An example of HellaSwag.
表 25 | HellaSwag 的一个示例。

PROMPT 提示

def starts_one_ends(n):
“’”’
Given a positive integer n, return the count of the numbers of n-digit positive integers that start or end with 1.
给定一个正整数 n，返回以 1 开头或结尾的 n 位正整数的数量。
“’”"

Table 26 | An example of HumanEval.
表 26 | HumanEval 的一个示例。

PROMPT 提示

Problem: 问题：

Find the domain of the expression

$ ∖

frac

{∖

sqrt

{x - 2}} {∖

sqrt{5-x}}$.}
找出表达式

$ ∖

frac

{∖

sqrt

{x - 2}} {∖

sqrt{5-x}}$ 的定义域。

Solution: 解决方案：

The expressions inside each square root must be non-negative.
每个平方根内的表达式必须是非负的。
Therefore,

$ x - 2 ∖

0 $

, so

$ x ∖ g e 2 $

, and

$ 5 - x

0 $

, so

$ x ∖ l e 5 $

.
因此，

$ x - 2 ∖

0 $

，所以

$ x ∖ g e 2 $

，而

$ 5 - x

0 $

，所以

$ x ∖ l e 5 $

。
Also, the denominator cannot be equal to zero, so

$ 5 - x > 0 $

, which gives

$ x < 5 $

.
此外，分母不能等于零，因此

$ 5 - x > 0 $

，这给出

$ x < 5 $

。
Therefore, the domain of the expression is

$ ∖

boxed

{[2, 5)} $

.
因此，该表达式的定义域是

$ ∖

boxed

{[2, 5)} $

。
Final Answer: The final answer is

$ [2, 5) $

. I hope it is correct.
最终答案：最终答案是

$ [2, 5) $

。我希望它是正确的。
Problem: 问题：
If

$ ∖

det

∖ mathbf {A} = 2 $

and

$ ∖

det

∖ mathbf {B} = 12, $

then find

$ ∖

det (

∖

mathbf

{A} ∖

mathbf

{B}

$

如果

$ ∖

除

∖ mathbf {A} = 2 $

和

$ ∖

除

∖ mathbf {B} = 12, $

那么找到

$ ∖

除 (

∖

mathbf

{A} ∖

mathbf

{B}

)。

$

Solution: 解决方案：
We have that

$ ∖ \det (∖

mathbf

{A} ∖

mathbf

{B}) = (∖ \det ∖ mathbf {A}) (∖ \det

∖

mathbf

{B}) = (2) (12) = ∖

boxed

{24} . $

我们有

$ ∖ \det (∖

mathbf

{A} ∖

mathbf

{B}) = (∖ \det ∖ mathbf {A}) (∖ \det

∖

mathbf

{B}) = (2) (12) = ∖

boxed

{24} . $

Final Answer: The final answer is

$ 24 $

. I hope it is correct.
最终答案：最终答案是

$ 24 $

。我希望它是正确的。
Problem: 问题：
Terrell usually lifts two 20-pound weights 12 times. If he uses two 15-pound weights instead, how many times must Terrell lift them in order to lift the same total weight?
特雷尔通常举起两个 20 磅的哑铃 12 次。如果他改用两个 15 磅的哑铃，他必须举多少次才能举起相同的总重量？

Solution: 解决方案：
If Terrell lifts two 20-pound weights 12 times, he lifts a total of

$ 2 ∖ c d o t

12 ∖ cdot 20 = 480 $

pounds of weight. If he lifts two 15-pound weights instead for

$ n $

times, he will lift a total of

$ 2 ∖ cdot 15 ∖ cdot n = 30 n $

pounds of weight. Equating this to 480 pounds, we can solve for

$ n $

∖

begin{align*

}

如果特雷尔举起两个 20 磅的哑铃 12 次，他总共举起了

$ 2 ∖ c d o t

12 ∖ cdot 20 = 480 $

磅的重量。如果他改为举两个 15 磅的哑铃

$ n $

次，他将总共举起

$ 2 ∖ cdot 15 ∖ cdot n = 30 n $

磅的重量。将其等于 480 磅，我们可以求解

$ n $

∖

begin{align*

}

30n&=480

∖

Rightarrow

∖

qquad

n & = 480 / 30 = ∖

boxed

{16}

∖

右箭头

∖

qquad

n & = 480 / 30 = ∖

方框

{16}

\end{align*} } \end{align*}
Final Answer: The final answer is

$ 16 $

. I hope it is correct.
最终答案：最终答案是

$ 16 $

。我希望它是正确的。
Problem: 问题：
If the system of equations
如果方程组

∖

begin

{

align

^{*}}

∖

开始

{

对齐

^{*}}

6 x - 4 y & = a

∖ ∖

6 y - 9 x & = b

.
\end{align*}has a solution } $ ( x , y ) $ where $ x $ and $ y $ are both nonzero, find

$ ∖

frac

{a} {b}, $

assuming

$ b $

is nonzero.
\end{align*}有一个解 } $ ( x , y ) $ 其中 $ x $ 和 $ y $ 都是非零的，假设

$ b $

是非零的，找到

$ ∖

frac

{a} {b}, $

。

Solution: 解决方案：
If we multiply the first equation by

$ - ∖

frac

{3} {2} $

, we obtain
如果我们将第一个方程乘以

$ - ∖

frac

{3} {2} $

，我们得到

$ $ 6 y - 9 x = - ∖

frac

{3} {2} a . $ $

Since we also know that

$ 6 y - 9 x = b $

, we have

$ $ 6 y - 9 x = - ∖

分数

{3} {2} a . $ $

由于我们也知道

$ 6 y - 9 x = b $

，我们有
$$-

∖

frac

{3} {2} a = b ∖

Rightarrow

∖

frac

{a} {b} = ∖

boxed

{- ∖

frac

{2} {3}} . $ $

$$-

∖

frac

{3} {2} a = b ∖

\Rightarrow

∖

frac

{a} {b} = ∖

\boxed

{- ∖

frac

{2} {3}} . $ $

Final Answer: The final answer is

$ - ∖ f r a c {2} {3} $

. I hope it is correct.
最终答案：最终答案是

$ - ∖ f r a c {2} {3} $

。我希望它是正确的。
Problem: Evaluate

$ ∖

log_21$.
问题：评估

$ ∖

log_21$。
Solution: 解决方案：
Table 27 | An example of MATH.
表 27 | MATH 的一个例子。

PROMPT 提示

You are an expert Python programmer, and here is your task: Write a function to find the similar elements from the given two tuple lists. Your code should pass these tests:
您是一位专家级的 Python 程序员，您的任务是：编写一个函数，从给定的两个元组列表中找到相似的元素。您的代码应该通过以下测试：
assert similar_elements

((3, 4, 5, 6), (5, 7, 4, 10)) == (4, 5)

断言相似元素

((3, 4, 5, 6), (5, 7, 4, 10)) == (4, 5)

assert similar_elements

((1, 2, 3, 4), (5, 4, 3, 7)) == (3, 4)

断言相似元素

((1, 2, 3, 4), (5, 4, 3, 7)) == (3, 4)

assert similar_elements

((11, 12, 14, 13), (17, 15, 14, 13)) == (13, 14)

断言相似元素

((11, 12, 14, 13), (17, 15, 14, 13)) == (13, 14)

[BEGIN] [开始]
def similar_elements(test_tup1, test_tup2):
res

=

tuple(set(test_tup1) & set(test_tup2))
res

=

元组(set(test_tup1) & set(test_tup2))
return (res) 返回 (res)
[DONE]
You are an expert Python programmer, and here is your task: Write a python function to identify non-prime numbers. Your code should pass these tests:
您是一位专家级的 Python 程序员，您的任务是：编写一个 Python 函数来识别非素数。您的代码应通过以下测试：

assert is_not_prime(2) == False
assert is_not_prime(10) == True
assert is_not_prime(35) == True
[BEGIN]
import math
def is_not_prime(n):
result = False
for i in range(2,int(math.sqrt(n)) + 1):
if n % i == 0:
result = True
return result
[DONE]
You are an expert Python programmer, and here is your task: Write a function
to find the largest integers from a given list of numbers using heap queue
algorithm. Your code should pass these tests:

assert heap_queue_largest(

[25, 35, 22, 85, 14, 65, 75, 22, 58], 3) == [85, 75, 65]

assert heap_queue_largest

([25, 35, 22, 85, 14, 65, 75, 22, 58], 2) == [85, 75]

断言 heap_queue_largest

([25, 35, 22, 85, 14, 65, 75, 22, 58], 2) == [85, 75]

assert heap_queue_largest(

[25, 35, 22, 85, 14, 65, 75, 22, 58], 5) == [85, 75, 65, 58

,
断言 heap_queue_largest(

[25, 35, 22, 85, 14, 65, 75, 22, 58], 5) == [85, 75, 65, 58

,
35]
[BEGIN]  [开始]
import heapq as hq
导入堆队列作为 hq
def heap_queue_largest(nums,n):
largest_nums = hq.nlargest(n, nums)
return largest_nums  返回 largest_nums
[DONE]
You are an expert Python programmer, and here is your task: Write a function
你是一位专家级的 Python 程序员，以下是你的任务：编写一个函数
to return the sum of all divisors of a number. Your code should pass these tests:
返回一个数字所有因子的总和。你的代码应该通过这些测试：
assert sum_div(8)==7  断言 sum_div(8)==7
assert sum_div(12)==16  断言 sum_div(12)==16
assert sum_div(7)==1  断言 sum_div(7)==1
[BEGIN]  [开始]

Table 28 | An example of MBPP.
表 28 | MBPP 的一个示例。

PROMPT 提示

The following are multiple choice questions (with answers) about miscellaneous.
以下是关于杂项的多项选择题（附答案）。

How many axles does a standard automobile have?
一辆标准汽车有多少个车轴？
A. one
B. two  B. 两个
C. four  C. 四
D. eight  D. 八

Answer: B  答案：B
What place is named in the title of the 1979 live album by rock legends Cheap Trick?
1979 年摇滚传奇乐队 Cheap Trick 的现场专辑标题中提到的地方是什么？
A. Budapest  A. 布达佩斯
B. Budokan
C. Bhutan  不丹
D. Britain

Answer: B 答案：B

Who is the shortest man to ever win an NBA slam dunk competition?
谁是赢得 NBA 扣篮大赛的最矮的选手？
A. Anthony ‘Spud’ Webb
A. 安东尼‘斯帕德’韦布
B. Michael ‘Air’ Jordan
B. 迈克尔‘空气’乔丹
C. Tyrone ‘Muggsy’ Bogues
D. Julius ‘Dr J’ Erving

Answer: A  答案：A
What is produced during photosynthesis?
光合作用中产生什么？
A. hydrogen  A. 氢
B. nylon  B. 尼龙
C. oxygen  C. 氧气
D. light

Answer: C 答案：C
Which of these songs was a Top 10 hit for the rock band The Police?
这些歌曲中哪一首是摇滚乐队 The Police 的前十名热门歌曲？
A. ‘Radio Ga-Ga’
B. ‘Ob-la-di Ob-la-da’
C. ‘De Do Do Do De Da Da Da’
D. ‘In-a-Gadda-Da-Vida’

Answer: C 答案：C

Which of the Three Stooges was not related to the others?
哪位三傻大闹天宫的成员与其他人没有关系？
A. Moe
B. Larry
C. Curly
D. Shemp

Answer: 答案：

OPTIONS 选项

Table 29 | An example of MMLU.
表 29 | MMLU 的一个例子。

PROMPT 提示

Answer these questions:  回答这些问题：
Q: Who is hosting the fifa world cup in 2022?
问：2022 年世界杯的主办国是哪个？
A: Qatar  A: 卡塔尔
Q: Who won the first women 's fifa world cup?
问：谁赢得了第一届女子世界杯？
A: United States  A: 美国
Q: When did miami vice go off the air?
问：迈阿密风云何时停播？
A: 1989
Q: Who wrote the song shout to the lord?
问：谁写了《Shout to the Lord》这首歌？
A: Darlene Zschech  A: 达琳·泽切
Q: Who was thrown in the lion’s den?
问：谁被扔进狮子坑？
A: Daniel  A: 丹尼尔
Q : What is the meaning of the name habib?
问：哈比布这个名字的意思是什么？
A:
Table 30 | An example of NaturalQuestions.
表 30 | 自然问题的示例。

PROMPT 提示

A woman notices that she is depressed every autumn, and wonders why. A friend suggests to her that perhaps certain changes that take place as seasons move from warm to cold may be having an effect on her. When pressed for an example of these changes, the friend cites
一位女性注意到自己每年秋天都会感到抑郁，并想知道原因。一个朋友向她建议，也许随着季节从温暖转向寒冷所发生的某些变化可能对她产生了影响。当被要求举例说明这些变化时，朋友提到

OPTIONS 选项

flowers blooming 花朵盛开
grass turning brown 草变成棕色
trees growing 树木生长
blossoms blooming 花朵盛开

Table 31 | An example of OpenBookQA.
表 31 | OpenBookQA 的示例。

PROMPT 提示

To make it easier to push the reset button of the garbage disposable machine which is located underneath the machine,
为了更方便地按下位于机器底部的垃圾处理机的重置按钮，

OPTIONS 选项

place a wall mirror on the floor of the cabinet
在柜子的地板上放置一面墙镜
hold a hand mirror under the garbage disposable machine
在垃圾处理机下拿着一面手镜

Table 32 | An example of PIQA.
表 32 | PIQA 的一个例子。

PROMPT 提示

Article: 文章：
When you read an article you will understand and remember it better if you can work out how the writer has put the ideas together. Sometimes a writer puts ideas together by asking questions and then answering them.For example, if the article is about groundhogs, the set of questions in the writer’s head might be:
当你阅读一篇文章时，如果你能弄清楚作者是如何将这些想法组合在一起的，你会更好地理解和记住它。有时，作者通过提出问题并回答这些问题来组合想法。例如，如果文章是关于土拨鼠的，作者脑海中的问题可能是：
What does a groundhog look like?
土拨鼠长什么样？
Where do groundhogs live?
土拨鼠生活在哪里？
What do they eat?..
他们吃什么？..
In the article, the author might answer those questions.
在文章中，作者可能会回答这些问题。
Sometimes an author writes out her questions in the article. These questions give you signals. They tell you what the author is going to write next.Often an author has a question in her head but she doesn’t write it out for you. You have to work out her question for yourself.Here’s a sample reading for you to practice this method.
有时，作者会在文章中写出她的问题。这些问题给你信号。它们告诉你作者接下来要写什么。通常，作者心中有一个问题，但她并没有写出来。你必须自己推测她的问题。这里有一个示例阅读，供你练习这种方法。
Earthworms 蚯蚓
Do you know how many kinds of earthworms there are?There are about 1800 kinds in the world! They can be brown,purple,green. They can be as small as 3 cm long and as large as 3 m long.
你知道有多少种蚯蚓吗？世界上大约有 1800 种！它们可以是棕色、紫色或绿色。它们可以小到 3 厘米长，也可以大到 3 米长。
The best time to see earthworms is at night,especially a cool,damp night.That’s when they come up from their burrows to hunt for food.Earthworms don’t like to be in the sun.That’s because they breathe through their skin, and they can’t breathe if their skin gets too dry.Earthworms must come out of the earth if it rains a lot,because they can’t breathe in their flooded burrows.What a dangerous life!
观察蚯蚓的最佳时间是晚上，尤其是凉爽潮湿的夜晚。那时它们会从洞里出来寻找食物。蚯蚓不喜欢在阳光下活动。这是因为它们通过皮肤呼吸，如果皮肤变得太干燥，它们就无法呼吸。如果下雨很多，蚯蚓必须从土里出来，因为它们在被淹的洞里无法呼吸。多么危险的生活啊！
Earthworms don’t have eyes,so how can they tell when it’s dark? They have special places on their skin that are sensitive to light.These spots tell whether it’s light or dark.If you shine a flashlight on an earthworm at night,it will quickly disappear into the ground.
蚯蚓没有眼睛，那么它们怎么知道什么时候是黑暗的呢？它们的皮肤上有特殊的地方对光线敏感。这些斑点可以判断是光亮还是黑暗。如果你在晚上用手电筒照射蚯蚓，它会迅速消失在地下。
Earthworms don’t have ears either,but they can hear by feeling movements in the earth.If you want to hear like an earthworm,lie on the ground with your fingers in your ears.Then have a friend stamp his or her feet near you.This is how earthworms feel birds and people walking, and moles digging,near them. Earthworms are useful.Farmers and gardeners like having lots of earthworms in their land because the worms help to make better soil when they dig.That digging keeps the soil loose and airy .In one year earthworms can pile up as much as

23, 000 kg

of castings in an area about the size of a football field.
蚯蚓也没有耳朵，但它们可以通过感知土壤中的运动来听。如果你想像蚯蚓一样听，躺在地上，把手指放在耳朵里。然后让朋友在你附近踩脚。这就是蚯蚓感知鸟类和人类走动，以及鼹鼠挖掘的方式。蚯蚓是有用的。农民和园丁喜欢在他们的土地上有很多蚯蚓，因为这些虫子在挖掘时有助于改善土壤。这样的挖掘使土壤保持松散和通气。在一年内，蚯蚓可以在大约一个足球场大小的区域内堆积多达

23, 000 kg

的蚯蚓粪。

Q: What’s the purpose of reading Earthworms?
问：阅读地球 worm 的目的是什么？
A: To put the writer’s idea into real use.
A: 将作者的想法付诸实际。
Q: Which question CANNOT be answered in the passage?
问：哪一个问题在文章中无法回答？
A: Why can human listen like earthworms?
A: 为什么人类能像蚯蚓一样听到声音？
Q: How can you understand Earthworms better according to this passage?
问：根据这段文字，你如何更好地理解蚯蚓？
A: Read to work out all the questions in the writer’s head while reading.
A: 在阅读时思考作家脑中的所有问题。
Q: What’s the best title for the passage?
问：这段文字最好的标题是什么？

A:

OPTIONS 选项

One way to help with understanding
帮助理解的一种方法
One way to practice with a new idea
练习新想法的一种方法
One way to learn to be a wise writer49
学习成为一位明智作家的一个方法 49
One way to be clearer about worms
关于蠕虫更清晰的一种方式

PROMPT 提示

Answer these questions: 回答这些问题：
Q: A Jayhawker was a term applied to anti-slavery militant bands from a certain US state that clashed with pro-slavery factions from Missouri. Which state is this, sometimes referred to as the Jayhawk State?
答：这个州是堪萨斯州，有时被称为杰伊霍克州。
A: Kans.
Q: Which Swedish DJ and record producer had a UK Number One single in 2013 with ‘Wake Me Up’?
问：哪位瑞典 DJ 和唱片制作人在 2013 年以《Wake Me Up》获得了英国单曲排行榜第一？
A: Tim Bergling
Q: Who is the MP for Sheffield Hallam?
问：谢菲尔德哈勒姆的议员是谁？
A: Nick clegg A: 尼克·克莱格
Q: A case that riveted the nation, the case of The State of Tennessee v. John Thomas Scopes concluded on July 21, 1925, with the jury finding Mr. Scopes guilty of teaching what?
问：一个吸引了全国注意的案件，田纳西州诉约翰·托马斯·斯科普斯案于 1925 年 7 月 21 日结束，陪审团裁定斯科普斯先生因教授什么而有罪？
A: Survival of species
A: 物种的生存
Q: What cartoon series featured a character called Little My?
问：哪个卡通系列中有一个叫小米的角色？
A: Muumi
Q: "What English model, with her short-haired androgynous look, born Lesley Hornby, was discovered in 1966 by Nigel Davies when she was 16 and weighed 6 stone (

41 kg, 91 lbs

), and became "“The Face of ’ 66 “’” with her high fashion mod look created by Mary Quant?”
问：“哪位英模，短发中性造型，出生名为 Lesley Hornby，1966 年被 Nigel Davies 发现，当时她 16 岁，体重 6 石（

41 kg, 91 lbs

），并因其由 Mary Quant 创造的高端时尚摩登造型而成为‘66 年的面孔’？”
A:
Table 34 | An example of TriviaQA.
表 34 | TriviaQA 的一个示例。

PREFIXES 前缀

So Monica 所以莫妮卡
So Jessica 所以杰西卡

COMPLETION 完成

avoids eating carrots for their eye health because Emily needs good eyesight while Monica doesn’t.
避免吃胡萝卜是为了眼睛健康，因为艾米莉需要良好的视力，而莫妮卡则不需要。

Table 35 | An example of WinoGrande. Note that there are multiple prefixes and only one completion for WinoGrande, and we choose the predicted prefix with the lowest perplexity of the completion.
表 35 | WinoGrande 的一个示例。请注意，WinoGrande 有多个前缀和仅一个完成，我们选择具有最低完成困惑度的预测前缀。

Prompt 提示

You will be given a function

f

and an output in the form

f (? ?) ==

output. Find any input such that executing

f

on the input leads to the given output. There may be multiple answers, but you should only output one. In [ANSWER] and [/ANSWER] tags, complete the assertion with one such input that will produce the output when executing the function.
您将获得一个函数

f

和一个输出形式为

f (? ?) ==

的输出。找到任何输入，使得在该输入上执行

f

会导致给定的输出。可能有多个答案，但您只需输出一个。在 [ANSWER] 和 [/ANSWER] 标签中，使用一个这样的输入完成断言，该输入在执行该函数时将产生输出。

[PYTHON]
def f(my_list):
count = 0
for i in my_list:
if len(i) % 2 == 0:
count += 1
return count
assert f(??) == 3
[/PYTHON]
[ANSWER]
assert f(["mq", "px", "zy"]) == 3
[/ANSWER]

[PYTHON]

def f (s 1, s 2)

:
return s1 + s2
返回 s1 + s2
assert

f (? ?) ==

“banana” 断言

f (? ?) ==

“banana”
[/PYTHON]
[ANSWER] [答案]
assert f(“ba”, “nana”) == “banana”
[/ANSWER]
[PYTHON]

def f (a, b, c)

:
result

= {}

结果

= {}

for

d

a, b, c

:
对于

d

在

a, b, c

中：
result.update(dict.fromkeys(d))
return result 返回结果
assert

f (? ?) == {1

: None, 2: None

}

断言

f (? ?) == {1

: None, 2: None

}

[/PYTHON]
[ANSWER] [答案]

Table 36 | An example of CRUXEval-I.
表 36 | CRUXEval-I 的示例。

Prompt 提示

You are given a Python function and an assertion containing an input to the function. Complete the assertion with a literal (no unsimplified expressions, no function calls) containing the output when executing the provided code on the given input, even if the function is incorrect or incomplete. Do NOT output any extra information. Provide the full assertion with the correct output in [ANSWER] and [/ANSWER] tags, following the examples.
[ANSWER]assert function_name(input) == expected_output[/ANSWER]

[PYTHON]
def f(n):
return n
assert f(17) == ??
[/PYTHON]
[ANSWER]
assert f(17) == 17
[/ANSWER]
[PYTHON]
def f(s):
return s + "a"
assert f("x9j") == ??
[/PYTHON]
[ANSWER]
assert f("x9j") == "x9ja"
[/ANSWER]

[PYTHON]

def f (

nums ):
output = []
for n in nums:
对于 n 在 nums 中：
output.append((nums.count(n), n))
output.sort(reverse=True)
return output 返回输出
assert

f ([1, 1, 3, 1, 3, 1]) ==

??
[/PYTHON]
[ANSWER] [答案]

Table 37 | An example of CRUXEval-O.
表 37 | CRUXEval-O 的示例。

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model DeepSeek-V2：一种强大、经济且高效的专家混合语言模型

Contents 内容

1. Introduction 1. 引言

2. Architecture 2. 架构

2.1. Multi-Head Latent Attention: Boosting Inference Efficiency2.1. 多头潜在注意力：提升推理效率

2.1.1. Preliminaries: Standard Multi-Head Attention2.1.1. 前言：标准多头注意力

2.1.2. Low-Rank Key-Value Joint Compression2.1.2. 低秩键值联合压缩

2.1.3. Decoupled Rotary Position Embedding2.1.3. 解耦旋转位置嵌入

2.1.4. Comparison of Key-Value Cache2.1.4. 键值缓存的比较

2.2. DeepSeekMoE: Training Strong Models at Economical Costs2.2. DeepSeekMoE：以经济成本训练强大模型

2.2.1. Basic Architecture2.2.1. 基本架构

2.2.2. Device-Limited Routing2.2.2. 设备限制路由

2.2.3. Auxiliary Loss for Load Balance2.2.3. 负载平衡的辅助损失

2.2.4. Token-Dropping Strategy2.2.4. 令牌丢弃策略

3. Pre-Training 3. 预训练

3.1. Experimental Setups3.1. 实验设置

3.1.1. Data Construction3.1.1. 数据构建

3.1.2. Hyper-Parameters 3.1.2. 超参数

3.1.3. Infrastructures 3.1.3. 基础设施

3.1.4. Long Context Extension3.1.4. 长上下文扩展

3.2. Evaluations 3.2. 评估

3.2.1. Evaluation Benchmarks3.2.1. 评估基准

3.2.2. Evaluation Results3.2.2. 评估结果

3.2.3. Training and Inference Efficiency3.2.3. 训练和推理效率

4. Alignment 4. 对齐

4.1. Supervised Fine-Tuning4.1. 监督微调

4.2. Reinforcement Learning4.2. 强化学习

4.3. Evaluation Results 4.3. 评估结果

4．4．Discussion 4．4．讨论

5. Conclusion, Limitation, and Future Work5. 结论、局限性和未来工作

References 参考文献

Appendix 附录

A. Contributions and AcknowledgmentsA. 贡献与致谢

Research & Engineering 研究与工程

B. DeepSeek-V2-Lite: A 16B Model Equipped with MLA and DeepSeekMoEB. DeepSeek-V2-Lite：一款配备 MLA 和 DeepSeekMoE 的 16B 模型

B.1. Model Description B.1. 模型描述

B.2. Performance EvaluationB.2. 性能评估

C. Full Formulas of MLAC. MLA 的完整公式

D. Ablation of Attention MechanismsD. 注意力机制的消融

D.1. Ablation of MHA, GQA, and MQAD.1. MHA、GQA 和 MQA 的消融

D.2. Comparison Between MLA and MHAD.2. MLA 与 MHA 的比较

E. Discussion About Pre-Training Data DebiasingE. 关于预训练数据去偏见的讨论

F. Additional Evaluations on Math and CodeF. 关于数学和代码的额外评估

G. Evaluation Formats G. 评估格式

PROMPT 提示

PROMPT 提示

PROMPT 提示

PROMPT 提示

OPTIONS 选项

PROMPT 提示

OPTIONS 选项

PROMPT 提示

OPTIONS 选项

PROMPT 提示

PROMPT 提示

OPTIONS 选项

PROMPT 提示

PROMPT 提示

PROMPT 提示

OPTIONS 选项

PROMPT 提示

PROMPT 提示

PROMPT 提示

OPTIONS 选项

PROMPT 提示

PROMPT 提示

Problem: 问题：

Solution: 解决方案：

PROMPT 提示

PROMPT 提示

OPTIONS 选项

PROMPT 提示

PROMPT 提示

OPTIONS 选项

PROMPT 提示

OPTIONS 选项

PROMPT 提示

A:

OPTIONS 选项

PROMPT 提示

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2：一种强大、经济且高效的专家混合语言模型

2.1. Multi-Head Latent Attention: Boosting Inference Efficiency
2.1. 多头潜在注意力：提升推理效率

2.1.1. Preliminaries: Standard Multi-Head Attention
2.1.1. 前言：标准多头注意力

2.1.2. Low-Rank Key-Value Joint Compression
2.1.2. 低秩键值联合压缩

2.1.3. Decoupled Rotary Position Embedding
2.1.3. 解耦旋转位置嵌入

2.1.4. Comparison of Key-Value Cache
2.1.4. 键值缓存的比较

2.2. DeepSeekMoE: Training Strong Models at Economical Costs
2.2. DeepSeekMoE：以经济成本训练强大模型

2.2.1. Basic Architecture
2.2.1. 基本架构

2.2.2. Device-Limited Routing
2.2.2. 设备限制路由

2.2.3. Auxiliary Loss for Load Balance
2.2.3. 负载平衡的辅助损失

2.2.4. Token-Dropping Strategy
2.2.4. 令牌丢弃策略

3.1. Experimental Setups
3.1. 实验设置

3.1.1. Data Construction
3.1.1. 数据构建

3.1.4. Long Context Extension
3.1.4. 长上下文扩展

3.2.1. Evaluation Benchmarks
3.2.1. 评估基准

3.2.2. Evaluation Results
3.2.2. 评估结果

3.2.3. Training and Inference Efficiency
3.2.3. 训练和推理效率

4.1. Supervised Fine-Tuning
4.1. 监督微调

4.2. Reinforcement Learning
4.2. 强化学习

5. Conclusion, Limitation, and Future Work
5. 结论、局限性和未来工作

A. Contributions and Acknowledgments
A. 贡献与致谢

B. DeepSeek-V2-Lite: A 16B Model Equipped with MLA and DeepSeekMoE
B. DeepSeek-V2-Lite：一款配备 MLA 和 DeepSeekMoE 的 16B 模型

B.2. Performance Evaluation
B.2. 性能评估

C. Full Formulas of MLA
C. MLA 的完整公式

D. Ablation of Attention Mechanisms
D. 注意力机制的消融

D.1. Ablation of MHA, GQA, and MQA
D.1. MHA、GQA 和 MQA 的消融

D.2. Comparison Between MLA and MHA
D.2. MLA 与 MHA 的比较

E. Discussion About Pre-Training Data Debiasing
E. 关于预训练数据去偏见的讨论

F. Additional Evaluations on Math and Code
F. 关于数学和代码的额外评估