这是用户在 2024-5-6 22:18 为 https://app.immersivetranslate.com/pdf-pro/59863aba-d0f6-475b-9868-37fbb97f8c09 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_05_06_43d22d551db373ca1ab5g

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2:强大、经济、高效的专家混合语言模型

DeepSeek-AI DeepSeek-人工智能research@deepseek.com

Abstract 抽象

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves of training costs, reduces the KV cache by , and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V2.
我们介绍了 DeepSeek-V2,这是一种强大的专家混合 (MoE) 语言模型,其特点是经济的训练和高效的推理。它包含 236B 个总参数,其中 21B 为每个令牌激活,并支持 128K 个令牌的上下文长度。DeepSeek-V2 采用多头潜在注意力 (MLA) 和 DeepSeekMoE 等创新架构。MLA 通过将键值 (KV) 缓存显着压缩为潜在向量来保证高效推理,而 DeepSeekMoE 则通过稀疏计算以经济的成本训练强大的模型。与DeepSeek 67B相比,DeepSeek-V2性能明显增强,同时节省 了训练成本,KV缓存减少 ,最大生成吞吐量提升至5.76倍。我们在由 8.1T 代币组成的高质量多源语料库上预训练 DeepSeek-V2,并进一步执行监督微调 (SFT) 和强化学习 (RL) 以充分释放其潜力。评估结果表明,即使仅激活了21B参数,DeepSeek-V2及其聊天版本仍能在开源模型中实现顶级性能。模型检查点可在 https://github.com/deepseek-ai/DeepSeek-V2 处使用。

(a) MMLU accuracy vs. activated parameters, among different (b) open-source models.
(a) MMLU 精度与激活参数,在不同的 (b) 开源模型中。

(b) Training costs and inference efficiency of DeepSeek 67B (Dense) and DeepSeek-V2.
(b) DeepSeek 67B (Dense) 和 DeepSeek-V2 的训练成本和推理效率。

Contents 内容

1 Introduction ..... 4 1 引言 .....4
2 Architecture ..... 6
2 建筑.....6

2.1 Multi-Head Latent Attention: Boosting Inference Efficiency, ..... 6
2.1 多头潜在注意力:提高推理效率,.....6

2.1.1 Preliminaries: Standard Multi-Head Attention ..... 6
2.1.1 预赛:标准多头注意力.....6

2.1.2 Low-Rank Key-Value Joint Compression ..... 7
2.1.2 低秩键值关节压缩.....7

2.1.3 Decoupled Rotary Position Embedding ..... 7
2.1.3 解耦旋转位置嵌入.....7

2.1.4 Comparison of Key-Value Cache ..... 8
2.1.4 键值缓存.....比较8

2.2 DeepSeekMoE: Training Strong Models at Economical Costs ..... 9
2.2 DeepSeekMoE:以经济的成本训练强大的模型 .....9

2.2.1 Basic Architecture ..... 9
2.2.1 基本架构.....9

2.2.2 Device-Limited Routing . ..... 9
2.2.2 设备受限路由。.....9

2.2.3 Auxiliary Loss for Load Balance ..... 9
2.2.3 负载均衡.....的辅助损耗9

2.2.4 Token-Dropping Strategy ..... 11
2.2.4 代币投放策略.....11

3 Pre-Training ..... 11
3 预训练.....11

3.1 Experimental Setups ..... 11
3.1 实验设置.....11

3.1.1 Data Construction ..... 11
3.1.1 数据构建.....11

3.1.2 Hyper-Parameters ..... 11
3.1.2 超参数.....11

3.1.3 Infrastructures ..... 12
3.1.3 基础设施.....12

3.1.4 Long Context Extension ..... 12
3.1.4 长上下文扩展.....12

3.2 Evaluations ..... 13
3.2 评价.....13

3.2.1 Evaluation Benchmarks ..... 13
3.2.1 评估基准 .....13

3.2.2 Evaluation Results ..... 14
3.2.2 评价结果 .....14

3.2.3 Training and Inference Efficiency ..... 15
3.2.3 训练和推理效率.....15

4 Alignment ..... 16
4 对齐.....16

4.1 Supervised Fine-Tuning ..... 16
4.1 监督微调.....16

4.2 Reinforcement Learning ..... 16
4.2 强化学习.....16

4.3 Evaluation Results ..... 17
4.3 评估结果 .....17

4.4 Discussion ..... 19
4.4 讨论.....19

5 Conclusion, Limitation, and Future Work ..... 20
5 结论、局限性和未来工作.....20

A Contributions and Acknowledgments ..... 27
a 文稿和致谢.....27

B Full Formulas of MLA ..... 29
B MLA的完整公式.....29

C Ablation of Attention Mechanisms ..... 29
C 消融注意力机制 .....29

C. 1 Ablation of MHA, GQA, and MQA ..... 29
C. 1 MHA、GQA 和 MQA .....的消融29

C. 2 Comparison Between MLA and MHA ..... 29
C. 2 MLA与MHA.....的比较29

D Discussion About Pre-Training Data Debiasing ..... 30
D 关于预训练数据去偏.....的讨论30

E Additional Evaluations on Math and Code ..... 31
E 数学和代码.....的附加评估31

F Evaluation Formats ..... 32
F 评估格式 .....32

1. Introduction 1. 引言

In the past few years, Large Language Models (LLMs) (Anthropic, 2023: Google, 2023 OpenAI, 2022, 2023) have undergone rapid development, offering a glimpse into the dawn of Artificial General Intelligence (AGI). In general, the intelligence of an LLM tends to improve as the number of parameters increases, allowing it to exhibit emergent capabilities across various tasks (Wei et al., 2022). However, the improvement comes at the cost of larger computing resources for training and a potential decrease in inference throughput. These constraints present significant challenges that impede the widespread adoption and utilization of LLMs. In order to tackle this problem, we introduce DeepSeek-V2, a strong open-source Mixture-of-Experts (MoE) language model, characterized by economical training and efficient inference through an innovative Transformer architecture. It is equipped with a total of 236B parameters, of which 21B are activated for each token, and supports a context length of tokens.
在过去的几年里,大型语言模型(LLMs)(Anthropic,2023:Google,2023 OpenAI,2022,2023)经历了快速发展,让我们得以一窥通用人工智能(AGI)的曙光。一般来说,随着参数数量的增加,一个人LLM的智力往往会提高,使其能够在各种任务中表现出紧急能力(Wei 等人,2022 年)。然而,这种改进是以更大的计算资源为代价的,并且可能会降低推理吞吐量。这些制约因素带来了重大挑战,阻碍了 LLMs.为了解决这个问题,我们引入了 DeepSeek-V2,这是一种强大的开源专家混合 (MoE) 语言模型,其特点是通过创新的 Transformer 架构进行经济的训练和高效的推理。它总共配备了 236B 参数,其中 21B 为每个令牌激活,并支持 令牌的上下文长度。
We optimize the attention modules and Feed-Forward Networks (FFNs) within the Transformer framework (Vaswani et al., 2017) with our proposed Multi-head Latent Attention (MLA) and DeepSeekMoE. (1) In the context of attention mechanisms, the Key-Value (KV) cache of the Multi-Head Attention (MHA) (Vaswani et al., 2017) poses a significant obstacle to the inference efficiency of LLMs. Various approaches have been explored to address this issue, including Grouped-Query Attention (GQA) (Ainslie et al. 2023 ) and Multi-Query Attention (MQA) (Shazeer 2019). However, these methods often compromise performance in their attempt to reduce the cache. In order to achieve the best of both worlds, we introduce MLA, an attention mechanism equipped with low-rank key-value joint compression. Empirically, MLA achieves superior performance compared with MHA, and meanwhile significantly reduces the KV cache during inference, thus boosting the inference efficiency. (2) For Feed-Forward Networks (FFNs), we follow the DeepSeekMoE architecture (Dai et al. 2024), which adopts fine-grained expert segmentation and shared expert isolation for higher potential in expert specialization. The DeepSeekMoE architecture demonstrates great advantages compared with conventional MoE architectures like GShard (Lepikhin et al., 2021), enabling us to train strong models at an economical cost. As we employ expert parallelism during training, we also devise supplementary mechanisms to control communication overheads and ensure load balance. By combining these two techniques, DeepSeek-V2 features strong performance (Figure 1(a)), economical training costs, and efficient inference throughput (Figure 1(b)), simultaneously.
我们用我们提出的多头潜在注意力(MLA)和DeepSeekMoE优化了Transformer框架(Vaswani等人,2017)中的注意力模块和前馈网络(FFN)。(1)在注意力机制的背景下,多头注意力(MHA)的键值(KV)缓存(Vaswani et al., 2017)对推理LLMs效率构成了重大障碍。已经探索了各种方法来解决这个问题,包括分组查询注意力 (GQA) (Ainslie 等人,2023 年)和多查询注意力 (MQA) (Shazeer 2019)。但是,这些方法在尝试减少 缓存时通常会损害性能。为了实现两全其美,我们引入了MLA,这是一种配备低秩键值联合压缩的注意力机制。从经验上看,MLA与MHA相比性能优越,同时显著降低了推理过程中的KV缓存,从而提高了推理效率。(2)对于前馈网络(FFN),我们遵循DeepSeekMoE架构(Dai et al. 2024),该架构采用细粒度的专家分割和共享的专家隔离,以提高专家专业化的潜力。与 GShard 等传统 MoE 架构相比,DeepSeekMoE 架构表现出巨大的优势(Lepikhin 等人,2021 年),使我们能够以经济的成本训练强大的模型。当我们在培训期间采用专家并行性时,我们还设计了补充机制来控制通信开销并确保负载平衡。通过结合这两种技术,DeepSeek-V2 同时具有强大的性能(图 1(a))、经济的训练成本和高效的推理吞吐量(图 1(b))。
We construct a high-quality and multi-source pre-training corpus consisting of tokens. Compared with the corpus used in DeepSeek 67B (our previous release) (DeepSeek-AI |2024), this corpus features an extended amount of data, especially Chinese data, and higher data quality. We first pretrain DeepSeek-V2 on the full pre-training corpus. Then, we collect conversational sessions, which encompass various domains such as math, code, writing, reasoning, safety, and more, to perform Supervised Fine-Tuning (SFT) for DeepSeek-V2 Chat (SFT). Finally, we follow DeepSeekMath (Shao et al., 2024) to employ Group Relative Policy Optimization (GRPO) to further align the model with human preference and produce DeepSeek-V2 Chat (RL).
我们构建了一个由 代币组成的高质量、多源的预训练语料库。与DeepSeek 67B(我们之前的版本)(DeepSeek-AI |2024)中使用的语料库相比,该语料库具有更大的数据量,尤其是中文数据,并且具有更高的数据质量。我们首先在完整的预训练语料库上预训练 DeepSeek-V2。然后,我们收集 对话会话,这些会话涵盖数学、代码、写作、推理、安全等各个领域,以执行 DeepSeek-V2 聊天 (SFT) 的监督微调 (SFT)。最后,我们遵循 DeepSeekMath (Shao et al., 2024) 采用组相对策略优化 (GRPO) 来进一步使模型与人类偏好保持一致并生成 DeepSeek-V2 聊天 (RL)。
We evaluate DeepSeek-V2 on a wide range of benchmarks in English and Chinese, and compare it with representative open-source models. Evaluation results show that even with only 21B activated parameters, DeepSeek-V2 still achieves top-tier performance among open-source models and becomes the strongest open-source MoE language model. Figure 1(a) highlights that, on MMLU, DeepSeek-V2 achieves top-ranking performance with only a small number of activated parameters. In addition, as shown in Figure 1(b), compared with DeepSeek 67B, DeepSeek-V2 saves of training costs, reduces the KV cache by , and boosts the maximum generation throughput to 5.76 times. We also evaluate DeepSeek-V2 Chat (SFT) and
我们在广泛的中英文基准测试中评估了 DeepSeek-V2,并将其与具有代表性的开源模型进行了比较。评估结果表明,即使仅激活了21B参数,DeepSeek-V2仍实现了开源模型中的顶级性能,成为最强的开源MoE语言模型。图 1(a) 突出显示,在 MMLU 上,DeepSeek-V2 仅使用少量激活参数即可实现顶级性能。此外,如图1(b)所示,与DeepSeek 67B相比,DeepSeek-V2节省 了训练成本,将KV缓存降低了 ,最大生成吞吐量提高到5.76倍。我们还评估了 DeepSeek-V2 聊天 (SFT) 和
Figure 1 | Illustration of the architecture of DeepSeek-V2. MLA ensures efficient inference by significantly reducing the KV cache for generation, and DeepSeekMoE enables training strong models at an economical cost through the sparse architecture.
图1 |DeepSeek-V2 架构图示。MLA 通过显著减少生成的 KV 缓存来确保高效推理,而 DeepSeekMoE 通过稀疏架构以经济的成本训练强大的模型。
DeepSeek-V2 Chat (RL) on open-ended benchmarks. Notably, DeepSeek-V2 Chat (RL) achieves 38.9 length-controlled win rate on AlpacaEval 2.0 (Dubois et al., 2024), 8.97 overall score on MT-Bench (Zheng et al. 2023), and 7.91 overall score on AlignBench (Liu et al., 2023). The English open-ended conversation evaluations demonstrate that DeepSeek-V2 Chat (RL) has toptier performance among open-source chat models. In addition, the evaluation on AlignBench indicates that in Chinese, DeepSeek-V2 Chat (RL) outperforms all of open-source models, and even beats most of closed-source models.
DeepSeek-V2 Chat (RL) 基于开放式基准测试。值得注意的是,DeepSeek-V2 Chat (RL) 在 AlpacaEval 2.0 (Dubois et al., 2024) 上实现了 38.9 的长度控制胜率,在 MT-Bench 上获得了 8.97 的总分 (Zheng et al. 2023),在 AlignBench 上获得了 7.91 的总分 (Liu et al., 2023)。英语开放式对话评估表明,DeepSeek-V2 Chat (RL) 在开源聊天模型中具有顶级性能。此外,AlignBench 的评估表明,在中文中,DeepSeek-V2 Chat (RL) 的表现优于所有开源模型,甚至优于大多数闭源模型。
In the rest of this paper, we first provide a detailed description of the model architecture of DeepSeek-V2 (Section 2). Subsequently, we introduce our pre-training endeavors, including the training data construction, hyper-parameter settings, infrastructures, long context extension, and the evaluation of model performance and efficiency (Section 3). Following this, we demonstrate our efforts in alignment, encompassing Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), the evaluation results, and other discussion (Section 4). Finally, we summarize the conclusion, deliberate on the current limitations of DeepSeek-V2, and outline our future work (Section 5).
在本文的其余部分,我们首先详细介绍了 DeepSeek-V2 的模型架构(第 2 节)。随后,我们介绍了我们的预训练工作,包括训练数据构建、超参数设置、基础设施、长上下文扩展以及模型性能和效率的评估(第 3 节)。在此之后,我们展示了我们在一致性方面的努力,包括监督微调 (SFT)、强化学习 (RL)、评估结果和其他讨论(第 4 节)。最后,我们总结了结论,讨论了 DeepSeek-V2 当前的局限性,并概述了我们未来的工作(第 5 节)。

2. Architecture 2. 建筑

By and large, DeepSeek-V2 is still in the Transformer architecture (Vaswani et al., 2017), where each Transformer block consists of an attention module and a Feed-Forward Network (FFN). However, for both the attention module and the FFN, we design and employ innovative architectures. For attention, we design MLA, which utilizes low-rank key-value joint compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference. For FFNs, we adopt the DeepSeekMoE architecture (Dai et al., 2024), a high-performance MoE architecture that enables training strong models at an economical cost. An illustration of the architecture of DeepSeek-V2 is presented in Figure 1, and we will introduce the details of MLA and DeepSeekMoE in this section. For other tiny details (e.g., layer normalization and the activation function in FFNs), unless specifically stated, DeepSeek-V2 follows the settings of DeepSeek 67B ( .
总的来说,DeepSeek-V2 仍处于 Transformer 架构中(Vaswani 等人,2017 年),其中每个 Transformer 模块由一个注意力模块和一个前馈网络 (FFN) 组成。然而,对于注意力模块和FFN,我们设计和采用了创新的架构。为了引起注意,我们设计了MLA,它利用低秩键值联合压缩来消除推理-时间键值缓存的瓶颈,从而支持高效推理。对于 FFN,我们采用 DeepSeekMoE 架构(Dai et al., 2024),这是一种高性能的 MoE 架构,能够以经济的成本训练强大的模型。图 1 展示了 DeepSeek-V2 的架构,我们将在本节中详细介绍 MLA 和 DeepSeekMoE。对于其他微小的细节(例如,层归一化和 FFN 中的激活函数),除非特别说明,否则 DeepSeek-V2 遵循 DeepSeek 67B ( .

2.1. Multi-Head Latent Attention: Boosting Inference Efficiency
2.1. 多头潜在注意力:提高推理效率

Conventional Transformer models usually adopts Multi-Head Attention (MHA) (Vaswani et al. 2017), but during generation, its heavy Key-Value (KV) cache will become the bottleneck that limit the inference efficiency. In order to reduce the KV cache, Multi-Query Attention (MQA) (Shazeer, 2019) and Grouped-Query Attention (GQA) (Ainslie et al., 2023) are proposed. They require a smaller magnitude of cache, but their performance does not match MHA (we provide the ablation of MHA, GQA and MQA in Appendix C.1).
传统的 Transformer 模型通常采用多头注意力 (MHA) (Vaswani et al. 2017),但在生成过程中,其繁重的键值 (KV) 缓存将成为限制推理效率的瓶颈。为了减少KV缓存,提出了多查询注意力(MQA)(Shazeer,2019)和分组查询注意力(GQA)(Ainslie等人,2023)。它们需要较小量级的 缓存,但它们的性能与 MHA 不匹配(我们在附录 C.1 中提供了 MHA、GQA 和 MQA 的消融)。
For DeepSeek-V2, we design an innovative attention mechanism called Multi-head Latent Attention (MLA). Equipped with low-rank key-value joint compression, MLA achieves better performance than MHA, but requires a significantly smaller amount of cache. We introduce its architecture in the following, and also provide a comparison between MLA and MHA in Appendix
对于 DeepSeek-V2,我们设计了一种创新的注意力机制,称为多头潜在注意力 (MLA)。MLA 配备低秩键值联合压缩,性能优于 MHA,但需要的 缓存量要小得多。我们将在下面介绍其架构,并在附录 中提供 MLA 和 MHA 之间的比较

2.1.1. Preliminaries: Standard Multi-Head Attention
2.1.1. 预赛:标准多头注意力

We first introduce the standard MHA mechanism as background. Let be the embedding dimension, be the number of attention heads, be the dimension per head, and be the attention input of the -th token at an attention layer. Standard MHA first produces through three matrices , respectively:
我们首先介绍标准的MHA机制作为背景。设 是嵌入维度, 是注意力头的数量, 是每个头的维度, 并且是注意力层上第 -th 标记的注意力输入。标准MHA首先通过三个矩阵 分别产生
Then, will be sliced into heads for the multi-head attention computation:
然后, 将被切成 头部进行多头注意力计算:
Figure 2 | Simplified illustration of Multi-Head Attention (MHA), Grouped-Query Attention (GQA), Multi-Query Attention (MQA), and Multi-head Latent Attention (MLA). Through jointly compressing the keys and values into a latent vector, MLA significantly reduces the KV cache during inference.
图2 |多头注意力 (MHA)、分组查询注意力 (GQA)、多查询注意力 (MQA) 和多头潜在注意力 (MLA) 的简化图示。通过将键和值联合压缩为潜在向量,MLA显著降低了推理过程中的KV缓存。
where denote the query, key, and value of the -th attention head, respectively; denotes the output projection matrix. During inference, all keys and values need to be cached to accelerate inference, so MHA needs to cache elements for each token. In model deployment, this heavy KV cache is a large bottleneck that limits the maximum batch size and sequence length.
其中 分别表示 第 -th 个注意力头的查询、键和值; 表示输出投影矩阵。在推理过程中,需要缓存所有键和值以加速推理,因此 MHA 需要缓存每个令牌 的元素。在模型部署中,这种繁重的 KV 缓存是一个很大的瓶颈,它限制了最大批处理大小和序列长度。

2.1.2. Low-Rank Key-Value Joint Compression
2.1.2. 低秩键值关节压缩

The core of MLA is the low-rank joint compression for keys and values to reduce cache:
MLA 的核心是对键和值进行低秩联合压缩,以减少 缓存:
where is the compressed latent vector for keys and values; denotes the compression dimension; is the down-projection matrix; and are the up-projection matrices for keys and values, respectively. During inference, MLA only needs to cache , so its cache has only elements, where denotes the number of layers. In addition, during inference, since can be absorbed into , and can be absorbed into , we even do not need to compute keys and values out for attention. Figure 2 intuitively illustrates how the joint compression in MLA reduces the cache.
其中 是键和值的压缩潜在向量; 表示 压缩尺寸; 是向下投影矩阵;和 分别是键和值的上投影矩阵。在推理过程中,MLA 只需要缓存 ,因此其 缓存只有 元素,其中 表示层数。此外,在推理过程中,由于 可以被吸收到 可以吸收到 ,我们甚至不需要计算出键和值来引起注意。图 2 直观地说明了 MLA 中 的联合压缩如何减少 缓存。
Moreover, in order to reduce the activation memory during training, we also perform low-rank compression for the queries, even if it cannot reduce the cache:
此外,为了减少训练过程中的激活内存,我们还对查询进行了低秩压缩,即使它不能减少 缓存:
where is the compressed latent vector for queries; denotes the query compression dimension; and are the down-projection and upprojection matrices for queries, respectively.
其中 是查询的压缩潜在向量; 表示查询压缩维度;和 分别是查询的下投影和上投影矩阵。

2.1.3. Decoupled Rotary Position Embedding
2.1.3. 解耦旋转位置嵌入

Following DeepSeek 67B ( ) we intend to use the Rotary Position Embedding (RoPE) (Su et al., 2024) for DeepSeek-V2. However, RoPE is incompatible with low-rank
在 DeepSeek 67B ( ) 之后,我们打算将旋转位置嵌入 (RoPE) (Su et al., 2024) 用于 DeepSeek-V2。但是,RoPE 与低秩不兼容
Attention Mechanism 注意力机制 KV Cache per Token (# Element)
每个令牌的 KV 缓存 (# 元素)
Capability
Multi-Head Attention (MHA)
多头注意力 (MHA)
Strong
Grouped-Query Attention (GQA)
分组查询注意 (GQA)
Moderate
Multi-Query Attention (MQA)
多查询注意力 (MQA)
Weak
MLA (Ours) Stronger
Table 1 | Comparison of the KV cache per token among different attention mechanisms. denotes the number of attention heads, denotes the dimension per attention head, denotes the number of layers, denotes the number of groups in GQA, and and denote the KV compression dimension and the per-head dimension of the decoupled queries and key in MLA, respectively. The amount of cache is measured by the number of elements, regardless of the storage precision. For DeepSeek-V2, is set to and is set to . So, its KV cache is equal to GQA with only 2.25 groups, but its performance is stronger than MHA.
表1 |不同注意力机制下每个令牌的 KV 缓存比较。 分别表示注意力头数、 每个注意力头的维度、 层数、 GQA 中的组数、MLA中解耦查询和键的KV压缩维度和每头维度。 缓存量由元素数来衡量,而与存储精度无关。对于 DeepSeek-V2, 设置为 设置为 。因此,它的 KV 缓存等于 GQA,只有 2.25 个组,但其性能比 MHA 强。
KV compression. To be specific, RoPE is position-sensitive for both keys and queries. If we apply RoPE for the keys in Equation 10 will be coupled with a position-sensitive RoPE matrix. In this way, cannot be absorbed into any more during inference, since a RoPE matrix related to the currently generating token will lie between and and matrix multiplication does not obey a commutative law. As a result, we must recompute the keys for all the prefix tokens during inference, which will significantly hinder the inference efficiency.
KV压缩。具体来说,RoPE 对键和查询都是位置敏感的。如果我们对公式 10 中的键 应用 RoPE,则将与位置敏感的 RoPE 矩阵耦合。这样, 在推理过程中就不能再被吸收了 ,因为与当前生成的令牌相关的 RoPE 矩阵将位于 和 之间 ,并且矩阵乘法不遵守交换定律。因此,在推理过程中,我们必须重新计算所有前缀标记的键,这将大大阻碍推理效率。
As a solution, we propose the decoupled RoPE strategy that uses additional multi-head queries and a shared key to carry RoPE, where denotes the per-head dimension of the decoupled queries and key. Equipped with the decoupled RoPE strategy, MLA performs the following computation:
作为解决方案,我们提出了解耦的 RoPE 策略,该策略使用额外的多头查询 和共享密钥 来携带 RoPE,其中 表示解耦查询和键的每头维度。配备解耦 RoPE 策略,MLA 执行以下计算:
where and are matrices to produce the decouples queries and key, respectively; denotes the operation that applies RoPE matrices; and denotes the concatenation operation. During inference, the decoupled key should also be cached. Therefore, DeepSeek-V2 requires a total cache containing elements.
其中 是矩阵,分别产生解耦查询和 key; 表示应用 RoPE 矩阵的操作;并表示 串联运算。在推理过程中,还应该缓存解耦的密钥。因此,DeepSeek-V2 需要包含 元素的总 缓存。
In order to demonstrate the complete computation process of MLA, we also organize and provide its full formulas in Appendix B.
为了演示MLA的完整计算过程,我们还在附录B中整理并提供了其完整的公式。

2.1.4. Comparison of Key-Value Cache
2.1.4. 键值缓存的比较

We demonstrate a comparison of the KV cache per token among different attention mechanisms in Table 1 MLA requires only a small amount of KV cache, equal to GQA with only 2.25 groups, but can achieve stronger performance than MHA.
我们在表 1 中展示了不同注意力机制之间每个令牌的 KV 缓存的比较,MLA 只需要少量的 KV 缓存,相当于只有 2.25 个组的 GQA,但可以实现比 MHA 更强的性能。

2.2. DeepSeekMoE: Training Strong Models at Economical Costs
2.2. DeepSeekMoE:以经济成本训练强大的模型

2.2.1. Basic Architecture
2.2.1. 基本架构

For FFNs, we employ the DeepSeekMoE architecture (Dai et al., 2024). DeepSeekMoE has two key ideas: segmenting experts into finer granularity for higher expert specialization and more accurate knowledge acquisition, and isolating some shared experts for mitigating knowledge redundancy among routed experts. With the same number of activated and total expert parameters, DeepSeekMoE can outperform conventional MoE architectures like GShard (Lepikhin et al., 2021) by a large margin.
对于 FFN,我们采用 DeepSeekMoE 架构(Dai 等人,2024 年)。DeepSeekMoE 有两个关键思想:将专家细分为更精细的粒度,以实现更高的专家专业化和更准确的知识获取,以及隔离一些共享专家以减少路由专家之间的知识冗余。在相同数量的激活和总专家参数下,DeepSeekMoE 的性能可以大大优于 GShard 等传统 MoE 架构(Lepikhin 等人,2021 年)。
Let be the FFN input of the -th token, we compute the FFN output as follows:
假设 第 -th 个标记的 FFN 输入,我们按如下方式计算 FFN 输出
where and denote the numbers of shared experts and routed experts, respectively; and denote the -th shared expert and the -th routed expert, respectively; denotes the number of activated routed experts; is the gate value for the -th expert; is the tokento-expert affinity; is the centroid of the -th routed expert in this layer; and denotes the set comprising highest scores among the affinity scores calculated for the -th token and all routed experts.
其中 分别表示共享专家和路由专家的数量; 分别 表示 -th 共享专家和 -th 路由专家; 表示已激活的路由专家的数量; 第 -th 个 expert 的门值; 是令牌与专家的亲和力; 是该层中 第 -th 路由专家的质心;并表示 包含针对 第 -th 个标记和所有路由专家计算的亲和力分数 中最高分数的集合。

2.2.2. Device-Limited Routing
2.2.2. 设备受限路由

We design a device-limited routing mechanism to bound MoE-related communication costs. When expert parallelism is employed, the routed experts will be distributed across multiple devices. For each token, its MoE-related communication frequency is proportional to the number of devices covered by its target experts. Due to the fine-grained expert segmentation in DeepSeekMoE, the number of activated experts can be large, so the MoE-related communication will be more costly if we apply expert parallelism.
我们设计了一种设备受限的路由机制来绑定与 MoE 相关的通信成本。当采用专家并行时,路由的专家将分布在多个设备上。对于每个令牌,其与 MoE 相关的通信频率与其目标专家覆盖的设备数量成正比。由于 DeepSeekMoE 中细粒度的专家分段,激活的专家数量可能很大,因此如果我们应用专家并行,与 MoE 相关的通信成本会更高。
For DeepSeek-V2, beyond the naive top-K selection of routed experts, we additionally ensure that the target experts of each token will be distributed on at most devices. To be specific, for each token, we first select devices that have experts with the highest affinity scores in them. Then, we perform top-K selection among experts on these devices. In practice, we find that when , the device-limited routing can achieve a good performance roughly aligned with the unrestricted top- routing.
对于 DeepSeek-V2,除了朴素的 top-K 路由专家选择之外,我们还确保每个代币的目标专家将分布在大多数 设备上。具体来说,对于每个令牌,我们首先选择 具有具有最高亲和力分数的专家的设备。然后,我们在这些 设备的专家中执行 top-K 选择。在实践中,我们发现,当 时,设备受限的路由可以实现与不受限制的顶部 路由大致一致的良好性能。

2.2.3. Auxiliary Loss for Load Balance
2.2.3. 负载均衡的辅助损耗

We take the load balance into consideration for automatically learned routing strategies. Firstly, unbalanced load will raise the risk of routing collapse (Shazeer et al., 2017), preventing some experts being fully trained and utilized. Secondly, when expert parallelism is employed, unbalanced load will diminish computation efficiency. During the training of DeepSeek-V2, we design three kinds of auxiliary losses, for controlling expert-level load balance ( ), device-level load balance , and communication balance , respectively.
对于自动学习的路由策略,我们会考虑负载均衡。首先,不平衡的负载会增加路由崩溃的风险(Shazeer et al., 2017),使一些专家无法得到充分的培训和利用。其次,当采用专家并行时,不平衡负载会降低计算效率。在DeepSeek-V2的训练过程中,我们设计了三种辅助损耗,分别用于控制专家级负载均衡( )、设备级负载均衡 和通信均衡
Expert-Level Balance Loss. We use an expert-level balance loss (Fedus et al. 2021 ; Lepikhin et al. 2021) to mitigate the risk of routing collapse:
专家级余额损失。我们使用专家级余额损失(Fedus 等人,2021 年;Lepikhin 等人,2021 年)来降低路由崩溃的风险:
where is a hyper-parameter called expert-level balance factor; denotes the indicator function; and denotes the number of tokens in a sequence.
其中 是一个称为专家级平衡因子的超参数; 表示指标功能;并表示 序列中的标记数。
Device-Level Balance Loss. In addition to the expert-level balance loss, we additionally design a device-level balance loss to ensure balanced computation across different devices. In the training process of DeepSeek-V2, we partition all routed experts into groups , and deploy each group on a single device. The device-level balance loss is computed as follows:
设备级余额损失。除了专家级的平衡损失外,我们还设计了设备级的平衡损失,以确保不同设备之间的平衡计算。在 DeepSeek-V2 的训练过程中,我们将所有路由专家划分为 ,并将每个组部署在单个设备上。设备级余额损失的计算方法如下:
where is a hyper-parameter called device-level balance factor.
其中 是称为设备级平衡因子的超参数。
Communication Balance Loss. Finally, we introduce a communication balance loss to ensure that the communication of each device is balanced. Although the device-limited routing mechanism guarantees that the sending communication of each device is bounded, if a certain device receives more tokens than other devices, the practical communication efficiency will also be affected. In order to mitigate this issue, we design a communication balance loss as follows:
通信余额损失。最后,我们引入通信平衡损失,以确保每个设备的通信是平衡的。虽然设备限制的路由机制保证了每个设备的发送通信是有界的,但如果某个设备接收到的令牌比其他设备多,实际通信效率也会受到影响。为了缓解这个问题,我们设计了一个通信余额损失,如下所示:
where is a hyper-parameter called communication balance factor. The device-limited routing mechanism operates on the principle of ensuring that each device transmits at most hidden states to other devices. Simultaneously, the communication balance loss is employed to encourage each device to receive around hidden states from other devices. The communication balance loss guarantees a balanced exchange of information among devices, promoting efficient communications.
其中 是称为通信平衡因子的超参数。设备限制路由机制的运行原理是确保每个设备最多 以隐藏状态传输到其他设备。同时,通信平衡损失用于鼓励每个设备接收来自其他设备的隐藏 状态。通信平衡损失保证了设备之间的信息平衡交换,促进了高效的通信。

2.2.4. Token-Dropping Strategy
2.2.4. 代币丢弃策略

While balance losses aim to encourage a balanced load, it is important to acknowledge that they cannot guarantee a strict load balance. In order to further mitigate the computation wastage caused by unbalanced load, we introduce a device-level token-dropping strategy during training. This approach first computes the average computational budget for each device, which means that the capacity factor for each device is equivalent to 1.0. Then, inspired by Riquelme et al. (2021), we drop tokens with the lowest affinity scores on each device until reaching the computational budget. In addition, we ensure that the tokens belonging to approximately of the training sequences will never be dropped. In this way, we can flexibly decide whether to drop tokens during inference according to the efficiency requirements, and always ensure consistency between training and inference.
虽然平衡损失旨在鼓励平衡负载,但重要的是要承认它们不能保证严格的负载平衡。为了进一步缓解负载不均衡导致的计算浪费,我们在训练过程中引入了设备级令牌丢弃策略。此方法首先计算每个设备的平均计算预算,这意味着每个设备的容量因子等于 1.0。然后,受 Riquelme 等人 (2021) 的启发,我们在每台设备上丢弃亲和力分数最低的代币,直到达到计算预算。此外,我们确保属于大约 训练序列的标记永远不会被丢弃。这样,我们可以根据效率要求灵活地决定在推理过程中是否丢弃令牌,并始终保证训练和推理的一致性。

3. Pre-Training 3. 预培训

3.1. Experimental Setups
3.1. 实验装置

3.1.1. Data Construction
3.1.1. 数据构建

While maintaining the same data processing stages as for DeepSeek 67B (DeepSeek-AI, 2024), we extend the amount of data and elevate the data quality. In order to enlarge our pre-training corpus, we delve deeper into the potential of the internet data and optimize our cleaning processes, thus recovering a large amount of mistakenly deleted data. Moreover, we incorporate more Chinese data, aiming to better leverage the corpus available on the Chinese internet. In addition to the amount of data, we also focus on the data quality. We enrich our pre-training corpus with high-quality data from various sources, and meanwhile improve the quality-based filtering algorithm. The improved algorithm ensures that a large amount of non-beneficial data will be removed, while the valuable data will be mostly retained. In addition, we filter out the contentious content from our pre-training corpus to mitigate the data bias introduced from specific regional cultures. A detailed discussion about the influence of this filtering strategy is presented in Appendix
在保持与 DeepSeek 67B(DeepSeek-AI,2024)相同的数据处理阶段的同时,我们扩展了数据量并提高了数据质量。为了扩大我们的预训练语料库,我们深入研究了互联网数据的潜力并优化了我们的清理过程,从而恢复了大量错误删除的数据。此外,我们纳入了更多的中文数据,旨在更好地利用中国互联网上的语料库。除了数据量,我们还关注数据质量。我们用来自各种来源的高质量数据丰富了我们的预训练语料库,同时改进了基于质量的过滤算法。改进后的算法确保了大量无益数据将被删除,而有价值的数据将大部分被保留。此外,我们从预训练语料库中过滤掉有争议的内容,以减轻从特定区域文化引入的数据偏差。附录 中详细介绍了这种过滤策略的影响
We adopt the same tokenizer as used in DeepSeek 67B, which is built based on the Byte-level Byte-Pair Encoding (BBPE) algorithm and has a vocabulary size of 100K. Our tokenized pretraining corpus contains tokens, where Chinese tokens are approximately more than English ones.
我们采用与 DeepSeek 67B 相同的分词器,该分词器基于字节级字节对编码 (BBPE) 算法构建,词汇量为 100K。我们的代币化预训练语料库包含 代币,其中中文代币大约 多于英文代币。

3.1.2. Hyper-Parameters 3.1.2. 超参数

Model Hyper-Parameters. We set the number of Transformer layers to 60 and the hidden dimension to 5120 . All learnable parameters are randomly initialized with a standard deviation of 0.006. In MLA, we set the number of attention heads to 128 and the per-head dimension to 128 . The compression dimension is set to 512 , and the query compression dimension is set to 1536. For the decoupled queries and key, we set the per-head dimension to 64 . Following Dai et al. (2024), we substitute all FFNs except for the first layer with MoE layers. Each MoE layer consists of 2 shared experts and 160 routed experts, where the intermediate hidden dimension of each expert is 1536. Among the routed experts, 6 experts will be activated for each token. In addition, the low-rank compression and fine-grained expert segmentation will impact the output scale of a layer. Therefore, in practice, we employ additional RMS Norm layers after the compressed latent vectors, and multiply additional scaling factors at the width bottlenecks (i.e., the compressed latent vectors and the intermediate hidden states of routed
对超参数进行建模。我们将 Transformer 层数设置为 60,将隐藏维度设置为 5120。所有可学习的参数都是随机初始化的,标准差为 0.006。在 MLA 中,我们将注意力头 的数量设置为 128,将每个头的维度 设置为 128。 压缩维度 设置为 512,查询压缩维度 设置为 1536。对于解耦的查询和键,我们将每头维度 设置为 64 。根据 Dai 等人 (2024),我们将除第一层以外的所有 FFN 替换为 MoE 层。每个 MoE 层由 2 个共享专家和 160 个路由专家组成,其中每个专家的中间隐藏维度为 1536。在路由的专家中,每个代币将激活 6 名专家。此外,低秩压缩和细粒度专家分割会影响层的输出规模。因此,在实践中,我们在压缩的潜在向量之后使用额外的RMS范数层,并在宽度瓶颈处乘以额外的比例因子(即压缩的潜在向量和路由的中间隐藏状态)

experts) to ensure stable training. Under this configuration, DeepSeek-V2 comprises 236B total parameters, of which 21B are activated for each token.
专家)以确保稳定的培训。在这种配置下,DeepSeek-V2 总共包含 236B 个参数,其中 21B 为每个令牌激活。
Training Hyper-Parameters. We employ the AdamW optimizer (Loshchilov and Hutter, 2017) with hyper-parameters set to , and weight_decay . The learning rate is scheduled using a warmup-and-step-decay strategy (DeepSeek-AI, 2024). Initially, the learning rate linearly increases from 0 to the maximum value during the first steps. Subsequently, the learning rate is multiplied by 0.316 after training about of tokens, and again by 0.316 after training about of tokens. The maximum learning rate is set to , and the gradient clipping norm is set to 1.0 . We also use a batch size scheduling strategy, where the batch size is gradually increased from 2304 to 9216 in the training of the first 225B tokens, and then keeps 9216 in the remaining training. We set the maximum sequence length to , and train DeepSeek-V2 on 8.1T tokens. We leverage pipeline parallelism to deploy different layers of a model on different devices, and for each layer, the routed experts will be uniformly deployed on 8 devices . As for the device-limited routing, each token will be sent to at most 3 devices . As for balance losses, we set to to 0.05 , and to 0.02 . We employ the token-dropping strategy during training for acceleration, but do not drop any tokens for evaluation.
训练超参数。我们使用 AdamW 优化器(Loshchilov 和 Hutter,2017 年),超参数设置为 ,并weight_decay 。学习率是使用预热和步进衰减策略安排的(DeepSeek-AI,2024)。最初,在第一 步中,学习率从 0 线性增加到最大值。随后,在训练有关 令牌后,学习率乘以 0.316,在训练有关 令牌后再次乘以 0.316。最大学习率设置为 ,梯度削波范数设置为 1.0。我们还使用批量大小调度策略,在前 225B 代币的训练中,批大小从 2304 逐渐增加到 9216,然后在剩余的训练中保留 9216。我们将最大序列长度设置为 ,并在 8.1T 代币上训练 DeepSeek-V2。我们利用流水线并行性将模型的不同层部署在不同的设备上,对于每一层,路由的专家将统一部署在 8 台设备 上。至于设备限制的路由,每个令牌最多会发送到 3 台设备 。至于余额损失,我们设置为 0.05 和 0.02 。我们在加速训练期间采用令牌丢弃策略,但不会丢弃任何令牌进行评估。

3.1.3. Infrastructures 3.1.3. 基础设施

DeepSeek-V2 is trained based on the HAI-LLM framework (High-flyer 2023), an efficient and light-weight training framework developed internally by our engineers. It employs a 16-way zero-bubble pipeline parallelism (Qi et al., 2023), an 8-way expert parallelism (Lepikhin et al. 2021), and ZeRO-1 data parallelism (Rajbhandari et al., 2020). Given that DeepSeek-V2 has relatively few activated parameters, and a portion of the operators are recomputed to save activation memory, it can be trained without the necessity of tensor parallelism, thereby decreasing the communication overhead. Moreover, in order to further improve the training efficiency, we overlap the computation of shared experts with the expert parallel all-to-all communication. We also customize faster CUDA kernels for communications, routing algorithms, and fused linear computations across different experts. In addition, MLA is also optimized based on an improved version of FlashAttention-2 (Dao, 2023).
DeepSeek-V2 基于 HAILLM 框架 (High-flyer 2023) 进行训练,这是一个由我们的工程师内部开发的高效、轻量级的训练框架。它采用 16 路零气泡管道并行性(Qi 等人,2023 年)、8 路专家并行性(Lepikhin 等人,2021 年)和 ZeRO-1 数据并行性(Rajbhandari 等人,2020 年)。鉴于 DeepSeek-V2 的激活参数相对较少,并且重新计算了一部分算子以节省激活内存,因此无需张量并行即可对其进行训练,从而降低了通信开销。此外,为了进一步提高训练效率,我们将共享专家的计算与专家并行多对多通信重叠。我们还为不同专家的通信、路由算法和融合线性计算定制更快的 CUDA 内核。此外,MLA 还基于 FlashAttention-2 的改进版本进行了优化(Dao,2023)。
We conduct all experiments on a cluster equipped with NVIDIA H800 GPUs. Each node in the H800 cluster contains 8 GPUs connected using NVLink and NVSwitch within nodes. Across nodes, InfiniBand interconnects are utilized to facilitate communications.
我们在配备 NVIDIA H800 GPU 的集群上进行所有实验。H800 集群中的每个节点包含 8 个 GPU,在节点内使用 NVLink 和 NVSwitch 连接。跨节点,利用 InfiniBand 互连来促进通信。

3.1.4. Long Context Extension
3.1.4. 长上下文扩展

After the initial pre-training of DeepSeek-V2, we employ YaRN (Peng et al., 2023) to extend the default context window length from to . YaRN was specifically applied to the decoupled shared key as it is responsible for carrying RoPE (Su et al. 2024). For YaRN, we set the scale to to to 32 , and the target maximum context length to . Under these settings, we can expect the model to respond well for a context length of . Slightly diverging from original YaRN, due to our distinct attention mechanism, we adjust the length scaling factor to modulate the attention entropy. The factor is computed as , aiming at minimizing the perplexity.
在对 DeepSeek-V2 进行初始预训练后,我们使用 YaRN(Peng 等人,2023 年)将默认上下文窗口长度从 扩展到 。YaRN 专门应用于解耦共享密钥 ,因为它负责携带 RoPE(Su 等人,2024 年)。对于 YaRN,我们将比例 设置为 32 ,并将目标最大上下文长度设置为 。在这些设置下,我们可以预期模型对上下文长度 的 响应良好。与原始 YaRN 略有不同,由于我们独特的注意力机制,我们调整了长度比例因子来调节注意力熵。该因子 计算为 ,旨在将困惑度降至最低。
We additionally train the model for 1000 steps, with a sequence length of and a batch size of 576 sequences. Although the training is conducted solely at the sequence length of ,
此外,我们还训练了 1000 个步骤的模型,序列长度为 576 个序列,批处理大小为 576 个序列。虽然训练仅在 的序列长度上 进行
Figure 3 | Evaluation results on the "Needle In A Haystack" (NIAH) tests. DeepSeek-V2 performs well across all context window lengths up to .
图3 |“大海捞针”(NIAH)测试的评估结果。DeepSeek-V2 在所有上下文窗口长度上都表现良好,最高 可达 。
the model still demonstrates robust performance when being evaluated at a context length of 128K. As shown in Figure 3, the results on the "Needle In A Haystack" (NIAH) tests indicate that DeepSeek-V2 performs well across all context window lengths up to .
在 128K 的上下文长度下进行评估时,该模型仍表现出强大的性能。如图 3 所示,“大海捞针”(NIAH) 测试的结果表明,DeepSeek-V2 在所有上下文窗口长度上都表现良好,最高 可达 。

3.2. Evaluations 3.2. 评估

3.2.1. Evaluation Benchmarks
3.2.1. 评估基准

DeepSeek-V2 is pretrained on a bilingual corpus, so we evaluate it on a series of benchmarks in English and Chinese. Our evaluation is based on our internal evaluation framework integrated in our HAI-LLM framework. Included benchmarks are categorized and listed as follows, where underlined benchmarks are in Chinese:
DeepSeek-V2 是在双语语料库上预先训练的,因此我们在一系列英文和中文基准测试中对其进行了评估。我们的评估基于集成在HAILLM框架中的内部评估框架。所包含的基准标准分类如下,其中带下划线的基准为中文基准:
Multi-subject multiple-choice datasets include MMLU (Hendrycks et al., 2020), C-Eval (Huang et al. 2023), and CMMLU (Li et al., 2023).
多主题多项选择数据集包括 MMLU(Hendrycks 等人,2020 年)、C-Eval (Huang 等人,2023 年)和 CMMLU(Li 等人,2023 年)。
Language understanding and reasoning datasets include HellaSwag (Zellers et al. 2019), PIQA (Bisk et al., 2020), ARC (Clark et al., 2018), and BigBench Hard (BBH) (Suzgun et al., 2022).
语言理解和推理数据集包括 HellaSwag (Zellers et al. 2019)、PIQA (Bisk et al., 2020)、ARC (Clark et al., 2018) 和 BigBench Hard (BBH) (Suzgun et al., 2022)。
Closed-book question answering datasets include TriviaQA (Joshi et al. 2017) and NaturalQuestions (Kwiatkowski et al., 2019).
闭卷问答数据集包括 TriviaQA (Joshi et al. 2017) 和 NaturalQuestions (Kwiatkowski et al., 2019)。
Reading comprehension datasets include RACE Lai et al. (2017), DROP (Dua et al. 2019), C3 (Sun et al. 2019), and CMRC (Cui et al. 2019).
阅读理解数据集包括 RACE Lai et al. (2017)、DROP (Dua et al. 2019)、C3 (Sun et al. 2019) 和 CMRC (Cui et al. 2019)。
Reference disambiguation datasets include WinoGrande Sakaguchi et al.(2019) and CLUEWSC (Xu et al. 2020 ).
参考消歧数据集包括 WinoGrande Sakaguchi et al.(2019) 和 CLUEWSC (Xu et al. 2020)。
Language modeling datasets include Pile (Gao et al., 2020).
语言建模数据集包括 Pile (Gao et al., 2020)。
Chinese understanding and culture datasets include (Zheng et al., 2019) and CCPM (Li et al. 2021 ).
中国的理解和文化数据集包括 (Zheng et al., 2019) 和 CCPM (Li et al. 2021)。
Math datasets include GSM8K (Cobbe et al. 2021), MATH (Hendrycks et al. , 2021), and CMath (Wei et al. 2023 ).
数学数据集包括 GSM8K(Cobbe 等人,2021 年)、数学(Hendrycks 等人,2021 年)和 CMath(Wei 等人,2023 年)。
Code datasets include HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), and CRUXEval (Gu et al., 2024).
代码数据集包括 HumanEval (Chen et al., 2021)、MBPP (Austin et al., 2021) 和 CRUXEval (Gu et al., 2024)。
Standardized exams include AGIEval (Zhong et al., 2023). Note that AGIEval includes both English and Chinese subsets.
标准化考试包括 AGIEval(Zhong 等人,2023 年)。请注意,AGIEval 包括英文和中文子集。
Following our previous work (DeepSeek-AI, 2024), we adopt perplexity-based evaluation for datasets including HellaSwag, PIQA, WinoGrande, RACE-Middle, RACE-High, MMLU, ARC-Easy, ARC-Challenge, CHID, C-Eval, CMMLU, C3, and CCPM, and adopt generationbased evaluation for TriviaQA, NaturalQuestions, DROP, MATH, GSM8K, HumanEval, MBPP, CRUXEval, BBH, AGIEval, CLUEWSC, CMRC, and CMath. In addition, we perform languagemodeling-based evaluation for Pile-test and use Bits-Per-Byte (BPB) as the metric to guarantee fair comparison among models with different tokenizers.
根据我们之前的工作(DeepSeek-AI,2024),我们对包括 HellaSwag、PIQA、WinoGrande、RACE-Middle、RACE-High、MMLU、ARC-Easy、ARC-Challenge、CHID、C-Eval、CMMLU、C3 和 CCPM 在内的数据集采用基于困惑的评估,并对 TriviaQA、NaturalQuestions、DROP、MATH、GSM8K、HumanEval、MBPP、CRUXEval、BBH、AGIEval、CLUEWSC、CMRC 和 CMath 等数据集采用基于世代的评估。此外,我们还对 Pile-test 进行基于语言建模的评估,并使用每字节比特 (BPB) 作为指标,以保证不同分词器模型之间的公平比较。
For an intuitive overview of these benchmarks, we additionally provide our evaluation formats for each benchmark in Appendix F.
为了直观地了解这些基准,我们还在附录 F 中提供了每个基准的评估格式。

3.2.2. Evaluation Results
3.2.2. 评估结果

In Table 2, we compare DeepSeek-V2 with several representative open-source models, including DeepSeek 67B (DeepSeek-AI, 2024) (our previous release), Qwen1.5 72B (Bai et al., 2023), LLaMA3 70B (AI@Metal, 2024), and Mixtral 8x22B (Mistral, 2024). We evaluate all these models with our internal evaluation framework, and ensure that they share the same evaluation setting. Overall, with only 21B activated parameters, DeepSeek-V2 significantly outperforms DeepSeek 67B on almost all benchmarks, and achieves top-tier performance among open-source models.
在表 2 中,我们将 DeepSeek-V2 与几个具有代表性的开源模型进行了比较,包括 DeepSeek 67B (DeepSeek-AI, 2024) (我们之前的版本)、Qwen1.5 72B (Bai et al., 2023)、LLaMA3 70B (AI@Metal, 2024) 和 Mixtral 8x22B (Mistral, 2024)。我们使用内部评估框架评估所有这些模型,并确保它们共享相同的评估设置。总体而言,仅激活 21B 参数的 DeepSeek-V2 在几乎所有基准测试中都明显优于 DeepSeek 67B,并在开源模型中实现了顶级性能。
Further, we elaborately compare DeepSeek-V2 with its open-source counterparts one by one. (1) Compared with Qwen1.5 72B, another model that supports both Chinese and English, DeepSeek-V2 demonstrates overwhelming advantages on the majority of English, code, and math benchmarks. As for Chinese benchmarks, Qwen1.5 72B shows better performance on multi-subject multiple-choice tasks while DeepSeek-V2 is comparable or better on others. Note that for the CHID benchmark, the tokenizer of Qwen1.5 72B will encounter errors in our evaluation framework, so we leave the CHID score blank for Qwen1.5 72B. (2) Compared with Mixtral 8x22B, DeepSeek-V2 achieves comparable or better English performance, except for TriviaQA, NaturalQuestions, and HellaSwag, which are closely related to English commonsense knowledge. Notably, DeepSeek-V2 outperforms Mixtral 8x22B on MMLU. On code and math benchmarks, DeepSeek-V2 demonstrates comparable performance with Mixtral B. Since Mixtral 8x22B is not specifically trained on Chinese data, its Chinese capability lags far behind DeepSeek-V2. (3) Compared with LLaMA3 70B, DeepSeek-V2 is trained on fewer than a quarter of English tokens. Therefore, we acknowledge that DeepSeek-V2 still has a slight gap in basic English capabilities with LLaMA3 70B. However, even with much fewer training tokens and activated parameters, DeepSeek-V2 still demonstrates comparable code and math capability with LLaMA3 70B. Also, as a bilingual language model, DeepSeek-V2 outperforms LLaMA3 70B overwhelmingly on Chinese benchmarks.
此外,我们详细地将 DeepSeek-V2 与其开源对应物进行了逐一比较。(1)与另一款同时支持中英文的Qwen1.5 72B相比,DeepSeek-V2在大多数英语、代码和数学基准测试中表现出压倒性的优势。在中国的基准测试中,Qwen1.5 72B在多主题多项选择任务上表现出更好的表现,而DeepSeek-V2在其他基准测试中的表现相当或更好。需要注意的是,对于CHID基准测试,Qwen1.5 72B的分词器在我们的评估框架中会遇到错误,因此我们将Qwen1.5 72B的CHID分数留空。 (2)与Mixtral 8x22B相比,DeepSeek-V2的英语性能相当或更好,但TriviaQA、NaturalQuestions和HellaSwag除外,它们与英语常识知识密切相关。值得注意的是,DeepSeek-V2 在 MMLU 上的性能优于 Mixtral 8x22B。在代码和数学基准测试中,DeepSeek-V2 的性能与 Mixtral B 相当。由于 Mixtral 8x22B 没有专门针对中国数据进行训练,因此其中文能力远远落后于 DeepSeek-V2。(3)与LLaMA3 70B相比,DeepSeek-V2的训练使用不到四分之一的英文代币。因此,我们承认 DeepSeek-V2 在基本英语功能方面与 LLaMA3 70B 相比仍存在轻微差距。然而,即使训练令牌和激活参数少得多,DeepSeek-V2 仍然展示了与 LLaMA3 70B 相当的代码和数学能力。此外,作为双语语言模型,DeepSeek-V2 在中国基准测试中的表现压倒性地优于 LLaMA3 70B。
Finally, it is worth mentioning that certain prior studies (Hu et al., 2024) incorporate SFT data during the pre-training stage, whereas DeepSeek-V2 has never been exposed to SFT data during pre-training.
最后,值得一提的是,某些先前的研究(胡等人,2024)在预训练阶段纳入了SFT数据,而DeepSeek-V2在预训练期间从未接触过SFT数据。
Benchmark (Metric) 基准测试(公制) # Shots
DeepSeek
67B
Qwen1.5
72B
Mixtral
LLaMA 3
70B
DeepSeek-V2
Architecture - Dense Dense MoE Dense MoE
# Activated Params # 已激活的参数 - 67B 39B 70B 21B
# Total Params # 参数总数 - 67B 141B 70B
English Pile-test (BPB) 桩测试(BPB) - 0.642 0.637 0.623 0.602 0.606
BBH (EM) 3-shot 68.7 59.9 78.9 81.0 78.9
MMLU (Acc.) 5-shot 71.3 77.2 78.9
DROP (F1) 3-shot 69.7 71.5 80.4 82.5
ARC-Easy (Acc.) ARC-Easy (Acc.) 25-shot 95.3 97.1 97.9
ARC-Challenge (Acc.) ARC-Challenge (Acc.) 25-shot 86.4 93.3 92.4
HellaSwag (Acc.) HellaSwag (Acc.) 10-shot 86.3 86.6 87.9 84.2
PIQA (Acc.) 0 -shot 83.3 85.0 83.7
WinoGrande (Acc.) WinoGrande (Acc.) 5-shot 844.9 82.4 85.7
RACE-Middle (Acc.) RACE-Middle (Acc.) 5-shot 69.9 63.4 73.3 73.3
RACE-High (Acc.) RACE-High (Acc.) 5-shot 50.7 47.0 56.7 57.9 52.7
TriviaQA (EM) 琐事QA (EM) 5-shot 78.9 73.1 81.6 79.9
NaturalQuestions (EM) 自然问题 (EM) 5-shot 36.6 35.6 39.6 38.7
AGIEval (Acc.) AGIEval (Acc.) 0 -shot 41.3 64.4 49.8 51.2
Code HumanEval (Pass@1) HumanEval (Pass@1) 0 -shot 45.1 43.9 53.1 48.2 48.8
MBPP (Pass@1) MBPP (Pass@1) 3-shot 57.4 53.6 64.2 68.6
CRUXEval-I (Acc.) CRUXEval-I (Acc.) 2-shot 42.5 44.3 52.4 49.4
CRUXEval-O (Acc.) CRUXEval-O (Acc.) 2-shot 41.0 42.3 52.8 54.3 49.8
Math GSM8K (EM) 8-shot 63.4 77.9 80.3 83.0 79.2
MATH (EM) 4-shot 18.7 41.4 42.2 43.6
CMath (EM) 3-shot 63.0
72.3 73.9 78.7
Chinese CLUEWSC (EM) 5-shot 81.0 80.5 77.5 78.3 82.2
C-Eval (Acc.) C-Eval (Acc.) 5-shot 83.7 59.6 67.5 81.7
CMMLU (Acc.) 5-shot 84.3 60.0 69.3
CMRC (EM) 1-shot 66.6 73.1 73.3 77.5
C3 (Acc.) 0 -shot 78.2 77.4
CHID (Acc.) 0 -shot 92.1 - 57.0 83.2
ССРМ (Acc.) 0 -shot 88.1 61.0 68.1 93.1
Table 2 | Comparison among DeepSeek-V2 and other representative open-source models. All models are evaluated in our internal framework and share the same evaluation setting. Bold denotes the best and underline denotes the second-best. Scores with a gap smaller than 0.3 are regarded as at the same level. With only 21B activated parameters, DeepSeek-V2 achieves top-tier performance among open-source models.
表2 |DeepSeek-V2 与其他具有代表性的开源模型的比较。所有模型都在我们的内部框架中进行评估,并共享相同的评估设置。粗体表示最好,下划线表示次好。差距小于 0.3 的分数被视为处于同一水平。DeepSeek-V2 仅激活了 21B 参数,在开源模型中实现了顶级性能。

3.2.3. Training and Inference Efficiency
3.2.3. 训练和推理效率

Training Costs. Since DeepSeek-V2 activates fewer parameters for each token and requires fewer FLOPs than DeepSeek 67B, training DeepSeek-V2 will be more economical than training DeepSeek 67B theoretically. Although training an MoE model will introduce additional communication overheads, through our operator and communication optimizations, the training for DeepSeek-V2 can attain a relatively high Model FLOPs Utilization (MFU). During our practical training on the H800 cluster, for training on each trillion tokens, DeepSeek 67B requires 300.6K GPU hours, while DeepSeek-V2 needs only 172.8K GPU hours, i.e., sparse DeepSeek-V2 can save training costs compared with dense DeepSeek 67B.
培训费用。由于 DeepSeek-V2 为每个令牌激活的参数比 DeepSeek 67B 少,并且需要的 FLOP 更少,因此训练 DeepSeek-V2 理论上将比训练 DeepSeek 67B 更经济。虽然训练 MoE 模型会带来额外的通信开销,但通过我们的操作员和通信优化,DeepSeek-V2 的训练可以获得相对较高的模型 FLOP 利用率 (MFU)。在H800集群的实战训练中,DeepSeek 67B每万亿个令牌的训练需要300.6K GPU小时,而DeepSeek-V2只需要172.8K GPU小时,也就是说,稀疏的DeepSeek-V2相比,可以节省 密集的DeepSeek 67B的训练成本。
Inference Efficiency. In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8. In addition, we also perform KV cache quantiza-
推理效率。为了高效部署DeepSeek-V2进行业务部署,我们首先将其参数转换为FP8的精度。此外,我们还执行 KV 缓存量化

tion (Hooper et al. 2024: Zhao et al., 2023) for DeepSeek-V2 to further compress each element in its KV cache into 6 bits on average. Benefiting from MLA and these optimizations, actually deployed DeepSeek-V2 requires significantly less KV cache than DeepSeek 67B, and thus can serve a much larger batch size. We evaluate the generation throughput of DeepSeek-V2 based on the prompt and generation length distribution from the actually deployed DeepSeek 67B service. On a single node with 8 H800 GPUs, DeepSeek-V2 achieves a generation throughput exceeding tokens per second, which is 5.76 times the maximum generation throughput of DeepSeek 67B. In addition, the prompt input throughput of DeepSeek-V2 exceeds 100K tokens per second.
(Hooper et al. 2024: Zhao et al., 2023) 让 DeepSeek-V2 进一步将其 KV 缓存中的每个元素平均压缩为 6 位。得益于 MLA 和这些优化,实际部署的 DeepSeek-V2 需要的 KV 缓存比 DeepSeek 67B 少得多,因此可以提供更大的批量大小。我们根据实际部署的 DeepSeek 67B 服务的提示和生成长度分布来评估 DeepSeek-V2 的生成吞吐量。在具有 8 个 H800 GPU 的单节点上,DeepSeek-V2 实现了超过每秒 令牌数的生成吞吐量,是 DeepSeek 67B 最大生成吞吐量的 5.76 倍。此外,DeepSeek-V2 的提示输入吞吐量超过每秒 100K 个令牌。

4. Alignment 4. 对齐

4.1. Supervised Fine-Tuning
4.1. 监督微调

Building upon our prior research (DeepSeek-AI, 2024), we curate our instruction tuning datasets to include instances, comprising instances for helpfulness and instances for safety. In comparison to the initial version, we improve the data quality to mitigate hallucinatory responses and enhance writing proficiency. We fine-tune DeepSeek-V2 with 2 epochs, and the learning rate is set to . For the evaluation of DeepSeek-V2 Chat (SFT), we mainly include generation-based benchmarks, except for several representative multiple-choice tasks (MMLU and ARC). We also conduct an instruction-following evaluation (IFEval) (Zhou et al., 2023) for DeepSeek-V2 Chat (SFT), using prompt-level loose accuracy as the metric. Moreover, we employ LiveCodeBench (Jain et al., 2024) questions from September 1st, 2023 to April 1st, 2024 to evaluate chat models. In addition to the standard benchmarks, we further evaluate our model on open-ended conversation benchmarks including MT-Bench (Zheng et al. 2023), AlpacaEval 2.0 (Dubois et al. 2024), and AlignBench (Liu et al., 2023). For comparison, we also evaluate Qwen1.5 72B Chat, LLaMA-3-70B Instruct, and Mistral-8x22B Instruct in our evaluation framework and settings. As for DeepSeek 67B Chat, we directly refer to the evaluation results reported in our previous release.
基于我们之前的研究(DeepSeek-AI,2024 年),我们策划了我们的指令调整数据集以包括 实例,包括 有用实例和安全 实例。与初始版本相比,我们提高了数据质量,以减轻幻觉反应并提高写作能力。我们用 2 个 epoch 对 DeepSeek-V2 进行微调,学习率设置为 。对于 DeepSeek-V2 Chat (SFT) 的评估,除了几个具有代表性的多项选择任务(MMLU 和 ARC)外,我们主要包括基于生成的基准测试。我们还对 DeepSeek-V2 聊天 (SFT) 进行了指令跟踪评估 (IFEval) (周 et al., 2023),使用提示级松散准确性作为指标。此外,我们在 2023 年 9 月 1 日至 2024 年 4 月 1 日期间使用 LiveCodeBench (Jain et al., 2024) 问题来评估聊天模型。除了标准基准之外,我们还在开放式对话基准上进一步评估了我们的模型,包括 MT-Bench (Zheng et al. 2023)、AlpacaEval 2.0 (Dubois et al. 2024) 和 AlignBench (Liu et al., 2023)。为了进行比较,我们还在我们的评估框架和设置中评估了 Qwen1.5 72B Chat、LLaMA-3-70B Install 和 Mistral-8x22B Instruct。至于 DeepSeek 67B Chat,我们直接参考了我们之前版本中报告的评估结果。

4.2. Reinforcement Learning
4.2. 强化学习

In order to further unlock the potential of DeepSeek-V2 and align it with human preference, we conduct Reinforcement Learning (RL) to adjust its preference.
为了进一步释放 DeepSeek-V2 的潜力并使其符合人类的偏好,我们进行了强化学习 (RL) 来调整其偏好。
Reinforcement Learning Algorithm. In order to save the training costs of RL, we adopt Group Relative Policy Optimization (GRPO) (Shao et al. 2024), which foregoes the critic model that is typically with the same size as the policy model, and estimates the baseline from group scores instead. Specifically, for each question , GRPO samples a group of outputs from the old policy and then optimizes the policy model by maximizing the following objective:
强化学习算法。为了节省RL的训练成本,我们采用了群体相对策略优化(GRPO)(Shao et al. 2024),它放弃了通常与策略模型大小相同的批评者模型,而是从群体分数来估计基线。具体来说,对于每个问题 ,GRPO从旧策略 抽取一组输出,然后通过最大化以下目标来优化策略模型
where and are hyper-parameters; and is the advantage, computed using a group of rewards corresponding to the outputs within each group:
其中 是超参数;并且是 优势,使用与每组内的输出 相对应的一组奖励来计算:
Training Strategy. In our preliminary experiments, we find that the RL training on reasoning data, such as code and math prompts, exhibits unique characteristics that are distinct from the training on general data. For example, the mathematical and coding abilities of our model can keep improving over a longer period of training steps. Therefore, we employ a two-stage RL training strategy, which first performs reasoning alignment, and then performs human preference alignment. In the first reasoning alignment stage, we train a reward model for code and math reasoning tasks, and optimize the policy model with the feedback of :
培训策略。在我们的初步实验中,我们发现基于推理数据(例如代码和数学提示)的RL训练表现出与常规数据训练不同的独特特征。例如,我们模型的数学和编码能力可以在较长时间的训练步骤中不断提高。因此,我们采用两阶段RL训练策略,首先进行推理对齐,然后进行人类偏好对齐。在第一个推理对齐阶段,我们训练了代码和数学推理任务的奖励模型 ,并根据以下 反馈优化了策略模型:
In the second human preference alignment stage, we adopt a multi-reward framework, which acquires rewards from a helpful reward model , a safety reward model , and a rule-based reward model . The final reward of a response is
在第二个人类偏好调整阶段,我们采用多元奖励框架,从有用的奖励模型 、安全奖励模型 和基于规则的奖励模型 中获取奖励。响应 的最终奖励是
where , and are corresponding coefficients.
其中 , 和 是对应的系数。
In order to obtain reliable reward models that play crucial roles in the RL training, we carefully collect preference data, and meticulously conduct quality filtering and proportion adjustments. We obtain code preference data based on compiler-feedback, and mathematical preference data based on the ground-truth labels. For reward model training, we initialize the reward models with DeepSeek-V2 Chat (SFT) and train them with either a point-wise or a pair-wise loss. In our experiments, we observe that the RL training can fully tap into and activate the potential of our model, enabling it to select the correct and satisfactory answer from possible responses.
为了获得在RL训练中起关键作用的可靠奖励模型,我们仔细收集偏好数据,并精心进行质量过滤和比例调整。我们基于编译器反馈获得代码偏好数据,并基于真实标签获得数学偏好数据。对于奖励模型训练,我们使用 DeepSeek-V2 聊天 (SFT) 初始化奖励模型,并使用逐点或成对损失来训练它们。在我们的实验中,我们观察到RL训练可以充分挖掘和激活我们模型的潜力,使其能够从可能的响应中选择正确和令人满意的答案。
Optimizations for Training Efficiency. Conducting RL training on extremely large models places high demands on the training framework. It requires careful engineering optimization to manage the GPU memory and RAM pressure, and meanwhile maintain a fast training speed. For this goal, we implement the following engineering optimizations. (1) Firstly, we propose a hybrid engine that adopts different parallel strategies for training and inference respectively to achieve higher GPU utilization. (2) Secondly, we leverage vLLM (Kwon et al., 2023) with large batch sizes as our inference backend to accelerate the inference speed. (3) Thirdly, we carefully design a scheduling strategy for offloading models to CPUs and loading models back to GPUs, which achieves a near-optimal balance between the training speed and memory consumption.
优化培训效率。在超大型模型上进行RL训练对训练框架提出了很高的要求。它需要仔细的工程优化来管理GPU内存和RAM压力,同时保持快速的训练速度。为此,我们实施了以下工程优化。(1)首先,我们提出了一种混合引擎,它分别采用不同的并行策略进行训练和推理,以实现更高的GPU利用率。(2)其次,我们利用大批量的vLLM(Kwon et al., 2023)作为我们的推理后端,以加快推理速度。(3)第三,我们精心设计了将模型卸载到CPU和将模型加载回GPU的调度策略,在训练速度和内存消耗之间实现了近乎最优的平衡。

4.3. Evaluation Results 4.3. 评估结果

Evaluations on Standard Benchmarks. Initially, we evaluate DeepSeek-V2 Chat (SFT) and DeepSeek-V2 Chat (RL) on standard benchmarks. Notably, DeepSeek-V2 Chat (SFT) demonstrates substantial improvements in GSM8K, MATH, and HumanEval evaluations compared with its base version. This progress can be attributed to the inclusion of our SFT data, which comprises a considerable volume of math and code related content. In addition, DeepSeek-V2
对标准基准的评估。最初,我们在标准基准测试中评估 DeepSeek-V2 聊天 (SFT) 和 DeepSeek-V2 聊天 (RL)。值得注意的是,与基础版本相比,DeepSeek-V2 Chat (SFT) 在 GSM8K、MATH 和 HumanEval 评估方面表现出了实质性的改进。这一进展可归因于纳入了我们的 SFT 数据,其中包括大量与数学和代码相关的内容。此外,DeepSeek-V2
Benchmark # Shots
DeepSeek
67B Chat
Qwen 1.5
72B Chat
LLaMA3
70B Inst.
Mixtral
8x22B Inst.
DeepSeek-V2
Chat (SFT)
DeepSeek-V2
Chat (RL)
Context Length 上下文长度 -
Architecture - Dense Dense Dense MoE MoE
# Activated Params # 已激活的参数 - 67B 39B
# Total Params # 参数总数 - 67B
English TriviaQA 5-shot 81.5 79.6 69.1 80.0 85.4 86.7
NaturalQuestions 5-shot 47.0 46.9 44.6 54.9 53.4
MMLU 5-shot 71.1 76.2 80.3 77.8 78.4
ARC-Easy 25 -shot 96.6 96.8 96.9 97.1 98.1
ARC-Challenge 25 -shot 88.9 91.7 92.6 90.0 92.3
BBH 3-shot 71.7 80.1 78.4 81.3 79.7
AGIEval 0 -shot 46.4 62.8
41.4 63.2 61.4
IFEval 0 -shot 55.5 57.3 79.7 64.1 63.8
Code HumanEval 0 -shot 73.8 68.9 76.2 75.0 76.8 81.1
MBPP 3-shot 61.4 52.2 69.8 64.4 72.0
CRUXEval-I-COT 2-shot 49.1 51.4 61.1 59.4
61.5
CRUXEval-O-COT 2-shot 50.9 56.5 63.6 60.7 63.0
LiveCodeBench 0 -shot 18.3 18.8 25.0 28.7
Math GSM8K 8-shot 84.1 81.9 93.2 87.9 90.8 92.2
MATH 4-shot 32.6 40.6 48.5 49.8 52.7
CMath 0 -shot 80.3 82.8 79.2 75.1
Chinese CLUEWSC 5-shot 78.5 90.1 85.4 75.8 88.6 89.9
C-Eval 5-shot 65.2 82.2 67.9 60.0 78.0
CMMLU 5-shot 67.8 82.9 70.7 61.0 82.4 81.6
Table 3 | Comparison among DeepSeek-V2 Chat (SFT), DeepSeek-V2 Chat (RL), and other representative open-source chat models. Regarding TriviaQA and NaturalQuestions, it is worth noting that chat models, such as LLaMA3 70B Instruct, might not strictly adhere to the format constraints typically specified in the few-shot setting. Consequently, this can lead to underestimation of certain models in our evaluation framework.
表3 |DeepSeek-V2 聊天 (SFT)、DeepSeek-V2 聊天 (RL) 和其他具有代表性的开源聊天模型的比较。关于 TriviaQA 和 NaturalQuestions,值得注意的是,聊天模型(如 LLaMA3 70B Instruct)可能无法严格遵守通常在少数镜头设置中指定的格式约束。因此,这可能导致我们评估框架中某些模型被低估。
Chat (RL) further boosts the performance on math and code benchmarks. We show more code and math evaluations in Appendix E.
聊天 (RL) 进一步提高了数学和代码基准测试的性能。我们在附录 E 中展示了更多的代码和数学评估。
As for the comparisons with other models, we first compare DeepSeek-V2 Chat (SFT) with Qwen1.5 72B Chat, and find that DeepSeek-V2 Chat (SFT) surpasses Qwen1.5 72B Chat on almost all of English, math, and code benchmarks. On Chinese benchmarks, DeepSeekV2 Chat (SFT) demonstrates slightly lower scores than Qwen1.5 72B Chat on multi-subject multiple-choice tasks, consistent with the performance observed from their base versions. When compared with the state-of-the-art open-source MoE model, Mixtral 8x22B Instruct, DeepSeekV2 Chat (SFT) exhibits better performance on most benchmarks, except for NaturalQuestions and IFEval. Furthermore, in comparison to the state-of-the-art open-source model LLaMA3 70B Chat, DeepSeek-V2 Chat (SFT) shows similar performance in code and math related benchmarks. LLaMA3 70B Chat exhibits better performance on MMLU and IFEval, while DeepSeek-V2 Chat (SFT) showcases stronger performance on Chinese tasks. Ultimately, DeepSeek-V2 Chat (RL) demonstrates further enhanced performance in both mathematical and coding tasks compared with DeepSeek-V2 Chat (SFT). These comparisons highlight the strengths of DeepSeek-V2 Chat in relation to other language models in various domains and languages.
至于与其他型号的比较,我们首先将 DeepSeek-V2 Chat (SFT) 与 Qwen1.5 72B Chat 进行比较,发现 DeepSeek-V2 Chat (SFT) 在几乎所有的英语、数学和代码基准测试中都超过了 Qwen1.5 72B Chat。在中国的基准测试中,DeepSeekV2 Chat (SFT) 在多主题多项选择任务上的得分略低于 Qwen1.5 72B Chat,这与从其基本版本中观察到的性能一致。与最先进的开源 MoE 模型 Mixtral 8x22B Intit 相比,DeepSeekV2 Chat (SFT) 在大多数基准测试中表现出更好的性能,但 NaturalQuestions 和 IFEval 除外。此外,与最先进的开源模型 LLaMA3 70B Chat 相比,DeepSeek-V2 Chat (SFT) 在代码和数学相关基准测试中表现出相似的性能。LLaMA3 70B Chat 在 MMLU 和 IFEval 上表现出更好的性能,而 DeepSeek-V2 Chat (SFT) 在中文任务上表现出更强的性能。最终,与 DeepSeek-V2 聊天 (SFT) 相比,DeepSeek-V2 聊天 (RL) 在数学和编码任务方面表现出进一步增强的性能。这些比较突出了 DeepSeek-V2 Chat 相对于不同领域和语言中的其他语言模型的优势。
Evaluations on Open-Ended Generation. We proceed with additional evaluations of our models on open-ended conversation benchmarks. For English open-ended conversation genera-
对开放式生成的评估。我们继续根据开放式对话基准对我们的模型进行额外的评估。对于英语开放式会话通用
Model MT-Bench AlpacaEval 2.0 羊驼Eval 2.0
DeepSeek 67B Chat DeepSeek 67B 聊天 8.35 16.6
Mistral 8x22B Instruct v0.1
Mistral 8x22B 指示 v0.1
8.66 30.9
Qwen1.5 72B Chat Qwen1.5 72B聊天 8.61 36.6
LLaMA3 70B Instruct LLaMA3 70B 指导 8.95 34.4
DeepSeek-V2 Chat (SFT) DeepSeek-V2 聊天 (SFT) 8.62 30.0
DeepSeek-V2 Chat (RL) DeepSeek-V2 聊天 (RL) 8.97 38.9
Table 4 | English open-ended conversation evaluations. For AlpacaEval 2.0, we use the lengthcontrolled win rate as the metric.
表4 |英语开放式会话评估。对于 AlpacaEval 2.0,我们使用长度控制的胜率作为指标。
tion, we utilize MT-Bench and AlpacaEval 2.0 as the benchmarks. Evaluation results presented in Table 4 demonstrate a significant performance advantage of DeepSeek-V2 Chat (RL) over DeepSeek-V2 Chat (SFT). This outcome showcases the effectiveness of our RL training in achieving improved alignment. In comparison to other open-source models, DeepSeek-V2 Chat (RL) demonstrates superior performance over Mistral 8x22B Instruct and Qwen1.5 72B Chat on both benchmarks. When compared with LLaMA3 70B Instruct, DeepSeek-V2 Chat (RL) showcases competitive performance on MT-Bench and notably outperforms it on AlpacaEval 2.0. These results highlight the strong performance of DeepSeek-V2 Chat (RL) in generating high-quality and contextually relevant responses, particularly in instruction-based conversation tasks.
因此,我们使用 MT-Bench 和 AlpacaEval 2.0 作为基准。表 4 中的评估结果表明,DeepSeek-V2 聊天 (RL) 比 DeepSeek-V2 聊天 (SFT) 具有显著的性能优势。这一结果展示了我们的 RL 培训在改善对齐方面的有效性。与其他开源模型相比,DeepSeek-V2 Chat (RL) 在两个基准测试中都表现出优于 Mistral 8x22B Instruct 和 Qwen1.5 72B Chat 的性能。与 LLaMA3 70B Install 相比,DeepSeek-V2 Chat (RL) 在 MT-Bench 上表现出了极强的竞争力,在 AlpacaEval 2.0 上的表现明显优于它。这些结果凸显了 DeepSeek-V2 聊天 (RL) 在生成高质量和上下文相关响应方面的强大性能,尤其是在基于指令的对话任务中。
In addition, we evaluate the Chinese open-ended generation capability based on AlignBench. As presented in Table 5, DeepSeek-V2 Chat (RL) exhibits a slight advantage over DeepSeek-V2 Chat (SFT). Notably, DeepSeek-V2 Chat (SFT) surpasses all open-source Chinese models by a significant margin. It significantly outperforms the second-best open-source model, Qwen1.5 72B Chat on both Chinese reasoning and language. Moreover, both DeepSeek-V2 Chat (SFT) and DeepSeek-V2 Chat (RL) outperform GPT-4-0613 and ERNIEBot 4.0, solidifying the position of our models in the top-tier LLMs that support Chinese. Specifically, DeepSeek-V2 Chat (RL) shows remarkable performance in Chinese language understanding, which outperforms all models including GPT-4-Turbo-1106-Preview. On the other hand, the reasoning capability of DeepSeek-V2 Chat (RL) still lags behind giant models, such as Erniebot-4.0 and GPT-4s.
此外,我们还基于AlignBench评估了中国的开放式发电能力。如表 5 所示,DeepSeek-V2 聊天 (RL) 比 DeepSeek-V2 聊天 (SFT) 略有优势。值得注意的是,DeepSeek-V2 Chat (SFT) 远远超过了所有开源的中国模型。它在中文推理和语言方面都明显优于第二好的开源模型 Qwen1.5 72B Chat。此外,DeepSeek-V2 Chat (SFT) 和 DeepSeek-V2 Chat (RL) 的表现都优于 GPT-4-0613 和 ERNIEBot 4.0,巩固了我们模型在支持中文的顶级LLMs模型中的地位。具体来说,DeepSeek-V2 Chat (RL) 在中文理解方面表现出显着的表现,优于 GPT-4-Turbo-1106-Preview 等所有型号。另一方面,DeepSeek-V2 Chat (RL) 的推理能力仍然落后于 Erniebot-4.0 和 GPT-4s 等巨型模型。

4.4. Discussion 4.4. 讨论

Amount of SFT Data. The discussion surrounding the necessity of a large SFT corpus has been a topic of intense debate. Previous works (Young et al., 2024; Zhou et al., 2024) argue that fewer than instances of SFT data are enough to produce satisfactory results. However, in our experiments, we observe a significant performance decline on the IFEval benchmark if we use fewer than instances. A possible explanation is that, a language model necessitates a certain amount of data to develop specific skills. Although the requisite data amount may diminish with the model size increasing, it cannot be entirely eliminated. Our observation underscores the critical need for sufficient data to equip an LLM with desired capabilities. Moreover, the quality of SFT data is also crucial, especially for tasks involving writing or open-ended questions.
SFT 数据量。围绕大型SFT语料库的必要性的讨论一直是激烈争论的话题。以前的工作(Young et al., 2024;周 等人,2024 年)认为,少于 实例的 SFT 数据足以产生令人满意的结果。但是,在我们的实验中,我们观察到,如果我们使用的实例少于 实例,则 IFEval 基准测试的性能会显着下降。一种可能的解释是,语言模型需要一定数量的数据来发展特定的技能。尽管必要的数据量可能会随着模型大小的增加而减少,但不能完全消除。我们的观察强调了对足够数据的迫切需要,以配备LLM所需的能力。此外,SFT 数据的质量也至关重要,尤其是对于涉及写作或开放式问题的任务。
Alignment Tax of Reinforcement Learning. During human preference alignment, we observe a significant performance enhancement on the open-ended generation benchmarks, in terms of the scores rated by both and human evaluators. However, we also notice a phenomenon of "alignment tax" (Ouyang et al., 2022), i.e., the alignment process can negatively impact the
强化学习的对齐税。在人类偏好调整过程中,我们观察到开放式生成基准的性能显着提高,就人类评估者和人类评估者评分 的分数而言。然而,我们也注意到一种“对齐税”现象(Ouyang et al., 2022),即对齐过程会对
Model
模型
Overall
总分
Reasoning 中文推理 Language 中文语言 语言 中文语言
Avg.
推理
总分
Math.
数学
计算
Logi.
逻辑
推理
Avg.
语言
总分
Fund.
基本
任务
Chi.
中文
理解
Open.
综合
问答
Writ.
文本
写作
Role.
角色
扮演
Pro.
专业
能力
GPT-4-1106-Preview 8.01 7.73 7.80 7.66 8.29 7.99 7.33 8.61 8.67 8.47 8.65
DeepSeek-V2 Chat (RL) DeepSeek-V2 聊天 (RL) 7.91 7.45 7.77 7.14 8.36 8.10 8.28 8.37 8.53 8.33 8.53
ERNIEBot-4.0-202404*(文心一言) 7.89 7.61 7.81 7.41 8.17 7.56 8.53 8.13 8.45 8.24 8.09
DeepSeek-V2 Chat (SFT) DeepSeek-V2 聊天 (SFT) 7.74 7.30 7.34 7.26 8.17 8.04 8.26 8.13 8.00 8.10 8.49
GPT-4-0613 7.53 7.47 7.56 7.37 7.59 7.81 6.93 7.42 7.93 7.51 7.94
ERNIEBot-4.0-202312*(文心一言) 7.36 6.84 7.00 6.67 7.88 7.47 7.88 8.05 8.19 7.84 7.85
Moonshot-v1-32k-202404*(月之暗面) 7.22 6.42 6.41 6.43 8.02 7.82 7.58 8.00 8.22 8.19 8.29
Qwen1.5-72B-Chat 7.19 6.45 6.58 6.31 7.93 7.38 7.77 8.15 8.02 8.05 8.24
DeepSeek-67B-Chat 6.43 5.75 5.71 5.79 7.11 7.12 6.52 7.58 7.20 6.91 7.37
ChatGLM-Turbo(智谱清言) 6.24 5.00 4.74 5.26 7.49 6.82 7.17 8.16 7.77 7.76 7.24
ERNIEBot-3.5(文心一言) 6.14 5.15 5.03 5.27 7.13 6.62 7.60 7.26 7.56 6.83 6.90
Yi-34B-Chat* 6.12 4.86 4.97 4.74 7.38 6.72 7.28 7.76 7.44 7.58 7.53
GPT-3.5-Turbo-0613 6.08 5.35 5.68 5.02 6.82 6.71 5.81 7.29 7.03 7.28 6.77
ChatGLM-Pro(智谱清言) 5.83 4.65 4.54 4.75 7.01 6.51 6.76 7.47 7.07 7.34 6.89
SparkDesk-V2(讯飞星火) 5.74 4.73 4.71 4.74 6.76 5.84 6.97 7.29 7.18 6.92 6.34
Qwen-14B-Chat 5.72 4.81 4.91 4.71 6.63 6.90 6.36 6.74 6.64 6.59 6.56
Baichuan2-13B-Chat 5.25 3.92 3.76 4.07 6.59 6.22 6.05 7.11 6.97 6.75 6.43
ChatGLM3-6B 4.97 3.85 3.55 4.14 6.10 5.75 5.29 6.71 6.83 6.28 5.73
Baichuan2-7B-Chat 4.97 3.66 3.56 3.75 6.28 5.81 5.50 7.13 6.84 6.53 5.84
InternLM-20B 4.96 3.66 3.39 3.92 6.26 5.96 5.50 7.18 6.19 6.49 6.22
Qwen-7B-Chat 4.91 3.73 3.62 3.83 6.09 6.40 5.74 6.26 6.31 6.19 5.66
ChatGLM2-6B 4.48 3.39 3.16 3.61 5.58 4.91 4.52 6.66 6.25 6.08 5.08
InternLM-Chat-7B 3.65 2.56 2.45 2.66 4.75 4.34 4.09 5.82 4.89 5.32 4.06
Chinese-LLaMA-2-7B-Ch 3.57 2.68 2.29 3.07 4.46 4.31 4.26 4.50 4.63 4.91 4.13
LLaMA-2-13B-Chinese-Chat 3.35 2.47 2.21 2.73 4.23 4.13 3.31 4.79 3.93 4.53 4.71
Table 5 | AlignBench leaderboard rated by GPT-4-0613. Models are ranked in descending order based on the overall score. Models marked with * represent that we evaluate them through their API service or open-weighted model, instead of referring to the results reported in their original papers. Suffixes of Erniebot-4.0 and Moonshot denote the timestamps when we called their API.
表5 |由 GPT-4-0613 评级的 AlignBench 排行榜。模型根据总分按降序排列。标有 * 的模型表示我们通过他们的 API 服务或开放加权模型来评估它们,而不是参考他们原始论文中报告的结果。Erniebot-4.0 和 Moonshot 的后缀表示我们调用它们的 API 时的时间戳。
performance on some standard benchmarks such as BBH. In order to alleviate the alignment tax, during the RL stage, we make significant efforts in data processing and improving training strategies, finally achieving a tolerable trade-off between the performance on standard and open-ended benchmarks. Exploring how to align a model with human preferences without compromising its general performance presents a valuable direction for future research.
在一些标准基准测试(如 BBH)上的表现。为了减轻对齐税,在RL阶段,我们在数据处理和改进训练策略方面做出了重大努力,最终在标准基准和开放式基准测试之间的性能之间实现了可容忍的权衡。探索如何在不影响其总体性能的情况下使模型与人类偏好保持一致,为未来的研究提供了一个有价值的方向。
Online Reinforcement Learning. In our preference alignment experiments, we find that the online approach significantly outperforms the offline approach. Therefore, we invest tremendous efforts in implementing an online RL framework for aligning DeepSeek-V2. The conclusion about online or offline preference alignment can vary in different contexts, and we reserve a more thorough comparison and analysis between them for future work.
在线强化学习。在我们的偏好对齐实验中,我们发现在线方法明显优于离线方法。因此,我们投入了巨大的精力来实施一个在线RL框架,以对齐DeepSeek-V2。关于线上或线下偏好一致性的结论在不同的情况下可能会有所不同,我们保留了对它们进行更彻底的比较和分析,以备将来的工作使用。

5. Conclusion, Limitation, and Future Work
5. 结论、限制和未来工作

In this paper, we introduce DeepSeek-V2, a large MoE language model that supports context length. In addition to strong performance, it is also characterized by economical training and efficient inference, benefiting from its innovative architecture including MLA and DeepSeekMoE. In practice, compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves of training costs, reduces the cache
在本文中,我们介绍了 DeepSeek-V2,这是一个支持 上下文长度的大型 MoE 语言模型。除了强大的性能外,它还具有经济的训练和高效的推理特点,得益于其创新架构,包括 MLA 和 DeepSeekMoE。在实践中,与DeepSeek 67B相比,DeepSeek-V2实现了明显更强的性能,同时节省 了训练成本,降低 了缓存

by , and boosts the maximum generation throughput to 5.76 times. Evaluation results further demonstrate that with only 21B activated parameters, DeepSeek-V2 achieves top-tier performance among open-source models and becomes the strongest open-source MoE model.
by ,并将最大发电吞吐量提高到 5.76 倍。评估结果进一步表明,仅激活21B参数,DeepSeek-V2就实现了开源模型中的顶级性能,成为最强的开源MoE模型。
DeepSeek-V2 and its chat versions share the acknowledged limitations commonly found in other LLMs, including the lack of ongoing knowledge updates after pre-training, the possibility of generating non-factual information such as unverified advice, and a chance to produce hallucinations. In addition, since our data primarily consist of Chinese and English content, our model may exhibit limited proficiency in other languages. In scenarios beyond Chinese and English, it should be used with caution.
DeepSeek-V2 及其聊天版本具有其他LLMs版本中常见的公认局限性,包括预训练后缺乏持续的知识更新、生成非事实信息(例如未经验证的建议)的可能性,以及产生幻觉的机会。此外,由于我们的数据主要由中文和英文内容组成,我们的模型可能表现出对其他语言的熟练程度有限。在中英文以外的场景中,应谨慎使用。
DeepSeek will continuously invest in open-source large models with longtermism, aiming to progressively approach the goal of artificial general intelligence.
DeepSeek将持续投资具有长期精神的开源大模型,旨在逐步接近通用人工智能的目标。
  • In our ongoing exploration, we are dedicated to devising methods that enable further scaling up MoE models while maintaining economical training and inference costs. The goal of our next step is to achieve performance on par with GPT-4 in our upcoming release.
    在我们不断的探索中,我们致力于设计能够进一步扩大 MoE 模型规模的方法,同时保持经济的训练和推理成本。我们下一步的目标是在即将发布的版本中实现与 GPT-4 相当的性能。
  • Our alignment team continuously strives to enhance our models, aiming to develop a model that is not only helpful but also honest and safe for worldwide users. Our ultimate objective is to align the values of our model with human values, while minimizing the need for human supervision. By prioritizing ethical considerations and responsible development, we are dedicated to creating a positive and beneficial impact on society.
    我们的对齐团队不断努力改进我们的模型,旨在开发一个不仅对全球用户有用而且诚实和安全的模型。我们的最终目标是使我们模型的价值观与人类价值观保持一致,同时最大限度地减少对人类监督的需求。通过优先考虑道德考虑和负责任的发展,我们致力于为社会创造积极和有益的影响。
  • Currently, DeepSeek-V2 is designed to support the text modality exclusively. In our forward-looking agenda, we intend to enable our model to support multiple modalities, enhancing its versatility and utility in a wider range of scenarios.
    目前,DeepSeek-V2 仅支持文本模态。在我们的前瞻性议程中,我们打算使我们的模型能够支持多种模式,从而增强其在更广泛场景中的多功能性和实用性。

References 引用

AI@Meta. Llama 3 model card, 2024. URLhttps://github.com/meta-llama/llama3/bl ob/main/MODEL_CARD.md.
AI@Meta。骆驼 3 模型卡,2024 年。URLhttps://github.com/meta-llama/llama3/bl ob/main/MODEL_CARD.md.
J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
J. Ainslie、J. Lee-Thorp、M. de Jong、Y. Zemlyanskiy、F. Lebrón 和 S. Sanghai。Gqa:从多头检查点训练广义多查询转换器模型。arXiv 预印本 arXiv:2305.13245, 2023.
Anthropic. Introducing Claude, 2023. URLhttps://www.anthropic.com/index/introd ucing-claude
人为的。介绍克劳德,2023 年。URLhttps://www.anthropic.com/index/introd 乌辛-克劳德
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. 江, C. Cai, M. Terry, Q. Le, et al.使用大型语言模型进行程序综合。arXiv 预印本 arXiv:2108.07732, 2021.
J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang, Z. Zhang, C. Zhou, J. Zhou, X. Zhou, and T. Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. 邓, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. 马, R. Men, X. 任, X. 任, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang, Z. Zhang, C. 周, J. 周, X. 周, 和 T. Zhu.Qwen技术报告。arXiv 预印本 arXiv:2309.16609, 2023.
Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI
Y. Bisk、R. Zellers、RL Bras、J. Gao 和 Y. Choi。PIQA:自然语言中关于物理常识的推理。第三十四届AAAI人工智能会议,AAAI 2020,第三十二届人工智能创新应用会议,IAAI 2020,第十届AAAI人工智能教育进展研讨会,EAAI
2020, New York, NY, USA, February 7-12, 2020, pages 7432-7439. AAAI Press, 2020. doi: 10.1609/aaai.v34i05.6239. URLhttps://doi.org/10.1609/aaai.v34i05.6239.
2020 年,美国纽约州纽约市,2020 年 2 月 7 日至 12 日,第 7432-7439 页。AAAI出版社,2020年。doi: 10.1609/aaai.v34i05.6239.URLhttps://doi.org/10.1609/aaai.v34i05.6239。
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URLhttps://arxiv.org/abs/2107.03374.
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D.卡明斯,M.普拉珀特,F.钱齐斯,E.巴恩斯,A.赫伯特-沃斯,W.H.古斯,A.尼科尔,A.派诺,N.特扎克,J.唐,I.巴布施金,S.巴拉吉,S.耆那教,W.桑德斯,C.黑塞,A.N.卡尔,J.莱克,J.阿奇亚姆,V.米斯拉,E.森川,A.拉德福德,M.奈特,M.布伦戴奇,M.穆拉蒂,K.梅耶,P.韦林德,B.麦格鲁, D. Amodei、S. McCandlish、I. Sutskever 和 W. Zaremba。评估在代码上训练的大型语言模型。CoRR,abs/2107.03374,2021 年。URLhttps://arxiv.org/abs/2107.03374。
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457.
P. Clark、I. Cowhey、O. Etzioni、T. Khot、A. Sabharwal、C. Schoenik 和 O. Tafjord。认为您已经解决了问题答案?尝试 arc,AI2 推理挑战。CoRR,abs/1803.05457,2018 年。URL http://arxiv.org/abs/1803.05457。
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al.培训验证者解决数学单词问题。arXiv 预印本 arXiv:2110.14168, 2021.
Y. Cui, T. Liu, W. Che, L. Xiao, Z. Chen, W. Ma, S. Wang, and G. Hu. A span-extraction dataset for Chinese machine reading comprehension. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5883-5889, Hong Kong, China, Nov. 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1600. URLhttps://aclanthology.org/D19-1 600 ,
Y. Cui, T. Liu, W. Che, L. Xiao, Z. Chen, W. 马, S. Wang, 和 G. 胡.用于中文机器阅读理解的跨度提取数据集.K. Inui, J. 江, V. Ng, and X. Wan, 编辑, 2019年自然语言处理实证方法会议论文集 暨第九届国际自然语言处理联合会议(EMNLP-IJCNLP),第5883-5889页,中国香港,2019年11月。计算语言学协会。doi: 10.18653/v1/D19-1600.URLhttps://aclanthology.org/D19-1 600 ,
D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. CoRR, abs/2401.06066, 2024. URL https://doi.org/10.48550/arXiv.2401.06066
戴磊, 邓, 赵建, 徐瑞旭, 高华, 陈东, 李建, 曾伟, 余旭, 吴彦, 谢志强, 李玉玲, 黄炳, 罗福, 阮春, 隋志伟, 梁伟.Deepseekmoe:在专家混合语言模型中实现最终的专家专业化。CoRR,abs/2401.06066,2024 年。网址 https://doi.org/10.48550/arXiv.2401.06066
T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning, 2023.
T.道。FlashAttention-2:更快的注意力,更好的并行性和工作分区,2023 年。
DeepSeek-AI. Deepseek LLM: scaling open-source language models with longtermism. CoRR, abs/2401.02954, 2024. URLhttps://doi.org/10.48550/arXiv. 2401.02954
DeepSeek-AI的。Deepseek LLM:以长期主义扩展开源语言模型。CoRR,abs/2401.02954,2024 年。URLhttps://doi.org/10.48550/arXiv。2401.02954
D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 23682378. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1246. URL https://doi.org/10.18653/v1/n19-1246.
D. Dua、Y. Wang、P. Dasigi、G. Stanovsky、S. Singh 和 M. Gardner。DROP:一个阅读理解基准,需要对段落进行离散推理。J. Burstein、C. Doran 和 T. Solorio,编辑,计算语言学协会北美分会 2019 年会议论文集:人类语言技术,NAACL-HLT 2019,美国明尼苏达州明尼阿波利斯,2019 年 6 月 2 日至 7 日,第 1 卷(长篇和短篇论文),第 23682378 页。计算语言学协会, 2019.doi: 10.18653/V1/N19-1246.URL https://doi.org/10.18653/v1/n19-1246。
Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.
Y. Dubois、B. Galambosi、P. Liang 和 T. B. Hashimoto。长度控制羊驼:一种消除自动评估器偏差的简单方法。arXiv 预印本 arXiv:2404.04475, 2024.
W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961, 2021. URLhttps://arxiv.org/ abs/2101.03961.
W. Fedus、B. Zoph 和 N. Shazeer。开关转换器:以简单高效的稀疏性扩展到万亿参数模型。CoRR,abs/2101.03961,2021 年。URLhttps://arxiv.org/ abs/2101.03961。
L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al.The Pile:一个 800GB 的数据集,其中包含用于语言建模的各种文本。arXiv 预印本 arXiv:2101.00027, 2020.
Google. Introducing gemini: our largest and most capable ai model, 2023. URL https: //blog.google/technology/ai/google-gemini-ai/
谷歌。介绍 Gemini:我们最大、最有能力的 AI 模型,2023 年。网址 https://blog.google/technology/ai/google-gemini-ai/
A. Gu, B. Rozière, H. Leather, A. Solar-Lezama, G. Synnaeve, and S. I. Wang. Cruxeval: A benchmark for code reasoning, understanding and execution, 2024.
A. Gu、B. Rozière、H. Leather、A. Solar-Lezama、G. Synnaeve 和 S. I. Wang。Cruxeval:代码推理、理解和执行的基准,2024 年。
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
D. Hendrycks、C. Burns、S. Basart、A. Zou、M. Mazeika、D. Song 和 J. Steinhardt。测量大规模的多任务语言理解。arXiv 预印本 arXiv:2009.03300, 2020.
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
D. Hendrycks、C. Burns、S. Kadavath、A. Arora、S. Basart、E. Tang、D. Song 和 J. Steinhardt。使用数学数据集衡量数学问题的解决。arXiv 预印本 arXiv:2103.03874, 2021.
High-flyer. Hai-llm: 高效且轻量的大模型训练工具, 2023. URL https://www.high-flyer.c n/en/blog/hai-llm
高飞者。Hai-llm: 高效且轻量的大模型训练工具, 2023.网址 https://www.high-flyer.c n/en/blog/hai-llm
C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y. S. Shao, K. Keutzer, and A. Gholami. Kvquant: Towards 10 million context length LLM inference with KV cache quantization. CoRR, abs/2401.18079, 2024. URLhttps://doi.org/10.48550/arXiv.2401.18079
C. Hooper、S. Kim、H. Mohammadzadeh、MW Mahoney、Y. S. Shao、K. Keutzer 和 A. Gholami。Kvquant:通过 KV 缓存量化实现 1000 万个上下文长度LLM推理。CoRR,abs/2401.18079,2024 年。URLhttps://doi.org/10.48550/arXiv.2401.18079
S. Hu, Y. Tu, X. Han, C. He, G. Cui, X. Long, Z. Zheng, Y. Fang, Y. Huang, W. Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024.
S. 胡, Y. Tu, X. Han, C. He, G. Cui, X. Long, Z. Zheng, Y. Fang, Y. Huang, W. Zhao, et al. Minicpm:揭示具有可扩展训练策略的小型语言模型的潜力。arXiv 预印本 arXiv:2404.06395, 2024.
Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval:基础模型的多层次多学科中文评估套件。arXiv 预印本 arXiv:2305.08322, 2023.
N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024.
N. Jain, K. Han, A. Gu, W.-D.Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, 和 I. Stoica.Livecodebench:对大型代码语言模型进行整体且无污染的评估。arXiv 预印本 arXiv:2403.07974, 2024.
M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In R. Barzilay and M.-Y. Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601-1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URLhttps://aclanthology.org/P17-1147.
M. Joshi、E. Choi、D. Weld 和 L. Zettlemoyer。TriviaQA:用于阅读理解的大规模远程监督挑战数据集。在 R. Barzilay 和 M.-Y.Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),第1601-1611页,加拿大温哥华,2017年7月。计算语言学协会。doi: 10.18653/v1/P17-1147.URLhttps://aclanthology.org/P17-1147。
T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452-466, 2019. doi: 10.1162/tacl_a . URL https://doi.org/10.1162/tacl_a_00276
T. Kwiatkowski、J. Palomaki、O. Redfield、M. Collins、AP Parikh、C. Alberti、D. Epstein、I. Polosukhin、J. Devlin、K. Lee、K. Toutanova、L. Jones、M. Kelcey、M. Chang、AM Dai、J. Uszkoreit、Q. Le 和 S. Petrov。自然问题:问答研究的基准。Trans. Assoc. Comput.语言学, 7:452-466, 2019.doi: 10.1162/tacl_a .网址 https://doi.org/10.1162/tacl_a_00276
W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, 和 I. Stoica.为大型语言模型提供高效的内存管理,并提供 pagedattention。在 ACM SIGOPS 第 29 届操作系统原理研讨会论文集,2023 年。
G. Lai, Q. Xie, H. Liu, Y. Yang, and E. H. Hovy. RACE: large-scale reading comprehension dataset from examinations. In M. Palmer, R. Hwa, and S. Riedel, editors, Proceedings of
G. Lai, Q. Xie, H. Liu, Y. Yang, 和 E. H. Hovy.RACE:来自考试的大规模阅读理解数据集。在 M. Palmer、R. Hwa 和 S. Riedel 主编的《会议记录》中

the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 785-794. Association for Computational Linguistics, 2017. doi: 10.18653/V1/D17-1082. URLhttps://doi.org/10.18653/v1/d1 . 
D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021. OpenReview.net, 2021. URLhttps://openreview.net/forum?id=qrwe7XHTmYb 
H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU: Measuring massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212, 2023. 
W. Li, F. Qi, M. Sun, X. Yi, and J. Zhang. Ccpm: A chinese classical poetry matching dataset, 2021. 
X. Liu, X. Lei, S. Wang, Y. Huang, Z. Feng, B. Wen, J. Cheng, P. Ke, Y. Xu, W. L. Tam, X. Zhang, L. Sun, H. Wang, J. Zhang, M. Huang, Y. Dong, and J. Tang. Alignbench: Benchmarking chinese alignment of large language models. CoRR, abs/2311.18743, 2023. doi: 10.48550/A RXIV.2311.18743. URLhttps://doi.org/10.48550/arXiv.2311.18743 
I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 
Mistral. Cheaper, better, faster, stronger: Continuing to push the frontier of ai and making it accessible to all, 2024. URLhttps://mistral.ai/news/mixtral-8x22b. 
OpenAI. Introducing ChatGPT, 2022. URLhttps://openai.com/blog/chatgpt. 
OpenAI. GPT4 technical report. arXiv preprint arXiv:2303.08774, 2023. 
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730-27744, 2022. 
B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023. 
P. Qi, X. Wan, G. Huang, and M. Lin. Zero bubble pipeline parallelism. arXiv preprint arXiv:2401.10241, 2023. 
S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1-16. IEEE, 2020. 
C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. S. Pinto, D. Keysers, and N. Houlsby. Scaling vision with sparse mixture of experts. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, pages 8583-8595, 2021. URLhttps://proceedings.neurips.cc/paper /2021/hash/48237d9f2dea8c74c2a72126cf63d933-Abstract.html. 
K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019. 
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 
N. Shazeer. Fast transformer decoding: One write-head is all you need. CoRR, abs/1911.02150, 2019. URLhttp://arxiv.org/abs/1911.02150. 
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V. Le, G. E. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017. OpenReview.net, 2017. URL https: //openreview.net/forum?id=B1ckMDqlg. 
J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. 
K. Sun, D. Yu, D. Yu, and C. Cardie. Investigating prior knowledge for challenging chinese machine reading comprehension, 2019. 
M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022. 
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022. 
T. Wei, J. Luan, W. Liu, S. Dong, and B. Wang. Cmath: Can your language model pass chinese elementary school math test?, 2023. 
L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu, C. Yu, Y. Tian, Q. Dong, W. Liu, B. Shi, Y. Cui, J. Li, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou, S. Liu, Z. Zhao, Q. Zhao, C. Yue, X. Zhang, Z. Yang, K. Richardson, and Z. Lan. CLUE: A chinese language understanding evaluation benchmark. In D. Scott, N. Bel, and C. Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 4762-4772. International Committee on Computational Linguistics, 2020. doi: 10.18653/V1/2020.COLING-MAIN.419. URL https://doi.org/10.18653/v1/2020.coling-main.419. 
A. Young, B. Chen, C. Li, C. Huang, G. Zhang, G. Zhang, H. Li, J. Zhu, J. Chen, J. Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024. 
R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. HellaSwag: Can a machine really finish your sentence? In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791-4800. Association for Computational Linguistics, 2019. doi: 10.18653/v1/p19-1472. URLhttps://doi.org/10.18653/v1/p1 9-1472, 
Y. Zhao, C. Lin, K. Zhu, Z. Ye, L. Chen, S. Zheng, L. Ceze, A. Krishnamurthy, T. Chen, and B. Kasikci. Atom: Low-bit quantization for efficient and accurate LLM serving. CoRR, abs/2310.19102, 2023. URLhttps://doi.org/10.48550/arXiv.2310.19102. 
C. Zheng, M. Huang, and A. Sun. Chid: A large-scale chinese idiom dataset for cloze test. In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 778-787. Association for Computational Linguistics, 2019. doi: 10.18653/V1/P19-1075. URL https://doi.org/10.18653/v1/p19-1075 
L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. 
W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. AGIEval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023. doi: 10.48550/arXiv.2304.06364. URLhttps://doi.org/10.48550/arXiv. 2304.06364. 
C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024. 
J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023.