这是用户在 2025-1-31 18:18 为 https://arxiv.org/html/2412.19437?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
License: arXiv.org perpetual non-exclusive license
许可证:arXiv.org 永久非独家许可
arXiv:2412.19437v1 [cs.CL] 27 Dec 2024
arXiv:2412.19437v1 [计算机科学/语言学习] 2024 年 12 月 27 日
\reportnumber  报告编号

001

DeepSeek-V3 Technical Report
DeepSeek-V3 技术报告

DeepSeek-AI
research@deepseek.com
Abstract  摘要

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
我们提出 DeepSeek-V3,一个具有 671B 总参数、每个标记激活 37B 的强大混合专家(MoE)语言模型。为了实现高效的推理和成本效益的训练,DeepSeek-V3 采用了多头潜在注意力(MLA)和 DeepSeekMoE 架构,这些架构在 DeepSeek-V2 中得到了充分验证。此外,DeepSeek-V3 开创了一种无辅助损失的负载均衡策略,并设定了多标记预测训练目标以实现更强的性能。我们在 14.8 万亿个多样化和高质量标记上预训练了 DeepSeek-V3,随后进行监督微调和强化学习阶段,以充分利用其能力。全面评估显示,DeepSeek-V3 优于其他开源模型,并达到了领先闭源模型相当的性能。尽管性能卓越,DeepSeek-V3 的全训练仅需 2.788M H800 GPU 小时。此外,其训练过程非常稳定。在整个训练过程中,我们没有遇到任何不可恢复的损失峰值或进行任何回滚。模型检查点可在 https://github.com/deepseek-ai/DeepSeek-V3 获取。

{CJK*}

UTF8gbsn

Refer to caption
Figure 1: Benchmark performance of DeepSeek-V3 and its counterparts.
图 1:DeepSeek-V3 及其对比基准性能。

1 Introduction  1 简介

In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution (OpenAI, 2024a; Anthropic, 2024; Google, 2024), progressively diminishing the gap towards Artificial General Intelligence (AGI). Beyond closed-source models, open-source models, including DeepSeek series (DeepSeek-AI, 2024b, c; Guo et al., 2024; DeepSeek-AI, 2024a), LLaMA series (Touvron et al., 2023a, b; AI@Meta, 2024a, b), Qwen series (Qwen, 2023, 2024a, 2024b), and Mistral series (Jiang et al., 2023; Mistral, 2024), are also making significant strides, endeavoring to close the gap with their closed-source counterparts. To further push the boundaries of open-source model capabilities, we scale up our models and introduce DeepSeek-V3, a large Mixture-of-Experts (MoE) model with 671B parameters, of which 37B are activated for each token.
近年来,大型语言模型(LLMs)经历了快速迭代和进化(OpenAI,2024a;Anthropic,2024;Google,2024),逐渐缩小与通用人工智能(AGI)的差距。除了闭源模型外,开源模型,包括 DeepSeek 系列(DeepSeek-AI,2024b,c;Guo 等,2024;DeepSeek-AI,2024a)、LLaMA 系列(Touvron 等,2023a,b;AI@Meta,2024a,b)、Qwen 系列(Qwen,2023,2024a,2024b)和 Mistral 系列(Jiang 等,2023;Mistral,2024),也在取得显著进展,努力缩小与闭源模型的差距。为了进一步推动开源模型能力的边界,我们扩大了我们的模型规模,并引入了 DeepSeek-V3,这是一个具有 6710 亿参数的大型专家混合(MoE)模型,其中每个标记激活了 370 亿个参数。

With a forward-looking perspective, we consistently strive for strong model performance and economical costs. Therefore, in terms of architecture, DeepSeek-V3 still adopts Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai et al., 2024) for cost-effective training. These two architectures have been validated in DeepSeek-V2 (DeepSeek-AI, 2024c), demonstrating their capability to maintain robust model performance while achieving efficient training and inference. Beyond the basic architecture, we implement two additional strategies to further enhance the model capabilities. Firstly, DeepSeek-V3 pioneers an auxiliary-loss-free strategy (Wang et al., 2024a) for load balancing, with the aim of minimizing the adverse impact on model performance that arises from the effort to encourage load balancing. Secondly, DeepSeek-V3 employs a multi-token prediction training objective, which we have observed to enhance the overall performance on evaluation benchmarks.
以前瞻性视角,我们持续追求强大的模型性能和经济的成本。因此,在架构方面,DeepSeek-V3 仍然采用多头潜在注意力(MLA)(DeepSeek-AI,2024c)进行高效推理和 DeepSeekMoE(Dai 等,2024)进行成本效益训练。这两个架构已在 DeepSeek-V2(DeepSeek-AI,2024c)中得到验证,证明了它们在实现高效训练和推理的同时,能够保持稳健的模型性能。在基本架构之外,我们还实施了两种额外的策略来进一步增强模型能力。首先,DeepSeek-V3 开创了一种无辅助损失的策略(Wang 等,2024a)以实现负载均衡,旨在最小化鼓励负载均衡对模型性能产生的负面影响。其次,DeepSeek-V3 采用多令牌预测训练目标,我们观察到这有助于提高评估基准上的整体性能。

In order to achieve efficient training, we support the FP8 mixed precision training and implement comprehensive optimizations for the training framework. Low-precision training has emerged as a promising solution for efficient training (Kalamkar et al., 2019; Narang et al., 2017; Peng et al., 2023b; Dettmers et al., 2022), its evolution being closely tied to advancements in hardware capabilities (Micikevicius et al., 2022; Luo et al., 2024; Rouhani et al., 2023a). In this work, we introduce an FP8 mixed precision training framework and, for the first time, validate its effectiveness on an extremely large-scale model. Through the support for FP8 computation and storage, we achieve both accelerated training and reduced GPU memory usage. As for the training framework, we design the DualPipe algorithm for efficient pipeline parallelism, which has fewer pipeline bubbles and hides most of the communication during training through computation-communication overlap. This overlap ensures that, as the model further scales up, as long as we maintain a constant computation-to-communication ratio, we can still employ fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead. In addition, we also develop efficient cross-node all-to-all communication kernels to fully utilize InfiniBand (IB) and NVLink bandwidths. Furthermore, we meticulously optimize the memory footprint, making it possible to train DeepSeek-V3 without using costly tensor parallelism. Combining these efforts, we achieve high training efficiency.
为了实现高效训练,我们支持 FP8 混合精度训练并对训练框架进行了全面优化。低精度训练已成为高效训练的有前景解决方案(Kalamkar 等人,2019;Narang 等人,2017;Peng 等人,2023b;Dettmers 等人,2022),其发展紧密关联于硬件能力的提升(Micikevicius 等人,2022;Luo 等人,2024;Rouhani 等人,2023a)。在本工作中,我们引入了 FP8 混合精度训练框架,并首次在一个极大规模模型上验证了其有效性。通过支持 FP8 计算和存储,我们实现了加速训练和降低 GPU 内存使用。至于训练框架,我们设计了 DualPipe 算法以实现高效的管道并行,该算法具有更少的管道气泡,并通过计算-通信重叠隐藏了训练过程中的大部分通信。 这一重叠确保了,随着模型进一步扩大,只要我们保持恒定的计算与通信比率,我们仍然可以在节点间使用细粒度专家,同时实现近乎零的全节点通信开销。此外,我们还开发了高效的跨节点全节点通信内核,以充分利用 InfiniBand(IB)和 NVLink 带宽。此外,我们还精心优化了内存占用,使得在无需使用昂贵的张量并行的情况下训练 DeepSeek-V3 成为可能。结合这些努力,我们实现了高训练效率。

During pre-training, we train DeepSeek-V3 on 14.8T high-quality and diverse tokens. The pre-training process is remarkably stable. Throughout the entire training process, we did not encounter any irrecoverable loss spikes or have to roll back. Next, we conduct a two-stage context length extension for DeepSeek-V3. In the first stage, the maximum context length is extended to 32K, and in the second stage, it is further extended to 128K. Following this, we conduct post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on the base model of DeepSeek-V3, to align it with human preferences and further unlock its potential. During the post-training stage, we distill the reasoning capability from the DeepSeek-R1 series of models, and meanwhile carefully maintain the balance between model accuracy and generation length.
在预训练阶段,我们在 14.8T 高质量和多样化的标记上训练 DeepSeek-V3。预训练过程非常稳定。在整个训练过程中,我们没有遇到任何不可恢复的损失峰值或需要回滚的情况。接下来,我们对 DeepSeek-V3 进行两阶段上下文长度扩展。在第一阶段,最大上下文长度扩展到 32K,在第二阶段,进一步扩展到 128K。随后,我们对 DeepSeek-V3 的基础模型进行后训练,包括监督微调(SFT)和强化学习(RL),以使其与人类偏好保持一致并进一步挖掘其潜力。在后训练阶段,我们从 DeepSeek-R1 系列模型中提取推理能力,同时仔细保持模型准确性和生成长度之间的平衡。

We evaluate DeepSeek-V3 on a comprehensive array of benchmarks. Despite its economical training costs, comprehensive evaluations reveal that DeepSeek-V3-Base has emerged as the strongest open-source base model currently available, especially in code and math. Its chat version also outperforms other open-source models and achieves performance comparable to leading closed-source models, including GPT-4o and Claude-3.5-Sonnet, on a series of standard and open-ended benchmarks.
我们对 DeepSeek-V3 在各种基准测试中进行评估。尽管其训练成本经济,但全面评估显示,DeepSeek-V3-Base 已成为目前最强大的开源基础模型,尤其是在代码和数学方面。其聊天版本也优于其他开源模型,并在一系列标准和开放式基准测试中实现了与领先闭源模型(包括 GPT-4o 和 Claude-3.5-Sonnet)相当的性能。

Training Costs  培训成本 Pre-Training  预训练 Context Extension  上下文扩展 Post-Training  训练后 Total  总计
in H800 GPU Hours
在 H800 GPU 小时
2664K 119K 5K 2788K
in USD  美元 $5.328M  53.28 百万美元 $0.238M  0.238M $0.01M  0.01 兆元 $5.576M  55.76 万
Table 1: Training costs of DeepSeek-V3, assuming the rental price of H800 is $2 per GPU hour.
表 1:DeepSeek-V3 的训练成本,假设 H800 的租赁价格为每 GPU 小时 2 美元。

Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware. During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pre-training stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M. Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data.
最后,我们再次强调 DeepSeek-V3 的经济培训成本,总结在表 1 中,这是通过我们优化的算法、框架和硬件的协同设计实现的。在预训练阶段,在每个万亿个标记上训练 DeepSeek-V3 只需要 180K H800 GPU 小时,即在我们拥有 2048 个 H800 GPU 的集群上需要 3.7 天。因此,我们的预训练阶段在不到两个月内完成,成本为 2664K GPU 小时。结合 119K GPU 小时用于上下文长度扩展和 5K GPU 小时用于后训练,DeepSeek-V3 的全训练成本仅为 2.788M GPU 小时。假设 H800 GPU 的租赁价格为每 GPU 小时 2 美元,我们的总培训成本仅为 5.576M 美元。请注意,上述成本仅包括 DeepSeek-V3 的官方培训成本,不包括与架构、算法或数据相关的先前研究和消融实验的成本。

Our main contribution includes:
我们的主要贡献包括:

Architecture: Innovative Load Balancing Strategy and Training Objective
架构:创新负载均衡策略和训练目标

  • On top of the efficient architecture of DeepSeek-V2, we pioneer an auxiliary-loss-free strategy for load balancing, which minimizes the performance degradation that arises from encouraging load balancing.


    在 DeepSeek-V2 高效架构的基础上,我们率先提出了一种无辅助损失的负载均衡策略,该策略最小化了鼓励负载均衡所带来的性能下降。
  • We investigate a Multi-Token Prediction (MTP) objective and prove it beneficial to model performance. It can also be used for speculative decoding for inference acceleration.


    我们研究了多令牌预测(MTP)目标,并证明它对模型性能有益。它还可以用于推理加速的推测性解码。

Pre-Training: Towards Ultimate Training Efficiency
预训练:迈向终极训练效率

  • We design an FP8 mixed precision training framework and, for the first time, validate the feasibility and effectiveness of FP8 training on an extremely large-scale model.


    我们设计了一个 FP8 混合精度训练框架,并首次验证了在超大规模模型上使用 FP8 训练的可行性和有效性。
  • Through the co-design of algorithms, frameworks, and hardware, we overcome the communication bottleneck in cross-node MoE training, achieving near-full computation-communication overlap. This significantly enhances our training efficiency and reduces the training costs, enabling us to further scale up the model size without additional overhead.


    通过算法、框架和硬件的协同设计,我们克服了跨节点 MoE 训练中的通信瓶颈,实现了接近完全的计算-通信重叠。这显著提高了我们的训练效率并降低了训练成本,使我们能够在不增加额外开销的情况下进一步扩大模型规模。
  • At an economical cost of only 2.664M H800 GPU hours, we complete the pre-training of DeepSeek-V3 on 14.8T tokens, producing the currently strongest open-source base model. The subsequent training stages after pre-training require only 0.1M GPU hours.


    仅以 2.664M H800 GPU 小时的经济成本,我们在 14.8T 个 token 上完成了 DeepSeek-V3 的前期训练,产生了目前最强的开源基础模型。前期训练之后的后续训练阶段仅需 0.1M GPU 小时。

Post-Training: Knowledge Distillation from DeepSeek-R1
训练后:从 DeepSeek-R1 的知识蒸馏

  • We introduce an innovative methodology to distill reasoning capabilities from the long-Chain-of-Thought (CoT) model, specifically from one of the DeepSeek R1 series models, into standard LLMs, particularly DeepSeek-V3. Our pipeline elegantly incorporates the verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its reasoning performance. Meanwhile, we also maintain control over the output style and length of DeepSeek-V3.


    我们提出了一种创新的方法,从长思维链(CoT)模型中提取推理能力,特别是从 DeepSeek R1 系列模型之一,提取到标准LLMs,尤其是 DeepSeek-V3。我们的管道优雅地融合了 R1 的验证和反思模式到 DeepSeek-V3 中,并显著提高了其推理性能。同时,我们还保持了 DeepSeek-V3 的输出风格和长度的控制。

Summary of Core Evaluation Results
核心评估结果摘要

  • Knowledge: (1) On educational benchmarks such as MMLU, MMLU-Pro, and GPQA, DeepSeek-V3 outperforms all other open-source models, achieving 88.5 on MMLU, 75.9 on MMLU-Pro, and 59.1 on GPQA. Its performance is comparable to leading closed-source models like GPT-4o and Claude-Sonnet-3.5, narrowing the gap between open-source and closed-source models in this domain. (2) For factuality benchmarks, DeepSeek-V3 demonstrates superior performance among open-source models on both SimpleQA and Chinese SimpleQA. While it trails behind GPT-4o and Claude-Sonnet-3.5 in English factual knowledge (SimpleQA), it surpasses these models in Chinese factual knowledge (Chinese SimpleQA), highlighting its strength in Chinese factual knowledge.


    • 知识:1)在 MMLU、MMLU-Pro 和 GPQA 等教育基准测试中,DeepSeek-V3 优于所有其他开源模型,在 MMLU 上获得 88.5 分,在 MMLU-Pro 上获得 75.9 分,在 GPQA 上获得 59.1 分。其性能与 GPT-4o 和 Claude-Sonnet-3.5 等领先闭源模型相当,缩小了开源和闭源模型在此领域的差距。2)对于事实性基准测试,DeepSeek-V3 在 SimpleQA 和中文 SimpleQA 两个测试中均展现出优于其他开源模型的表现。虽然它在英语事实知识(SimpleQA)方面落后于 GPT-4o 和 Claude-Sonnet-3.5,但在中文事实知识(中文 SimpleQA)方面超过了这些模型,突显了其在中文事实知识方面的优势。
  • Code, Math, and Reasoning: (1) DeepSeek-V3 achieves state-of-the-art performance on math-related benchmarks among all non-long-CoT open-source and closed-source models. Notably, it even outperforms o1-preview on specific benchmarks, such as MATH-500, demonstrating its robust mathematical reasoning capabilities. (2) On coding-related tasks, DeepSeek-V3 emerges as the top-performing model for coding competition benchmarks, such as LiveCodeBench, solidifying its position as the leading model in this domain. For engineering-related tasks, while DeepSeek-V3 performs slightly below Claude-Sonnet-3.5, it still outpaces all other models by a significant margin, demonstrating its competitiveness across diverse technical benchmarks.


    • 代码、数学和推理:(1)DeepSeek-V3 在所有非长 CoT 开源和闭源模型中,在数学相关基准测试中取得了最先进的性能。值得注意的是,它在特定基准测试中甚至超过了 o1-preview,如 MATH-500,展示了其强大的数学推理能力。(2)在编码相关任务中,DeepSeek-V3 成为编码竞赛基准测试(如 LiveCodeBench)中表现最好的模型,巩固了其在该领域的领先地位。对于工程相关任务,虽然 DeepSeek-V3 的表现略逊于 Claude-Sonnet-3.5,但它仍然显著优于所有其他模型,展示了其在各种技术基准测试中的竞争力。

In the remainder of this paper, we first present a detailed exposition of our DeepSeek-V3 model architecture (Section 2). Subsequently, we introduce our infrastructures, encompassing our compute clusters, the training framework, the support for FP8 training, the inference deployment strategy, and our suggestions on future hardware design. Next, we describe our pre-training process, including the construction of training data, hyper-parameter settings, long-context extension techniques, the associated evaluations, as well as some discussions (Section 4). Thereafter, we discuss our efforts on post-training, which include Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), the corresponding evaluations, and discussions (Section 5). Lastly, we conclude this work, discuss existing limitations of DeepSeek-V3, and propose potential directions for future research (Section 6).
本文剩余部分,我们首先详细介绍了我们的 DeepSeek-V3 模型架构(第 2 节)。随后,我们介绍了我们的基础设施,包括计算集群、训练框架、FP8 训练支持、推理部署策略以及我们对未来硬件设计的建议。接下来,我们描述了我们的预训练过程,包括训练数据的构建、超参数设置、长上下文扩展技术、相关评估以及一些讨论(第 4 节)。此后,我们讨论了我们的后训练努力,包括监督微调(SFT)、强化学习(RL)、相应的评估和讨论(第 5 节)。最后,我们总结了这项工作,讨论了 DeepSeek-V3 的现有局限性,并提出了未来研究的潜在方向(第 6 节)。

2 Architecture  2 架构

We first introduce the basic architecture of DeepSeek-V3, featured by Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai et al., 2024) for economical training. Then, we present a Multi-Token Prediction (MTP) training objective, which we have observed to enhance the overall performance on evaluation benchmarks. For other minor details not explicitly mentioned, DeepSeek-V3 adheres to the settings of DeepSeek-V2 (DeepSeek-AI, 2024c).
首先介绍了 DeepSeek-V3 的基本架构,其特点为多头潜在注意力(MLA)(DeepSeek-AI,2024c)用于高效推理和 DeepSeekMoE(Dai 等,2024)用于经济训练。然后,我们提出了一个多令牌预测(MTP)训练目标,我们发现它在评估基准上的整体性能有所提升。对于其他未明确提及的细节,DeepSeek-V3 遵循 DeepSeek-V2(DeepSeek-AI,2024c)的设置。

Refer to caption
Figure 2: Illustration of the basic architecture of DeepSeek-V3. Following DeepSeek-V2, we adopt MLA and DeepSeekMoE for efficient inference and economical training.

2.1 Basic Architecture

The basic architecture of DeepSeek-V3 is still within the Transformer (Vaswani et al., 2017) framework. For efficient inference and economical training, DeepSeek-V3 also adopts MLA and DeepSeekMoE, which have been thoroughly validated by DeepSeek-V2. Compared with DeepSeek-V2, an exception is that we additionally introduce an auxiliary-loss-free load balancing strategy (Wang et al., 2024a) for DeepSeekMoE to mitigate the performance degradation induced by the effort to ensure load balance. Figure 2 illustrates the basic architecture of DeepSeek-V3, and we will briefly review the details of MLA and DeepSeekMoE in this section.

2.1.1 Multi-Head Latent Attention
2.1.1 多头潜在注意力

For attention, DeepSeek-V3 adopts the MLA architecture. Let dditalic_d denote the embedding dimension, nhn_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT denote the number of attention heads, dhd_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT denote the dimension per head, and 𝐡td\mathbf{h}_{t}\in\mathbb{R}^{d}bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denote the attention input for the ttitalic_t-th token at a given attention layer. The core of MLA is the low-rank joint compression for attention keys and values to reduce Key-Value (KV) cache during inference:
DeepSeek-V3 采用 MLA 架构。用 dditalic_d 表示嵌入维度, nhsubscriptn_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 表示注意力头数, dhsubscriptd_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 表示每个头的维度, 𝐡tdsubscriptsuperscript\mathbf{h}_{t}\in\mathbb{R}^{d}bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT 表示给定注意力层中第 ttitalic_t 个标记的注意力输入。MLA 的核心是对注意力键和值进行低秩联合压缩,以减少推理过程中的键值(KV)缓存:

𝐜tKV\displaystyle\boxed{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}% {0,0,1}\mathbf{c}_{t}^{KV}}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT =WDKV𝐡t,\displaystyle=W^{DKV}\mathbf{h}_{t},= italic_W start_POSTSUPERSCRIPT italic_D italic_K italic_V end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (1)
[𝐤t,1C;𝐤t,2C;;𝐤t,nhC]=𝐤tC\displaystyle[\mathbf{k}_{t,1}^{C};\mathbf{k}_{t,2}^{C};...;\mathbf{k}_{t,n_{h% }}^{C}]=\mathbf{k}_{t}^{C}[ bold_k start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; bold_k start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; … ; bold_k start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ] = bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT =WUK𝐜tKV,\displaystyle=W^{UK}\mathbf{c}_{t}^{KV},= italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT , (2)
𝐤tR\displaystyle\boxed{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}% {0,0,1}\mathbf{k}_{t}^{R}}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT =RoPE(WKR𝐡t),\displaystyle=\operatorname{RoPE}({W^{KR}}\mathbf{h}_{t}),= roman_RoPE ( italic_W start_POSTSUPERSCRIPT italic_K italic_R end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (3)
𝐤t,i\displaystyle\mathbf{k}_{t,i}bold_k start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT =[𝐤t,iC;𝐤tR],\displaystyle=[\mathbf{k}_{t,i}^{C};\mathbf{k}_{t}^{R}],= [ bold_k start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ] , (4)
[𝐯t,1C;𝐯t,2C;;𝐯t,nhC]=𝐯tC\displaystyle[\mathbf{v}_{t,1}^{C};\mathbf{v}_{t,2}^{C};...;\mathbf{v}_{t,n_{h% }}^{C}]=\mathbf{v}_{t}^{C}[ bold_v start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; bold_v start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; … ; bold_v start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ] = bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT =WUV𝐜tKV,\displaystyle=W^{UV}\mathbf{c}_{t}^{KV},= italic_W start_POSTSUPERSCRIPT italic_U italic_V end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT , (5)

where 𝐜tKVdc\mathbf{c}_{t}^{KV}\in\mathbb{R}^{d_{c}}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the compressed latent vector for keys and values; dc(dhnh)d_{c}(\ll d_{h}n_{h})italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( ≪ italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) indicates the KV compression dimension; WDKVdc×dW^{DKV}\in\mathbb{R}^{d_{c}\times d}italic_W start_POSTSUPERSCRIPT italic_D italic_K italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT denotes the down-projection matrix; WUK,WUVdhnh×dcW^{UK},W^{UV}\in\mathbb{R}^{d_{h}n_{h}\times d_{c}}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_U italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are the up-projection matrices for keys and values, respectively; WKRdhR×dW^{KR}\in\mathbb{R}^{d_{h}^{R}\times d}italic_W start_POSTSUPERSCRIPT italic_K italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT × italic_d end_POSTSUPERSCRIPT is the matrix used to produce the decoupled key that carries Rotary Positional Embedding (RoPE) (Su et al., 2024); RoPE()\operatorname{RoPE}(\cdot)roman_RoPE ( ⋅ ) denotes the operation that applies RoPE matrices; and [;][\cdot;\cdot][ ⋅ ; ⋅ ] denotes concatenation. Note that for MLA, only the blue-boxed vectors (i.e., 𝐜tKV\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{c}_{t% }^{KV}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT and 𝐤tR\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{k}_{t% }^{R}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT) need to be cached during generation, which results in significantly reduced KV cache while maintaining performance comparable to standard Multi-Head Attention (MHA) (Vaswani et al., 2017).
𝐜tKVdcsuperscriptsubscriptsuperscriptsubscript\mathbf{c}_{t}^{KV}\in\mathbb{R}^{d_{c}}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 是键和值的压缩潜在向量; dc(dhnh)annotatedsubscriptmuch-less-thanabsentsubscriptsubscriptd_{c}(\ll d_{h}n_{h})italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( ≪ italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) 表示 KV 压缩维度; WDKVdc×dsuperscriptsuperscriptsubscriptW^{DKV}\in\mathbb{R}^{d_{c}\times d}italic_W start_POSTSUPERSCRIPT italic_D italic_K italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT 表示下投影矩阵; WUK,WUVdhnh×dcsuperscriptsuperscriptsuperscriptsubscriptsubscriptsubscriptW^{UK},W^{UV}\in\mathbb{R}^{d_{h}n_{h}\times d_{c}}italic_W start_POSTSUPERSCRIPT italic_U italic_K end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_U italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 分别是键和值的上投影矩阵; WKRdhR×dsuperscriptsuperscriptsuperscriptsubscriptW^{KR}\in\mathbb{R}^{d_{h}^{R}\times d}italic_W start_POSTSUPERSCRIPT italic_K italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT × italic_d end_POSTSUPERSCRIPT 是用于产生携带旋转位置嵌入(RoPE)的解耦键的矩阵(Su 等,2024); RoPE()\operatorname{RoPE}(\cdot)roman_RoPE ( ⋅ ) 表示应用 RoPE 矩阵的操作; [;][\cdot;\cdot][ ⋅ ; ⋅ ] 表示拼接。请注意,对于 MLA,在生成过程中只需缓存蓝色框内的向量(即 𝐜tKVsuperscriptsubscript\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{c}_{t% }^{KV}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_V end_POSTSUPERSCRIPT𝐤tRsuperscriptsubscript\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{k}_{t% }^{R}bold_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ),这显著减少了 KV 缓存,同时保持了与标准多头注意力(MHA)(Vaswani 等,2017)相当的性能。

For the attention queries, we also perform a low-rank compression, which can reduce the activation memory during training:
对于关注查询,我们同样执行低秩压缩,这可以在训练期间减少激活内存:

𝐜tQ\displaystyle\mathbf{c}_{t}^{Q}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT =WDQ𝐡t,\displaystyle=W^{DQ}\mathbf{h}_{t},= italic_W start_POSTSUPERSCRIPT italic_D italic_Q end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (6)
[𝐪t,1C;𝐪t,2C;;𝐪t,nhC]=𝐪tC\displaystyle[\mathbf{q}_{t,1}^{C};\mathbf{q}_{t,2}^{C};...;\mathbf{q}_{t,n_{h% }}^{C}]=\mathbf{q}_{t}^{C}[ bold_q start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; bold_q start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; … ; bold_q start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ] = bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT =WUQ𝐜tQ,\displaystyle=W^{UQ}\mathbf{c}_{t}^{Q},= italic_W start_POSTSUPERSCRIPT italic_U italic_Q end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , (7)
[𝐪t,1R;𝐪t,2R;;𝐪t,nhR]=𝐪tR\displaystyle[\mathbf{q}_{t,1}^{R};\mathbf{q}_{t,2}^{R};...;\mathbf{q}_{t,n_{h% }}^{R}]=\mathbf{q}_{t}^{R}[ bold_q start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ; bold_q start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ; … ; bold_q start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ] = bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT =RoPE(WQR𝐜tQ),\displaystyle=\operatorname{RoPE}({W^{QR}}\mathbf{c}_{t}^{Q}),= roman_RoPE ( italic_W start_POSTSUPERSCRIPT italic_Q italic_R end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ) , (8)
𝐪t,i\displaystyle\mathbf{q}_{t,i}bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT =[𝐪t,iC;𝐪t,iR],\displaystyle=[\mathbf{q}_{t,i}^{C};\mathbf{q}_{t,i}^{R}],= [ bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ; bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ] , (9)

where 𝐜tQdc\mathbf{c}_{t}^{Q}\in\mathbb{R}^{d_{c}^{\prime}}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is the compressed latent vector for queries; dc(dhnh)d_{c}^{\prime}(\ll d_{h}n_{h})italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ≪ italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) denotes the query compression dimension; WDQdc×d,WUQdhnh×dcW^{DQ}\in\mathbb{R}^{d_{c}^{\prime}\times d},W^{UQ}\in\mathbb{R}^{d_{h}n_{h}% \times d_{c}^{\prime}}italic_W start_POSTSUPERSCRIPT italic_D italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_d end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_U italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT are the down-projection and up-projection matrices for queries, respectively; and WQRdhRnh×dcW^{QR}\in\mathbb{R}^{d_{h}^{R}n_{h}\times d_{c}^{\prime}}italic_W start_POSTSUPERSCRIPT italic_Q italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is the matrix to produce the decoupled queries that carry RoPE.

Ultimately, the attention queries (𝐪t,i\mathbf{q}_{t,i}bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT), keys (𝐤j,i\mathbf{k}_{j,i}bold_k start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT), and values (𝐯j,iC\mathbf{v}_{j,i}^{C}bold_v start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT) are combined to yield the final attention output 𝐮t\mathbf{u}_{t}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

𝐨t,i\displaystyle\mathbf{o}_{t,i}bold_o start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT =j=1tSoftmaxj(𝐪t,iT𝐤j,idh+dhR)𝐯j,iC,\displaystyle=\sum_{j=1}^{t}\operatorname{Softmax}_{j}(\frac{\mathbf{q}_{t,i}^% {T}\mathbf{k}_{j,i}}{\sqrt{d_{h}+d_{h}^{R}}})\mathbf{v}_{j,i}^{C},= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Softmax start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( divide start_ARG bold_q start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_k start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_ARG end_ARG ) bold_v start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , (10)
𝐮t\displaystyle\mathbf{u}_{t}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =WO[𝐨t,1;𝐨t,2;;𝐨t,nh],\displaystyle=W^{O}[\mathbf{o}_{t,1};\mathbf{o}_{t,2};...;\mathbf{o}_{t,n_{h}}],= italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT [ bold_o start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ; bold_o start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ; … ; bold_o start_POSTSUBSCRIPT italic_t , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] , (11)

where WOd×dhnhW^{O}\in\mathbb{R}^{d\times d_{h}n_{h}}italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the output projection matrix.

2.1.2 DeepSeekMoE with Auxiliary-Loss-Free Load Balancing

Basic Architecture of DeepSeekMoE.

For Feed-Forward Networks (FFNs), DeepSeek-V3 employs the DeepSeekMoE architecture (Dai et al., 2024). Compared with traditional MoE architectures like GShard (Lepikhin et al., 2021), DeepSeekMoE uses finer-grained experts and isolates some experts as shared ones. Let 𝐮t\mathbf{u}_{t}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the FFN input of the ttitalic_t-th token, we compute the FFN output 𝐡t\mathbf{h}_{t}^{\prime}bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as follows:

𝐡t\displaystyle\mathbf{h}_{t}^{\prime}bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =𝐮t+i=1NsFFNi(s)(𝐮t)+i=1Nrgi,tFFNi(r)(𝐮t),\displaystyle=\mathbf{u}_{t}+\sum_{i=1}^{N_{s}}{\operatorname{FFN}^{(s)}_{i}% \left(\mathbf{u}_{t}\right)}+\sum_{i=1}^{N_{r}}{g_{i,t}\operatorname{FFN}^{(r)% }_{i}\left(\mathbf{u}_{t}\right)},= bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_FFN start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT roman_FFN start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (12)
gi,t\displaystyle g_{i,t}italic_g start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT =gi,tj=1Nrgj,t,\displaystyle=\frac{g^{\prime}_{i,t}}{\sum_{j=1}^{N_{r}}g^{\prime}_{j,t}},= divide start_ARG italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT end_ARG , (13)
gi,t\displaystyle g^{\prime}_{i,t}italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ={si,t,si,tTopk({sj,t|1jNr},Kr),0,otherwise,\displaystyle=\begin{cases}s_{i,t},&s_{i,t}\in\operatorname{Topk}(\{s_{j,t}|1% \leqslant j\leqslant N_{r}\},K_{r}),\\ 0,&\text{otherwise},\end{cases}= { start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , end_CELL start_CELL italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∈ roman_Topk ( { italic_s start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT | 1 ⩽ italic_j ⩽ italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } , italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise , end_CELL end_ROW (14)
si,t\displaystyle s_{i,t}italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT =Sigmoid(𝐮tT𝐞i),\displaystyle=\operatorname{Sigmoid}\left({\mathbf{u}_{t}}^{T}\mathbf{e}_{i}% \right),= roman_Sigmoid ( bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (15)

where NsN_{s}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and NrN_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT denote the numbers of shared experts and routed experts, respectively; FFNi(s)()\operatorname{FFN}^{(s)}_{i}(\cdot)roman_FFN start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) and FFNi(r)()\operatorname{FFN}^{(r)}_{i}(\cdot)roman_FFN start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) denote the iiitalic_i-th shared expert and the iiitalic_i-th routed expert, respectively; KrK_{r}italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT denotes the number of activated routed experts; gi,tg_{i,t}italic_g start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is the gating value for the iiitalic_i-th expert; si,ts_{i,t}italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is the token-to-expert affinity; 𝐞i\mathbf{e}_{i}bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the centroid vector of the iiitalic_i-th routed expert; and Topk(,K)\operatorname{Topk}(\cdot,K)roman_Topk ( ⋅ , italic_K ) denotes the set comprising KKitalic_K highest scores among the affinity scores calculated for the ttitalic_t-th token and all routed experts. Slightly different from DeepSeek-V2, DeepSeek-V3 uses the sigmoid function to compute the affinity scores, and applies a normalization among all selected affinity scores to produce the gating values.

Auxiliary-Loss-Free Load Balancing.

For MoE models, an unbalanced expert load will lead to routing collapse (Shazeer et al., 2017) and diminish computational efficiency in scenarios with expert parallelism. Conventional solutions usually rely on the auxiliary loss (Fedus et al., 2021; Lepikhin et al., 2021) to avoid unbalanced load. However, too large an auxiliary loss will impair the model performance (Wang et al., 2024a). To achieve a better trade-off between load balance and model performance, we pioneer an auxiliary-loss-free load balancing strategy (Wang et al., 2024a) to ensure load balance. To be specific, we introduce a bias term bib_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each expert and add it to the corresponding affinity scores si,ts_{i,t}italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT to determine the top-K routing:

gi,t\displaystyle g^{\prime}_{i,t}italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ={si,t,si,t+biTopk({sj,t+bj|1jNr},Kr),0,otherwise.\displaystyle=\begin{cases}s_{i,t},&s_{i,t}+b_{i}\in\operatorname{Topk}(\{s_{j% ,t}+b_{j}|1\leqslant j\leqslant N_{r}\},K_{r}),\\ 0,&\text{otherwise}.\end{cases}= { start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , end_CELL start_CELL italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_Topk ( { italic_s start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | 1 ⩽ italic_j ⩽ italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } , italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise . end_CELL end_ROW (16)

Note that the bias term is only used for routing. The gating value, which will be multiplied with the FFN output, is still derived from the original affinity score si,ts_{i,t}italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT. During training, we keep monitoring the expert load on the whole batch of each training step. At the end of each step, we will decrease the bias term by γ\gammaitalic_γ if its corresponding expert is overloaded, and increase it by γ\gammaitalic_γ if its corresponding expert is underloaded, where γ\gammaitalic_γ is a hyper-parameter called bias update speed. Through the dynamic adjustment, DeepSeek-V3 keeps balanced expert load during training, and achieves better performance than models that encourage load balance through pure auxiliary losses.

Complementary Sequence-Wise Auxiliary Loss.

Although DeepSeek-V3 mainly relies on the auxiliary-loss-free strategy for load balance, to prevent extreme imbalance within any single sequence, we also employ a complementary sequence-wise balance loss:

Bal\displaystyle\mathcal{L}_{\mathrm{Bal}}caligraphic_L start_POSTSUBSCRIPT roman_Bal end_POSTSUBSCRIPT =αi=1NrfiPi,\displaystyle=\alpha\sum_{i=1}^{N_{r}}{f_{i}P_{i}},= italic_α ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (17)
fi=NrKrTt=1T𝟙\displaystyle f_{i}=\frac{N_{r}}{K_{r}T}\sum_{t=1}^{T}\mathds{1}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG start_ARG italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_1 (si,tTopk({sj,t|1jNr},Kr)),\displaystyle\left(s_{i,t}\in\operatorname{Topk}(\{s_{j,t}|1\leqslant j% \leqslant N_{r}\},K_{r})\right),( italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∈ roman_Topk ( { italic_s start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT | 1 ⩽ italic_j ⩽ italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } , italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ) , (18)
si,t\displaystyle s^{\prime}_{i,t}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT =si,tj=1Nrsj,t,\displaystyle=\frac{s_{i,t}}{\sum_{j=1}^{N_{r}}s_{j,t}},= divide start_ARG italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT end_ARG , (19)
Pi\displaystyle P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =1Tt=1Tsi,t,\displaystyle=\frac{1}{T}\sum_{t=1}^{T}{s^{\prime}_{i,t}},= divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , (20)

where the balance factor α\alphaitalic_α is a hyper-parameter, which will be assigned an extremely small value for DeepSeek-V3; 𝟙()\mathds{1}(\cdot)blackboard_1 ( ⋅ ) denotes the indicator function; and TTitalic_T denotes the number of tokens in a sequence. The sequence-wise balance loss encourages the expert load on each sequence to be balanced.

Node-Limited Routing.

Like the device-limited routing used by DeepSeek-V2, DeepSeek-V3 also uses a restricted routing mechanism to limit communication costs during training. In short, we ensure that each token will be sent to at most MMitalic_M nodes, which are selected according to the sum of the highest KrM\frac{K_{r}}{M}divide start_ARG italic_K start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG start_ARG italic_M end_ARG affinity scores of the experts distributed on each node. Under this constraint, our MoE training framework can nearly achieve full computation-communication overlap.

No Token-Dropping.  无令牌丢弃。

Due to the effective load balancing strategy, DeepSeek-V3 keeps a good load balance during its full training. Therefore, DeepSeek-V3 does not drop any tokens during training. In addition, we also implement specific deployment strategies to ensure inference load balance, so DeepSeek-V3 also does not drop tokens during inference.
由于有效的负载均衡策略,DeepSeek-V3 在完整训练过程中保持良好的负载平衡。因此,DeepSeek-V3 在训练过程中不会丢弃任何标记。此外,我们还实施了特定的部署策略以确保推理负载平衡,因此 DeepSeek-V3 在推理过程中也不会丢弃标记。

Refer to caption
Figure 3: Illustration of our Multi-Token Prediction (MTP) implementation. We keep the complete causal chain for the prediction of each token at each depth.

2.2 Multi-Token Prediction

Inspired by Gloeckle et al. (2024), we investigate and set a Multi-Token Prediction (MTP) objective for DeepSeek-V3, which extends the prediction scope to multiple future tokens at each position. On the one hand, an MTP objective densifies the training signals and may improve data efficiency. On the other hand, MTP may enable the model to pre-plan its representations for better prediction of future tokens. Figure 3 illustrates our implementation of MTP. Different from Gloeckle et al. (2024), which parallelly predicts DDitalic_D additional tokens using independent output heads, we sequentially predict additional tokens and keep the complete causal chain at each prediction depth. We introduce the details of our MTP implementation in this section.

MTP Modules.

To be specific, our MTP implementation uses DDitalic_D sequential modules to predict DDitalic_D additional tokens. The kkitalic_k-th MTP module consists of a shared embedding layer Emb()\operatorname{Emb}(\cdot)roman_Emb ( ⋅ ), a shared output head OutHead()\operatorname{OutHead}(\cdot)roman_OutHead ( ⋅ ), a Transformer block TRMk()\operatorname{TRM}_{k}(\cdot)roman_TRM start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ ), and a projection matrix Mkd×2dM_{k}\in\mathbb{R}^{d\times 2d}italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 2 italic_d end_POSTSUPERSCRIPT. For the iiitalic_i-th input token tit_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, at the kkitalic_k-th prediction depth, we first combine the representation of the iiitalic_i-th token at the (k1)(k-1)( italic_k - 1 )-th depth 𝐡ik1d\mathbf{h}_{i}^{k-1}\in\mathbb{R}^{d}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and the embedding of the (i+k)(i+k)( italic_i + italic_k )-th token Emb(ti+k)dEmb(t_{i+k})\in\mathbb{R}^{d}italic_E italic_m italic_b ( italic_t start_POSTSUBSCRIPT italic_i + italic_k end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT with the linear projection:

𝐡ik=Mk[RMSNorm(𝐡ik1);RMSNorm(Emb(ti+k))],\mathbf{h}_{i}^{\prime k}=M_{k}[\operatorname{RMSNorm}(\mathbf{h}_{i}^{k-1});% \operatorname{RMSNorm}(\operatorname{Emb}(t_{i+k}))],bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ italic_k end_POSTSUPERSCRIPT = italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ roman_RMSNorm ( bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) ; roman_RMSNorm ( roman_Emb ( italic_t start_POSTSUBSCRIPT italic_i + italic_k end_POSTSUBSCRIPT ) ) ] , (21)

where [;][\cdot;\cdot][ ⋅ ; ⋅ ] denotes concatenation. Especially, when k=1k=1italic_k = 1, 𝐡ik1\mathbf{h}_{i}^{k-1}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT refers to the representation given by the main model. Note that for each MTP module, its embedding layer is shared with the main model. The combined 𝐡ik\mathbf{h}_{i}^{\prime k}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ italic_k end_POSTSUPERSCRIPT serves as the input of the Transformer block at the kkitalic_k-th depth to produce the output representation at the current depth 𝐡ik\mathbf{h}_{i}^{k}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT:

𝐡1:Tkk=TRMk(𝐡1:Tkk),\mathbf{h}_{1:T-k}^{k}=\operatorname{TRM}_{k}(\mathbf{h}_{1:T-k}^{\prime k}),bold_h start_POSTSUBSCRIPT 1 : italic_T - italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = roman_TRM start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT 1 : italic_T - italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ italic_k end_POSTSUPERSCRIPT ) , (22)

where TTitalic_T represents the input sequence length and i:j denotes the slicing operation (inclusive of both the left and right boundaries). Finally, taking 𝐡ik\mathbf{h}_{i}^{k}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT as the input, the shared output head will compute the probability distribution for the kkitalic_k-th additional prediction token Pi+1+kkVP_{i+1+k}^{k}\in\mathbb{R}^{V}italic_P start_POSTSUBSCRIPT italic_i + 1 + italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT, where VVitalic_V is the vocabulary size:
TTitalic_T 表示输入序列长度, i:j 表示切片操作(包括左右边界)。最后,以 𝐡iksuperscriptsubscript\mathbf{h}_{i}^{k}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT 作为输入,共享输出头将计算第 kkitalic_k 个额外预测标记 Pi+1+kkVsuperscriptsubscript1superscriptP_{i+1+k}^{k}\in\mathbb{R}^{V}italic_P start_POSTSUBSCRIPT italic_i + 1 + italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT 的概率分布,其中 VVitalic_V 是词汇量大小:

Pi+k+1k=OutHead(𝐡ik).P_{i+k+1}^{k}=\operatorname{OutHead}(\mathbf{h}_{i}^{k}).italic_P start_POSTSUBSCRIPT italic_i + italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = roman_OutHead ( bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) . (23)

The output head OutHead()\operatorname{OutHead}(\cdot)roman_OutHead ( ⋅ ) linearly maps the representation to logits and subsequently applies the Softmax()\operatorname{Softmax}(\cdot)roman_Softmax ( ⋅ ) function to compute the prediction probabilities of the kkitalic_k-th additional token. Also, for each MTP module, its output head is shared with the main model. Our principle of maintaining the causal chain of predictions is similar to that of EAGLE (Li et al., 2024b), but its primary objective is speculative decoding (Xia et al., 2023; Leviathan et al., 2023), whereas we utilize MTP to improve training.
输出头 OutHead()\operatorname{OutHead}(\cdot)roman_OutHead ( ⋅ ) 线性地将表示映射到 logits,然后应用 Softmax()\operatorname{Softmax}(\cdot)roman_Softmax ( ⋅ ) 函数计算第 kkitalic_k 个额外标记的预测概率。此外,对于每个 MTP 模块,其输出头与主模型共享。我们保持预测因果链的原则与 EAGLE(Li 等人,2024b)类似,但其主要目标是推测性解码(Xia 等人,2023;Leviathan 等人,2023),而我们是利用 MTP 来提高训练。

MTP Training Objective.

For each prediction depth, we compute a cross-entropy loss MTPk\mathcal{L}_{\text{MTP}}^{k}caligraphic_L start_POSTSUBSCRIPT MTP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT:

MTPk=CrossEntropy(P2+k:T+1k,t2+k:T+1)=1Ti=2+kT+1logPik[ti],\mathcal{L}_{\text{MTP}}^{k}=\operatorname{CrossEntropy}(P_{2+k:T+1}^{k},t_{2+% k:T+1})=-\frac{1}{T}\sum_{i=2+k}^{T+1}\log P_{i}^{k}[t_{i}],caligraphic_L start_POSTSUBSCRIPT MTP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = roman_CrossEntropy ( italic_P start_POSTSUBSCRIPT 2 + italic_k : italic_T + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT 2 + italic_k : italic_T + 1 end_POSTSUBSCRIPT ) = - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_i = 2 + italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T + 1 end_POSTSUPERSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] , (24)

where TTitalic_T denotes the input sequence length, tit_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the ground-truth token at the iiitalic_i-th position, and Pik[ti]P_{i}^{k}[t_{i}]italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] denotes the corresponding prediction probability of tit_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, given by the kkitalic_k-th MTP module. Finally, we compute the average of the MTP losses across all depths and multiply it by a weighting factor λ\lambdaitalic_λ to obtain the overall MTP loss MTP\mathcal{L}_{\text{MTP}}caligraphic_L start_POSTSUBSCRIPT MTP end_POSTSUBSCRIPT, which serves as an additional training objective for DeepSeek-V3:

MTP=λDk=1DMTPk.\mathcal{L}_{\text{MTP}}=\frac{\lambda}{D}\sum_{k=1}^{D}\mathcal{L}_{\text{MTP% }}^{k}.caligraphic_L start_POSTSUBSCRIPT MTP end_POSTSUBSCRIPT = divide start_ARG italic_λ end_ARG start_ARG italic_D end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT MTP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (25)
MTP in Inference.

Our MTP strategy mainly aims to improve the performance of the main model, so during inference, we can directly discard the MTP modules and the main model can function independently and normally. Additionally, we can also repurpose these MTP modules for speculative decoding to further improve the generation latency.

3 Infrastructures

3.1 Compute Clusters

DeepSeek-V3 is trained on a cluster equipped with 2048 NVIDIA H800 GPUs. Each node in the H800 cluster contains 8 GPUs connected by NVLink and NVSwitch within nodes. Across different nodes, InfiniBand (IB) interconnects are utilized to facilitate communications.

3.2 Training Framework

The training of DeepSeek-V3 is supported by the HAI-LLM framework, an efficient and lightweight training framework crafted by our engineers from the ground up. On the whole, DeepSeek-V3 applies 16-way Pipeline Parallelism (PP) (Qi et al., 2023a), 64-way Expert Parallelism (EP) (Lepikhin et al., 2021) spanning 8 nodes, and ZeRO-1 Data Parallelism (DP) (Rajbhandari et al., 2020).

In order to facilitate efficient training of DeepSeek-V3, we implement meticulous engineering optimizations. Firstly, we design the DualPipe algorithm for efficient pipeline parallelism. Compared with existing PP methods, DualPipe has fewer pipeline bubbles. More importantly, it overlaps the computation and communication phases across forward and backward processes, thereby addressing the challenge of heavy communication overhead introduced by cross-node expert parallelism. Secondly, we develop efficient cross-node all-to-all communication kernels to fully utilize IB and NVLink bandwidths and conserve Streaming Multiprocessors (SMs) dedicated to communication. Finally, we meticulously optimize the memory footprint during training, thereby enabling us to train DeepSeek-V3 without using costly Tensor Parallelism (TP).

3.2.1 DualPipe and Computation-Communication Overlap

Refer to caption
Figure 4: Overlapping strategy for a pair of individual forward and backward chunks (the boundaries of the transformer blocks are not aligned). Orange denotes forward, green denotes "backward for input", blue denotes "backward for weights", purple denotes PP communication, and red denotes barriers. Both all-to-all and PP communication can be fully hidden.

For DeepSeek-V3, the communication overhead introduced by cross-node expert parallelism results in an inefficient computation-to-communication ratio of approximately 1:1. To tackle this challenge, we design an innovative pipeline parallelism algorithm called DualPipe, which not only accelerates model training by effectively overlapping forward and backward computation-communication phases, but also reduces the pipeline bubbles.
对于 DeepSeek-V3,跨节点专家并行引入的通信开销导致计算与通信比约为 1:1,效率低下。为了应对这一挑战,我们设计了一种创新的管道并行算法,称为 DualPipe,它不仅通过有效地重叠正向和反向计算-通信阶段来加速模型训练,还减少了管道气泡。

The key idea of DualPipe is to overlap the computation and communication within a pair of individual forward and backward chunks. To be specific, we divide each chunk into four components: attention, all-to-all dispatch, MLP, and all-to-all combine. Specially, for a backward chunk, both attention and MLP are further split into two parts, backward for input and backward for weights, like in ZeroBubble (Qi et al., 2023b). In addition, we have a PP communication component. As illustrated in Figure 4, for a pair of forward and backward chunks, we rearrange these components and manually adjust the ratio of GPU SMs dedicated to communication versus computation. In this overlapping strategy, we can ensure that both all-to-all and PP communication can be fully hidden during execution. Given the efficient overlapping strategy, the full DualPipe scheduling is illustrated in Figure 5. It employs a bidirectional pipeline scheduling, which feeds micro-batches from both ends of the pipeline simultaneously and a significant portion of communications can be fully overlapped. This overlap also ensures that, as the model further scales up, as long as we maintain a constant computation-to-communication ratio, we can still employ fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead.
双管的关键思想是在一对单独的前向和后向块中重叠计算和通信。具体来说,我们将每个块分为四个部分:注意力、全到全调度、MLP 和全到全组合。特别地,对于后向块,注意力和 MLP 进一步分为两部分,后向输入和后向权重,类似于 ZeroBubble(Qi 等人,2023b)。此外,我们还有一个 PP 通信组件。如图 4 所示,对于一对前向和后向块,我们重新排列这些组件,并手动调整用于通信和计算的 GPU SMs 的比率。在这种重叠策略中,我们可以确保全到全和 PP 通信在执行期间可以完全隐藏。鉴于高效的重叠策略,完整的 DualPipe 调度如图 5 所示。它采用双向管道调度,同时从管道两端提供微批次,并且大部分通信可以完全重叠。 这种重叠还确保了,随着模型进一步扩大,只要我们保持恒定的计算与通信比率,我们仍然可以在节点间使用细粒度专家,同时实现近乎零的全对全通信开销。

Refer to caption
Figure 5: Example DualPipe scheduling for 8 PP ranks and 20 micro-batches in two directions. The micro-batches in the reverse direction are symmetric to those in the forward direction, so we omit their batch ID for illustration simplicity. Two cells enclosed by a shared black border have mutually overlapped computation and communication.
图 5:8 个 PP 排名和 20 个微批次的 DualPipe 调度示例。反向方向的微批次与正向方向的微批次对称,因此为了说明的简洁性,省略了它们的批次 ID。两个由共享黑色边框包围的单元格具有相互重叠的计算和通信。

In addition, even in more general scenarios without a heavy communication burden, DualPipe still exhibits efficiency advantages. In Table 2, we summarize the pipeline bubbles and memory usage across different PP methods. As shown in the table, compared with ZB1P (Qi et al., 2023b) and 1F1B (Harlap et al., 2018), DualPipe significantly reduces the pipeline bubbles while only increasing the peak activation memory by 1PP\frac{1}{PP}divide start_ARG 1 end_ARG start_ARG italic_P italic_P end_ARG times. Although DualPipe requires keeping two copies of the model parameters, this does not significantly increase the memory consumption since we use a large EP size during training. Compared with Chimera (Li and Hoefler, 2021), DualPipe only requires that the pipeline stages and micro-batches be divisible by 2, without requiring micro-batches to be divisible by pipeline stages. In addition, for DualPipe, neither the bubbles nor activation memory will increase as the number of micro-batches grows.
此外,即使在更通用的场景中,没有沉重的通信负担,DualPipe 仍然显示出效率优势。在表 2 中,我们总结了不同 PP 方法中的管道气泡和内存使用情况。如表所示,与 ZB1P(Qi 等人,2023b)和 1F1B(Harlap 等人,2018)相比,DualPipe 显著减少了管道气泡,同时仅将峰值激活内存增加 1PP1\frac{1}{PP}divide start_ARG 1 end_ARG start_ARG italic_P italic_P end_ARG 倍。尽管 DualPipe 需要保留两个模型参数副本,但由于我们在训练期间使用了一个较大的 EP 大小,这并不会显著增加内存消耗。与 Chimera(Li 和 Hoefler,2021)相比,DualPipe 仅要求管道阶段和微批次可被 2 整除,而不要求微批次可被管道阶段整除。此外,对于 DualPipe,随着微批次数量的增加,气泡和激活内存都不会增加。

Method  方法 Bubble  泡泡 Parameter  参数 Activation  激活
1F1B (PP1)(F+B)(PP-1)(F+B)( italic_P italic_P - 1 ) ( italic_F + italic_B ) 1×1\times1 × PPPPitalic_P italic_P
ZB1P (PP1)(F+B2W)(PP-1)(F+B-2W)( italic_P italic_P - 1 ) ( italic_F + italic_B - 2 italic_W ) 1×1\times1 × PPPPitalic_P italic_P
DualPipe (Ours)  双管( ours) (PP21)(F&B+B3W)(\frac{PP}{2}-1)(F\&B+B-3W)( divide start_ARG italic_P italic_P end_ARG start_ARG 2 end_ARG - 1 ) ( italic_F & italic_B + italic_B - 3 italic_W ) 2×2\times2 × PP+1PP+1italic_P italic_P + 1
Table 2: Comparison of pipeline bubbles and memory usage across different pipeline parallel methods. FFitalic_F denotes the execution time of a forward chunk, BBitalic_B denotes the execution time of a full backward chunk, WWitalic_W denotes the execution time of a "backward for weights" chunk, and F&BF\&Bitalic_F & italic_B denotes the execution time of two mutually overlapped forward and backward chunks.
表 2:不同管道并行方法中管道气泡和内存使用的比较。 FFitalic_F 表示前向块执行时间, BBitalic_B 表示完整反向块执行时间, WWitalic_W 表示“反向权重”块执行时间, F&BF\&Bitalic_F & italic_B 表示两个相互重叠的前向和反向块执行时间。

3.2.2 Efficient Implementation of Cross-Node All-to-All Communication
3.2.2 跨节点全对全通信的高效实现

In order to ensure sufficient computational performance for DualPipe, we customize efficient cross-node all-to-all communication kernels (including dispatching and combining) to conserve the number of SMs dedicated to communication. The implementation of the kernels is co-designed with the MoE gating algorithm and the network topology of our cluster. To be specific, in our cluster, cross-node GPUs are fully interconnected with IB, and intra-node communications are handled via NVLink. NVLink offers a bandwidth of 160 GB/s, roughly 3.2 times that of IB (50 GB/s). To effectively leverage the different bandwidths of IB and NVLink, we limit each token to be dispatched to at most 4 nodes, thereby reducing IB traffic. For each token, when its routing decision is made, it will first be transmitted via IB to the GPUs with the same in-node index on its target nodes. Once it reaches the target nodes, we will endeavor to ensure that it is instantaneously forwarded via NVLink to specific GPUs that host their target experts, without being blocked by subsequently arriving tokens. In this way, communications via IB and NVLink are fully overlapped, and each token can efficiently select an average of 3.2 experts per node without incurring additional overhead from NVLink. This implies that, although DeepSeek-V3 selects only 8 routed experts in practice, it can scale up this number to a maximum of 13 experts (4 nodes ×\times× 3.2 experts/node) while preserving the same communication cost. Overall, under such a communication strategy, only 20 SMs are sufficient to fully utilize the bandwidths of IB and NVLink.
为确保 DualPipe 有足够的计算性能,我们定制了高效的跨节点全全连接通信内核(包括调度和组合),以节省用于通信的 SM 数量。内核的实现与 MoE 门控算法和我们的集群网络拓扑协同设计。具体来说,在我们的集群中,跨节点 GPU 通过 IB 完全互联,节点内通信通过 NVLink 处理。NVLink 提供 160 GB/s 的带宽,大约是 IB(50 GB/s)的 3.2 倍。为了有效地利用 IB 和 NVLink 的不同带宽,我们限制每个令牌最多分发到 4 个节点,从而减少 IB 流量。对于每个令牌,当其路由决策确定后,它将通过 IB 首先传输到目标节点上具有相同节点索引的 GPU。一旦到达目标节点,我们将努力确保它通过 NVLink 即时转发到托管其目标专家的特定 GPU,而不会被随后到达的令牌阻塞。 这种方式下,通过 IB 和 NVLink 的通信完全重叠,每个令牌可以高效地选择每个节点平均 3.2 位专家,而不会产生额外的 NVLink 开销。这意味着,尽管 DeepSeek-V3 实际上只选择了 8 位路由专家,但它可以将这个数字扩展到最多 13 位专家(4 节点 @ 0# 3.2 位专家/节点),同时保持相同的通信成本。总体而言,在这种通信策略下,只需要 20 个 SM 就可以充分利用 IB 和 NVLink 的带宽。

In detail, we employ the warp specialization technique (Bauer et al., 2014) and partition 20 SMs into 10 communication channels. During the dispatching process, (1) IB sending, (2) IB-to-NVLink forwarding, and (3) NVLink receiving are handled by respective warps. The number of warps allocated to each communication task is dynamically adjusted according to the actual workload across all SMs. Similarly, during the combining process, (1) NVLink sending, (2) NVLink-to-IB forwarding and accumulation, and (3) IB receiving and accumulation are also handled by dynamically adjusted warps. In addition, both dispatching and combining kernels overlap with the computation stream, so we also consider their impact on other SM computation kernels. Specifically, we employ customized PTX (Parallel Thread Execution) instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs.
详细来说,我们采用了变形专用技术(Bauer 等,2014)并将 20 个 SM 分为 10 个通信通道。在调度过程中,(1) IB 发送、(2) IB 到 NVLink 转发和(3) NVLink 接收由相应的变形处理。分配给每个通信任务变形的数量根据所有 SM 的实际工作负载动态调整。同样,在合并过程中,(1) NVLink 发送、(2) NVLink 到 IB 转发和累积以及(3) IB 接收和累积也由动态调整的变形处理。此外,调度和合并内核与计算流重叠,因此我们也考虑它们对其他 SM 计算内核的影响。具体来说,我们采用了定制的 PTX(并行线程执行)指令并自动调整通信块大小,这显著减少了 L2 缓存的占用和其他 SM 的干扰。

3.2.3 Extremely Memory Saving with Minimal Overhead
3.2.3 极其节省内存,开销最小

In order to reduce the memory footprint during training, we employ the following techniques.
为了在训练过程中减少内存占用,我们采用了以下技术。

Recomputation of RMSNorm and MLA Up-Projection.
重新计算 RMSNorm 和 MLA 上投影。

We recompute all RMSNorm operations and MLA up-projections during back-propagation, thereby eliminating the need to persistently store their output activations. With a minor overhead, this strategy significantly reduces memory requirements for storing activations.
我们重新计算所有 RMSNorm 操作和 MLA 上投影,在反向传播过程中,从而消除了持续存储其输出激活的需求。这种策略以轻微的开销,显著降低了存储激活所需的内存需求。

Exponential Moving Average in CPU.
指数移动平均在 CPU 中。

During training, we preserve the Exponential Moving Average (EMA) of the model parameters for early estimation of the model performance after learning rate decay. The EMA parameters are stored in CPU memory and are updated asynchronously after each training step. This method allows us to maintain EMA parameters without incurring additional memory or time overhead.
在训练过程中,我们保留模型参数的指数移动平均(EMA)以在学习率衰减后早期估计模型性能。EMA 参数存储在 CPU 内存中,并在每次训练步骤后异步更新。这种方法允许我们在不产生额外内存或时间开销的情况下维护 EMA 参数。

Shared Embedding and Output Head for Multi-Token Prediction.
共享多令牌预测的共享嵌入和输出头

With the DualPipe strategy, we deploy the shallowest layers (including the embedding layer) and deepest layers (including the output head) of the model on the same PP rank. This arrangement enables the physical sharing of parameters and gradients, of the shared embedding and output head, between the MTP module and the main model. This physical sharing mechanism further enhances our memory efficiency. 

3.3 FP8 Training 

Refer to caption
Figure 6: The overall mixed precision framework with FP8 data format. For clarification, only the Linear operator is illustrated.

Inspired by recent advances in low-precision training (Peng et al., 2023b; Dettmers et al., 2022; Noune et al., 2022), we propose a fine-grained mixed precision framework utilizing the FP8 data format for training DeepSeek-V3. While low-precision training holds great promise, it is often limited by the presence of outliers in activations, weights, and gradients (Sun et al., 2024; He et al., ; Fishman et al., 2024). Although significant progress has been made in inference quantization (Xiao et al., 2023; Frantar et al., 2022), there are relatively few studies demonstrating successful application of low-precision techniques in large-scale language model pre-training (Fishman et al., 2024). To address this challenge and effectively extend the dynamic range of the FP8 format, we introduce a fine-grained quantization strategy: tile-wise grouping with 1×Nc1\times N_{c}1 × italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT elements or block-wise grouping with Nc×NcN_{c}\times N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT elements. The associated dequantization overhead is largely mitigated under our increased-precision accumulation process, a critical aspect for achieving accurate FP8 General Matrix Multiplication (GEMM). Moreover, to further reduce memory and communication overhead in MoE training, we cache and dispatch activations in FP8, while storing low-precision optimizer states in BF16. We validate the proposed FP8 mixed precision framework on two model scales similar to DeepSeek-V2-Lite and DeepSeek-V2, training for approximately 1 trillion tokens (see more details in Appendix B.1). Notably, compared with the BF16 baseline, the relative loss error of our FP8-training model remains consistently below 0.25%, a level well within the acceptable range of training randomness.

3.3.1 Mixed Precision Framework

Building upon widely adopted techniques in low-precision training (Kalamkar et al., 2019; Narang et al., 2017), we propose a mixed precision framework for FP8 training. In this framework, most compute-density operations are conducted in FP8, while a few key operations are strategically maintained in their original data formats to balance training efficiency and numerical stability. The overall framework is illustrated in Figure 6.

Firstly, in order to accelerate model training, the majority of core computation kernels, i.e., GEMM operations, are implemented in FP8 precision. These GEMM operations accept FP8 tensors as inputs and produce outputs in BF16 or FP32. As depicted in Figure 6, all three GEMMs associated with the Linear operator, namely Fprop (forward pass), Dgrad (activation backward pass), and Wgrad (weight backward pass), are executed in FP8. This design theoretically doubles the computational speed compared with the original BF16 method. Additionally, the FP8 Wgrad GEMM allows activations to be stored in FP8 for use in the backward pass. This significantly reduces memory consumption.

Despite the efficiency advantage of the FP8 format, certain operators still require a higher precision due to their sensitivity to low-precision computations. Besides, some low-cost operators can also utilize a higher precision with a negligible overhead to the overall training cost. For this reason, after careful investigations, we maintain the original precision (e.g., BF16 or FP32) for the following components: the embedding module, the output head, MoE gating modules, normalization operators, and attention operators. These targeted retentions of high precision ensure stable training dynamics for DeepSeek-V3. To further guarantee numerical stability, we store the master weights, weight gradients, and optimizer states in higher precision. While these high-precision components incur some memory overheads, their impact can be minimized through efficient sharding across multiple DP ranks in our distributed training system.

Refer to caption
Figure 7: (a) We propose a fine-grained quantization method to mitigate quantization errors caused by feature outliers; for illustration simplicity, only Fprop is illustrated. (b) In conjunction with our quantization strategy, we improve the FP8 GEMM precision by promoting to CUDA Cores at an interval of NC=128N_{C}=128italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = 128 elements MMA for the high-precision accumulation.

3.3.2 Improved Precision from Quantization and Multiplication

Based on our mixed precision FP8 framework, we introduce several strategies to enhance low-precision training accuracy, focusing on both the quantization method and the multiplication process.

Fine-Grained Quantization.

In low-precision training frameworks, overflows and underflows are common challenges due to the limited dynamic range of the FP8 format, which is constrained by its reduced exponent bits. As a standard practice, the input distribution is aligned to the representable range of the FP8 format by scaling the maximum absolute value of the input tensor to the maximum representable value of FP8 (Narang et al., 2017). This method makes low-precision training highly sensitive to activation outliers, which can heavily degrade quantization accuracy. To solve this, we propose a fine-grained quantization method that applies scaling at a more granular level. As illustrated in Figure 7 (a), (1) for activations, we group and scale elements on a 1x128 tile basis (i.e., per token per 128 channels); and (2) for weights, we group and scale elements on a 128x128 block basis (i.e., per 128 input channels per 128 output channels). This approach ensures that the quantization process can better accommodate outliers by adapting the scale according to smaller groups of elements. In Appendix B.2, we further discuss the training instability when we group and scale activations on a block basis in the same way as weights quantization.

One key modification in our method is the introduction of per-group scaling factors along the inner dimension of GEMM operations. This functionality is not directly supported in the standard FP8 GEMM. However, combined with our precise FP32 accumulation strategy, it can be efficiently implemented.

Notably, our fine-grained quantization strategy is highly consistent with the idea of microscaling formats (Rouhani et al., 2023b), while the Tensor Cores of NVIDIA next-generation GPUs (Blackwell series) have announced the support for microscaling formats with smaller quantization granularity (NVIDIA, 2024a). We hope our design can serve as a reference for future work to keep pace with the latest GPU architectures.

Increasing Accumulation Precision.

Low-precision GEMM operations often suffer from underflow issues, and their accuracy largely depends on high-precision accumulation, which is commonly performed in an FP32 precision (Kalamkar et al., 2019; Narang et al., 2017). However, we observe that the accumulation precision of FP8 GEMM on NVIDIA H800 GPUs is limited to retaining around 14 bits, which is significantly lower than FP32 accumulation precision. This problem will become more pronounced when the inner dimension K is large (Wortsman et al., 2023), a typical scenario in large-scale model training where the batch size and model width are increased. Taking GEMM operations of two random matrices with K = 4096 for example, in our preliminary test, the limited accumulation precision in Tensor Cores results in a maximum relative error of nearly 2%. Despite these problems, the limited accumulation precision is still the default option in a few FP8 frameworks (NVIDIA, 2024b), severely constraining the training accuracy.

In order to address this issue, we adopt the strategy of promotion to CUDA Cores for higher precision (Thakkar et al., 2023). The process is illustrated in Figure 7 (b). To be specific, during MMA (Matrix Multiply-Accumulate) execution on Tensor Cores, intermediate results are accumulated using the limited bit width. Once an interval of NCN_{C}italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is reached, these partial results will be copied to FP32 registers on CUDA Cores, where full-precision FP32 accumulation is performed. As mentioned before, our fine-grained quantization applies per-group scaling factors along the inner dimension K. These scaling factors can be efficiently multiplied on the CUDA Cores as the dequantization process with minimal additional computational cost.

It is worth noting that this modification reduces the WGMMA (Warpgroup-level Matrix Multiply-Accumulate) instruction issue rate for a single warpgroup. However, on the H800 architecture, it is typical for two WGMMA to persist concurrently: while one warpgroup performs the promotion operation, the other is able to execute the MMA operation. This design enables overlapping of the two operations, maintaining high utilization of Tensor Cores. Based on our experiments, setting NC=128N_{C}=128italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = 128 elements, equivalent to 4 WGMMAs, represents the minimal accumulation interval that can significantly improve precision without introducing substantial overhead.

Mantissa over Exponents.

In contrast to the hybrid FP8 format adopted by prior work (NVIDIA, 2024b; Peng et al., 2023b; Sun et al., 2019b), which uses E4M3 (4-bit exponent and 3-bit mantissa) in Fprop and E5M2 (5-bit exponent and 2-bit mantissa) in Dgrad and Wgrad, we adopt the E4M3 format on all tensors for higher precision. We attribute the feasibility of this approach to our fine-grained quantization strategy, i.e., tile and block-wise scaling. By operating on smaller element groups, our methodology effectively shares exponent bits among these grouped elements, mitigating the impact of the limited dynamic range.

Online Quantization.

Delayed quantization is employed in tensor-wise quantization frameworks (NVIDIA, 2024b; Peng et al., 2023b), which maintains a history of the maximum absolute values across prior iterations to infer the current value. In order to ensure accurate scales and simplify the framework, we calculate the maximum absolute value online for each 1x128 activation tile or 128x128 weight block. Based on it, we derive the scaling factor and then quantize the activation or weight online into the FP8 format.

3.3.3 Low-Precision Storage and Communication

In conjunction with our FP8 training framework, we further reduce the memory consumption and communication overhead by compressing cached activations and optimizer states into lower-precision formats.

Low-Precision Optimizer States.

We adopt the BF16 data format instead of FP32 to track the first and second moments in the AdamW (Loshchilov and Hutter, 2017) optimizer, without incurring observable performance degradation. However, the master weights (stored by the optimizer) and gradients (used for batch size accumulation) are still retained in FP32 to ensure numerical stability throughout training.

Low-Precision Activation.

As illustrated in Figure 6, the Wgrad operation is performed in FP8. To reduce the memory consumption, it is a natural choice to cache activations in FP8 format for the backward pass of the Linear operator. However, special considerations are taken on several operators for low-cost high-precision training:

(1) Inputs of the Linear after the attention operator. These activations are also used in the backward pass of the attention operator, which makes it sensitive to precision. We adopt a customized E5M6 data format exclusively for these activations. Additionally, these activations will be converted from an 1x128 quantization tile to an 128x1 tile in the backward pass. To avoid introducing extra quantization error, all the scaling factors are round scaled, i.e., integral power of 2.

(2) Inputs of the SwiGLU operator in MoE. To further reduce the memory cost, we cache the inputs of the SwiGLU operator and recompute its output in the backward pass. These activations are also stored in FP8 with our fine-grained quantization method, striking a balance between memory efficiency and computational accuracy.

Low-Precision Communication.

Communication bandwidth is a critical bottleneck in the training of MoE models. To alleviate this challenge, we quantize the activation before MoE up-projections into FP8 and then apply dispatch components, which is compatible with FP8 Fprop in MoE up-projections. Like the inputs of the Linear after the attention operator, scaling factors for this activation are integral power of 2. A similar strategy is applied to the activation gradient before MoE down-projections. For both the forward and backward combine components, we retain them in BF16 to preserve training precision in critical parts of the training pipeline.

3.4 Inference and Deployment

We deploy DeepSeek-V3 on the H800 cluster, where GPUs within each node are interconnected using NVLink, and all GPUs across the cluster are fully interconnected via IB. To simultaneously ensure both the Service-Level Objective (SLO) for online services and high throughput, we employ the following deployment strategy that separates the prefilling and decoding stages.

3.4.1 Prefilling

The minimum deployment unit of the prefilling stage consists of 4 nodes with 32 GPUs. The attention part employs 4-way Tensor Parallelism (TP4) with Sequence Parallelism (SP), combined with 8-way Data Parallelism (DP8). Its small TP size of 4 limits the overhead of TP communication. For the MoE part, we use 32-way Expert Parallelism (EP32), which ensures that each expert processes a sufficiently large batch size, thereby enhancing computational efficiency. For the MoE all-to-all communication, we use the same method as in training: first transferring tokens across nodes via IB, and then forwarding among the intra-node GPUs via NVLink. In particular, we use 1-way Tensor Parallelism for the dense MLPs in shallow layers to save TP communication.

To achieve load balancing among different experts in the MoE part, we need to ensure that each GPU processes approximately the same number of tokens. To this end, we introduce a deployment strategy of redundant experts, which duplicates high-load experts and deploys them redundantly. The high-load experts are detected based on statistics collected during the online deployment and are adjusted periodically (e.g., every 10 minutes). After determining the set of redundant experts, we carefully rearrange experts among GPUs within a node based on the observed loads, striving to balance the load across GPUs as much as possible without increasing the cross-node all-to-all communication overhead. For the deployment of DeepSeek-V3, we set 32 redundant experts for the prefilling stage. For each GPU, besides the original 8 experts it hosts, it will also host one additional redundant expert. 

Furthermore, in the prefilling stage, to improve the throughput and hide the overhead of all-to-all and TP communication, we simultaneously process two micro-batches with similar computational workloads, overlapping the attention and MoE of one micro-batch with the dispatch and combine of another. 

Finally, we are exploring a dynamic redundancy strategy for experts, where each GPU hosts more experts (e.g., 16 experts), but only 9 will be activated during each inference step. Before the all-to-all operation at each layer begins, we compute the globally optimal routing scheme on the fly. Given the substantial computation involved in the prefilling stage, the overhead of computing this routing scheme is almost negligible. 

3.4.2 Decoding 

During decoding, we treat the shared expert as a routed one. From this perspective, each token will select 9 experts during routing, where the shared expert is regarded as a heavy-load one that will always be selected. The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs. The attention part employs TP4 with SP, combined with DP80, while the MoE part uses EP320. For the MoE part, each GPU hosts only one expert, and 64 GPUs are responsible for hosting redundant experts and shared experts. All-to-all communication of the dispatch and combine parts is performed via direct point-to-point transfers over IB to achieve low latency. Additionally, we leverage the IBGDA (NVIDIA, 2022) technology to further minimize latency and enhance communication efficiency. 

Similar to prefilling, we periodically determine the set of redundant experts in a certain interval, based on the statistical expert load from our online service. However, we do not need to rearrange experts since each GPU only hosts one expert. We are also exploring the dynamic redundancy strategy for decoding. However, this requires more careful optimization of the algorithm that computes the globally optimal routing scheme and the fusion with the dispatch kernel to reduce overhead. 

Additionally, to enhance throughput and hide the overhead of all-to-all communication, we are also exploring processing two micro-batches with similar computational workloads simultaneously in the decoding stage. Unlike prefilling, attention consumes a larger portion of time in the decoding stage. Therefore, we overlap the attention of one micro-batch with the dispatch+MoE+combine of another. In the decoding stage, the batch size per expert is relatively small (usually within 256 tokens), and the bottleneck is memory access rather than computation. Since the MoE part only needs to load the parameters of one expert, the memory access overhead is minimal, so using fewer SMs will not significantly affect the overall performance. Therefore, to avoid impacting the computation speed of the attention part, we can allocate only a small portion of SMs to dispatch+MoE+combine. 

3.5 Suggestions on Hardware Design
3.5 硬件设计建议

Based on our implementation of the all-to-all communication and FP8 training scheme, we propose the following suggestions on chip design to AI hardware vendors.
基于我们对全互连通信和 FP8 训练方案的实现,我们向 AI 硬件供应商提出以下关于芯片设计的建议。

3.5.1 Communication Hardware

In DeepSeek-V3, we implement the overlap between computation and communication to hide the communication latency during computation. This significantly reduces the dependency on communication bandwidth compared to serial computation and communication. However, the current communication implementation relies on expensive SMs (e.g., we allocate 20 out of the 132 SMs available in the H800 GPU for this purpose), which will limit the computational throughput. Moreover, using SMs for communication results in significant inefficiencies, as tensor cores remain entirely -utilized.

Currently, the SMs primarily perform the following tasks for all-to-all communication:

  • Forwarding data between the IB (InfiniBand) and NVLink domain while aggregating IB traffic destined for multiple GPUs within the same node from a single GPU.

  • Transporting data between RDMA buffers (registered GPU memory regions) and input/output buffers.

  • Executing reduce operations for all-to-all combine.

  • Managing fine-grained memory layout during chunked data transferring to multiple experts across the IB and NVLink domain.

We aspire to see future vendors developing hardware that offloads these communication tasks from the valuable computation unit SM, serving as a GPU co-processor or a network co-processor like NVIDIA SHARP Graham et al. (2016). Furthermore, to reduce application programming complexity, we aim for this hardware to unify the IB (scale-out) and NVLink (scale-up) networks from the perspective of the computation units. With this unified interface, computation units can easily accomplish operations such as read, write, multicast, and reduce across the entire IB-NVLink-unified domain via submitting communication requests based on simple primitives.

3.5.2 Compute Hardware

Higher FP8 GEMM Accumulation Precision in Tensor Cores.
更高的 FP8 GEMM 累加精度在张量核心中。

In the current Tensor Core implementation of the NVIDIA Hopper architecture, FP8 GEMM (General Matrix Multiply) employs fixed-point accumulation, aligning the mantissa products by right-shifting based on the maximum exponent before addition. Our experiments reveal that it only uses the highest 14 bits of each mantissa product after sign-fill right shifting, and truncates bits exceeding this range. However, for example, to achieve precise FP32 results from the accumulation of 32 FP8×\times×FP8 multiplications, at least 34-bit precision is required. Thus, we recommend that future chip designs increase accumulation precision in Tensor Cores to support full-precision accumulation, or select an appropriate accumulation bit-width according to the accuracy requirements of training and inference algorithms. This approach ensures that errors remain within acceptable bounds while maintaining computational efficiency.
在当前 NVIDIA Hopper 架构的 Tensor Core 实现中,FP8 GEMM(通用矩阵乘法)采用定点累加,通过在加法前根据最大指数右移对尾数乘积进行对齐。我们的实验表明,它仅使用每个尾数乘积的最高 14 位,并在符号填充右移后截断超出此范围的位。然而,例如,为了从 32 个 FP8 ×\times× FP8 乘法的累加中获得精确的 FP32 结果,至少需要 34 位精度。因此,我们建议未来的芯片设计在 Tensor Cores 中增加累加精度以支持全精度累加,或根据训练和推理算法的精度要求选择合适的累加位宽。这种方法确保错误保持在可接受的范围内,同时保持计算效率。

Support for Tile- and Block-Wise Quantization.
支持块和分块量化。

Current GPUs only support per-tensor quantization, lacking the native support for fine-grained quantization like our tile- and block-wise quantization. In the current implementation, when the NCN_{C}italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT interval is reached, the partial results will be copied from Tensor Cores to CUDA cores, multiplied by the scaling factors, and added to FP32 registers on CUDA cores. Although the dequantization overhead is significantly mitigated combined with our precise FP32 accumulation strategy, the frequent data movements between Tensor Cores and CUDA cores still limit the computational efficiency. Therefore, we recommend future chips to support fine-grained quantization by enabling Tensor Cores to receive scaling factors and implement MMA with group scaling. In this way, the whole partial sum accumulation and dequantization can be completed directly inside Tensor Cores until the final result is produced, avoiding frequent data movements.
当前 GPU 仅支持张量量化,缺乏对我们块和瓦片量化等细粒度量化的原生支持。在当前实现中,当达到 NCsubscriptN_{C}italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT 区间时,部分结果将从张量核心复制到 CUDA 核心,乘以缩放因子,并添加到 CUDA 核心上的 FP32 寄存器。尽管与我们的精确 FP32 累积策略相结合,去量化开销得到了显著缓解,但张量核心和 CUDA 核心之间的频繁数据移动仍然限制了计算效率。因此,我们建议未来的芯片通过使张量核心能够接收缩放因子并实现分组缩放来支持细粒度量化。这样,整个部分和累积和去量化可以直接在张量核心内完成,直到最终结果产生,避免频繁的数据移动。

Support for Online Quantization.
支持在线量化。

The current implementations struggle to effectively support online quantization, despite its effectiveness demonstrated in our research. In the existing process, we need to read 128 BF16 activation values (the output of the previous computation) from HBM (High Bandwidth Memory) for quantization, and the quantized FP8 values are then written back to HBM, only to be read again for MMA. To address this inefficiency, we recommend that future chips integrate FP8 cast and TMA (Tensor Memory Accelerator) access into a single fused operation, so quantization can be completed during the transfer of activations from global memory to shared memory, avoiding frequent memory reads and writes. We also recommend supporting a warp-level cast instruction for speedup, which further facilitates the better fusion of layer normalization and FP8 cast. Alternatively, a near-memory computing approach can be adopted, where compute logic is placed near the HBM. In this case, BF16 elements can be cast to FP8 directly as they are read from HBM into the GPU, reducing off-chip memory access by roughly 50%.
当前实现难以有效支持在线量化,尽管我们的研究证明了其有效性。在现有过程中,我们需要从 HBM(高带宽内存)读取 128 个 BF16 激活值(前一次计算的输出)进行量化,然后将量化的 FP8 值写回 HBM,仅供 MMA 再次读取。为了解决这种低效,我们建议未来的芯片将 FP8 转换和 TMA(张量内存加速器)访问集成到单个融合操作中,以便在激活从全局内存传输到共享内存的过程中完成量化,避免频繁的内存读写。我们还建议支持 warp 级别的转换指令以加速,这进一步促进了层归一化和 FP8 转换的更好融合。或者,可以采用近内存计算方法,将计算逻辑放置在 HBM 附近。在这种情况下,BF16 元素可以直接在从 HBM 读取到 GPU 时转换为 FP8,从而将片外内存访问减少约 50%。

Support for Transposed GEMM Operations.
支持转置 GEMM 操作。

The current architecture makes it cumbersome to fuse matrix transposition with GEMM operations. In our workflow, activations during the forward pass are quantized into 1x128 FP8 tiles and stored. During the backward pass, the matrix needs to be read out, dequantized, transposed, re-quantized into 128x1 tiles, and stored in HBM. To reduce memory operations, we recommend future chips to enable direct transposed reads of matrices from shared memory before MMA operation, for those precisions required in both training and inference. Combined with the fusion of FP8 format conversion and TMA access, this enhancement will significantly streamline the quantization workflow.
当前架构使得将矩阵转置与 GEMM 操作融合变得繁琐。在我们的工作流程中,正向传播期间的激活被量化为 1x128 FP8 瓦片并存储。在反向传播期间,需要读取矩阵,反量化,转置,重新量化为 128x1 瓦片,并存储在 HBM 中。为了减少内存操作,我们建议未来的芯片在 MMA 操作之前,从共享内存中直接读取矩阵的转置,以满足训练和推理所需的精度。结合 FP8 格式转换和 TMA 访问的融合,这一增强将显著简化量化工作流程。

4 Pre-Training

4.1 Data Construction

Compared with DeepSeek-V2, we optimize the pre-training corpus by enhancing the ratio of mathematical and programming samples, while expanding multilingual coverage beyond English and Chinese. Also, our data processing pipeline is refined to minimize redundancy while maintaining corpus diversity. Inspired by Ding et al. (2024), we implement the document packing method for data integrity but do not incorporate cross-sample attention masking during training. Finally, the training corpus for DeepSeek-V3 consists of 14.8T high-quality and diverse tokens in our tokenizer.

In the training process of DeepSeekCoder-V2 (DeepSeek-AI, 2024a), we observe that the Fill-in-Middle (FIM) strategy does not compromise the next-token prediction capability while enabling the model to accurately predict middle text based on contextual cues. In alignment with DeepSeekCoder-V2, we also incorporate the FIM strategy in the pre-training of DeepSeek-V3. To be specific, we employ the Prefix-Suffix-Middle (PSM) framework to structure data as follows:

<|fim_begin|>fpre<|fim_hole|>fsuf<|fim_end|>fmiddle<|eos_token|>.\displaystyle\texttt{<|fim\_begin|>}f_{\text{pre}}\texttt{<|fim\_hole|>}f_{% \text{suf}}\texttt{<|fim\_end|>}f_{\text{middle}}\texttt{<|eos\_token|>}.<|fim_begin|> italic_f start_POSTSUBSCRIPT pre end_POSTSUBSCRIPT <|fim_hole|> italic_f start_POSTSUBSCRIPT suf end_POSTSUBSCRIPT <|fim_end|> italic_f start_POSTSUBSCRIPT middle end_POSTSUBSCRIPT <|eos_token|> .

This structure is applied at the document level as a part of the pre-packing process. The FIM strategy is applied at a rate of 0.1, consistent with the PSM framework.

The tokenizer for DeepSeek-V3 employs Byte-level BPE (Shibata et al., 1999) with an extended vocabulary of 128K tokens. The pretokenizer and training data for our tokenizer are modified to optimize multilingual compression efficiency. In addition, compared with DeepSeek-V2, the new pretokenizer introduces tokens that combine punctuations and line breaks. However, this trick may introduce the token boundary bias (Lundberg, 2023) when the model processes multi-line prompts without terminal line breaks, particularly for few-shot evaluation prompts. To address this issue, we randomly split a certain proportion of such combined tokens during training, which exposes the model to a wider array of special cases and mitigates this bias.

4.2 Hyper-Parameters

Model Hyper-Parameters.

We set the number of Transformer layers to 61 and the hidden dimension to 7168. All learnable parameters are randomly initialized with a standard deviation of 0.006. In MLA, we set the number of attention heads nhn_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT to 128 and the per-head dimension dhd_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT to 128. The KV compression dimension dcd_{c}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is set to 512, and the query compression dimension dcd_{c}^{\prime}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is set to 1536. For the decoupled queries and key, we set the per-head dimension dhRd_{h}^{R}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT to 64. We substitute all FFNs except for the first three layers with MoE layers. Each MoE layer consists of 1 shared expert and 256 routed experts, where the intermediate hidden dimension of each expert is 2048. Among the routed experts, 8 experts will be activated for each token, and each token will be ensured to be sent to at most 4 nodes. The multi-token prediction depth DDitalic_D is set to 1, i.e., besides the exact next token, each token will predict one additional token. As DeepSeek-V2, DeepSeek-V3 also employs additional RMSNorm layers after the compressed latent vectors, and multiplies additional scaling factors at the width bottlenecks. Under this configuration, DeepSeek-V3 comprises 671B total parameters, of which 37B are activated for each token.

Training Hyper-Parameters.

We employ the AdamW optimizer (Loshchilov and Hutter, 2017) with hyper-parameters set to β1=0.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.95\beta_{2}=0.95italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95, and weight_decay=0.1\mathrm{weight\_decay}=0.1roman_weight _ roman_decay = 0.1. We set the maximum sequence length to 4K during pre-training, and pre-train DeepSeek-V3 on 14.8T tokens. As for the learning rate scheduling, we first linearly increase it from 0 to 2.2×1042.2\times 10^{-4}2.2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT during the first 2K steps. Then, we keep a constant learning rate of 2.2×1042.2\times 10^{-4}2.2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT until the model consumes 10T training tokens. Subsequently, we gradually decay the learning rate to 2.2×1052.2\times 10^{-5}2.2 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT in 4.3T tokens, following a cosine decay curve. During the training of the final 500B tokens, we keep a constant learning rate of 2.2×1052.2\times 10^{-5}2.2 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT in the first 333B tokens, and switch to another constant learning rate of 7.3×1067.3\times 10^{-6}7.3 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT in the remaining 167B tokens. The gradient clipping norm is set to 1.0. We employ a batch size scheduling strategy, where the batch size is gradually increased from 3072 to 15360 in the training of the first 469B tokens, and then keeps 15360 in the remaining training. We leverage pipeline parallelism to deploy different layers of a model on different GPUs, and for each layer, the routed experts will be uniformly deployed on 64 GPUs belonging to 8 nodes. As for the node-limited routing, each token will be sent to at most 4 nodes (i.e., M=4M=4italic_M = 4). For auxiliary-loss-free load balancing, we set the bias update speed γ\gammaitalic_γ to 0.001 for the first 14.3T tokens, and to 0.0 for the remaining 500B tokens. For the balance loss, we set α\alphaitalic_α to 0.0001, just to avoid extreme imbalance within any single sequence. The MTP loss weight λ\lambdaitalic_λ is set to 0.3 for the first 10T tokens, and to 0.1 for the remaining 4.8T tokens.

Refer to caption
Figure 8: Evaluation results on the ”Needle In A Haystack” (NIAH) tests. DeepSeek-V3 performs well across all context window lengths up to 128K.

4.3 Long Context Extension

We adopt a similar approach to DeepSeek-V2 (DeepSeek-AI, 2024c) to enable long context capabilities in DeepSeek-V3. After the pre-training stage, we apply YaRN (Peng et al., 2023a) for context extension and perform two additional training phases, each comprising 1000 steps, to progressively expand the context window from 4K to 32K and then to 128K. The YaRN configuration is consistent with that used in DeepSeek-V2, being applied exclusively to the decoupled shared key 𝐤tR\mathbf{k}^{R}_{t}bold_k start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The hyper-parameters remain identical across both phases, with the scale s=40s=40italic_s = 40, α=1\alpha=1italic_α = 1, β=32\beta=32italic_β = 32, and the scaling factor t=0.1lns+1\sqrt{t}=0.1\ln{s}+1square-root start_ARG italic_t end_ARG = 0.1 roman_ln italic_s + 1. In the first phase, the sequence length is set to 32K, and the batch size is 1920. During the second phase, the sequence length is increased to 128K, and the batch size is reduced to 480. The learning rate for both phases is set to 7.3×1067.3\times 10^{-6}7.3 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, matching the final learning rate from the pre-training stage.

Through this two-phase extension training, DeepSeek-V3 is capable of handling inputs up to 128K in length while maintaining strong performance. Figure 8 illustrates that DeepSeek-V3, following supervised fine-tuning, achieves notable performance on the "Needle In A Haystack" (NIAH) test, demonstrating consistent robustness across context window lengths up to 128K.

4.4 Evaluations

4.4.1 Evaluation Benchmarks

The base model of DeepSeek-V3 is pretrained on a multilingual corpus with English and Chinese constituting the majority, so we evaluate its performance on a series of benchmarks primarily in English and Chinese, as well as on a multilingual benchmark. Our evaluation is based on our internal evaluation framework integrated in our HAI-LLM framework. Considered benchmarks are categorized and listed as follows, where underlined benchmarks are in Chinese and double-underlined benchmarks are multilingual ones:

Multi-subject multiple-choice datasets include MMLU (Hendrycks et al., 2020), MMLU-Redux (Gema et al., 2024), MMLU-Pro (Wang et al., 2024b), MMMLU (OpenAI, 2024b), C-Eval (Huang et al., 2023), and CMMLU (Li et al., 2023).

Language understanding and reasoning datasets include HellaSwag (Zellers et al., 2019), PIQA (Bisk et al., 2020), ARC (Clark et al., 2018), and BigBench Hard (BBH) (Suzgun et al., 2022).

Closed-book question answering datasets include TriviaQA (Joshi et al., 2017) and NaturalQuestions (Kwiatkowski et al., 2019).

Reading comprehension datasets include RACE Lai et al. (2017), DROP (Dua et al., 2019), C3 (Sun et al., 2019a), and CMRC (Cui et al., 2019).

Reference disambiguation datasets include CLUEWSC (Xu et al., 2020) and WinoGrande Sakaguchi et al. (2019).

Language modeling datasets include Pile (Gao et al., 2020).

Chinese understanding and culture datasets include CCPM (Li et al., 2021).

Math datasets include GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), MGSM (Shi et al., 2023), and CMath (Wei et al., 2023).

Code datasets include HumanEval (Chen et al., 2021), LiveCodeBench-Base (0801-1101) (Jain et al., 2024), MBPP (Austin et al., 2021), and CRUXEval (Gu et al., 2024).

Standardized exams include AGIEval (Zhong et al., 2023). Note that AGIEval includes both English and Chinese subsets.

Following our previous work (DeepSeek-AI, 2024b, c), we adopt perplexity-based evaluation for datasets including HellaSwag, PIQA, WinoGrande, RACE-Middle, RACE-High, MMLU, MMLU-Redux, MMLU-Pro, MMMLU, ARC-Easy, ARC-Challenge, C-Eval, CMMLU, C3, and CCPM, and adopt generation-based evaluation for TriviaQA, NaturalQuestions, DROP, MATH, GSM8K, MGSM, HumanEval, MBPP, LiveCodeBench-Base, CRUXEval, BBH, AGIEval, CLUEWSC, CMRC, and CMath. In addition, we perform language-modeling-based evaluation for Pile-test and use Bits-Per-Byte (BPB) as the metric to guarantee fair comparison among models using different tokenizers.

Benchmark (Metric) # Shots DeepSeek-V2 Qwen2.5 LLaMA-3.1 DeepSeek-V3
Base 72B Base 405B Base Base
Architecture - MoE Dense Dense MoE
# Activated Params - 21B 72B 405B 37B
# Total Params - 236B 72B 405B 671B
English Pile-test (BPB) - 0.606 0.638 0.542 0.548
BBH (EM) 3-shot 78.8 79.8 82.9 87.5
MMLU (EM) 5-shot 78.4 85.0 84.4 87.1
MMLU-Redux (EM) 5-shot 75.6 83.2 81.3 86.2
MMLU-Pro (EM) 5-shot 51.4 58.3 52.8 64.4
DROP (F1) 3-shot 80.4 80.6 86.0 89.0
ARC-Easy (EM) 25-shot 97.6 98.4 98.4 98.9
ARC-Challenge (EM) 25-shot 92.2 94.5 95.3 95.3
HellaSwag (EM) 10-shot 87.1 84.8 89.2 88.9
PIQA (EM) 0-shot 83.9 82.6 85.9 84.7
WinoGrande (EM) 5-shot 86.3 82.3 85.2 84.9
RACE-Middle (EM) 5-shot 73.1 68.1 74.2 67.1
RACE-High (EM) 5-shot 52.6 50.3 56.8 51.3
TriviaQA (EM) 5-shot 80.0 71.9 82.7 82.9
NaturalQuestions (EM) 5-shot 38.6 33.2 41.5 40.0
AGIEval (EM) 0-shot 57.5 75.8 60.6 79.6
Code HumanEval (Pass@1) 0-shot 43.3 53.0 54.9 65.2
MBPP (Pass@1) 3-shot 65.0 72.6 68.4 75.4
LiveCodeBench-Base (Pass@1) 3-shot 11.6 12.9 15.5 19.4
CRUXEval-I (EM) 2-shot 52.5 59.1 58.5 67.3
CRUXEval-O (EM) 2-shot 49.8 59.9 59.9 69.8
Math GSM8K (EM) 8-shot 81.6 88.3 83.5 89.3
MATH (EM) 4-shot 43.4 54.4 49.0 61.6
MGSM (EM) 8-shot 63.6 76.2 69.9 79.8
CMath (EM) 3-shot 78.7 84.5 77.3 90.7
Chinese CLUEWSC (EM) 5-shot 82.0 82.5 83.0 82.7
C-Eval (EM) 5-shot 81.4 89.2 72.5 90.1
CMMLU (EM) 5-shot 84.0 89.5 73.7 88.8
CMRC (EM) 1-shot 77.4 75.8 76.0 76.3
C3 (EM) 0-shot 77.4 76.7 79.7 78.6
CCPM (EM) 0-shot 93.0 88.5 78.6 92.0
Multilingual MMMLU-non-English (EM) 5-shot 64.0 74.8 73.8 79.4
Table 3: Comparison among DeepSeek-V3-Base and other representative open-source base models. All models are evaluated in our internal framework and share the same evaluation setting. Scores with a gap not exceeding 0.3 are considered to be at the same level. DeepSeek-V3-Base achieves the best performance on most benchmarks, especially on math and code tasks.

4.4.2 Evaluation Results

In Table 3, we compare the base model of DeepSeek-V3 with the state-of-the-art open-source base models, including DeepSeek-V2-Base (DeepSeek-AI, 2024c) (our previous release), Qwen2.5 72B Base (Qwen, 2024b), and LLaMA-3.1 405B Base (AI@Meta, 2024b). We evaluate all these models with our internal evaluation framework, and ensure that they share the same evaluation setting. Note that due to the changes in our evaluation framework over the past months, the performance of DeepSeek-V2-Base exhibits a slight difference from our previously reported results. Overall, DeepSeek-V3-Base comprehensively outperforms DeepSeek-V2-Base and Qwen2.5 72B Base, and surpasses LLaMA-3.1 405B Base in the majority of benchmarks, essentially becoming the strongest open-source model.

From a more detailed perspective, we compare DeepSeek-V3-Base with the other open-source base models individually. (1) Compared with DeepSeek-V2-Base, due to the improvements in our model architecture, the scale-up of the model size and training tokens, and the enhancement of data quality, DeepSeek-V3-Base achieves significantly better performance as expected. (2) Compared with Qwen2.5 72B Base, the state-of-the-art Chinese open-source model, with only half of the activated parameters, DeepSeek-V3-Base also demonstrates remarkable advantages, especially on English, multilingual, code, and math benchmarks. As for Chinese benchmarks, except for CMMLU, a Chinese multi-subject multiple-choice task, DeepSeek-V3-Base also shows better performance than Qwen2.5 72B. (3) Compared with LLaMA-3.1 405B Base, the largest open-source model with 11 times the activated parameters, DeepSeek-V3-Base also exhibits much better performance on multilingual, code, and math benchmarks. As for English and Chinese language benchmarks, DeepSeek-V3-Base shows competitive or better performance, and is especially good on BBH, MMLU-series, DROP, C-Eval, CMMLU, and CCPM.

Due to our efficient architectures and comprehensive engineering optimizations, DeepSeek-V3 achieves extremely high training efficiency. Under our training framework and infrastructures, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, which is much cheaper than training 72B or 405B dense models.

Benchmark (Metric)  基准(指标) # Shots  # 射击 Small MoE  小 MoE Small MoE  小 MoE Large MoE  大 MoE Large MoE  大 MoE
Baseline  基线 w/ MTP  与 MTP Baseline  基线 w/ MTP  与 MTP
# Activated Params (Inference)
激活参数(推理)
- 2.4B 2.4B 20.9B 20.9B
# Total Params (Inference)
总参数(推理)
- 15.7B 15.7B 228.7B 228.7B
# Training Tokens  训练令牌 - 1.33T 1.33T 540B 540B
Pile-test (BPB)  堆测试(BPB) - 0.729 0.729 0.658 0.657
BBH (EM)  BBH(电磁) 3-shot  三连击 39.0 41.4 70.0 70.7
MMLU (EM)  MMLU(EM) 5-shot 50.0 53.3 67.5 66.6
DROP (F1) 1-shot  一镜到底 39.2 41.3 68.5 70.6
TriviaQA (EM)  TriviaQA(EM) 5-shot 56.9 57.7 67.0 67.3
NaturalQuestions (EM)  自然问题(EM) 5-shot 22.7 22.3 27.2 28.5
HumanEval (Pass@1)  HumanEval(Pass@1) 0-shot  零射击 20.7 26.8 44.5 53.7
MBPP (Pass@1)  MBPP(Pass@1) 3-shot  三连击 35.8 36.8 61.6 62.2
GSM8K (EM)  GSM8K(EM) 8-shot 25.4 31.4 72.3 74.0
MATH (EM)  数学(EM) 4-shot  四镜头 10.7 12.6 38.6 39.8
Table 4: Ablation results for the MTP strategy. The MTP strategy consistently enhances the model performance on most of the evaluation benchmarks.
表 4:MTP 策略的消融结果。MTP 策略在大多数评估基准上持续提升模型性能。

4.5 Discussion  4.5 讨论

4.5.1 Ablation Studies for Multi-Token Prediction
4.5.1 多令牌预测的消融研究

In Table 4, we show the ablation results for the MTP strategy. To be specific, we validate the MTP strategy on top of two baseline models across different scales. At the small scale, we train a baseline MoE model comprising 15.7B total parameters on 1.33T tokens. At the large scale, we train a baseline MoE model comprising 228.7B total parameters on 540B tokens. On top of them, keeping the training data and the other architectures the same, we append a 1-depth MTP module onto them and train two models with the MTP strategy for comparison. Note that during inference, we directly discard the MTP module, so the inference costs of the compared models are exactly the same. From the table, we can observe that the MTP strategy consistently enhances the model performance on most of the evaluation benchmarks.
在表 4 中,我们展示了 MTP 策略的消融结果。具体来说,我们在两个基线模型的基础上,针对不同规模进行了 MTP 策略的验证。在小规模上,我们在 1.33T 个标记上训练了一个包含 15.7B 个总参数的基线 MoE 模型。在大规模上,我们在 540B 个标记上训练了一个包含 228.7B 个总参数的基线 MoE 模型。在此基础上,保持训练数据和其它架构不变,我们在其上附加了一个 1 层深度的 MTP 模块,并使用 MTP 策略训练了两个模型以进行比较。注意,在推理过程中,我们直接丢弃了 MTP 模块,因此比较模型的推理成本完全相同。从表中可以看出,MTP 策略在大多数评估基准上持续提升了模型性能。

4.5.2 Ablation Studies for the Auxiliary-Loss-Free Balancing Strategy

In Table 5, we show the ablation results for the auxiliary-loss-free balancing strategy. We validate this strategy on top of two baseline models across different scales. At the small scale, we train a baseline MoE model comprising 15.7B total parameters on 1.33T tokens. At the large scale, we train a baseline MoE model comprising 228.7B total parameters on 578B tokens. Both of the baseline models purely use auxiliary losses to encourage load balance, and use the sigmoid gating function with top-K affinity normalization. Their hyper-parameters to control the strength of auxiliary losses are the same as DeepSeek-V2-Lite and DeepSeek-V2, respectively. On top of these two baseline models, keeping the training data and the other architectures the same, we remove all auxiliary losses and introduce the auxiliary-loss-free balancing strategy for comparison. From the table, we can observe that the auxiliary-loss-free strategy consistently achieves better model performance on most of the evaluation benchmarks.

Benchmark (Metric) # Shots Small MoE Small MoE Large MoE Large MoE
Aux-Loss-Based Aux-Loss-Free Aux-Loss-Based Aux-Loss-Free
# Activated Params - 2.4B 2.4B 20.9B 20.9B
# Total Params - 15.7B 15.7B 228.7B 228.7B
# Training Tokens - 1.33T 1.33T 578B 578B
Pile-test (BPB) - 0.727 0.724 0.656 0.652
BBH (EM) 3-shot 37.3 39.3 66.7 67.9
MMLU (EM) 5-shot 51.0 51.8 68.3 67.2
DROP (F1) 1-shot 38.1 39.0 67.1 67.1
TriviaQA (EM) 5-shot 58.3 58.5 66.7 67.7
NaturalQuestions (EM) 5-shot 23.2 23.4 27.1 28.1
HumanEval (Pass@1) 0-shot 22.0 22.6 40.2 46.3
MBPP (Pass@1) 3-shot 36.6 35.8 59.2 61.2
GSM8K (EM) 8-shot 27.1 29.6 70.7 74.5
MATH (EM) 4-shot 10.9 11.1 37.2 39.6
Table 5: Ablation results for the auxiliary-loss-free balancing strategy. Compared with the purely auxiliary-loss-based method, the auxiliary-loss-free strategy consistently achieves better model performance on most of the evaluation benchmarks.

4.5.3 Batch-Wise Load Balance VS. Sequence-Wise Load Balance

The key distinction between auxiliary-loss-free balancing and sequence-wise auxiliary loss lies in their balancing scope: batch-wise versus sequence-wise. Compared with the sequence-wise auxiliary loss, batch-wise balancing imposes a more flexible constraint, as it does not enforce in-domain balance on each sequence. This flexibility allows experts to better specialize in different domains. To validate this, we record and analyze the expert load of a 16B auxiliary-loss-based baseline and a 16B auxiliary-loss-free model on different domains in the Pile test set. As illustrated in Figure 9, we observe that the auxiliary-loss-free model demonstrates greater expert specialization patterns as expected.
辅助损失无平衡与序列辅助损失之间的关键区别在于它们的平衡范围:批量平衡与序列平衡。与序列辅助损失相比,批量平衡施加了更灵活的约束,因为它不对每个序列强制执行域内平衡。这种灵活性允许专家更好地专注于不同的领域。为了验证这一点,我们在 Pile 测试集的不同领域记录并分析了基于 16B 辅助损失的基线和 16B 辅助损失无模型的专业负载。如图 9 所示,我们观察到辅助损失无模型表现出预期的更大专家专业化模式。

To further investigate the correlation between this flexibility and the advantage in model performance, we additionally design and validate a batch-wise auxiliary loss that encourages load balance on each training batch instead of on each sequence. The experimental results show that, when achieving a similar level of batch-wise load balance, the batch-wise auxiliary loss can also achieve similar model performance to the auxiliary-loss-free method. To be specific, in our experiments with 1B MoE models, the validation losses are: 2.258 (using a sequence-wise auxiliary loss), 2.253 (using the auxiliary-loss-free method), and 2.253 (using a batch-wise auxiliary loss). We also observe similar results on 3B MoE models: the model using a sequence-wise auxiliary loss achieves a validation loss of 2.085, and the models using the auxiliary-loss-free method or a batch-wise auxiliary loss achieve the same validation loss of 2.080.
为了进一步研究这种灵活性与模型性能优势之间的关系,我们额外设计并验证了一种批量辅助损失,它鼓励在每个训练批次上实现负载均衡,而不是在每个序列上。实验结果表明,在实现类似的批量负载均衡水平时,批量辅助损失也能达到与无辅助损失方法相似的模型性能。具体来说,在我们的 1B MoE 模型实验中,验证损失为:2.258(使用序列辅助损失)、2.253(使用无辅助损失方法)和 2.253(使用批量辅助损失)。我们还在 3B MoE 模型上观察到类似的结果:使用序列辅助损失的模型达到的验证损失为 2.085,而使用无辅助损失方法或批量辅助损失的模型达到相同的验证损失 2.080。

In addition, although the batch-wise load balancing methods show consistent performance advantages, they also face two potential challenges in efficiency: (1) load imbalance within certain sequences or small batches, and (2) domain-shift-induced load imbalance during inference. The first challenge is naturally addressed by our training framework that uses large-scale expert parallelism and data parallelism, which guarantees a large size of each micro-batch. For the second challenge, we also design and implement an efficient inference framework with redundant expert deployment, as described in Section 3.4, to overcome it.
此外,尽管批量负载均衡方法表现出持续的性能优势,但它们也面临着两个潜在的效率挑战:(1)某些序列或小批次的负载不平衡,以及(2)推理过程中由领域迁移引起的负载不平衡。第一个挑战通过我们的训练框架自然解决,该框架使用大规模专家并行和数据并行,保证了每个微批次的较大规模。对于第二个挑战,我们还在第 3.4 节中设计并实现了一个高效的推理框架,具有冗余专家部署,以克服它。

Refer to caption
Figure 9: Expert load of auxiliary-loss-free and auxiliary-loss-based models on three domains in the Pile test set. The auxiliary-loss-free model shows greater expert specialization patterns than the auxiliary-loss-based one. The relative expert load denotes the ratio between the actual expert load and the theoretically balanced expert load. Due to space constraints, we only present the results of two layers as an example, with the results of all layers provided in Appendix C.
图 9:在 Pile 测试集三个领域中的辅助损失无模型和辅助损失模型的专业化负载。辅助损失无模型比辅助损失模型显示出更多的专业化模式。相对专家负载表示实际专家负载与理论平衡专家负载之间的比率。由于空间限制,我们仅以两层为例展示结果,所有层的结果见附录 C。

5 Post-Training  5 训练后

5.1 Supervised Fine-Tuning
5.1 监督微调

We curate our instruction-tuning datasets to include 1.5M instances spanning multiple domains, with each domain employing distinct data creation methods tailored to its specific requirements.
我们精心挑选我们的指令微调数据集,包含涵盖多个领域的 150 万实例,每个领域都采用针对其特定需求定制的数据创建方法。

Reasoning Data.  推理数据。

For reasoning-related datasets, including those focused on mathematics, code competition problems, and logic puzzles, we generate the data by leveraging an internal DeepSeek-R1 model. Specifically, while the R1-generated data demonstrates strong accuracy, it suffers from issues such as overthinking, poor formatting, and excessive length. Our objective is to balance the high accuracy of R1-generated reasoning data and the clarity and conciseness of regularly formatted reasoning data.
针对与推理相关的数据集,包括专注于数学、代码竞赛问题和逻辑谜题的数据集,我们通过利用内部 DeepSeek-R1 模型生成数据。具体来说,虽然 R1 生成的数据展示了强大的准确性,但它存在过度思考、格式不佳和长度过长等问题。我们的目标是平衡 R1 生成的推理数据的高准确性和常规格式推理数据的清晰性和简洁性。

To establish our methodology, we begin by developing an expert model tailored to a specific domain, such as code, mathematics, or general reasoning, using a combined Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training pipeline. This expert model serves as a data generator for the final model. The training process involves generating two distinct types of SFT samples for each instance: the first couples the problem with its original response in the format of <problem, original response>, while the second incorporates a system prompt alongside the problem and the R1 response in the format of <system prompt, problem, R1 response>.
为了建立我们的方法论,我们首先开发一个针对特定领域(如代码、数学或一般推理)的专家模型,使用结合监督微调(SFT)和强化学习(RL)的训练流程。此专家模型作为最终模型的数据生成器。训练过程包括为每个实例生成两种不同类型的 SFT 样本:第一种将问题与其原始响应配对,格式为<问题,原始响应>,而第二种将系统提示与问题和 R1 响应结合,格式为<系统提示,问题,R1 响应>。

The system prompt is meticulously designed to include instructions that guide the model toward producing responses enriched with mechanisms for reflection and verification. During the RL phase, the model leverages high-temperature sampling to generate responses that integrate patterns from both the R1-generated and original data, even in the absence of explicit system prompts. After hundreds of RL steps, the intermediate RL model learns to incorporate R1 patterns, thereby enhancing overall performance strategically.
系统提示语精心设计,包含指导模型产生富含反思和验证机制的响应的指令。在强化学习阶段,模型利用高温采样生成响应,即使在没有明确系统提示的情况下,也能整合来自 R1 生成数据和原始数据的模式。经过数百次强化学习步骤后,中间强化学习模型学会融入 R1 模式,从而战略性地提高整体性能。

Upon completing the RL training phase, we implement rejection sampling to curate high-quality SFT data for the final model, where the expert models are used as data generation sources. This method ensures that the final training data retains the strengths of DeepSeek-R1 while producing responses that are concise and effective.
完成 RL 训练阶段后,我们实施拒绝采样以精选高质量 SFT 数据用于最终模型,其中专家模型用作数据生成源。此方法确保最终训练数据保留了 DeepSeek-R1 的优势,同时产生简洁有效的响应。

Non-Reasoning Data.  非推理数据。

For non-reasoning data, such as creative writing, role-play, and simple question answering, we utilize DeepSeek-V2.5 to generate responses and enlist human annotators to verify the accuracy and correctness of the data.
对于非推理数据,如创意写作、角色扮演和简单问答,我们使用 DeepSeek-V2.5 生成回复,并聘请人工标注员验证数据的准确性和正确性。

SFT Settings.  SFT 设置。

We fine-tune DeepSeek-V3-Base for two epochs using the SFT dataset, using the cosine decay learning rate scheduling that starts at 5×1065\times 10^{-6}5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and gradually decreases to 1×1061\times 10^{-6}1 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT. During training, each single sequence is packed from multiple samples. However, we adopt a sample masking strategy to ensure that these examples remain isolated and mutually invisible.
我们使用 SFT 数据集对 DeepSeek-V3-Base 进行两次微调,采用从 5×1065superscript1065\times 10^{-6}5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 开始逐渐降低到 1×1061superscript1061\times 10^{-6}1 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 的余弦衰减学习率调度。在训练过程中,每个单独的序列由多个样本打包而成。然而,我们采用样本掩码策略以确保这些示例保持隔离且相互不可见。

5.2 Reinforcement Learning
5.2 强化学习

5.2.1 Reward Model  5.2.1 奖励模型

We employ a rule-based Reward Model (RM) and a model-based RM in our RL process.
我们在我们强化学习过程中采用基于规则的奖励模型(RM)和基于模型的 RM。

Rule-Based RM.  基于规则的 RM

For questions that can be validated using specific rules, we adopt a rule-based reward system to determine the feedback. For instance, certain math problems have deterministic results, and we require the model to provide the final answer within a designated format (e.g., in a box), allowing us to apply rules to verify the correctness. Similarly, for LeetCode problems, we can utilize a compiler to generate feedback based on test cases. By leveraging rule-based validation wherever possible, we ensure a higher level of reliability, as this approach is resistant to manipulation or exploitation.
对于可以使用特定规则验证的问题,我们采用基于规则的奖励系统来确定反馈。例如,某些数学问题有确定的结果,我们要求模型在指定的格式(例如,在一个框中)内提供最终答案,以便我们可以应用规则来验证正确性。同样,对于 LeetCode 问题,我们可以利用编译器根据测试用例生成反馈。通过尽可能利用基于规则的验证,我们确保了更高的可靠性,因为这种方法不易受到操纵或利用。

Model-Based RM.  基于模型的 RM

For questions with free-form ground-truth answers, we rely on the reward model to determine whether the response matches the expected ground-truth. Conversely, for questions without a definitive ground-truth, such as those involving creative writing, the reward model is tasked with providing feedback based on the question and the corresponding answer as inputs. The reward model is trained from the DeepSeek-V3 SFT checkpoints. To enhance its reliability, we construct preference data that not only provides the final reward but also includes the chain-of-thought leading to the reward. This approach helps mitigate the risk of reward hacking in specific tasks.
对于具有自由形式真实答案的问题,我们依赖奖励模型来确定响应是否与预期真实答案匹配。相反,对于没有明确真实答案的问题,例如涉及创意写作的问题,奖励模型的任务是根据问题和相应的答案提供反馈。奖励模型是从 DeepSeek-V3 SFT 检查点训练的。为了提高其可靠性,我们构建了偏好数据,不仅提供最终奖励,还包括导致奖励的思维链。这种方法有助于降低特定任务中奖励黑客攻击的风险。

5.2.2 Group Relative Policy Optimization
5.2.2 群组相对策略优化

Similar to DeepSeek-V2 (DeepSeek-AI, 2024c), we adopt Group Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is typically with the same size as the policy model, and estimates the baseline from group scores instead. Specifically, for each question qqitalic_q, GRPO samples a group of outputs {o1,o2,,oG}\{o_{1},o_{2},\cdots,o_{G}\}{ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_o start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } from the old policy model πθold\pi_{\theta_{old}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT and then optimizes the policy model πθ\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT by maximizing the following objective:
类似于 DeepSeek-V2(DeepSeek-AI,2024c),我们采用组相对策略优化(GRPO)(Shao 等,2024),它放弃了与策略模型大小相同的评论家模型,并从组分数中估计基线。具体来说,对于每个问题 qqitalic_q ,GRPO 从旧策略模型 πθoldsubscriptsubscript\pi_{\theta_{old}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT 中采样一组输出 {o1,o2,,oG}subscript1subscript2subscript\{o_{1},o_{2},\cdots,o_{G}\}{ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_o start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } ,然后通过最大化以下目标来优化策略模型 πθsubscript\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT

𝒥GRPO(θ)=𝔼[qP(Q),{oi}i=1Gπθold(O|q)]1Gi=1G(min(πθ(oi|q)πθold(oi|q)Ai,clip(πθ(oi|q)πθold(oi|q),1ε,1+ε)Ai)β𝔻KL(πθ||πref)),\begin{split}\mathcal{J}_{GRPO}(\theta)&=\mathbb{E}{[q\sim P(Q),\{o_{i}\}_{i=1% }^{G}\sim\pi_{\theta_{old}}(O|q)]}\\ &\frac{1}{G}\sum_{i=1}^{G}\left(\min\left(\frac{\pi_{\theta}(o_{i}|q)}{\pi_{% \theta_{old}}(o_{i}|q)}A_{i},\text{clip}\left(\frac{\pi_{\theta}(o_{i}|q)}{\pi% _{\theta_{old}}(o_{i}|q)},1-\varepsilon,1+\varepsilon\right)A_{i}\right)-\beta% \mathbb{D}_{KL}\left(\pi_{\theta}||\pi_{ref}\right)\right),\end{split}start_ROW start_CELL caligraphic_J start_POSTSUBSCRIPT italic_G italic_R italic_P italic_O end_POSTSUBSCRIPT ( italic_θ ) end_CELL start_CELL = blackboard_E [ italic_q ∼ italic_P ( italic_Q ) , { italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_O | italic_q ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_G end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ( roman_min ( divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , clip ( divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG , 1 - italic_ε , 1 + italic_ε ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_β blackboard_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT | | italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (26)
𝔻KL(πθ||πref)=πref(oi|q)πθ(oi|q)logπref(oi|q)πθ(oi|q)1,\mathbb{D}_{KL}\left(\pi_{\theta}||\pi_{ref}\right)=\frac{\pi_{ref}(o_{i}|q)}{% \pi_{\theta}(o_{i}|q)}-\log\frac{\pi_{ref}(o_{i}|q)}{\pi_{\theta}(o_{i}|q)}-1,blackboard_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT | | italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) = divide start_ARG italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG - roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q ) end_ARG - 1 , (27)

where ε\varepsilonitalic_ε and β\betaitalic_β are hyper-parameters; πref\pi_{ref}italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT is the reference model; and AiA_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the advantage, derived from the rewards {r1,r2,,rG}\{r_{1},r_{2},\ldots,r_{G}\}{ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } corresponding to the outputs within each group:
ε\varepsilonitalic_εβ\betaitalic_β 是超参数; πrefsubscript\pi_{ref}italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT 是参考模型; AisubscriptA_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 是从每个组内的输出对应的奖励 {r1,r2,,rG}subscript1subscript2subscript\{r_{1},r_{2},\ldots,r_{G}\}{ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } 中导出的优势:

Ai=rimean({r1,r2,,rG})std({r1,r2,,rG}).A_{i}=\frac{r_{i}-{\operatorname{mean}(\{r_{1},r_{2},\cdots,r_{G}\})}}{{% \operatorname{std}(\{r_{1},r_{2},\cdots,r_{G}\})}}.italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_mean ( { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_r start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } ) end_ARG start_ARG roman_std ( { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_r start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } ) end_ARG . (28)

We incorporate prompts from diverse domains, such as coding, math, writing, role-playing, and question answering, during the RL process. This approach not only aligns the model more closely with human preferences but also enhances performance on benchmarks, especially in scenarios where available SFT data are limited.
我们将在 RL 过程中整合来自不同领域的提示,例如编码、数学、写作、角色扮演和问答。这种方法不仅使模型更接近人类偏好,而且提高了在基准测试中的性能,尤其是在可用 SFT 数据有限的情况下。

5.3 Evaluations  5.3 评估

5.3.1 Evaluation Settings  5.3.1 评估设置

Evaluation Benchmarks.  评估基准。

Apart from the benchmark we used for base model testing, we further evaluate instructed models on IFEval (Zhou et al., 2023), FRAMES (Krishna et al., 2024), LongBench v2 (Bai et al., 2024), GPQA (Rein et al., 2023), SimpleQA (OpenAI, 2024c), C-SimpleQA (He et al., 2024), SWE-Bench Verified (OpenAI, 2024d), Aider 111https://aider.chat, LiveCodeBench (Jain et al., 2024) (questions from August 2024 to November 2024), Codeforces 222https://codeforces.com, Chinese National High School Mathematics Olympiad (CNMO 2024)333https://www.cms.org.cn/Home/comp/comp/cid/12.html, and American Invitational Mathematics Examination 2024 (AIME 2024) (MAA, 2024).
除了我们用于基线模型测试的基准之外,我们还对指令模型在 IFEval(周等,2023)、FRAMES(克利斯纳等,2024)、LongBench v2(白等,2024)、GPQA(雷因等,2023)、SimpleQA(OpenAI,2024c)、C-SimpleQA(何等,2024)、SWE-Bench Verified(OpenAI,2024d)、Aider 1 、LiveCodeBench(贾因等,2024)(2024 年 8 月至 11 月的问题)、Codeforces 2 、中国全国高中数学奥林匹克竞赛(CNMO 2024) 3 以及美国邀请数学考试 2024(AIME 2024)(MAA,2024)上进行了评估。

Compared Baselines.  比较基线。

We conduct comprehensive evaluations of our chat model against several strong baselines, including DeepSeek-V2-0506, DeepSeek-V2.5-0905, Qwen2.5 72B Instruct, LLaMA-3.1 405B Instruct, Claude-Sonnet-3.5-1022, and GPT-4o-0513. For the DeepSeek-V2 model series, we select the most representative variants for comparison. For closed-source models, evaluations are performed through their respective APIs.
我们对我们的聊天模型与多个强大基线进行了全面评估,包括 DeepSeek-V2-0506、DeepSeek-V2.5-0905、Qwen2.5 72B Instruct、LLaMA-3.1 405B Instruct、Claude-Sonnet-3.5-1022 和 GPT-4o-0513。对于 DeepSeek-V2 模型系列,我们选择了最具代表性的变体进行比较。对于闭源模型,评估通过各自的 API 进行。

Detailed Evaluation Configurations.
详细评估配置。

For standard benchmarks including MMLU, DROP, GPQA, and SimpleQA, we adopt the evaluation prompts from the simple-evals framework444https://github.com/openai/simple-evals. We utilize the Zero-Eval prompt format (Lin, 2024) for MMLU-Redux in a zero-shot setting. For other datasets, we follow their original evaluation protocols with default prompts as provided by the dataset creators. For code and math benchmarks, the HumanEval-Mul dataset includes 8 mainstream programming languages (Python, Java, Cpp, C#, JavaScript, TypeScript, PHP, and Bash) in total. We use CoT and non-CoT methods to evaluate model performance on LiveCodeBench, where the data are collected from August 2024 to November 2024. The Codeforces dataset is measured using the percentage of competitors. SWE-Bench verified is evaluated using the agentless framework (Xia et al., 2024). We use the “diff” format to evaluate the Aider-related benchmarks. For mathematical assessments, AIME and CNMO 2024 are evaluated with a temperature of 0.7, and the results are averaged over 16 runs, while MATH-500 employs greedy decoding. We allow all models to output a maximum of 8192 tokens for each benchmark.
对于标准基准,包括 MMLU、DROP、GPQA 和 SimpleQA,我们采用 simple-evals 框架的评估提示 4 。在零样本设置中,我们使用 Zero-Eval 提示格式(Lin,2024)对 MMLU-Redux 进行评估。对于其他数据集,我们遵循其原始评估协议,使用数据集创建者提供的默认提示。对于代码和数学基准,HumanEval-Mul 数据集包括 8 种主流编程语言(Python、Java、C++、C#、JavaScript、TypeScript、PHP 和 Bash)。我们在 LiveCodeBench 上使用 CoT 和非 CoT 方法评估模型性能,数据收集时间为 2024 年 8 月至 11 月。Codeforces 数据集通过参赛者百分比进行衡量。SWE-Bench 验证使用无代理框架(Xia 等,2024)进行评估。我们使用“diff”格式评估与 Aider 相关的基准。对于数学评估,AIME 和 CNMO 2024 以 0.7 的温度进行评估,结果在 16 次运行中平均,而 MATH-500 采用贪婪解码。我们允许所有模型为每个基准输出最多 8192 个标记。

Benchmark (Metric)  基准(指标) DeepSeek DeepSeek Qwen2.5 LLaMA-3.1 Claude-3.5-  克劳德-3.5- GPT-4o DeepSeek
V2-0506 V2.5-0905 72B-Inst.  72B-机构 405B-Inst.  405B-机构 Sonnet-1022 0513 V3
Architecture  建筑 MoE  教育部门 MoE  教育部门 Dense  密集 Dense  密集 - - MoE  教育部门
# Activated Params  激活参数 21B 21B 72B 405B - - 37B
# Total Params  # 总参数 236B 236B 72B 405B - - 671B
English  英文 MMLU (EM)  MMLU(EM) 78.2 80.6 85.3 88.6 88.3 87.2 88.5
MMLU-Redux (EM)  MMLU-Redux(EM) 77.9 80.3 85.6 86.2 88.9 88.0 89.1
MMLU-Pro (EM)  MMLU-Pro(EM) 58.5 66.2 71.6 73.3 78.0 72.6 75.9
DROP (3-shot F1)  DROP(3 次射击 F1) 83.0 87.8 76.7 88.7 88.3 83.7 91.
IF-Eval (Prompt Strict)  IF-Eval(提示严格) 57.7 80.6 84.1 86.0 86.5 84.3 86.1
GPQA-Diamond (Pass@1)  GPQA-钻石(Pass@1) 35.3 41.3 49.0 51.1 65.0 49.9 59.1
SimpleQA (Correct)  SimpleQA(正确) 9.0 10.2 9.1 17.1 28.4 38.2 24.9
FRAMES (Acc.)  框架(加速度) 66.9 65.4 69.8 70.0 72.5 80.5 73.3
LongBench v2 (Acc.)  LongBench v2(加速版) 31.6 35.4 39.4 36.1 41.0 48.1 48.7
Code  代码 HumanEval-Mul (Pass@1)  HumanEval-Mul(通过率@1) 69.3 77.4 77.3 77.2 81.7 80.5 82.6
LiveCodeBench (Pass@1-COT)
LiveCodeBench(Pass@1-COT)
18.8 29.2 31.1 28.4 36.3 33.4 40.5
LiveCodeBench (Pass@1)  LiveCodeBench(通过率@1) 20.3 28.4 28.7 30.1 32.8 34.2 37.6
Codeforces (Percentile)  Codeforces(百分位数) 17.5 35.6 24.8 25.3 20.3 23.6 51.6
SWE Verified (Resolved)  SWE 已验证(已解决) - 22.6 23.8 24.5 50.8 38.8 42.0
Aider-Edit (Acc.)  Aider-Edit(附件) 60.3 71.6 65.4 63.9 84.2 72.9 79.7
Aider-Polyglot (Acc.)  Aider-Polyglot(认证) - 18.2 7.6 5.8 45.3 16.0 49.6
Math  数学 AIME 2024 (Pass@1)  AIME 2024(通过率 1%) 4.6 16.7 23.3 23.3 16.0 9.3 39.2
MATH-500 (EM)  MATH-500(EM) 56.3 74.7 80.0 73.8 78.3 74.6 90.2
CNMO 2024 (Pass@1)  CNMO 2024(Pass@1) 2.8 10.8 15.9 6.8 13.1 10.8 43.2
Chinese  中文 CLUEWSC (EM)  CLUEWSC(EM) 89.9 90.4 91.4 84.7 85.4 87.9 90.9
C-Eval (EM)  C-Eval(评估模块) 78.6 79.5 86.1 61.5 76.7 76.0 86.5
C-SimpleQA (Correct)  C-简单问答(正确) 48.5 54.1 48.4 50.4 51.3 59.3 64.8
Table 6: Comparison between DeepSeek-V3 and other representative chat models. All models are evaluated in a configuration that limits the output length to 8K. Benchmarks containing fewer than 1000 samples are tested multiple times using varying temperature settings to derive robust final results. DeepSeek-V3 stands as the best-performing open-source model, and also exhibits competitive performance against frontier closed-source models.
表 6:DeepSeek-V3 与其他代表性聊天模型的比较。所有模型均在限制输出长度为 8K 的配置下进行评估。包含少于 1000 个样本的基准测试使用不同的温度设置进行多次测试,以得出稳健的最终结果。DeepSeek-V3 作为表现最佳的开放源代码模型,同时在对抗前沿闭源模型方面也展现出有竞争力的性能。

5.3.2 Standard Evaluation  5.3.2 标准评估

Table 6 presents the evaluation results, showcasing that DeepSeek-V3 stands as the best-performing open-source model. Additionally, it is competitive against frontier closed-source models like GPT-4o and Claude-3.5-Sonnet.
表 6 展示了评估结果,显示 DeepSeek-V3 是表现最佳的开放源代码模型。此外,它在与 GPT-4o 和 Claude-3.5-Sonnet 等前沿闭源模型竞争中表现良好。

English Benchmarks.  英文基准。

MMLU is a widely recognized benchmark designed to assess the performance of large language models, across diverse knowledge domains and tasks. DeepSeek-V3 demonstrates competitive performance, standing on par with top-tier models such as LLaMA-3.1-405B, GPT-4o, and Claude-Sonnet 3.5, while significantly outperforming Qwen2.5 72B. Moreover, DeepSeek-V3 excels in MMLU-Pro, a more challenging educational knowledge benchmark, where it closely trails Claude-Sonnet 3.5. On MMLU-Redux, a refined version of MMLU with corrected labels, DeepSeek-V3 surpasses its peers. In addition, on GPQA-Diamond, a PhD-level evaluation testbed, DeepSeek-V3 achieves remarkable results, ranking just behind Claude 3.5 Sonnet and outperforming all other competitors by a substantial margin.
MMLU 是一个被广泛认可的基准,旨在评估大型语言模型在各个知识领域和任务中的性能。DeepSeek-V3 表现出竞争力,与 LLaMA-3.1-405B、GPT-4o 和 Claude-Sonnet 3.5 等顶级模型处于同一水平,同时显著优于 Qwen2.5 72B。此外,DeepSeek-V3 在更具挑战性的教育知识基准 MMLU-Pro 中表现出色,紧随 Claude-Sonnet 3.5 之后。在经过修正标签的 MMLU-Redux 上,DeepSeek-V3 超越了同行。此外,在博士级别的评估测试平台 GPQA-Diamond 上,DeepSeek-V3 取得了显著成果,仅略逊于 Claude 3.5 Sonnet,并大幅超越所有其他竞争对手。

In long-context understanding benchmarks such as DROP, LongBench v2, and FRAMES, DeepSeek-V3 continues to demonstrate its position as a top-tier model. It achieves an impressive 91.6 F1 score in the 3-shot setting on DROP, outperforming all other models in this category. On FRAMES, a benchmark requiring question-answering over 100k token contexts, DeepSeek-V3 closely trails GPT-4o while outperforming all other models by a significant margin. This demonstrates the strong capability of DeepSeek-V3 in handling extremely long-context tasks. The long-context capability of DeepSeek-V3 is further validated by its best-in-class performance on LongBench v2, a dataset that was released just a few weeks before the launch of DeepSeek V3. On the factual knowledge benchmark, SimpleQA, DeepSeek-V3 falls behind GPT-4o and Claude-Sonnet, primarily due to its design focus and resource allocation. DeepSeek-V3 assigns more training tokens to learn Chinese knowledge, leading to exceptional performance on the C-SimpleQA. On the instruction-following benchmark, DeepSeek-V3 significantly outperforms its predecessor, DeepSeek-V2-series, highlighting its improved ability to understand and adhere to user-defined format constraints.
在长上下文理解基准测试中,如 DROP、LongBench v2 和 FRAMES,DeepSeek-V3 继续展示其作为顶级模型的地位。它在 DROP 的 3 次射击设置中实现了令人印象深刻的 91.6 F1 分数,超过该类别中所有其他模型。在 FRAMES 上,这是一个需要回答超过 10 万个标记上下文问题的基准,DeepSeek-V3 紧随 GPT-4o 之后,而显著优于所有其他模型。这证明了 DeepSeek-V3 在处理极长上下文任务方面的强大能力。DeepSeek-V3 的长上下文能力通过其在 LongBench v2 上的最佳表现得到进一步验证,这是一个在 DeepSeek V3 发布前几周发布的数据库。在事实知识基准 SimpleQA 上,DeepSeek-V3 落后于 GPT-4o 和 Claude-Sonnet,这主要是由于其设计重点和资源分配。DeepSeek-V3 分配更多训练标记来学习中文知识,从而在 C-SimpleQA 上实现了卓越的性能。 在遵循指令的基准测试中,DeepSeek-V3 显著优于其前身 DeepSeek-V2 系列,突显了其理解并遵守用户定义的格式约束能力的提升。

Code and Math Benchmarks.
代码和数学基准。

Coding is a challenging and practical task for LLMs, encompassing engineering-focused tasks like SWE-Bench-Verified and Aider, as well as algorithmic tasks such as HumanEval and LiveCodeBench. In engineering tasks, DeepSeek-V3 trails behind Claude-Sonnet-3.5-1022 but significantly outperforms open-source models. The open-source DeepSeek-V3 is expected to foster advancements in coding-related engineering tasks. By providing access to its robust capabilities, DeepSeek-V3 can drive innovation and improvement in areas such as software engineering and algorithm development, empowering developers and researchers to push the boundaries of what open-source models can achieve in coding tasks. In algorithmic tasks, DeepSeek-V3 demonstrates superior performance, outperforming all baselines on benchmarks like HumanEval-Mul and LiveCodeBench. This success can be attributed to its advanced knowledge distillation technique, which effectively enhances its code generation and problem-solving capabilities in algorithm-focused tasks.
编码对LLMs来说是一项具有挑战性和实用性的任务,包括 SWE-Bench-Verified 和 Aider 等以工程为重点的任务,以及 HumanEval 和 LiveCodeBench 等算法任务。在工程任务中,DeepSeek-V3 在 Claude-Sonnet-3.5-1022 之后,但显著优于开源模型。预计开源的 DeepSeek-V3 将促进与编码相关的工程任务的进步。通过提供其强大的功能,DeepSeek-V3 可以推动软件工程和算法开发等领域的创新和改进,使开发人员和研究人员能够推动开源模型在编码任务中实现的可能性。在算法任务中,DeepSeek-V3 表现出卓越的性能,在 HumanEval-Mul 和 LiveCodeBench 等基准测试中优于所有基线。这一成功可以归因于其先进的知识蒸馏技术,该技术有效地增强了其在算法任务中的代码生成和问题解决能力。

On math benchmarks, DeepSeek-V3 demonstrates exceptional performance, significantly surpassing baselines and setting a new state-of-the-art for non-o1-like models. Specifically, on AIME, MATH-500, and CNMO 2024, DeepSeek-V3 outperforms the second-best model, Qwen2.5 72B, by approximately 10% in absolute scores, which is a substantial margin for such challenging benchmarks. This remarkable capability highlights the effectiveness of the distillation technique from DeepSeek-R1, which has been proven highly beneficial for non-o1-like models.
在数学基准测试中,DeepSeek-V3 展现出卓越的性能,显著超越基线,为非 o1-like 模型设定了新的最高水平。具体来说,在 AIME、MATH-500 和 CNMO 2024 上,DeepSeek-V3 在绝对分数上比第二好的模型 Qwen2.5 72B 高出约 10%,这对于如此具有挑战性的基准来说是一个很大的差距。这种非凡的能力突出了 DeepSeek-R1 蒸馏技术的有效性,该技术已被证明对非 o1-like 模型非常有益。

Chinese Benchmarks.  中国基准。

Qwen and DeepSeek are two representative model series with robust support for both Chinese and English. On the factual benchmark Chinese SimpleQA, DeepSeek-V3 surpasses Qwen2.5-72B by 16.4 points, despite Qwen2.5 being trained on a larger corpus compromising 18T tokens, which are 20% more than the 14.8T tokens that DeepSeek-V3 is pre-trained on.
Qwen 和 DeepSeek 是两个具有强大中英文支持的代表性模型系列。在事实基准中文简单问答(Chinese SimpleQA)上,尽管 Qwen2.5 在包含 18T(1800 亿)个标记的大语料库上训练,比 DeepSeek-V3 预训练的 14.8T(1480 亿)个标记多 20%,但 DeepSeek-V3 仍以 16.4 分的优势超越了 Qwen2.5-72B。

On C-Eval, a representative benchmark for Chinese educational knowledge evaluation, and CLUEWSC (Chinese Winograd Schema Challenge), DeepSeek-V3 and Qwen2.5-72B exhibit similar performance levels, indicating that both models are well-optimized for challenging Chinese-language reasoning and educational tasks.
在 C-Eval,一个代表中国教育知识评估的基准,以及 CLUEWSC(中文 Winograd Schema Challenge)、DeepSeek-V3 和 Qwen2.5-72B 上,这些模型展现出相似的性能水平,表明这两个模型都很好地优化了具有挑战性的中文语言推理和教育任务。

Model Arena-Hard  竞技场-硬 AlpacaEval 2.0  阿尔帕卡评估 2.0
DeepSeek-V2.5-0905 76.2 50.5
Qwen2.5-72B-Instruct 81.2 49.1
LLaMA-3.1 405B 69.3 40.5
GPT-4o-0513 80.4 51.1
Claude-Sonnet-3.5-1022  克劳德-颂诗-3.5-1022 85.2 52.0
DeepSeek-V3 85.5 70.0
Table 7: English open-ended conversation evaluations. For AlpacaEval 2.0, we use the length-controlled win rate as the metric.
表 7:英语开放式对话评估。对于 AlpacaEval 2.0,我们使用长度控制胜率作为指标。

5.3.3 Open-Ended Evaluation
5.3.3 开放式评估

In addition to standard benchmarks, we also evaluate our models on open-ended generation tasks using LLMs as judges, with the results shown in Table 7. Specifically, we adhere to the original configurations of AlpacaEval 2.0 (Dubois et al., 2024) and Arena-Hard (Li et al., 2024a), which leverage GPT-4-Turbo-1106 as judges for pairwise comparisons. On Arena-Hard, DeepSeek-V3 achieves an impressive win rate of over 86% against the baseline GPT-4-0314, performing on par with top-tier models like Claude-Sonnet-3.5-1022. This underscores the robust capabilities of DeepSeek-V3, especially in dealing with complex prompts, including coding and debugging tasks. Furthermore, DeepSeek-V3 achieves a groundbreaking milestone as the first open-source model to surpass 85% on the Arena-Hard benchmark. This achievement significantly bridges the performance gap between open-source and closed-source models, setting a new standard for what open-source models can accomplish in challenging domains.
除了标准基准测试外,我们还使用LLMs作为评委,评估我们的模型在开放式生成任务上的表现,结果如表 7 所示。具体来说,我们遵循 AlpacaEval 2.0(Dubois 等人,2024 年)和 Arena-Hard(Li 等人,2024a)的原始配置,这些配置利用 GPT-4-Turbo-1106 作为评委进行成对比较。在 Arena-Hard 上,DeepSeek-V3 实现了超过 86%的惊人胜率,与 Claude-Sonnet-3.5-1022 等顶级模型表现相当。这突显了 DeepSeek-V3 的强大能力,尤其是在处理复杂提示,包括编码和调试任务方面。此外,DeepSeek-V3 作为第一个在 Arena-Hard 基准测试中超过 85%的开源模型,实现了突破性的里程碑。这一成就显著缩小了开源模型和闭源模型之间的性能差距,为开源模型在具有挑战性的领域所能取得的成就设定了新的标准。

Similarly, DeepSeek-V3 showcases exceptional performance on AlpacaEval 2.0, outperforming both closed-source and open-source models. This demonstrates its outstanding proficiency in writing tasks and handling straightforward question-answering scenarios. Notably, it surpasses DeepSeek-V2.5-0905 by a significant margin of 20%, highlighting substantial improvements in tackling simple tasks and showcasing the effectiveness of its advancements.
同样,DeepSeek-V3 在 AlpacaEval 2.0 上展现出卓越的性能,优于闭源和开源模型。这证明了它在写作任务和简单问答场景处理方面的出色能力。值得注意的是,它比 DeepSeek-V2.5-0905 高出 20%,显著提升了处理简单任务的能力,并展示了其进步的有效性。

5.3.4 DeepSeek-V3 as a Generative Reward Model
5.3.4 DeepSeek-V3 作为生成式奖励模型

We compare the judgment ability of DeepSeek-V3 with state-of-the-art models, namely GPT-4o and Claude-3.5. Table 8 presents the performance of these models in RewardBench (Lambert et al., 2024). DeepSeek-V3 achieves performance on par with the best versions of GPT-4o-0806 and Claude-3.5-Sonnet-1022, while surpassing other versions. Additionally, the judgment ability of DeepSeek-V3 can also be enhanced by the voting technique. Therefore, we employ DeepSeek-V3 along with voting to offer self-feedback on open-ended questions, thereby improving the effectiveness and robustness of the alignment process.
我们比较了 DeepSeek-V3 与最先进的模型 GPT-4o 和 Claude-3.5 的判断能力。表 8 展示了这些模型在 RewardBench(Lambert 等人,2024)上的性能。DeepSeek-V3 的性能与 GPT-4o-0806 和 Claude-3.5-Sonnet-1022 的最佳版本相当,并超越了其他版本。此外,DeepSeek-V3 的判断能力还可以通过投票技术得到增强。因此,我们采用 DeepSeek-V3 和投票技术来对开放式问题提供自我反馈,从而提高对齐过程的有效性和鲁棒性。

Model Chat  聊天 Chat-Hard  Chat-Hard 聊天-硬 Safety  安全 Reasoning  推理 Average  平均
GPT-4o-0513 96.6 70.4 86.7 84.9 84.7
GPT-4o-0806  GPT-4 版本 0806 96.1 76.1 88.1 86.6 86.7
GPT-4o-1120 95.8 71.3 86.2 85.2 84.6
Claude-3.5-sonnet-0620  克劳德-3.5-十四行诗-0620 96.4 74.0 81.6 84.7 84.2
Claude-3.5-sonnet-1022  克劳德-3.5-十四行诗-1022 96.4 79.7 91.1 87.6 88.7
DeepSeek-V3 96.9 79.8 87.0 84.3 87.0
DeepSeek-V3 (maj@6) 96.9 82.6 89.5 89.2 89.6
Table 8: Performances of GPT-4o, Claude-3.5-sonnet and DeepSeek-V3 on RewardBench.
表 8:GPT-4o、Claude-3.5-sonnet 和 DeepSeek-V3 在 RewardBench 上的表现

5.4 Discussion  5.4 讨论

5.4.1 Distillation from DeepSeek-R1
5.4.1 深 Seek-R1 蒸馏

We ablate the contribution of distillation from DeepSeek-R1 based on DeepSeek-V2.5. The baseline is trained on short CoT data, whereas its competitor uses data generated by the expert checkpoints described above.
我们基于 DeepSeek-V2.5 消除了 DeepSeek-R1 中蒸馏的贡献。基线是在短 CoT 数据上训练的,而其竞争对手使用上述专家检查点生成数据。

Table 9 demonstrates the effectiveness of the distillation data, showing significant improvements in both LiveCodeBench and MATH-500 benchmarks. Our experiments reveal an interesting trade-off: the distillation leads to better performance but also substantially increases the average response length. To maintain a balance between model accuracy and computational efficiency, we carefully selected optimal settings for DeepSeek-V3 in distillation.
表 9 展示了蒸馏数据的有效性,显示了 LiveCodeBench 和 MATH-500 基准测试中的显著改进。我们的实验揭示了一个有趣的权衡:蒸馏带来了更好的性能,但也显著增加了平均响应长度。为了在模型准确性和计算效率之间保持平衡,我们在蒸馏中仔细选择了 DeepSeek-V3 的最佳设置。

Our research suggests that knowledge distillation from reasoning models presents a promising direction for post-training optimization. While our current work focuses on distilling data from mathematics and coding domains, this approach shows potential for broader applications across various task domains. The effectiveness demonstrated in these specific areas indicates that long-CoT distillation could be valuable for enhancing model performance in other cognitive tasks requiring complex reasoning. Further exploration of this approach across different domains remains an important direction for future research.
我们的研究表明,从推理模型中进行知识蒸馏是后训练优化的一个有希望的方向。虽然我们目前的工作集中在从数学和编码领域提取数据,但这种方法在各个任务领域具有广泛的应用潜力。在这些特定领域所展示的有效性表明,长 CoT 蒸馏对于增强其他需要复杂推理的认知任务中的模型性能可能是有价值的。在未来研究中,跨不同领域进一步探索这种方法仍然是一个重要的方向。

Model LiveCodeBench-CoT MATH-500
Pass@1 Length  长度 Pass@1 Length  长度
DeepSeek-V2.5 Baseline  DeepSeek-V2.5 基准 31.1 718 74.6 769
DeepSeek-V2.5 +R1 Distill
DeepSeek-V2.5 +R1 蒸馏
37.4 783 83.2 1510
Table 9: The contribution of distillation from DeepSeek-R1. The evaluation settings of LiveCodeBench and MATH-500 are the same as in Table 6.
表 9:DeepSeek-R1 蒸馏的贡献。LiveCodeBench 和 MATH-500 的评估设置与表 6 相同。

5.4.2 Self-Rewarding  5.4.2 自我奖励

Rewards play a pivotal role in RL, steering the optimization process. In domains where verification through external tools is straightforward, such as some coding or mathematics scenarios, RL demonstrates exceptional efficacy. However, in more general scenarios, constructing a feedback mechanism through hard coding is impractical. During the development of DeepSeek-V3, for these broader contexts, we employ the constitutional AI approach (Bai et al., 2022), leveraging the voting evaluation results of DeepSeek-V3 itself as a feedback source. This method has produced notable alignment effects, significantly enhancing the performance of DeepSeek-V3 in subjective evaluations. By integrating additional constitutional inputs, DeepSeek-V3 can optimize towards the constitutional direction. We believe that this paradigm, which combines supplementary information with LLMs as a feedback source, is of paramount importance. The LLM serves as a versatile processor capable of transforming unstructured information from diverse scenarios into rewards, ultimately facilitating the self-improvement of LLMs. Beyond self-rewarding, we are also dedicated to uncovering other general and scalable rewarding methods to consistently advance the model capabilities in general scenarios.
奖励在强化学习中扮演关键角色,引导优化过程。在验证通过外部工具简单易行的领域,例如某些编码或数学场景,强化学习表现出卓越的效率。然而,在更普遍的场景中,通过硬编码构建反馈机制是不切实际的。在开发 DeepSeek-V3 期间,对于这些更广泛的背景,我们采用宪法 AI 方法(Bai 等,2022),利用 DeepSeek-V3 自身的投票评估结果作为反馈来源。这种方法产生了显著的协调效应,显著提高了 DeepSeek-V3 在主观评估中的性能。通过整合额外的宪法输入,DeepSeek-V3 可以优化向宪法方向。我们相信,这种结合补充信息与LLMs作为反馈来源的范式至关重要。LLM作为一个多功能的处理器,能够将来自不同场景的非结构化信息转化为奖励,最终促进LLMs的自我提升。 除了自我奖励外,我们还致力于发现其他通用且可扩展的奖励方法,以持续提升模型在一般场景下的能力。

5.4.3 Multi-Token Prediction Evaluation
5.4.3 多令牌预测评估

Instead of predicting just the next single token, DeepSeek-V3 predicts the next 2 tokens through the MTP technique. Combined with the framework of speculative decoding (Leviathan et al., 2023; Xia et al., 2023), it can significantly accelerate the decoding speed of the model. A natural question arises concerning the acceptance rate of the additionally predicted token. Based on our evaluation, the acceptance rate of the second token prediction ranges between 85% and 90% across various generation topics, demonstrating consistent reliability. This high acceptance rate enables DeepSeek-V3 to achieve a significantly improved decoding speed, delivering 1.8 times TPS (Tokens Per Second).
DeepSeek-V3 通过 MTP 技术预测下一个 2 个 token,而不是仅仅预测下一个单个 token。结合推测解码框架(Leviathan 等人,2023;Xia 等人,2023),它可以显著提高模型的解码速度。关于额外预测 token 的接受率,自然会提出一个自然的问题。根据我们的评估,第二个 token 预测的接受率在各个生成主题中介于 85%到 90%之间,显示出一致的可靠性。这种高接受率使 DeepSeek-V3 能够实现解码速度的显著提高,达到每秒 1.8 倍的 TPS(token 每秒)。

6 Conclusion, Limitations, and Future Directions
结论、局限性及未来方向

In this paper, we introduce DeepSeek-V3, a large MoE language model with 671B total parameters and 37B activated parameters, trained on 14.8T tokens. In addition to the MLA and DeepSeekMoE architectures, it also pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. The training of DeepSeek-V3 is cost-effective due to the support of FP8 training and meticulous engineering optimizations. The post-training also makes a success in distilling the reasoning capability from the DeepSeek-R1 series of models. Comprehensive evaluations demonstrate that DeepSeek-V3 has emerged as the strongest open-source model currently available, and achieves performance comparable to leading closed-source models like GPT-4o and Claude-3.5-Sonnet. Despite its strong performance, it also maintains economical training costs. It requires only 2.788M H800 GPU hours for its full training, including pre-training, context length extension, and post-training.
本文介绍了 DeepSeek-V3,一个拥有 671 亿总参数和 370 亿激活参数的大型 MoE 语言模型,在 14.8 万亿个 token 上进行训练。除了 MLA 和 DeepSeekMoE 架构外,它还开创了一种无辅助损失的负载均衡策略,并设定了多 token 预测训练目标以获得更强的性能。由于支持 FP8 训练和精细的工程优化,DeepSeek-V3 的训练具有成本效益。训练后还成功提炼了 DeepSeek-R1 系列模型的推理能力。综合评估表明,DeepSeek-V3 已成为目前最强大的开源模型,其性能可与 GPT-4o 和 Claude-3.5-Sonnet 等领先闭源模型相媲美。尽管性能强大,但其训练成本仍然经济。其完整训练(包括预训练、上下文长度扩展和训练后)仅需 2.788M H800 GPU 小时。

While acknowledging its strong performance and cost-effectiveness, we also recognize that DeepSeek-V3 has some limitations, especially on the deployment. Firstly, to ensure efficient inference, the recommended deployment unit for DeepSeek-V3 is relatively large, which might pose a burden for small-sized teams. Secondly, although our deployment strategy for DeepSeek-V3 has achieved an end-to-end generation speed of more than two times that of DeepSeek-V2, there still remains potential for further enhancement. Fortunately, these limitations are expected to be naturally addressed with the development of more advanced hardware.
承认其强大的性能和成本效益的同时,我们也认识到 DeepSeek-V3 存在一些局限性,尤其是在部署方面。首先,为了确保高效的推理,DeepSeek-V3 推荐的部署单元相对较大,这可能会给小型团队带来负担。其次,尽管我们为 DeepSeek-V3 制定的部署策略实现了比 DeepSeek-V2 超过两倍的端到端生成速度,但仍存在进一步改进的潜力。幸运的是,随着更先进硬件的发展,这些局限性有望得到自然解决。

DeepSeek consistently adheres to the route of open-source models with longtermism, aiming to steadily approach the ultimate goal of AGI (Artificial General Intelligence). In the future, we plan to strategically invest in research across the following directions.
DeepSeek 始终遵循长期主义的开源模型路线,旨在稳步接近通用人工智能(AGI)的最终目标。未来,我们计划在以下方向进行战略投资研究。

  • We will consistently study and refine our model architectures, aiming to further improve both the training and inference efficiency, striving to approach efficient support for infinite context length. Additionally, we will try to break through the architectural limitations of Transformer, thereby pushing the boundaries of its modeling capabilities.


    我们将持续研究和优化我们的模型架构,旨在进一步提高训练和推理效率,努力实现对无限长度的上下文的高效支持。此外,我们还将尝试突破 Transformer 的架构限制,从而推动其建模能力的边界。
  • We will continuously iterate on the quantity and quality of our training data, and explore the incorporation of additional training signal sources, aiming to drive data scaling across a more comprehensive range of dimensions.


    我们将持续迭代我们的训练数据数量和质量,并探索整合额外的训练信号源,旨在推动数据在更广泛的维度上进行扩展。
  • We will consistently explore and iterate on the deep thinking capabilities of our models, aiming to enhance their intelligence and problem-solving abilities by expanding their reasoning length and depth.


    我们将持续探索和迭代模型的深度思考能力,通过扩展其推理的长度和深度,旨在提升其智能和解决问题的能力。
  • We will explore more comprehensive and multi-dimensional model evaluation methods to prevent the tendency towards optimizing a fixed set of benchmarks during research, which may create a misleading impression of the model capabilities and affect our foundational assessment.


    我们将探索更全面和多维度的模型评估方法,以防止在研究过程中趋向于优化一组固定的基准,这可能会对模型能力的误导性印象产生影响,并影响我们的基础评估。

References  参考文献

  • AI@Meta (2024a)  AI@Meta(2024a) AI@Meta. Llama 3 model card, 2024a. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
    AI@Meta. Llama 3 模型卡片,2024a。URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
  • AI@Meta (2024b)  AI@Meta(2024b) AI@Meta. Llama 3.1 model card, 2024b. URL https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md.
    AI@Meta. Llama 3.1 模型卡片,2024b。URL https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md.
  • Anthropic (2024)  Anthropic(2024) Anthropic. Claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/claude-3-5-sonnet.
    Anthropic. 克劳德 3.5 十四行诗,2024。网址 https://www.anthropic.com/news/claude-3-5-sonnet
  • Austin et al. (2021)  奥斯汀等人(2021 年) J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le 等. 基于大型语言模型的程序综合。arXiv 预印本 arXiv:2108.07732,2021。
  • Bai et al. (2022)  白等(2022) Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022.
    Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon 等。宪法人工智能:从人工智能反馈中的无害性。arXiv 预印本 arXiv:2212.08073,2022。
  • Bai et al. (2024)  白等(2024) Y. Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y. Dong, J. Tang, and J. Li. LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. arXiv preprint arXiv:2412.15204, 2024.
    Bai Y., Tu S., Zhang J., Peng H., Wang X., Lv X., Cao S., Xu J., Hou L., Dong Y., Tang J. 和 Li J. LongBench v2:对现实长上下文多任务的深入理解和推理。arXiv 预印本 arXiv:2412.15204,2024。
  • Bauer et al. (2014)  鲍尔等人(2014) M. Bauer, S. Treichler, and A. Aiken. Singe: leveraging warp specialization for high performance on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’14, page 119–130, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450326568. 10.1145/2555243.2555258. URL https://doi.org/10.1145/2555243.2555258.
    M.鲍尔,S.特里歇勒,和 A.艾肯。Singe:利用 warp 专业化在 GPU 上实现高性能。在 19 届 ACM SIGPLAN 并行编程原理与实践研讨会论文集,PPoPP '14,第 119-130 页,纽约,纽约,美国,2014。计算机协会。ISBN 9781450326568。10.1145/2555243.2555258。URL https://doi.org/10.1145/2555243.2555258。
  • Bisk et al. (2020)  Bisk 等人(2020 年) Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7432–7439. AAAI Press, 2020. 10.1609/aaai.v34i05.6239. URL https://doi.org/10.1609/aaai.v34i05.6239.
    Y. Bisk, R. Zellers, R. L. Bras, J. Gao, 和 Y. Choi. PIQA:在自然语言中推理物理常识。在第三十四届 AAAI 人工智能会议,AAAI 2020,第三十二届人工智能创新应用会议,IAAI 2020,第十届 AAAI 人工智能教育进步研讨会,EAAI 2020,美国纽约,纽约州,2020 年 2 月 7-12 日,第 7432-7439 页。AAAI 出版社,2020。10.1609/aaai.v34i05.6239。URL https://doi.org/10.1609/aaai.v34i05.6239。
  • Chen et al. (2021)  陈等(2021) M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
    陈明,特沃雷克,俊,袁泉,奥利维拉·平托,卡普兰,爱德华兹,布尔达,约瑟夫,布罗克曼,雷,普里,克鲁格,彼得罗夫,克拉夫,萨斯特里,米什金,陈,格雷,ライダー,帕夫洛夫,鲍威尔,凯撒,温特,蒂莱特,苏奇,坎明斯,普拉珀特,查恩蒂斯,巴恩斯,赫伯特-沃斯,古斯,尼科尔,帕伊诺,特扎克,唐,巴布什金,巴拉吉,贾因,桑德斯,赫斯,卡伦,卡里,莱克,阿奇亚姆,米斯拉,森川,拉德福德,奈特,布兰达奇,穆拉蒂,梅耶,韦林德,麦格鲁,阿莫迪,麦克安德利什,苏茨克维尔,扎伦巴。评估在代码上训练的大型语言模型。CoRR,abs/2107.03374,2021。URL https://arxiv.org/abs/2107.03374。
  • Clark et al. (2018)  克拉克等人(2018 年) P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457.
    P. 克拉克,I. 科威,O. 埃齐奥尼,T. 霍特,A. 沙哈瓦尔,C. 舍尼克,和 O. 塔夫约德。你认为你解决了问答问题?试试 arc,AI2 推理挑战。CoRR,abs/1803.05457,2018。URL http://arxiv.org/abs/1803.05457。
  • Cobbe et al. (2021)  Cobbe 等人(2021 年) K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
    K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano 等人。训练验证者解决数学文字问题。arXiv 预印本 arXiv:2110.14168,2021。
  • Cui et al. (2019)  崔等. (2019) Y. Cui, T. Liu, W. Che, L. Xiao, Z. Chen, W. Ma, S. Wang, and G. Hu. A span-extraction dataset for Chinese machine reading comprehension. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5883–5889, Hong Kong, China, Nov. 2019. Association for Computational Linguistics. 10.18653/v1/D19-1600. URL https://aclanthology.org/D19-1600.
    Cui Y., Liu T., Che W., Xiao L., Chen Z., Ma W., Wang S. 和 Hu G. 中文机器阅读理解跨度提取数据集。在 Inui K.,Jiang J.,Ng V. 和 Wan X. 编著的《2019 年自然语言处理实证方法会议和第 9 届国际自然语言处理联合会议(EMNLP-IJCNLP)论文集》,第 5883-5889 页,中国香港,2019 年 11 月。计算语言学协会。10.18653/v1/D19-1600。URL https://aclanthology.org/D19-1600。
  • Dai et al. (2024)  戴等。(2024) D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. CoRR, abs/2401.06066, 2024. URL https://doi.org/10.48550/arXiv.2401.06066.
    D. 戴,C. 邓华,C. 赵晨曦,R. X. 许,H. 高,D. 陈,J. 李,W. 曾,X. 余,Y. 吴,Z. 谢宇科,Y. K. 李,P. 黄,F. 骆,C. 阮,Z. 隧,W. 梁等. Deepseekmoe:迈向混合专家语言模型的终极专家专业化。CoRR,abs/2401.06066,2024。URL https://doi.org/10.48550/arXiv.2401.06066。
  • DeepSeek-AI (2024a)  DeepSeek-AI(2024a) DeepSeek-AI. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. CoRR, abs/2406.11931, 2024a. URL https://doi.org/10.48550/arXiv.2406.11931.
    DeepSeek-AI. Deepseek-coder-v2:突破代码智能中封闭源模型的技术壁垒。CoRR,abs/2406.11931,2024a。URL https://doi.org/10.48550/arXiv.2406.11931。
  • DeepSeek-AI (2024b)  DeepSeek-AI(2024b) DeepSeek-AI. Deepseek LLM: scaling open-source language models with longtermism. CoRR, abs/2401.02954, 2024b. URL https://doi.org/10.48550/arXiv.2401.02954.  
  • DeepSeek-AI (2024c)  DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. CoRR, abs/2405.04434, 2024c. URL https://doi.org/10.48550/arXiv.2405.04434.  
  • Dettmers et al. (2022)  T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332, 2022.  
  • Ding et al. (2024)  H. Ding, Z. Wang, G. Paolini, V. Kumar, A. Deoras, D. Roth, and S. Soatto. Fewer truncations improve language modeling. arXiv preprint arXiv:2404.10830, 2024.  
  • Dua et al. (2019)  Dua 等人(2019 年) D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 2368–2378. Association for Computational Linguistics, 2019. 10.18653/V1/N19-1246. URL https://doi.org/10.18653/v1/n19-1246.
    D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, 和 M. Gardner. DROP:一个需要段落离散推理的阅读理解基准。在 J. Burstein,C. Doran 和 T. Solorio 编辑的《2019 年北美计算语言学协会人机语言技术会议:NAACL-HLT 2019 会议论文集》,第 1 卷(长篇和短篇论文),第 2368-2378 页。计算语言学协会,2019。10.18653/V1/N19-1246。URL https://doi.org/10.18653/v1/n19-1246。
  • Dubois et al. (2024)  Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.  
  • Fedus et al. (2021)  W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961, 2021. URL https://arxiv.org/abs/2101.03961.  
  • Fishman et al. (2024)  M. Fishman, B. Chmiel, R. Banner, and D. Soudry. Scaling FP8 training to trillion-token llms. arXiv preprint arXiv:2409.12517, 2024.  
  • Frantar et al. (2022)  E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.  
  • Gao et al. (2020)  L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.  
  • Gema et al. (2024)